1 of 32

BNN-PYNQ overlays

Yaman Umuroglu, NTNU

QPYNQ workshop at RISE SICS

Stockholm, Sweden

2 of 32

Outline

  1. Introduction to BNN-PYNQ (~10m)

  • Existing overlays and networks + hands-on (~20m)

  • New network on an existing overlay + hands-on (~15m)

  • Making new overlays (~5m)

2

3 of 32

Part 1: Introduction to BNN-PYNQ

3

4 of 32

What is BNN-PYNQ?

  • Two overlays for the PYNQ FPGA
  • Five BNNs running on overlays + Jupyter notebooks
  • Tools to put new BNNs into overlays
  • Tools to build new overlays
  • All open-source on GitHub

4

5 of 32

What can BNN-PYNQ do?

Overlay

Performance

Network

Accuracy

Examples

LFC

fully connected

28x28 monochr.

168 kFPS

(974 GOPS)

102 us latency

MNIST

98.4%

Fashion-MNIST

85.5%

NIST SD-19

79.2%

CNV

VGG-like, convolutional

32x32 RGB

3 kFPS

(341 GOPS)

1541 us latency

CIFAR-10

80.1%

SVHN

96.7%

GTSRB

97.7%

At 2-2.5 W of power consumption

(less if you don’t need so much performance)

5

6 of 32

Typical HW Architecture for Inference

  • Execute one layer at a time on a compute array
  • This is what most other solutions do
    • CPUs, GPUs, Google TPU, …
  • Issues: latency, utilization (batch size ↑), off-chip accesses

main memory

maximum-sized

compute array

for all layers

off-chip

on-chip

homogeneous

processing

elements

on-chip feedback path

6

7 of 32

FINN: Heterogeneous Streaming Architecture

Layer 0

Layer 1

Layer N

image

result

FPGA

BNN topology

1M ops

10M ops

1x PE

10x PE

1x FPS

10x FPS

10x PE

100x PE

  • One hardware layer per BNN layer
  • Heterogeneous: Avoid “one-size-fits-all” penalties
  • Streaming: Maximize throughput, minimize latency

8 of 32

More gory hardware details?

8

9 of 32

Whirlwind Tour of Folder Structure

All Jupyter notebooks will be copied to:

/home/xilinx/jupyter_notebooks/bnn

All source code and prebuilt overlays will be copied to:

/opt/python3.6/lib/python3.6/site-packages/bnn/

  • bnn.py -- Python API
  • bitstreams/ -- FPGA bitstreams for overlays
  • libraries/ -- precompiled drivers for overlays
  • params/ -- network parameters for overlays
  • src/ -- source code for HW, drivers and training scripts

9

10 of 32

Part 2: Existing Overlays and Networks

10

11 of 32

Overlay vs Network?

The Overlay is a...

  • hardware accelerator
    • FPGA bitstream (+driver)
  • w/ fixed topology
    • fixed input size
    • fixed layer shapes
    • fixed quantization
  • w/ fixed performance
  • parameters, dataset, accuracy? from network

The Network is a...

  • trained QNN
    • fixed parameters
    • fixed topology
  • w/ fixed dataset
  • w/ fixed accuracy

11

12 of 32

The LFC Overlay

Fully binarized, including inputs and outputs

168 kFPS, 0.1 ms latency

available

networks:

input:

28x28

binary

fc:1024

threshold

fc:1024

threshold

fc:1024

threshold

fc:64

output:

64

binary

NIST SD-19

handwritten letters & digits

Fashion-MNIST

clothing

MNIST

handwritten digits

your network!

...

threshold

12

13 of 32

Clone and pip install BNN-PYNQ

From a Jupyter terminal (root by default), run:

pip3.6 install --upgrade git+https://github.com/maltanar/BNN-PYNQ.git

(already installed! may take a few minutes to reinstall)

Note: this is the “workshop version” with some extras, mainstream:

https://github.com/Xilinx/BNN-PYNQ

13

14 of 32

Hands-on: Minimal MNIST

Let’s go through the following Jupyter notebook:

bnn/minimal_mnist.ipynb

14

15 of 32

Under the Hood

Cortex-A9 CPU

FPGA

DRAM

Python libraries

MLBP

Jupyter notebook

15

16 of 32

Load Overlay Bitstream

Cortex-A9 CPU

FPGA

DRAM

Python libraries

MLBP

Jupyter notebook

fc:1024

threshold

fc:1024

threshold

fc:1024

threshold

fc:64

threshold

DMA in

DMA out

parameter memory banks

control/status

16

17 of 32

Load Network Parameters

Cortex-A9 CPU

FPGA

DRAM

Python libraries

MLBP

Jupyter notebook

fc:1024

threshold

fc:1024

threshold

fc:1024

threshold

fc:64

threshold

DMA in

DMA out

parameter memory banks

control/status

17

18 of 32

Resize and Pack Input Images

Cortex-A9 CPU

FPGA

DRAM

Python libraries

MLBP

Jupyter notebook

fc:1024

threshold

fc:1024

threshold

fc:1024

threshold

fc:64

threshold

DMA in

DMA out

parameter memory banks

control/status

pack

resize

N input images

18

19 of 32

Run Accelerator with N Images

Cortex-A9 CPU

FPGA

DRAM

Python libraries

MLBP

Jupyter notebook

fc:1024

threshold

fc:1024

threshold

fc:1024

threshold

fc:64

threshold

DMA in

DMA out

parameter memory banks

control/status

N input images

N output vectors

19

20 of 32

The CNV Overlay

Binarized except first layer: 8-bit inputs

and last layer: 16-bit outputs

3k FPS, 1.5 ms latency

available

networks:

input:

32x32

3x8-bit RGB

cnv:3x3:64

threshold

cnv:3x3:64

threshold

cnv:3x3:128

threshold

cnv:3x3:128

threshold

maxpool:2x2

maxpool:2x2

cnv:3x3:256

threshold

cnv:3x3:256

threshold

fc:512

threshold

fc:512

threshold

fc:64

output:

64

16-bit

CIFAR-10

animals, vehicles..

GTSRB

traffic signs

SVHN

street view house numbers

your network!

...

20

21 of 32

Hands-on: CIFAR-10

Let’s go through the following Jupyter notebook:

bnn/Cifar10.ipynb

21

22 of 32

Part 3: A New Network on an Existing Overlay

22

23 of 32

Topology

New network topology must match overlay exactly

When training new nets, use lfc.py and cnv.py in bnn/src/training to guarantee correct topology for the overlays

23

24 of 32

Input and labels

CNV uses inputs in the range [-1, +1]

LFC uses binarized {-1, +1} inputs and outputs

Remember to rescale inputs, maybe enhance contrast

Use mnist.py and cifar10.py as templates

24

25 of 32

Parameter Generation

Once network is trained, convert npz to packed weights

Almost identical procedure for same overlay, examples:

bnn/src/training/mnist-gen-binary-weights.py

bnn/src/training/cifar10-gen-binary-weights.py

Copy packed weight folder into bnn/params

25

26 of 32

Hands-on: Fashion-MNIST on LFC

Let’s go through the following Jupyter notebook:

bnn/new_params_for_overlay.ipynb

26

27 of 32

Part 4: Making New Overlays

27

28 of 32

Warning: Here Be Dragons

  • At the moment, making new overlays requires significant knowledge of the BNN-PYNQ internals

  • Future BNN-PYNQ releases will support new topologies much more flexibly (November 2017)

  • We will only briefly cover a few tips here

28

29 of 32

Where to Get Started?

Study the source code for existing overlays

  • bnn/src/
    • library/
      • driver/ -- low-level driver for communication
      • hls/ -- BNN hardware building blocks in Vivado HLS
      • host/ -- C++ to pack inputs, launch accelerator, get results
      • script/ -- scripts for FPGA synthesis
    • network/{cnv-pynq, lfc-pynq}
      • hw/ -- instantiation of hardware overlay
      • sw/ -- C++ functions called by Python for this overlay
    • training/

29

30 of 32

New Overlay Tips for Current Version

  • Training
    • Pick layers that already exist in HW library (valid padding..)
  • Hardware
    • Must fit into FPGA (LUT and BRAM constraints)
    • Watch out for padding
      • matrix rows / PE, matrix cols / SIMD, I/O bus width
    • Intra-layer FIFO sizes matter for performance
  • Software (driver)
    • Padding, endianness for bit packing etc. must match HW

30

31 of 32

Resources and Further Reading

31

32 of 32

Thank you for listening!

32