1 of 32

BNN-PYNQ overlays

Yaman Umuroglu, NTNU

QPYNQ workshop at RISE SICS

Stockholm, Sweden

2 of 32

Outline

Introduction to BNN-PYNQ (~10m)

Existing overlays and networks + hands-on (~20m)

New network on an existing overlay + hands-on (~15m)

Making new overlays (~5m)

2

3 of 32

Part 1: Introduction to BNN-PYNQ

3

4 of 32

What is BNN-PYNQ?

Two overlays for the PYNQ FPGA
Five BNNs running on overlays + Jupyter notebooks
Tools to put new BNNs into overlays
Tools to build new overlays
All open-source on GitHub

4

5 of 32

What can BNN-PYNQ do?

Overlay	Performance	Network	Accuracy	Examples
LFC fully connected 28x28 monochr.	168 kFPS (974 GOPS) 102 us latency	MNIST	98.4%
		Fashion-MNIST	85.5%
		NIST SD-19	79.2%
CNV VGG-like, convolutional 32x32 RGB	3 kFPS (341 GOPS) 1541 us latency	CIFAR-10	80.1%
		SVHN	96.7%
		GTSRB	97.7%

At 2-2.5 W of power consumption

(less if you don’t need so much performance)

5

6 of 32

Typical HW Architecture for Inference

Execute one layer at a time on a compute array
This is what most other solutions do

CPUs, GPUs, Google TPU, …

Issues: latency, utilization (batch size ↑), off-chip accesses

main memory

maximum-sized

compute array

for all layers

off-chip

on-chip

homogeneous

processing

elements

on-chip feedback path

6

7 of 32

FINN: Heterogeneous Streaming Architecture

Layer 0

Layer 1

Layer N

…

image

result

FPGA

BNN topology

1M ops

10M ops

1x PE

10x PE

1x FPS

10x FPS

10x PE

100x PE

One hardware layer per BNN layer

Heterogeneous: Avoid “one-size-fits-all” penalties

Streaming: Maximize throughput, minimize latency

8 of 32

More gory hardware details?

https://arxiv.org/abs/1612.07119

8

9 of 32

Whirlwind Tour of Folder Structure

All Jupyter notebooks will be copied to:

/home/xilinx/jupyter_notebooks/bnn

All source code and prebuilt overlays will be copied to:

/opt/python3.6/lib/python3.6/site-packages/bnn/

bnn.py -- Python API
bitstreams/ -- FPGA bitstreams for overlays
libraries/ -- precompiled drivers for overlays
params/ -- network parameters for overlays
src/ -- source code for HW, drivers and training scripts

9

10 of 32

Part 2: Existing Overlays and Networks

10

11 of 32

Overlay vs Network?

The Overlay is a...

hardware accelerator

FPGA bitstream (+driver)

w/ fixed topology

fixed input size
fixed layer shapes
fixed quantization

w/ fixed performance
parameters, dataset, accuracy? from network

The Network is a...

trained QNN

fixed parameters
fixed topology

w/ fixed dataset
w/ fixed accuracy

11

12 of 32

The LFC Overlay

Fully binarized, including inputs and outputs

168 kFPS, 0.1 ms latency

available

networks:

input:

28x28

binary

fc:1024

threshold

fc:1024

threshold

fc:1024

threshold

fc:64

output:

64

binary

NIST SD-19

handwritten letters & digits

Fashion-MNIST

clothing

MNIST

handwritten digits

your network!

...

threshold

12

13 of 32

Clone and pip install BNN-PYNQ

From a Jupyter terminal (root by default), run:

pip3.6 install --upgrade git+https://github.com/maltanar/BNN-PYNQ.git

(already installed! may take a few minutes to reinstall)

Note: this is the “workshop version” with some extras, mainstream:

https://github.com/Xilinx/BNN-PYNQ

13

14 of 32

Hands-on: Minimal MNIST

Let’s go through the following Jupyter notebook:

bnn/minimal_mnist.ipynb

14

15 of 32

Under the Hood

Cortex-A9 CPU

FPGA

DRAM

Python libraries

MLBP

Jupyter notebook

15

16 of 32

Load Overlay Bitstream

Cortex-A9 CPU

FPGA

DRAM

Python libraries

MLBP

Jupyter notebook

fc:1024

threshold

fc:1024

threshold

fc:1024

threshold

fc:64

threshold

DMA in

DMA out

parameter memory banks

control/status

16

17 of 32

Load Network Parameters

Cortex-A9 CPU

FPGA

DRAM

Python libraries

MLBP

Jupyter notebook

fc:1024

threshold

fc:1024

threshold

fc:1024

threshold

fc:64

threshold

DMA in

DMA out

parameter memory banks

control/status

17

18 of 32

Resize and Pack Input Images

Cortex-A9 CPU

FPGA

DRAM

Python libraries

MLBP

Jupyter notebook

fc:1024

threshold

fc:1024

threshold

fc:1024

threshold

fc:64

threshold

DMA in

DMA out

parameter memory banks

control/status

pack

resize

N input images

18

19 of 32

Run Accelerator with N Images

Cortex-A9 CPU

FPGA

DRAM

Python libraries

MLBP

Jupyter notebook

fc:1024

threshold

fc:1024

threshold

fc:1024

threshold

fc:64

threshold

DMA in

DMA out

parameter memory banks

control/status

N input images

N output vectors

19

20 of 32

The CNV Overlay

Binarized except first layer: 8-bit inputs

and last layer: 16-bit outputs

3k FPS, 1.5 ms latency

available

networks:

input:

32x32

3x8-bit RGB

cnv:3x3:64

threshold

cnv:3x3:64

threshold

cnv:3x3:128

threshold

cnv:3x3:128

threshold

maxpool:2x2

cnv:3x3:256

threshold

cnv:3x3:256

threshold

fc:512

threshold

fc:512

threshold

fc:64

output:

64

16-bit

CIFAR-10

animals, vehicles..

GTSRB

traffic signs

SVHN

street view house numbers

your network!

...

20

21 of 32

Hands-on: CIFAR-10

Let’s go through the following Jupyter notebook:

bnn/Cifar10.ipynb

21

22 of 32

Part 3: A New Network on an Existing Overlay

22

23 of 32

Topology

New network topology must match overlay exactly

When training new nets, use lfc.py and cnv.py in bnn/src/training to guarantee correct topology for the overlays

23

24 of 32

Input and labels

CNV uses inputs in the range [-1, +1]

LFC uses binarized {-1, +1} inputs and outputs

Remember to rescale inputs, maybe enhance contrast

Use mnist.py and cifar10.py as templates

24

25 of 32

Parameter Generation

Once network is trained, convert npz to packed weights

Almost identical procedure for same overlay, examples:

bnn/src/training/mnist-gen-binary-weights.py

bnn/src/training/cifar10-gen-binary-weights.py

Copy packed weight folder into bnn/params

25

26 of 32

Hands-on: Fashion-MNIST on LFC

Let’s go through the following Jupyter notebook:

bnn/new_params_for_overlay.ipynb

26

27 of 32

Part 4: Making New Overlays

27

28 of 32

Warning: Here Be Dragons

At the moment, making new overlays requires significant knowledge of the BNN-PYNQ internals

Future BNN-PYNQ releases will support new topologies much more flexibly (November 2017)

We will only briefly cover a few tips here

28

29 of 32

Where to Get Started?

Study the source code for existing overlays

bnn/src/

library/

driver/ -- low-level driver for communication
hls/ -- BNN hardware building blocks in Vivado HLS
host/ -- C++ to pack inputs, launch accelerator, get results
script/ -- scripts for FPGA synthesis

network/{cnv-pynq, lfc-pynq}

hw/ -- instantiation of hardware overlay
sw/ -- C++ functions called by Python for this overlay

training/

29

30 of 32

New Overlay Tips for Current Version

Training

Pick layers that already exist in HW library (valid padding..)

Hardware

Must fit into FPGA (LUT and BRAM constraints)
Watch out for padding

matrix rows / PE, matrix cols / SIMD, I/O bus width

Intra-layer FIFO sizes matter for performance

Software (driver)

Padding, endianness for bit packing etc. must match HW

30

31 of 32

Resources and Further Reading

PYNQ

http://pynq.readthedocs.io/en/latest/

Vivado HLS

https://www.xilinx.com/video/hardware/getting-started-vivado-high-level-synthesis.html

Papers

FINN https://arxiv.org/abs/1612.07119
BinaryNet http://arxiv.org/abs/1602.02830

31

32 of 32

Thank you for listening!

32