2 of 32

Overview

Background & Motivation
Related work

Region proposal
Cascade network

Mask-Net

The architecture
Promising features
Algorithm and hardware innovations

Experiment results
Design space exploration
Future research directions

3 of 32

Overview

Background & Motivation
Related work

Region proposal
Cascade network

Mask-Net

The architecture
Promising features
Algorithm and hardware innovations

Experiment results
Design space exploration
Future research directions

4 of 32

Background & Motivation

Challenges for object detection on embedded systems with DNNs

Deep Neural Networks

Implementation on embedded systems

Limited computing and memory resources

Tight energy budget

Challenges

5 of 32

Background & Motivation

Redundant computation: a large part of an image is background and it is unnecessary to focus on these regions.

Sample image from DAC-SDC ^[1] dataset

The distribution of bounding box relative size in three different datasets

[1] Xiaowei Xu, Xinyi Zhang, Bei Yu, X Sharon Hu, Christopher Rowen, Jingtong Hu, and Yiyu Shi. Dac-sdc low power object detection challenge for uav applications. IEEE transactions on pattern analysis and machine intelligence, 2019.

6 of 32

Overview

Background & Motivation
Related work

Region proposal
Cascade network

Mask-Net

The architecture
Promising features
Algorithm and hardware innovations

Experiment results
Design space exploration
Future research directions

7 of 32

Related Work: Region Proposal

Faster-RCNN	Mask-RCNN
Computationally expensive: needs deep convolution layers to extract enough features	No rectangular regions: not beneficial for hardware acceleration

The Mask-RCNN^[2] framework for instance segmentation and its extension

Faster R-CNN^[1] is a single, unified network for object detection. The RPN module serves as the ‘attention’ of this unified network.

[1] Ren, Shaoqing, et al. "Faster r-cnn: Towards real-time object detection with region proposal networks." Advances in neural information processing systems 28 (2015): 91-99.

[2] He, Kaiming, et al. "Mask r-cnn." Proceedings of the IEEE international conference on computer vision. 2017.

8 of 32

Related Work : Cascade

A famous example: Cascade R-CNN

Four common cascade network in object detection

Cai, Zhaowei, and Nuno Vasconcelos. "Cascade r-cnn: Delving into high quality object detection." Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.

Image credit:

Pros	Cons
Higher quality detectors are only required to operate on higher quality hypotheses.	Non hardware efficient: has many branches. Unsuitable for edge devices: fits better in dense DNNs instead of lightweight DNNs.

9 of 32

Overview

Background & Motivation
Related work

Region proposal
Cascade network

Mask-Net

The architecture
Promising features
Algorithm and hardware innovations

Experiment results
Design space exploration
Future research directions

10 of 32

The architecture of Mask-Net

11 of 32

The architecture of Mask-Net

Shared by new branch and backbone to extract preliminary features

12 of 32

The architecture of Mask-Net

Generate a mask with proposed regions

13 of 32

The architecture of Mask-Net

Only compute the proposed regions to generate bounding box

14 of 32

A case study: Mask-SkyNet

SkyNet^[1] is a hardware-efficient object detection and tracking backbone.

We choose SkyNet as base model to design a FPGA accelerator.

Mask-SkyNet architecture

[1] Xiaofan Zhang, Haoming Lu, Cong Hao, Jiachen Li, Bowen Cheng, Yuhong Li, Kyle Rupnow, Jinjun Xiong, Thomas Huang, Honghui Shi, et al. Skynet: a hardware-efficient method for object detection and tracking on embedded systems. Proceedings of Machine Learning and Systems, 2:216–229, 2020.

15 of 32

Promising features of Mask-Net

Generalizable

Can be applied to different object detection or tracking backbones, including SkyNet, UltraNet and ResNet-18.
Works well in different scenarios, including DAC-SDC, UAV123 and OTB100 dataset.

Hardware friendly

The mask generation branch can reuse convolution modules in the backbone.
Mask shape regularization can help avoid complex control logic.
Channel shuffle can help reduce data movement between DRAM and on-chip memory.

Small Overhead

The mask generation branch’s computation cost is about 6% of the whole network.

16 of 32

Promising features of Mask-Net

Generalizable

Can be applied to different object detection or tracking backbones, including SkyNet, UltraNet and ResNet-18.
Works well in different scenarios, including DAC-SDC, UAV123 and OTB100 dataset.

Hardware friendly

The mask generation branch can reuse convolution modules in the backbone.
Mask shape regularization can help avoid complex control logic.
Channel shuffle can help reduce data movement between DRAM and on-chip memory.

Small Overhead

The mask generation branch’s computation cost is about 6% of the whole network.

17 of 32

Promising features of Mask-Net

Generalizable

Can be applied to different object detection or tracking backbones, including SkyNet, UltraNet and ResNet-18.
Works well in different scenarios, including DAC-SDC, UAV123 and OTB100 dataset.

Hardware friendly

The mask generation branch can reuse convolution modules in the backbone.
Mask shape regularization can help avoid complex control logic.
Channel shuffle can help reduce data movement between DRAM and on-chip memory.

Small Overhead

The mask generation branch’s computation cost is about 6% of the whole network.

18 of 32

Algorithm Innovations

Confidence mask and regions of interest generation

The confidence mask is generated by Sigmoid.
The patch’s score in the mask is proportional to the probability of objects’ existence.

Mask generation process

The gate function

19 of 32

Algorithm Innovations

All pass mechanism

Two-stage training process

The first step is to fix the weights in the backbone and only train the new branch.
The second step is to fix the weights of the new branch and fine-tune the backbone.

Train the new branch

Fine-tune the backbone

Stage 2

Stage 1

Apply mask to backbone

Two stage training process

All pass mechanism

If the score of all patches does not exceed the threshold, then we need to calculate the whole image to improve the robustness of mask generation.

20 of 32

Hardware Innovations

Region of Interest Shape Regularization

Our FPGA accelerator is tile-based. If the shape of regions of interest is not rectangular, additional judgement is needed before loading and calculating the tile.
Regularize the shape of regions of interest in the mask into rectangular will avoid the complex control logic introduced by the judgement of patches’ score.

Non-rectangular Shape Mask

Rectangular Shape Mask

21 of 32

Hardware Innovations

Channel shuffle

Reduce data movement between DRAM and on-chip memory to reduce computation cost.
Our channel shuffle method is the inverse process of that in ShuffleNet.

Channel shuffle in Mask-Net

Channel shuffle in ShuffleNet^[1]

[1] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6848–6856, 2018.

22 of 32

Overview

Background & Motivation
Related work

Region proposal
Cascade network

Mask-Net

The architecture
Promising features
Algorithm and hardware innovations

Experiment results
Design space exploration
Future research directions

23 of 32

Experiment Results

24 of 32

Experiment Results

Mask quality

The experiment is done using Mask-SkyNet on DAC-SDC dataset.
84.3% of proposed regions can completely cover the object while only 1.2% are completely wrong.
Masked region proposals will only cause 3%~4% IoU loss.

Mask quality analysis

IoU loss comparison

25 of 32

Experiment Results

C/RTL co-simulation results

Only 6% of total inference time allows the network to correctly distinguish between objects and background.

C/RTL co-simulation results from Vitis

26 of 32

Experiment Results

Software and hardware evaluation results

At software level, we evaluate Mask-Net using three detection backbones: SkyNet, UltraNet and ResNet-18 and three datasets: DAC-SDC, OTB100 and UAV123. The weights and feature maps are 32-bit.
At hardware level, we use the same datasets to test our Mask-SkyNet accelerator on ZCU106 FPGA. The weights are 11 bits and feature maps are 9 bits.

Software evaluation results ( table Ⅰ )

Hardware evaluation results ( table Ⅱ )

Resource utilization report

( table Ⅲ )

73 of the 87 DSPs added come from the new branch

27 of 32

Overview

Background & Motivation
Related work

Region proposal
Cascade network

Mask-Net

The architecture
Promising features
Algorithm and hardware innovations

Experiment results
Design space exploration
Future research directions

28 of 32

Design Space Exploration

Reasons for DSE

The time imbalance between different parts of the accelerator will affect the overall performance.
We want to optimally allocating DSPs to different parts of the accelerator to balance the computations under a fixed number of DSPs.

DSE model

The model is used to calculate the theoretical speedup under fixed number of DSPs after DSP redistribution.

( 1 )

( 2 )

( 3 )

( 4 )

29 of 32

Design Space Exploration Results

DSP exploration space when extra DSP count is 1309

The relationship between the number of extra DSPs, inference time and theoretical speedup

The DSP distribution across three parts with different number of total extra DSPs.

30 of 32

Overview

Background & Motivation
Related work

Region proposal
Cascade network

Mask-Net

The architecture
Promising features
Algorithm and hardware innovations

Experiment results
Design space exploration
Future research directions

31 of 32

Future Research Directions

Adaptive threshold

The threshold used in the gate in Mask-Net now is empirically selected from experiment results.
Adaptive threshold may select region proposals more efficiently, further reduce the computation cost, especially in case of complex background.

Extend Mask-Net to different tasks

Object tracking
Image classification
Instance segmentation

1 of 32

2 of 32

3 of 32

4 of 32

5 of 32

6 of 32

7 of 32

8 of 32

9 of 32

10 of 32

11 of 32

12 of 32

13 of 32

14 of 32

15 of 32

16 of 32

17 of 32

18 of 32

19 of 32

20 of 32

21 of 32

22 of 32

23 of 32

24 of 32

25 of 32

26 of 32

27 of 32

28 of 32

29 of 32

30 of 32

31 of 32

32 of 32