1 of 32

Mask-Net�A Hardware-efficient Object Detection Network with Masked Region Proposals

Hanqiu Chen*, Cong (Callie) Hao

Georgia Institute of Technology

* Work done during internship at Georgia Tech

2 of 32

Overview

  • Background & Motivation
  • Related work
    • Region proposal
    • Cascade network

2

  • Mask-Net
    • The architecture
    • Promising features
    • Algorithm and hardware innovations
  • Experiment results
  • Design space exploration
  • Future research directions

3 of 32

Overview

  • Background & Motivation
  • Related work
    • Region proposal
    • Cascade network

3

  • Mask-Net
    • The architecture
    • Promising features
    • Algorithm and hardware innovations
  • Experiment results
  • Design space exploration
  • Future research directions

4 of 32

Background & Motivation

Challenges for object detection on embedded systems with DNNs

4

Deep Neural Networks

Implementation on embedded systems

  • Limited computing and memory resources

  • Tight energy budget

Challenges

5 of 32

Background & Motivation

Redundant computation: a large part of an image is background and it is unnecessary to focus on these regions.

5

Sample image from DAC-SDC [1] dataset

The distribution of bounding box relative size in three different datasets

[1] Xiaowei Xu, Xinyi Zhang, Bei Yu, X Sharon Hu, Christopher Rowen, Jingtong Hu, and Yiyu Shi. Dac-sdc low power object detection challenge for uav applications. IEEE transactions on pattern analysis and machine intelligence, 2019.

6 of 32

Overview

  • Background & Motivation
  • Related work
    • Region proposal
    • Cascade network

6

  • Mask-Net
    • The architecture
    • Promising features
    • Algorithm and hardware innovations
  • Experiment results
  • Design space exploration
  • Future research directions

7 of 32

Related Work: Region Proposal

7

Faster-RCNN

Mask-RCNN

Computationally expensive: needs deep convolution layers to extract enough features

No rectangular regions: not beneficial for hardware acceleration

The Mask-RCNN[2] framework for instance segmentation and its extension

Faster R-CNN[1] is a single, unified network for object detection. The RPN module serves as the ‘attention’ of this unified network.

[1] Ren, Shaoqing, et al. "Faster r-cnn: Towards real-time object detection with region proposal networks." Advances in neural information processing systems 28 (2015): 91-99.

[2] He, Kaiming, et al. "Mask r-cnn." Proceedings of the IEEE international conference on computer vision. 2017.

8 of 32

Related Work : Cascade

8

  • A famous example: Cascade R-CNN

Four common cascade network in object detection

Cai, Zhaowei, and Nuno Vasconcelos. "Cascade r-cnn: Delving into high quality object detection." Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.

Image credit:

Pros

Cons

  • Higher quality detectors are only required to operate on higher quality hypotheses.
  • Non hardware efficient: has many branches.
  • Unsuitable for edge devices: fits better in dense DNNs instead of lightweight DNNs.

9 of 32

Overview

  • Background & Motivation
  • Related work
    • Region proposal
    • Cascade network

9

  • Mask-Net
    • The architecture
    • Promising features
    • Algorithm and hardware innovations
  • Experiment results
  • Design space exploration
  • Future research directions

10 of 32

The architecture of Mask-Net

10

11 of 32

The architecture of Mask-Net

11

Shared by new branch and backbone to extract preliminary features

12 of 32

The architecture of Mask-Net

12

Generate a mask with proposed regions

13 of 32

The architecture of Mask-Net

13

Only compute the proposed regions to generate bounding box

14 of 32

A case study: Mask-SkyNet

14

SkyNet[1] is a hardware-efficient object detection and tracking backbone.

We choose SkyNet as base model to design a FPGA accelerator.

Mask-SkyNet architecture

[1] Xiaofan Zhang, Haoming Lu, Cong Hao, Jiachen Li, Bowen Cheng, Yuhong Li, Kyle Rupnow, Jinjun Xiong, Thomas Huang, Honghui Shi, et al. Skynet: a hardware-efficient method for object detection and tracking on embedded systems. Proceedings of Machine Learning and Systems, 2:216–229, 2020.

15 of 32

Promising features of Mask-Net

  • Generalizable
    • Can be applied to different object detection or tracking backbones, including SkyNet, UltraNet and ResNet-18.
    • Works well in different scenarios, including DAC-SDC, UAV123 and OTB100 dataset.

15

  • Hardware friendly
    • The mask generation branch can reuse convolution modules in the backbone.
    • Mask shape regularization can help avoid complex control logic.
    • Channel shuffle can help reduce data movement between DRAM and on-chip memory.
  • Small Overhead
    • The mask generation branch’s computation cost is about 6% of the whole network.

16 of 32

Promising features of Mask-Net

  • Generalizable
    • Can be applied to different object detection or tracking backbones, including SkyNet, UltraNet and ResNet-18.
    • Works well in different scenarios, including DAC-SDC, UAV123 and OTB100 dataset.

16

  • Hardware friendly
    • The mask generation branch can reuse convolution modules in the backbone.
    • Mask shape regularization can help avoid complex control logic.
    • Channel shuffle can help reduce data movement between DRAM and on-chip memory.
  • Small Overhead
    • The mask generation branch’s computation cost is about 6% of the whole network.

17 of 32

Promising features of Mask-Net

  • Generalizable
    • Can be applied to different object detection or tracking backbones, including SkyNet, UltraNet and ResNet-18.
    • Works well in different scenarios, including DAC-SDC, UAV123 and OTB100 dataset.

17

  • Hardware friendly
    • The mask generation branch can reuse convolution modules in the backbone.
    • Mask shape regularization can help avoid complex control logic.
    • Channel shuffle can help reduce data movement between DRAM and on-chip memory.
  • Small Overhead
    • The mask generation branch’s computation cost is about 6% of the whole network.

18 of 32

Algorithm Innovations

18

  • Confidence mask and regions of interest generation
    • The confidence mask is generated by Sigmoid.
    • The patch’s score in the mask is proportional to the probability of objects’ existence.

 

Mask generation process

The gate function

19 of 32

Algorithm Innovations

19

All pass mechanism

  • Two-stage training process
    • The first step is to fix the weights in the backbone and only train the new branch.
    • The second step is to fix the weights of the new branch and fine-tune the backbone.

Train the new branch

Fine-tune the backbone

Stage 2

Stage 1

Apply mask to backbone

Two stage training process

  • All pass mechanism
    • If the score of all patches does not exceed the threshold, then we need to calculate the whole image to improve the robustness of mask generation.

20 of 32

Hardware Innovations

20

  • Region of Interest Shape Regularization
    • Our FPGA accelerator is tile-based. If the shape of regions of interest is not rectangular, additional judgement is needed before loading and calculating the tile.
    • Regularize the shape of regions of interest in the mask into rectangular will avoid the complex control logic introduced by the judgement of patches’ score.

0

0

0

0

0

0

0

0

1

0

0

1

1

1

0

0

0

1

0

0

0

0

0

0

0

Non-rectangular Shape Mask

0

0

0

0

0

0

1

1

1

0

0

1

1

1

0

0

1

1

1

0

0

0

0

0

0

Rectangular Shape Mask

21 of 32

Hardware Innovations

21

  • Channel shuffle
    • Reduce data movement between DRAM and on-chip memory to reduce computation cost.
    • Our channel shuffle method is the inverse process of that in ShuffleNet.

Channel shuffle in Mask-Net

Channel shuffle in ShuffleNet[1]

[1] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6848–6856, 2018.

22 of 32

Overview

  • Background & Motivation
  • Related work
    • Region proposal
    • Cascade network

22

  • Mask-Net
    • The architecture
    • Promising features
    • Algorithm and hardware innovations
  • Experiment results
  • Design space exploration
  • Future research directions

23 of 32

Experiment Results

23

 

 

24 of 32

Experiment Results

24

  • Mask quality
    • The experiment is done using Mask-SkyNet on DAC-SDC dataset.
    • 84.3% of proposed regions can completely cover the object while only 1.2% are completely wrong.
    • Masked region proposals will only cause 3%~4% IoU loss.

Mask quality analysis

IoU loss comparison

25 of 32

Experiment Results

  • C/RTL co-simulation results
    • Only 6% of total inference time allows the network to correctly distinguish between objects and background.

25

C/RTL co-simulation results from Vitis

26 of 32

Experiment Results

26

  • Software and hardware evaluation results
    • At software level, we evaluate Mask-Net using three detection backbones: SkyNet, UltraNet and ResNet-18 and three datasets: DAC-SDC, OTB100 and UAV123. The weights and feature maps are 32-bit.
    • At hardware level, we use the same datasets to test our Mask-SkyNet accelerator on ZCU106 FPGA. The weights are 11 bits and feature maps are 9 bits.

Software evaluation results ( table )

Hardware evaluation results ( table )

Resource utilization report

( table )

73 of the 87 DSPs added come from the new branch

27 of 32

Overview

  • Background & Motivation
  • Related work
    • Region proposal
    • Cascade network

27

  • Mask-Net
    • The architecture
    • Promising features
    • Algorithm and hardware innovations
  • Experiment results
  • Design space exploration
  • Future research directions

28 of 32

Design Space Exploration

28

  • Reasons for DSE
    • The time imbalance between different parts of the accelerator will affect the overall performance.
    • We want to optimally allocating DSPs to different parts of the accelerator to balance the computations under a fixed number of DSPs.

  • DSE model
    • The model is used to calculate the theoretical speedup under fixed number of DSPs after DSP redistribution.

( 1 )

( 2 )

( 3 )

( 4 )

29 of 32

Design Space Exploration Results

29

DSP exploration space when extra DSP count is 1309

The relationship between the number of extra DSPs, inference time and theoretical speedup

The DSP distribution across three parts with different number of total extra DSPs.

30 of 32

Overview

  • Background & Motivation
  • Related work
    • Region proposal
    • Cascade network

30

  • Mask-Net
    • The architecture
    • Promising features
    • Algorithm and hardware innovations
  • Experiment results
  • Design space exploration
  • Future research directions

31 of 32

Future Research Directions

31

  • Adaptive threshold
    • The threshold used in the gate in Mask-Net now is empirically selected from experiment results.
    • Adaptive threshold may select region proposals more efficiently, further reduce the computation cost, especially in case of complex background.

  • Extend Mask-Net to different tasks
    • Object tracking
    • Image classification
    • Instance segmentation

32 of 32

Thank you!