1 of 70

S9551 | Mar 20, 2019 | 14:00 pm, RM 231

Turbo-boosting Neural Networks for Object Detection

Hongyang Li

The Chinese University of Hong Kong

2 of 70

Outline

Introduction to Object Detection

Pipeline overview
Dataset and evaluation
Popular methods
Existing problems

Solution: A Feature Intertwiner Module
Detection in Reality

Implementation on GPUs
Efficiency and accuracy tradeoff

Future of Object Detection

3 of 70

Hongyang

CUHK Ph.D. candidate /

Microsoft Intern

Research Timeline

Ph.D. student start

ImageNet Challenge (PAMI), Object Attributes (ICCV)

2015

Multi-bias Activation (ICML)

Recurrent Design for Detection (ICCV), COCO Loss (NIPS)

2016

2017

Zoom-out-and-in Network (IJCV), Capsule Nets (ECCV)

Feature Intertwiner (ICLR), Few-shot Learning (CVPR)

2018

2019

First-author Papers

4 of 70

Introduction to Object Detection

5 of 70

Object Detection: core and fundamental task in computer vision

He et al.

Mask-RCNN

ICCV 2017

Best paper

6 of 70

Object Detection is everywhere

OBJECT DETECTION

core and fundamental task in computer vision

7 of 70

How to solve it?

A naive solution: place many boxes on top of image/feature maps and classify them!

person

Not person

8 of 70

How to solve it?

And yet challenges are:

person

Variations in shape/appearance/size(scale)

baseball

Helmet

Cotton Hat

2. Ambiguity in cluttered scenarios

9 of 70

How to solve it?

(a) Place anchors as many as possible and

(b) have layers deeper and deeper.

(a) place anchors

(b) network design

10 of 70

Popular methods at a glance

Pipeline/system design

One-stage:

YOLO and variants

SSD and variants

Two-stage:

R-CNN family

(Fast RCNN, Faster RCNN, etc)

Component/structure/loss design

[1] Feature Pyramid Network

[2] Focal loss (RetinaNet)

[3] Online hard negative mining (OHEM)

[4] Zoom-out-and-in Network (ours)

[5] Recurrent Scale Approximation (ours)

[6] Feature Intertwiner (ours)

(a) place anchors

(b) network design

TODO: References missed.

11 of 70

Pipeline: a roadmap of R-CNN family (two-stage detector)

P_l is the feature map output at level l; P_m is from a higher level m.

level m

level l

...

12 of 70

Pipeline: a roadmap of R-CNN family (two-stage detector)

P_l is the feature map output at level l; P_m is from a higher level m.

RoI

level m

level l

Small anchors cropped out of P_l

...

RoI output

(fixed size)

13 of 70

Pipeline: a roadmap of R-CNN family (two-stage detector)

P_l is the feature map output at level l; P_m is from a higher level m.

RoI

Person

detected!

level m

level l

...

14 of 70

Pipeline: a roadmap of R-CNN family (two-stage detector)

P_l is the feature map output at level l; P_m is from a higher level m.

RoI

Person

detected!

level m

level l

Large anchors cropped out of P_m

...

Why fixed RoI output
Why put small anchors on low level/stage and large ones on higher stages?

15 of 70

Pipeline: a roadmap of R-CNN family (two-stage detector)

P_l is the feature map output at level l; P_m is from a higher level m.

RoI

Person

detected!

level m

level l

RPN loss

...

RPN = region proposal network

16 of 70

Side: what is RoI (region of interest) operation?

Person

detected!

RPN loss

...

RoI

Fixed size output

RoI*

*Achieved by pooling;

No learned parameters here

Many variants of RoI operations

Arbitrary size of feature map

17 of 70

R-CNN family (two-stage detector) vs. YOLO (one -stage detector)

RoI

...

Two stage:

R-CNN family

RPN loss

RPN: Two-class cls. problem

(object or not?)

K-class cls. problem

(dog, cat, etc)

Image size can vary

18 of 70

R-CNN family (two-stage detector) vs. YOLO (one -stage detector)

RoI

...

Multiple K-class classifiers

(dog, cat, etc)

Two stage:

R-CNN family

One stage:

YOLO/SSD

Image size can NOT vary

RPN loss

RPN: Two-class cls. problem

(object or not?)

K-class cls. problem

(dog, cat, etc)

Image size can vary

More accurate

Faster

19 of 70

Both R-CNN and SSD models have been tremendously adopted in academia/industry.

In this talk, we focus on the two-stage detector with RoI operation.

20 of 70

Datasets

COCO dataset

http://mscoco.org/

YouTube-8M dataset

https://research.google.com/youtube8m/

And many others

ImageNet, VisualGenome, Pascal VOC, KITTI, etc.

21 of 70

Evaluation - mean AP

prediction

Ground truth

If IoU (intersection / union)

= 0.65 > threshold,

Then current prediction is counted as Correct

For category person,

Get a set of Correct/incorrect predictions, compute the precision/recall.

Get the average precision (AP) from the precision/recall figure.

Done.

Get all categories,

that’s mAP (under threshold).

22 of 70

What is “uncomfortable” in current pipelines?

Assume RoI’s output is 20

RoI input 40 → 20

RoI input 7 → 20

Inaccurate features due to up-sampling!

Accurate features in down-sampling!

Large objects

Small objects

23 of 70

What percentage of objects suffer from this?

Table 3 in our paper.

Proposal assignment on each level before RoI operation. ‘below #’ indicates how many proposals are there whose size is below the size of RoI output.

We define small set to be the anchors on current level and large set to be all anchors above current level.

24 of 70

2. Solution: A Feature Intertwiner Module

25 of 70

Our assumption

Visual feature

Semantic feature

The semantic features among instances (large or small) within the same class should be the same.

same!!!

26 of 70

Our motivation

Inaccurate maps/features

Intuition: let reliable features supervise/guide the learning of the less reliable ones.

Naive feature intertwiner concept:

Suppose we have two sets of features already -

one is from large objects and the other is from small ones.

27 of 70

The Feature Intertwiner

For current level l

Cls. loss

Reg. loss (bbox)

Make-up layer:

fuel back the lost information during RoI and compensate necessary details for small instances.

(one conv. layer)

For small objects

28 of 70

The Feature Intertwiner

For current level l

Cls. loss

Reg. loss (bbox)

Intertwiner

Loss (e.g., L2 loss)

Input to Intertwiner

Critic layer:

transfer features to a larger channel size and reduce spatial size to one. (two conv. layers)

For large objects

29 of 70

The Feature Intertwiner

For current level l

Cls. loss

Reg. loss (bbox)

Intertwiner

loss

Input to Intertwiner

Total loss = (Intertwiner+cls.+reg.) for all levels

30 of 70

The Feature Intertwiner

Anchors are placed at various levels.

What if there are no large instances in this mini-batch,

for the current level?

We define small set to be the anchors on current level and large set to be all anchors above current level.

31 of 70

The Feature Intertwiner - class buffer

For level l

For all levels

We use a class buffer to store the accurate feature set from large instances.

How to generate the buffer?

One simple idea is to

Take the average of features of all large objects during training.

Feature

Intertwiner

Level 2

Level 3

...

Historical logger

Inter. loss

32 of 70

Discussions on Feature Intertwiner

the intertwiner is proposed to optimize feature learning of the less reliable set. During test, the green part will be removed.
can be seen as a teacher-student guidance in the self-supervised domain.
detach the gradient update in buffer will obtain better results. “Soft targets”, similarly as in RL (replay memory).
The buffer is level-agnostic. Improvements over all levels/sizes of objects are observed.

Historical logger

Inter. loss

For inference

33 of 70

The Feature Intertwiner - choosing optimal feature maps

For level l

For all levels

How to choose the appropriate maps for large objects? as input to intertwiner

One simple solution is to

(a) Use the feature map directly on current level.

This is inappropriate.

why?

Inter. loss

We define small set to be the anchors on current level and large set to be all anchors above current level.

34 of 70

The Feature Intertwiner - choosing optimal feature maps

How to choose the appropriate maps for large objects? as input to intertwiner

Other options are

(b) use the feature maps on higher level.

We will empirically analyze these later.

35 of 70

The Feature Intertwiner - choosing optimal feature maps

How to choose the appropriate maps for large objects? as input to intertwiner

Our final option is based on (c)

(d), build a better alignment between the upsampled feature map with current map.

36 of 70

The Feature Intertwiner - choosing optimal feature maps

How to choose the appropriate maps for large objects? as input to intertwiner

The approach is

Optimal transport (OT).

In a nutshell, OT is to optimally move one distribution (P_m|l) to the other (P_l).

Our final option is based on (c)

(d), build a better alignment between the upsampled feature map with current map.

37 of 70

The Feature Intertwiner - choosing optimal feature maps

How to choose the appropriate maps for large objects? as input to intertwiner

The approach is

Optimal transport (OT).

In a nutshell, OT is to optimally move one distribution (P_m|l) to the other (P_l).

Q is a cost matrix (distance)

P is a proxy matrix satisfying some constraint.

Our final option is based on (c)

(d), build a better alignment between the upsampled feature map with current map.

38 of 70

The Feature Intertwiner - choosing optimal feature maps

How to choose the appropriate maps for large objects? as input to intertwiner

How to compute

Optimal transport (OT).

Cost matrix

Sinkhorn iterate

OT loss

39 of 70

The Feature Intertwiner - choosing optimal feature maps

How to choose the appropriate maps for large objects? as input to intertwiner

How to compute

Optimal transport (OT).

Components

Cost matrix

Sinkhorn iterate

OT loss

H->Q

40 of 70

The Feature Intertwiner - choosing optimal feature maps

Why Optimal transport (OT) is better than others?

Hence, the final loss:

OT metric converges while other variants (KL or JS) don’t
Provides sensible cost functions when learning distributions supported by low-dim manifolds (p_l and p_m|l)

41 of 70

Summary of our method

42 of 70

Experiments

43 of 70

Setup

Evaluate our algorithm on COCO dataset
Train set: trainval-35k, test set: minival
Network structure: resNet-50 or resNet-101 with FPN
Based on Mask-RCNN framework without seg. Branch
Evaluation metric: meanAP under different thresholds and sizes

The rest of details are stated in Sec. 6.5 in the paper.

44 of 70

Ablation on module design

Table 2 in the paper

gray background is the chosen default

Different anchor

placements

Observation #1:

Feature Intertwiner Module is better than baseline.

~2% mAP increase

Large objects also improve.

Why?

Does the intertwiner module work better?

45 of 70

Ablation on module design

Table 2 in the paper

gray background is the chosen default

Observation #2:

By optimizing the make-up layer; the linearly combined features would further boost performance.

How does the intertwiner module affect feature learning?

Gradient flow

46 of 70

Ablation on module design

Table 2 in the paper

gray background is the chosen default

Observation #3:

Recording all history of the large/reliable set would achieve better results (and save mem); one unified buffer is enough.

Does buffer size matter? Unified or level-based buffer?

How to design the buffer?

47 of 70

Ablation on OT unit

Table 1 in the paper

Different input sources for the reliable set

48 of 70

Visualization on samples within class

w/o intertwiner

with intertwiner

49 of 70

Comparison with state-of-the-arts (I)

The most distinctive improvements are

Microwave, truck, cow, car, zebra

Zoom in

50 of 70

Comparison with state-of-the-arts (I)

Figure 4 in the paper

Improvement per category after embedding the feature intertwiner

32.8 (baseline) vs 35.2 (ours)

Most small-sized objects get improved!

51 of 70

Comparison with state-of-the-arts (I)

Dropped!

Some categories witness a drop of performance

Couch, baseball bat, broccoli

Couch

The feature set of large couch is less accurate due to noises (of other classes).

52 of 70

Comparison with state-of-the-arts (II)

Fast-RCNN

variants

36.8

44.2

Same backbone

39.1

SSD

33.2

Proposed

Table 4 in the paper

Single-model performance (bounding box AP)

53 of 70

Comparison with state-of-the-arts (III)

54 of 70

This work is published at ICLR 2019

Paper:

https://openreview.net/forum?id=SyxZJn05YX

Code:

https://github.com/hli2020/feature_intertwiner

Check out our poster at GTC!

P9108

AI/Deep Learning Research

Near the gear store

55 of 70

3. Detection in Reality

56 of 70

Practical issues on multi-GPUs

Batch normalization

Standard Implementations of BN in public frameworks (suck as Caffe, MXNet, Torch, TF, PyTorch) are unsynchronized, which means that the data are normalized within each GPU.

https://hangzhang.org/PyTorch-Encoding/notes/syncbn.html

Synchronized BN

57 of 70

Practical issues on multi-GPUs

Batch normalization

Does it matter? As long as bs on each GPU is not too few, unsynchronized BN is ok.

Note that bs in the “deeper” part is the # of RoIs/boxes on each card;

Batch size in the backbone is the # of image!

Another rule of thumb: fixed BN in the backbone when finetune the network on your task.

58 of 70

Practical issues on multi-GPUs

2. Wrap up the loss computation into forward() on each card

Otherwise GPU 0 would take too much memory in some cases, causing mem imbalance and decrease utility of other GPUs.

loss

59 of 70

Practical issues on multi-GPUs

3. Different images must have same size of targets as input

4. What if the utility of GPUs is low?

Dataloader is slow
Move op. to Tensor
…
Or change to another workstation
(often during inference, utility is low)

60 of 70

Trade-off between accuracy and efficiency

Additional model capacity increase in our method:

Critic/make-up layers
Buffer
OT module

But these new designs only have light-weight effect.

FPN

SSD

Better

area

61 of 70

Trade-off between accuracy and efficiency

More facts:

Training: 8 GPUs, batch size=8, 3.4 days

Mem cost 9.6G/gpu, baseline 8.3G

Test (input 800 on Titan X):

325 ms/image, baseline 308 ms/image

FPN

SSD

Better

area

Mask-RCNN (39.2)

InterNet (42.5)

62 of 70

4. Future of Object Detection

63 of 70

Any alternatives? to abandon current anchor-based pipeline

Idea:

Current solution are all based on anchors (one-stage or two-stage).

Is bounding box really accurate to detector all objects?

How about detect objects using bottom-up approaches? Like pixel-wise segmentation? In this way, we can walk around the box detection pipeline.

Densely cluttered persons

64 of 70

Any alternatives? to abandon current anchor-based pipeline

Some inspiration using bottom-up approach to address detection:

CornerNet: Detecting Objects as Paired Keypoints, arXiv 1808.01244

Predict the top-left and bottom-right corners in the image!

To generate the bounding boxes!

Extension: learn some skeleton pattern for each category as priors.

65 of 70

Any alternatives? to abandon current anchor-based pipeline

Idea 2:

Solve detection via 3D structure of the world (not only with aid of context)

66 of 70

Take-away Messages

Object detection is the basic and core task of other high-level vision problems.
Feature engine (backbone) and detector design (domain knowledge) are important.
Beyond current pipeline (dense anchors):

solve detection via bottom-up approaches or 3D structure of objects.

4. Beyond detection only - one model to learn them all:

Multi-task: detection, segmentation, pose estimation, captioning,

Zero-shot detection, curriculum learning, ...

67 of 70

Thank you! Questions?

Collaborators:

Yu Liu

Bo Dai

Xiaoyang

Shaoshuai

Wanli

Xiaogang

Email: yangli@ee.cuhk.edu.hk

Slides at: http://www.ee.cuhk.edu.hk/~yangli/ twitter @francislee2020

69 of 70

Hongyang

CUHK Ph.D. candidate /

Microsoft Intern

Ph.D. student start

ImageNet Challenge (PAMI), Object Attributes (ICCV)

Multi-bias Activation (ICML)

Recurrent Design for Detection (ICCV), COCO Loss (NIPS)

Zoom-out-and-in Network (IJCV), Capsule Nets (ECCV)

Feature Intertwiner (ICLR), Few-shot Learning (CVPR)

2015

Research Timeline at The Chinese Univ. of Hong Kong

2015

2016

2017

2018

2019

70 of 70

Outline

Introduction to Object Detection

Pipeline overview
Dataset and evaluation
Popular methods
Existing problems

Solution: A Feature Intertwiner Module
Detection in Reality

Implementation on GPUs
Efficiency and accuracy tradeoff

Future of Object Detection

… 15 min

… 20 min

… 5 min

… 10 min

Core content

Core discussion

1 of 70

2 of 70

3 of 70

4 of 70

5 of 70

6 of 70

7 of 70

8 of 70

9 of 70

10 of 70

11 of 70

12 of 70

13 of 70

14 of 70

15 of 70

16 of 70

17 of 70

18 of 70

19 of 70

20 of 70

21 of 70

22 of 70

23 of 70

24 of 70

25 of 70

26 of 70

27 of 70

28 of 70

29 of 70

30 of 70

31 of 70

32 of 70

33 of 70

34 of 70

35 of 70

36 of 70

37 of 70

38 of 70

39 of 70

40 of 70

41 of 70

42 of 70

43 of 70

44 of 70

45 of 70

46 of 70

47 of 70

48 of 70

49 of 70

50 of 70

51 of 70

52 of 70

53 of 70

54 of 70

55 of 70

56 of 70

57 of 70

58 of 70

59 of 70

60 of 70

61 of 70

62 of 70

63 of 70

64 of 70

65 of 70

66 of 70

67 of 70

68 of 70

69 of 70

70 of 70