S9551 | Mar 20, 2019 | 14:00 pm, RM 231
Turbo-boosting Neural Networks for Object Detection
Hongyang Li
The Chinese University of Hong Kong
Outline
Hongyang
CUHK Ph.D. candidate /
Microsoft Intern
Research Timeline
Ph.D. student start
ImageNet Challenge (PAMI), Object Attributes (ICCV)
2015
2015
Multi-bias Activation (ICML)
Recurrent Design for Detection (ICCV), COCO Loss (NIPS)
2016
2017
Zoom-out-and-in Network (IJCV), Capsule Nets (ECCV)
Feature Intertwiner (ICLR), Few-shot Learning (CVPR)
2018
2019
First-author Papers
Object Detection: core and fundamental task in computer vision
He et al.
Mask-RCNN
ICCV 2017
Best paper
Object Detection is everywhere
OBJECT DETECTION
core and fundamental task in computer vision
How to solve it?
A naive solution: place many boxes on top of image/feature maps and classify them!
person
Not person
How to solve it?
And yet challenges are:
person
baseball
Helmet
Cotton Hat
2. Ambiguity in cluttered scenarios
How to solve it?
(a) Place anchors as many as possible and
(b) have layers deeper and deeper.
(a) place anchors
(b) network design
Popular methods at a glance
Pipeline/system design
One-stage:
YOLO and variants
SSD and variants
Two-stage:
R-CNN family
(Fast RCNN, Faster RCNN, etc)
Component/structure/loss design
[1] Feature Pyramid Network
[2] Focal loss (RetinaNet)
[3] Online hard negative mining (OHEM)
[4] Zoom-out-and-in Network (ours)
[5] Recurrent Scale Approximation (ours)
[6] Feature Intertwiner (ours)
(a) place anchors
(b) network design
TODO: References missed.
Pipeline: a roadmap of R-CNN family (two-stage detector)
P_l is the feature map output at level l; P_m is from a higher level m.
level m
level l
...
Pipeline: a roadmap of R-CNN family (two-stage detector)
P_l is the feature map output at level l; P_m is from a higher level m.
RoI
level m
level l
Small anchors cropped out of P_l
...
RoI output
(fixed size)
Pipeline: a roadmap of R-CNN family (two-stage detector)
P_l is the feature map output at level l; P_m is from a higher level m.
RoI
Person
detected!
level m
level l
...
Pipeline: a roadmap of R-CNN family (two-stage detector)
P_l is the feature map output at level l; P_m is from a higher level m.
RoI
RoI
Person
detected!
level m
level l
Large anchors cropped out of P_m
...
Pipeline: a roadmap of R-CNN family (two-stage detector)
P_l is the feature map output at level l; P_m is from a higher level m.
RoI
RoI
Person
detected!
level m
level l
RPN loss
RPN loss
...
RPN = region proposal network
Side: what is RoI (region of interest) operation?
Person
detected!
RPN loss
RPN loss
...
RoI
RoI
Fixed size output
RoI*
*Achieved by pooling;
No learned parameters here
Many variants of RoI operations
Arbitrary size of feature map
R-CNN family (two-stage detector) vs. YOLO (one -stage detector)
RoI
RoI
...
Two stage:
R-CNN family
RPN loss
RPN loss
RPN: Two-class cls. problem
(object or not?)
K-class cls. problem
(dog, cat, etc)
Image size can vary
R-CNN family (two-stage detector) vs. YOLO (one -stage detector)
RoI
RoI
...
...
Multiple K-class classifiers
(dog, cat, etc)
Two stage:
R-CNN family
One stage:
YOLO/SSD
Image size can NOT vary
RPN loss
RPN loss
RPN: Two-class cls. problem
(object or not?)
K-class cls. problem
(dog, cat, etc)
Image size can vary
More accurate
Faster
Both R-CNN and SSD models have been tremendously adopted in academia/industry.
In this talk, we focus on the two-stage detector with RoI operation.
Datasets
And many others
ImageNet, VisualGenome, Pascal VOC, KITTI, etc.
Evaluation - mean AP
prediction
Ground truth
If IoU (intersection / union)
= 0.65 > threshold,
Then current prediction is counted as Correct
For category person,
Get a set of Correct/incorrect predictions, compute the precision/recall.
Get the average precision (AP) from the precision/recall figure.
Done.
Get all categories,
that’s mAP (under threshold).
What is “uncomfortable” in current pipelines?
Assume RoI’s output is 20
RoI input 40 → 20
RoI input 7 → 20
Inaccurate features due to up-sampling!
Accurate features in down-sampling!
Large objects
Small objects
What percentage of objects suffer from this?
Table 3 in our paper.
Proposal assignment on each level before RoI operation. ‘below #’ indicates how many proposals are there whose size is below the size of RoI output.
We define small set to be the anchors on current level and large set to be all anchors above current level.
2. Solution: A Feature Intertwiner Module
Our assumption
Visual feature
Semantic feature
The semantic features among instances (large or small) within the same class should be the same.
same!!!
Our motivation
Inaccurate maps/features
Intuition: let reliable features supervise/guide the learning of the less reliable ones.
Naive feature intertwiner concept:
Suppose we have two sets of features already -
one is from large objects and the other is from small ones.
The Feature Intertwiner
For current level l
Cls. loss
Reg. loss (bbox)
Make-up layer:
fuel back the lost information during RoI and compensate necessary details for small instances.
(one conv. layer)
For small objects
The Feature Intertwiner
For current level l
Cls. loss
Reg. loss (bbox)
Intertwiner
Loss (e.g., L2 loss)
Input to Intertwiner
Critic layer:
transfer features to a larger channel size and reduce spatial size to one. (two conv. layers)
For large objects
The Feature Intertwiner
For current level l
Cls. loss
Reg. loss (bbox)
Intertwiner
loss
Input to Intertwiner
Total loss = (Intertwiner+cls.+reg.) for all levels
The Feature Intertwiner
Anchors are placed at various levels.
What if there are no large instances in this mini-batch,
for the current level?
We define small set to be the anchors on current level and large set to be all anchors above current level.
The Feature Intertwiner - class buffer
For level l
For all levels
We use a class buffer to store the accurate feature set from large instances.
How to generate the buffer?
One simple idea is to
Take the average of features of all large objects during training.
Feature
Intertwiner
Level 2
Level 3
...
Historical logger
Inter. loss
Discussions on Feature Intertwiner
Historical logger
Inter. loss
For inference
The Feature Intertwiner - choosing optimal feature maps
For level l
For all levels
How to choose the appropriate maps for large objects? as input to intertwiner
One simple solution is to
(a) Use the feature map directly on current level.
This is inappropriate.
why?
Inter. loss
We define small set to be the anchors on current level and large set to be all anchors above current level.
The Feature Intertwiner - choosing optimal feature maps
How to choose the appropriate maps for large objects? as input to intertwiner
Other options are
(b) use the feature maps on higher level.
(c) upsample higher-level maps to current level, with learnable parameters (or not).
We will empirically analyze these later.
The Feature Intertwiner - choosing optimal feature maps
How to choose the appropriate maps for large objects? as input to intertwiner
Our final option is based on (c)
(d), build a better alignment between the upsampled feature map with current map.
The Feature Intertwiner - choosing optimal feature maps
How to choose the appropriate maps for large objects? as input to intertwiner
The approach is
Optimal transport (OT).
In a nutshell, OT is to optimally move one distribution (P_m|l) to the other (P_l).
Our final option is based on (c)
(d), build a better alignment between the upsampled feature map with current map.
The Feature Intertwiner - choosing optimal feature maps
How to choose the appropriate maps for large objects? as input to intertwiner
The approach is
Optimal transport (OT).
In a nutshell, OT is to optimally move one distribution (P_m|l) to the other (P_l).
Q is a cost matrix (distance)
P is a proxy matrix satisfying some constraint.
Our final option is based on (c)
(d), build a better alignment between the upsampled feature map with current map.
The Feature Intertwiner - choosing optimal feature maps
How to choose the appropriate maps for large objects? as input to intertwiner
How to compute
Optimal transport (OT).
=
Pm
F
H
Q
Cost matrix
P
Sinkhorn iterate
OT loss
The Feature Intertwiner - choosing optimal feature maps
How to choose the appropriate maps for large objects? as input to intertwiner
How to compute
Optimal transport (OT).
=
Components
Pm
F
H
Q
Cost matrix
P
Sinkhorn iterate
OT loss
P
H->Q
The Feature Intertwiner - choosing optimal feature maps
Why Optimal transport (OT) is better than others?
Hence, the final loss:
Summary of our method
Experiments
Setup
The rest of details are stated in Sec. 6.5 in the paper.
Ablation on module design
Table 2 in the paper
gray background is the chosen default
Different anchor
placements
Observation #1:
Feature Intertwiner Module is better than baseline.
~2% mAP increase
Large objects also improve.
Why?
Does the intertwiner module work better?
Ablation on module design
Table 2 in the paper
gray background is the chosen default
Observation #2:
By optimizing the make-up layer; the linearly combined features would further boost performance.
How does the intertwiner module affect feature learning?
Gradient flow
Ablation on module design
Table 2 in the paper
gray background is the chosen default
Observation #3:
Recording all history of the large/reliable set would achieve better results (and save mem); one unified buffer is enough.
Does buffer size matter? Unified or level-based buffer?
How to design the buffer?
Ablation on OT unit
Table 1 in the paper
Different input sources for the reliable set
Visualization on samples within class
w/o intertwiner
with intertwiner
Comparison with state-of-the-arts (I)
The most distinctive improvements are
Microwave, truck, cow, car, zebra
Zoom in
Comparison with state-of-the-arts (I)
Figure 4 in the paper
Improvement per category after embedding the feature intertwiner
32.8 (baseline) vs 35.2 (ours)
Most small-sized objects get improved!
Comparison with state-of-the-arts (I)
Dropped!
Some categories witness a drop of performance
Couch, baseball bat, broccoli
Couch
The feature set of large couch is less accurate due to noises (of other classes).
Comparison with state-of-the-arts (II)
Fast-RCNN
variants
36.8
44.2
Same backbone
39.1
SSD
33.2
Proposed
Table 4 in the paper
Single-model performance (bounding box AP)
Comparison with state-of-the-arts (III)
This work is published at ICLR 2019
Check out our poster at GTC!
P9108
AI/Deep Learning Research
Near the gear store
3. Detection in Reality
Practical issues on multi-GPUs
Standard Implementations of BN in public frameworks (suck as Caffe, MXNet, Torch, TF, PyTorch) are unsynchronized, which means that the data are normalized within each GPU.
Synchronized BN
Practical issues on multi-GPUs
Does it matter? As long as bs on each GPU is not too few, unsynchronized BN is ok.
Note that bs in the “deeper” part is the # of RoIs/boxes on each card;
Batch size in the backbone is the # of image!
Another rule of thumb: fixed BN in the backbone when finetune the network on your task.
Practical issues on multi-GPUs
2. Wrap up the loss computation into forward() on each card
Otherwise GPU 0 would take too much memory in some cases, causing mem imbalance and decrease utility of other GPUs.
loss
loss
loss
loss
loss
Practical issues on multi-GPUs
3. Different images must have same size of targets as input
4. What if the utility of GPUs is low?
Trade-off between accuracy and efficiency
Additional model capacity increase in our method:
But these new designs only have light-weight effect.
FPN
SSD
Better
area
Trade-off between accuracy and efficiency
More facts:
Training: 8 GPUs, batch size=8, 3.4 days
Mem cost 9.6G/gpu, baseline 8.3G
Test (input 800 on Titan X):
325 ms/image, baseline 308 ms/image
FPN
SSD
Better
area
Mask-RCNN (39.2)
InterNet (42.5)
4. Future of Object Detection
Any alternatives? to abandon current anchor-based pipeline
Idea:
Current solution are all based on anchors (one-stage or two-stage).
Is bounding box really accurate to detector all objects?
How about detect objects using bottom-up approaches? Like pixel-wise segmentation? In this way, we can walk around the box detection pipeline.
Densely cluttered persons
Any alternatives? to abandon current anchor-based pipeline
Some inspiration using bottom-up approach to address detection:
CornerNet: Detecting Objects as Paired Keypoints, arXiv 1808.01244
Predict the top-left and bottom-right corners in the image!
To generate the bounding boxes!
Extension: learn some skeleton pattern for each category as priors.
Any alternatives? to abandon current anchor-based pipeline
Idea 2:
Solve detection via 3D structure of the world (not only with aid of context)
Take-away Messages
solve detection via bottom-up approaches or 3D structure of objects.
4. Beyond detection only - one model to learn them all:
Multi-task: detection, segmentation, pose estimation, captioning,
Zero-shot detection, curriculum learning, ...
Thank you! Questions?
Collaborators:
Yu Liu
Bo Dai
Xiaoyang
Shaoshuai
Wanli
Xiaogang
Email: yangli@ee.cuhk.edu.hk
Slides at: http://www.ee.cuhk.edu.hk/~yangli/ twitter @francislee2020
Hongyang
CUHK Ph.D. candidate /
Microsoft Intern
Ph.D. student start
ImageNet Challenge (PAMI), Object Attributes (ICCV)
Multi-bias Activation (ICML)
Recurrent Design for Detection (ICCV), COCO Loss (NIPS)
Zoom-out-and-in Network (IJCV), Capsule Nets (ECCV)
Feature Intertwiner (ICLR), Few-shot Learning (CVPR)
2015
Research Timeline at The Chinese Univ. of Hong Kong
2015
2016
2017
2018
2019
Outline
… 15 min
… 20 min
… 5 min
… 10 min
Core content
Core discussion