CSE 5524: �Object detection
1
HW 1 & 2
Midterm
Homework and quiz plan
Final project (30%)
Project first glance
Today (Chapter 50)
7
How do I know it is a zebra?
Convolutions
9
Linear receptive field
Convolutions
10
Linear receptive field
Exponential receptive field
(with pooling + down-sampling)
CNN
A general architecture of CNN or visual transformers involves
11
What is the final FC layer doing?
[Koh et al., Concept Bottleneck Models, 2020]
Illustration
red
blue
Long beak
Long leg
Cardinal
Flamingo
Blue Jay
Image
CNN
Yes: 1; No: -1
Popular CNN architectures
14
Popular CNN architectures
15
Exemplar computer vision tasks
[C. Rieke, 2019]
Representative 2D recognition tasks
17
Dog
Cat
Horse
Sheep
W
H
a)
c)
b)
d)
Does segmentation need a new architecture?
18
Single spatial output!
Fully-convolutional network (FCN)
19
|
|
|
|
|
|
|
|
|
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
CNN
| | |
| | |
| | |
Feature map
Vector after vectorization
Dog
Cat
Boat
Bird
Matrix multiplication, inner product
Fully-convolutional network (FCN)
20
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
CNN
| | |
| | |
| | |
Dog
Cat
Boat
Bird
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
Each row = a Conv filter
Feature map
What if I input a larger image?
21
CNN
| | |
| | |
| | |
Dog Cat Boat Bird
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
Feature map
What if I input a larger image?
22
CNN
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
Feature map
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
Dog Cat Boat Bird
What if I input a larger image?
23
| | | | | |
| | | | | |
| | | | | |
CNN
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
Feature map
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
Dog Cat Boat Bird
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
Fully-convolutional network (FCN)
24
[Long et al., Fully Convolutional Networks for Semantic Segmentation, CVPR 2015]
U-Net
25
Help localization
Help
context + semantics
[Ronneberger et al., U-Net: Convolutional Networks for Biomedical Image Segmentation, MICCAI 2015]
How to teach the model?
Today
27
Object detection
28
[class, u-center, v-center, width, height]
Naïve way
29
ResNet classifier
R-CNN
30
[Girshick et al., Rich feature hierarchies for accurate object detection and semantic segmentation, CVPR 2014]
Selective search for proposal generation
31
[Stanford CS 231b]
R-CNN
32
[Girshick et al., Rich feature hierarchies for accurate object detection and semantic segmentation, CVPR 2014]
[Girshick, CVPR 2019 tutorial]
R-CNN
By offset = MLP(feature)
33
Proposal
Ground truth
R-CNN
34
Fast R-CNN
35
ROI pooling
[Girshick, CVPR 2019 tutorial]
[Girshick, Fast R-CNN, ICCV 2015]
ROI pooling vs. ROI align
36
ROI Align
ROI Pooling
Making features extracted from different proposals the same size!
Faster R-CNN
37
ROI pooling
[Girshick, CVPR 2019 tutorial]
[Ren et al., Faster r-cnn: Towards real-time object detection with region proposal networks, NIPS 2015]
How to teach a model to propose object locations?
How to teach a model to propose object locations?
How to teach a model to propose object locations?
Ground truth
How to teach a model to propose object locations?
Ground truth
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | 1 | | | |
| | | | | | | |
There is a car centered around this location!
What size?
Ground truth
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | 1 | | | |
| | | | | | | |
There is a car centered around this location!
What size?
Ground truth
Consider “pre-defined” anchors
What size?
Ground truth
Consider “pre-defined” anchors
What size?
Ground truth
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | 1 | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | 1 | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | 1 | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | 1 | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | 1 | | | |
| | | | | | | |
How to develop RPN�(region proposal network)?
46
5 * 8 * K * (2 + 4)
[Ren et al., 2015]
Ground truth
Anchor
What do we learn from RPN?
47
Questions?
How to deal with object sizes?
49
[Lin et al., Feature Pyramid Networks for Object Detection, CVPR 2017]
Mask R-CNN
50
[Girshick, CVPR 2019 tutorial]
[He et al., Mask r-cnn, ICCV 2017]
Mask R-CNN: for instance segmentation
51
CNN: convolutional neural network
RPN: region proposal network
Bulldozer: 80%
Bus: 15%
Motorcycle: 5%
2-stage vs. 1-stage detectors
52
[Redmon et al., 2016]
2-stage detector
1-stage detector
Exemplar 1-stage detectors
53
[Liu et al., 2016]
SSD
YOLO
[Redmon et al., 2016]
Exemplar 1-stage detectors (Retina Net)
54
[Lin et al., 2017]
2-stage vs. 1-stage detectors
55
[Redmon et al., 2016]
Inference: choose few from many
56
[Pictures from “towards data science” post]
Example results
57
[Zhang, et al., 2021]
New approach to object detection
New approach to object detection
Key names
Take home
61
LiDAR-based 3D perception
LiDAR-based 3D perception
63
[Source: Graham Murdoch/Popular Science]
LiDAR:
LiDAR-based 3D perception
You can view the LiDAR point clouds from different angles
64
Frontal view
Bird’s-eye view (BEV)
Two major ways to process LiDAR point clouds
65
Voxel-based processing + 3D object detectors
66
[Yang et al., PIXOR: Real-time 3D Object Detection from Point Clouds, 2019]
height
depth
Left-right