CSE 5524: �Vision Transformer
1
HW 1 & 2
Final project (30%)
Today (Chapter 50 & 26)
4
Object detection
5
[class, u-center, v-center, width, height]
R-CNN
6
[Girshick et al., Rich feature hierarchies for accurate object detection and semantic segmentation, CVPR 2014]
[Girshick, CVPR 2019 tutorial]
R-CNN
By offset = MLP(feature)
7
Proposal
Ground truth
Fast R-CNN
8
ROI pooling
[Girshick, CVPR 2019 tutorial]
[Girshick, Fast R-CNN, ICCV 2015]
ROI pooling vs. ROI align
9
ROI Align
ROI Pooling
Making features extracted from different proposals the same size!
Faster R-CNN
10
ROI pooling
[Girshick, CVPR 2019 tutorial]
[Ren et al., Faster r-cnn: Towards real-time object detection with region proposal networks, NIPS 2015]
Region proposal network (RPN)
FCN
……
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
FCN = fully convolutional neural network
Region proposal network (RPN)
FCN
……
Shared MLP
To each patch
Does each patch belong to an object?
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
Yes
No
1
1
1
1
1
1
1
1
1
Region proposal network (RPN)
FCN
……
Shared MLP
To each patch
Does each patch belong to an object?
Shared MLP
If yes, what is the size? Where is the center?
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
Yes
No
1
1
1
1
1
1
1
1
1
Length
Width
Center-x
Center-y
How to deal with multiple predictions?
Inference: choose few from many
15
[Pictures from “towards data science” post]
Region proposal network (RPN): rethink
FCN
……
Shared MLP
To each patch
Does each patch belong to an object?
Shared MLP
If yes, what is the size? Where is the center?
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
Yes
No
1
1
1
1
1
1
1
1
1
Length
Width
Center-x
Center-y
Can we leverage some “prior” knowledge about object sizes and locations?
Region proposal network (RPN): rethink
FCN
……
Shared MLP
To each patch
Does each patch belong to an object of certain size?
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
Yes
No
1
1
1
1
1
1
1
1
1
Region proposal network (RPN): rethink
FCN
……
Shared MLP
To each patch
Does each patch belong to an object of certain size?
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
Yes
No
0
0
0
0
0
0
0
0
0
Region proposal network (RPN): rethink
FCN
……
Shared MLP
To each patch
Shared MLP
If yes, what is the offset of the size and center?
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
Yes
No
1
1
1
1
1
1
1
1
1
Length
Width
Center-x
Center-y
Does each patch belong to an object of certain size?
How to develop RPN�(region proposal network)?
20
5 * 8 * K * (2 + 4)
[Ren et al., 2015]
Ground truth
How to deal with object sizes?
21
[Lin et al., Feature Pyramid Networks for Object Detection, CVPR 2017]
2-stage vs. 1-stage detectors
22
[Redmon et al., 2016]
2-stage detector
1-stage detector
Exemplar 1-stage detectors
23
[Liu et al., 2016]
SSD
YOLO
[Redmon et al., 2016]
Exemplar 1-stage detectors (Retina Net)
24
[Lin et al., 2017]
2-stage vs. 1-stage detectors
25
[Redmon et al., 2016]
Key names
Take home
27
LiDAR-based 3D perception
LiDAR-based 3D perception
29
[Source: Graham Murdoch/Popular Science]
LiDAR:
LiDAR-based 3D perception
You can view the LiDAR point clouds from different angles
30
Frontal view
Bird’s-eye view (BEV)
Two major ways to process LiDAR point clouds
31
Voxel-based processing + 3D object detectors
32
[Yang et al., PIXOR: Real-time 3D Object Detection from Point Clouds, 2019]
height
depth
Left-right
Transformers
CNN vs. Vision transformer
34
CNN
Convolution
Vision transformer
Transformer
Key difference
Limitation of CNN
Using size 2 kernel
How about Fully connected layers?
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
Attention
Vision transformer
38
(2) Vectorize each of them
+ encode each with a shared MLP
+ “spatial” encoding
(1) Split an image into patches
1-layer of Transformer Encoder
[Dosovitskiy et al., An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, ICLR 2021]
1-layer of transformer encoder
39
K
Q
V
key, query, value
“learnable” matrices
Relatedness of patch-5 to others (after softmax)
Weighted value vectors
Single-head case
Data type in attention: tokens
[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]
Operations on tokens: combination
[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]
Operations on tokens: parallel computation
[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]
The “F” function here is a “shared” MLP
Operations on tokens: overall
[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]
The attention layer
[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]
The attention layer
[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]
How to obtain A?
[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]
How to obtain A?
[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]
Each token will generate a key vector k
How to obtain A?
[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]
Softmax:
What if there is no explicit question? Self-attention
[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]
Self-attention expanded
[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]
Illustration
[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]
MLP vs. (multi-layer) transformer
[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]
Vision Transformer
[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]
Skip connection
Multi-head
Self-attention
Check textbook!
Properties: permutation invariant
[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]
Masked attention
[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]
I go to school
Position encoding
[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]
Position encoding
[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]
CNN vs. Vision transformer
58
CNN
Convolutions
Vision transformer
Transformer
New approach to object detection
New approach to object detection
New approach to object detection
New approach to object detection
New approach to object detection