CSE 5539: �Computer Vision
Today
Presentation
Presentation
Questions?
Today
Exemplar computer vision tasks
[C. Rieke, 2019]
Exemplar computer vision tasks
Retrieval, representation learning
Image generation
Vision and language
Neural Radiance Fields (NeRF)
Object-centric vs. scene-centric images
9
Object-centric images:
Scene-centric images:
ImageNet [image-level label]:
MSCOCO [instance segment]:
Classification on object-centric images
car
elephant
The progress of deep learning for classification
11
ImageNet-1K (ILSVRC)
Metric: Top-k accuracy
The progress of deep learning for classification
[Simonyan et al., 2015]
[Szegedy et al., 2015]
[Huang et al., 2017]
[He et al., 2016]
[Krizhevsky et al., 2012]
Top-5 error rate
General formulation for all these variants
13
Image (pixels)
Deep neural networks (DNN)
Homework:
-Dropout
-Batch norm
Convolution
A special computation between layers
15
Convolution
16
0 | 0 | 0 | 0 | 1 |
0 | 0 | 0 | 1 | 1 |
0 | 0 | 1 | 1 | 1 |
0 | 1 | 1 | 1 | 1 |
1 | 1 | 1 | 1 | 1 |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
0 | 0 | 1 |
0 | 1 | 1 |
1 | 1 | 1 |
Feature map (nodes) at layer t
Feature map at layer t+1
“Filter” weights
(3-by-3)
Inner product
Element-wise multiplication and sum
1
Convolution
17
0 | 0 | 0 | 0 | 1 |
0 | 0 | 0 | 1 | 1 |
0 | 0 | 1 | 1 | 1 |
0 | 1 | 1 | 1 | 1 |
1 | 1 | 1 | 1 | 1 |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
0 | 0 | 1 |
0 | 1 | 1 |
1 | 1 | 1 |
“Filter” weights
(3-by-3)
Inner product
6
Feature map (nodes) at layer t
Feature map at layer t+1
Convolution
18
0 | 0 | 0 | 0 | 1 |
0 | 0 | 0 | 1 | 1 |
0 | 0 | 1 | 1 | 1 |
0 | 1 | 1 | 1 | 1 |
1 | 1 | 1 | 1 | 1 |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
0 | 0 | 1 |
0 | 1 | 1 |
1 | 1 | 1 |
“Filter” weights
(3-by-3)
Inner product
1
Zero-padding: Set the missing values to be 0
Feature map (nodes) at layer t
Feature map at layer t+1
Convolution example
19
0 | 0 | 0 |
1 | 1 | 1 |
0 | 0 | 0 |
1 | 1 | 1 |
0 | 0 | 0 |
1 | 1 | 1 |
Convolution
20
“Filter” weights
(3-by-3)
“Filter” weights
(3-by-3-by-“2”)
| | |
| | |
| | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
0 | 0 | 0 | 0 | 1 |
0 | 0 | 0 | 1 | 1 |
0 | 0 | 1 | 1 | 1 |
0 | 1 | 1 | 1 | 1 |
1 | 1 | 1 | 1 | 1 |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
0 | 0 | 1 |
0 | 1 | 1 |
1 | 1 | 1 |
Inner product
Feature map (nodes) at layer t
Feature map at layer t+1
Convolution
21
“Filter” weights
(3-by-3-by-“2”)
| | |
| | |
| | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
0 | 0 | 0 | 0 | 1 |
0 | 0 | 0 | 1 | 1 |
0 | 0 | 1 | 1 | 1 |
0 | 1 | 1 | 1 | 1 |
1 | 1 | 1 | 1 | 1 |
0 | 0 | 1 |
0 | 1 | 1 |
1 | 1 | 1 |
Inner product
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | |
| | |
| | |
1 | 1 | 1 |
0 | 0 | 0 |
1 | 1 | 1 |
Feature map (nodes) at layer t
Feature map at layer t+1
One filter for one output “channel” to capture a different “pattern” (e.g., edges, circles, eyes, etc.)
Convolution: properties
22
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
Top-left, Top right: has ears
Middle: has eyes
Convolutional neural networks (CNN)
23
Shared weights
Vectorization + FC layers
Max pooling + down-sampling
Receptive field
24
Linear receptive field
Exponential receptive field
(with pooling + down-sampling)
Layers of feature maps (representations)
25
What does a large response at each layer/channel mean?
Representative CNN networks
[Krizhevsky et al., 2012]
[Simonyan et al., 2015]
26
Representative CNN networks
Representative CNN networks
[He et al, 2016]
[Huang et al, 2017]
28
Advantages:
Representative CNN networks
A general architecture involves
29
Training a CNN for classification
30
100: elephant
Minimize the empirical risk
Classification on object-centric images
Car�Person
Bike
Tree
Elephant
Giraffe
Tree
The diversity of deep learning models
32
Visual transformers
[Liu et al., 2021]
[Battaglia et al., 2018]
Graph neural networks
[Qi et al., 2017]
PointNet
[Zoph et al., 2017]
Neural architecture search
The diversity of deep learning algorithms
33
Meta-learning
[Finn et al., 2017]
Adversarial learning
[Ganin et al., 2016]
[He et al., 2020]
Contrastive learning
Today
Visual transformer
35
Image (pixels)
CNN vs. Visual transformer
36
CNN
Convolution
Visual transformer
Transformer
Visual transformer
37
(2) Vectorize each of them
+ encode each with a shared MLP
+ “spatial” encoding
(1) Split an image into patches
1-layer of Transformer Encoder
[Dosovitskiy et al., An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, ICLR 2021]
Position encoding
[https://erdem.pl/2021/05/understanding-positional-encoding-in-transformers]
1-layer of transformer encoder
39
K
Q
V
key, query, value
“learnable” matrices
Relatedness of patch-5 to others (after softmax)
Weighted value vectors
Single-head case
CNN vs. Visual transformer
40
CNN
Convolutions
Visual transformer
Transformer
Swin transformer
41
[Liu et al., Swin Transformer: Hierarchical Vision Transformer using Shifted Windows, ICCV 2021]
ImageNet classification accuracy
42
[Liu et al., 2021]
Question: How to perform final classification?
43
Adding a classification token
[Dosovitskiy et al., An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, ICLR 2021]
44
Often called a [CLS] token which is learnable
1-layer of transformer encoder
K
Q
V
key, query, value
“learnable” matrices
Single-head case
45
1-layer of transformer encoder
K
Q
V
key, query, value
“learnable” matrices
Multi-head case
46
K
Q
V
Multi-head attention
[Mathilde Caron et al., Emerging Properties in Self-Supervised Vision Transformers, 2021] [DINO]
Short summary
A general architecture of CNN or visual transformers involves
48
Today
Representative 2D recognition tasks
50
Dog
Cat
Horse
Sheep
W
H
a)
c)
b)
d)
Object- vs. scene centric images
51
MSCOCO [scene-centric]:
ImageNet [object-centric]:
Object- vs. scene centric images
52
Today
Semantic segmentation
54
New architecture?
55
Single spatial output!
Fully-convolutional network (FCN)
56
|
|
|
|
|
|
|
|
|
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
CNN
| | |
| | |
| | |
Feature map
Vector after vectorization
Dog
Cat
Boat
Bird
Matrix multiplication, inner product
Fully-convolutional network (FCN)
57
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
CNN
| | |
| | |
| | |
Dog
Cat
Boat
Bird
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
Each row = a Conv filter
Feature map
Fully-convolutional network (FCN)
58
[Long et al., Fully Convolutional Networks for Semantic Segmentation, CVPR 2015]
Up-sampling
59
Interpolation
Deconvolution
Fully-convolutional network (FCN)
60
Help localization
Help
context + semantics
[Long et al., Fully Convolutional Networks for Semantic Segmentation, CVPR 2015]
U-Net
61
Help localization
Help
context + semantics
[Ronneberger et al., U-Net: Convolutional Networks for Biomedical Image Segmentation, MICCAI 2015]
U-Net (aka, Hourglass network)
62
Dilated (Atrous) convolution
Exponential receptive field:�w/o down-sampling + up-sampling
w/ same # of parameters to learn
CRF to improve localization
64
CRF: similar and nearby pixels have the same class label
[Chen et al., DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs, PAMI 2017]
Atrous Spatial Pyramid Pooling (ASPP)�for multi-scale features
65
Example results
66
[Nirkin et al., HyperSeg, 2021]
Ground truth
[Zhao et al., Pyramid scene parsing network, 2017]
Today
Object detection
68
[class, u-center, v-center, width, height]
Naïve way
69
ResNet classifier
R-CNN
70
[Girshick et al., Rich feature hierarchies for accurate object detection and semantic segmentation, CVPR 2014]
Selective search for proposal generation
71
[Stanford CS 231b]
R-CNN
72
[Girshick et al., Rich feature hierarchies for accurate object detection and semantic segmentation, CVPR 2014]
[Girshick, CVPR 2019 tutorial]
R-CNN
By offset = MLP(feature)
73
Proposal
Ground truth
R-CNN
74
Fast R-CNN
75
ROI pooling
[Girshick, CVPR 2019 tutorial]
[Girshick, Fast R-CNN, ICCV 2015]
ROI pooling vs. ROI align
76
ROI Align
ROI Pooling
Making features extracted from different proposals the same size!
Faster R-CNN
77
ROI pooling
[Girshick, CVPR 2019 tutorial]
[Ren et al., Faster r-cnn: Towards real-time object detection with region proposal networks, NIPS 2015]
How to develop RPN�(region proposal network)?
78
5 * 8 * K * (2 + 4)
[Ren et al., 2015]
Ground truth
Anchor
What do we learn from RPN?
79
Questions?
How to deal with object sizes?
81
[Lin et al., Feature Pyramid Networks for Object Detection, CVPR 2017]
Mask R-CNN
82
[Girshick, CVPR 2019 tutorial]
[He et al., Mask r-cnn, ICCV 2017]
Mask R-CNN: for instance segmentation
83
CNN: convolutional neural network
RPN: region proposal network
Bulldozer: 80%
Bus: 15%
Motorcycle: 5%
2-stage vs. 1-stage detectors
84
[Redmon et al., 2016]
2-stage detector
1-stage detector
Exemplar 1-stage detectors
85
[Liu et al., 2016]
SSD
YOLO
[Redmon et al., 2016]
Exemplar 1-stage detectors (Retina Net)
86
[Lin et al., 2017]
2-stage vs. 1-stage detectors
87
[Redmon et al., 2016]
Inference: choose few from many
88
[Pictures from “towards data science” post]
Example results
89
[Zhang, et al., 2021]
Key names
Take home
91