CSE 5524: �ConvNet & Transformer
1
Midterm exam and HW # 2
Homework and quiz plan
Final project (20 – 30%)
What will be in the midterm
What is this? How do you tell?
6
Filters to capture these “frequency” components
7
[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]
Fourier domain
Spatial domain
Gabor filters
8
[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]
9
[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]
How many do I need? How separate do I need?
10
It seems we need many!
11
What if we resize the image?
12
Questions?
Besides translation, scale is another factor to be invariant
14
Typically, a single filter can only detect one size
15
However, we can resize the image
16
Image pyramid
17
Feature pyramid networks
18
[Feature Pyramid Networks for Object Detection, CVPR 2017]
Questions?
Revisit the bird image
20
How to apply?
Convolutional neural networks (CNN) for detection
21
Convolutional neural networks (CNN) for segmentation
22
CNN revisit
23
Shared weights
Vectorization + FC layers
Max pooling + down-sampling
Convolution revisit
24
“Filter” weights
(3-by-3-by-“2”)
| | |
| | |
| | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
0 | 0 | 0 | 0 | 1 |
0 | 0 | 0 | 1 | 1 |
0 | 0 | 1 | 1 | 1 |
0 | 1 | 1 | 1 | 1 |
1 | 1 | 1 | 1 | 1 |
0 | 0 | 1 |
0 | 1 | 1 |
1 | 1 | 1 |
Inner product
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | |
| | |
| | |
1 | 1 | 1 |
0 | 0 | 0 |
1 | 1 | 1 |
Feature map (nodes) at layer t
Feature map at layer t+1
One filter for one output “channel” to capture a different “pattern” (e.g., edges, circles, eyes, etc.)
Convolution with [?] filters
25
Stride (down-sampled) convolution
26
Stride (down-sampled) convolution
27
Dilated convolution
28
CNN revisit
29
Shared weights
Vectorization + FC layers
Max pooling + down-sampling
A simple CNN classifier
30
Spatial output
31
Spatial output
32
Popular CNN architectures
33
Popular CNN architectures
34
Popular CNN architectures
35
Questions?
Vision transformer
37
Image (pixels)
CNN vs. Vision transformer
38
CNN
Convolution
Vision transformer
Transformer
Vision transformer
39
(2) Vectorize each of them
+ encode each with a shared MLP
+ “spatial” encoding
(1) Split an image into patches
1-layer of Transformer Encoder
[Dosovitskiy et al., An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, ICLR 2021]
Position encoding
[https://erdem.pl/2021/05/understanding-positional-encoding-in-transformers]
1-layer of transformer encoder
41
K
Q
V
key, query, value
“learnable” matrices
Relatedness of patch-5 to others (after softmax)
Weighted value vectors
Single-head case
CNN vs. Vision transformer
42
CNN
Convolutions
Vision transformer
Transformer