1 of 63

CSE 5524: �Vision Transformer

1

2 of 63

HW 1 & 2

Grades are posted.
Regrading forms are posted on Carmen.

3 of 63

Final project (30%)

Team forming:

2 – 3 students: same expectation

Plan:

Project proposal: 3/27 (3%)
Project presentation: 4/22 & 4/23 (10%)
Project report & code release: 4/25 (15%)

4 of 63

Today (Chapter 50 & 26)

Object detection recap & continued
Vision transformer
Midterm

4

5 of 63

Object detection

Properties:

Labels + bounding boxes
Localization
Multi-scale
Context
“Undetermined” numbers

5

[class, u-center, v-center, width, height]

6 of 63

R-CNN

6

[Girshick et al., Rich feature hierarchies for accurate object detection and semantic segmentation, CVPR 2014]

[Girshick, CVPR 2019 tutorial]

7 of 63

R-CNN

Box regression:
(du, dv)
(dw, dh)

By offset = MLP(feature)

7

Proposal

Ground truth

8 of 63

Fast R-CNN

8

ROI pooling

[Girshick, CVPR 2019 tutorial]

[Girshick, Fast R-CNN, ICCV 2015]

9 of 63

ROI pooling vs. ROI align

9

ROI Align

ROI Pooling

Making features extracted from different proposals the same size!

10 of 63

Faster R-CNN

10

ROI pooling

[Girshick, CVPR 2019 tutorial]

[Ren et al., Faster r-cnn: Towards real-time object detection with region proposal networks, NIPS 2015]

11 of 63

Region proposal network (RPN)

FCN

……

FCN = fully convolutional neural network

12 of 63

Region proposal network (RPN)

FCN

……

Shared MLP

To each patch

Does each patch belong to an object?

Yes

No

1

13 of 63

Region proposal network (RPN)

FCN

……

Shared MLP

To each patch

Does each patch belong to an object?

Shared MLP

If yes, what is the size? Where is the center?

Yes

No

1

Length

Width

Center-x

Center-y

14 of 63

How to deal with multiple predictions?

15 of 63

Inference: choose few from many

Non-Maximum Suppression (NMS)

15

[Pictures from “towards data science” post]

16 of 63

Region proposal network (RPN): rethink

FCN

……

Shared MLP

To each patch

Does each patch belong to an object?

Shared MLP

If yes, what is the size? Where is the center?

Yes

No

1

Length

Width

Center-x

Center-y

Can we leverage some “prior” knowledge about object sizes and locations?

17 of 63

Region proposal network (RPN): rethink

FCN

……

Shared MLP

To each patch

Does each patch belong to an object of certain size?

Yes

No

1

18 of 63

Region proposal network (RPN): rethink

FCN

……

Shared MLP

To each patch

Does each patch belong to an object of certain size?

Yes

No

0

19 of 63

Region proposal network (RPN): rethink

FCN

……

Shared MLP

To each patch

Shared MLP

If yes, what is the offset of the size and center?

Yes

No

1

Length

Width

Center-x

Center-y

Does each patch belong to an object of certain size?

20 of 63

How to develop RPN�(region proposal network)?

20

5 * 8 * K * (2 + 4)

[Ren et al., 2015]

Ground truth

21 of 63

How to deal with object sizes?

21

[Lin et al., Feature Pyramid Networks for Object Detection, CVPR 2017]

22 of 63

2-stage vs. 1-stage detectors

Other names: single-shot, single-pass, … (e.g., YOLO, SSD)
Difference: no ROI pooling/align

22

[Redmon et al., 2016]

2-stage detector

1-stage detector

23 of 63

Exemplar 1-stage detectors

23

[Liu et al., 2016]

SSD

YOLO

[Redmon et al., 2016]

24 of 63

Exemplar 1-stage detectors (Retina Net)

24

[Lin et al., 2017]

25 of 63

2-stage vs. 1-stage detectors

Pros for 1-stage:

Faster!

Cons for 1-stage:

Too many negative locations; scale

25

[Redmon et al., 2016]

26 of 63

Key names

Tsung-Yi Lin
Ross Girshick
Kaiming He
Piotr Dollar

27 of 63

Take home

Good tutorials online:

CVPR 2017-2022, ECCV 2018-2020, ICCV 2017-2021 [search tutorial or workshop]
ICML/NeurIPS/ICLR 2018-2021 [search tutorial or workshop]

Good framework:

PyTorch: Torchvision
PyTorch: Detectron2

Good source code:

Papers with code: https://paperswithcode.com/

27

28 of 63

LiDAR-based 3D perception

29 of 63

LiDAR-based 3D perception

29

[Source: Graham Murdoch/Popular Science]

LiDAR:

Light Detection and Ranging sensor
accurate 3D point clouds of the environment

30 of 63

LiDAR-based 3D perception

You can view the LiDAR point clouds from different angles

30

Frontal view

Bird’s-eye view (BEV)

31 of 63

Two major ways to process LiDAR point clouds

Point-wise processing

PointNet [Qi et al., 2017]
PointNet++ [Qi et al., 2017]
PointRCNN [Shi et al., 2019]
…

Voxel-based processing: turn points into a tensor (e.g., W x D x H x F)

PointPillars [Lang et al., 2019]
VoxelNet [Zhou et al., 2017]
PIXOR [Liang et al., 2018]
…

31

32 of 63

Voxel-based processing + 3D object detectors

Occupation (PIXOR): 3D points as a 3D occupation tensor from bird’s-eye-view

32

[Yang et al., PIXOR: Real-time 3D Object Detection from Point Clouds, 2019]

height

depth

Left-right

33 of 63

Transformers

34 of 63

CNN vs. Vision transformer

34

CNN

Convolution

Vision transformer

Transformer

35 of 63

Key difference

Replace “convolutional” layer by “attention” layers

36 of 63

Limitation of CNN

Assumption: locality (filters with small kernel sizes)

Challenge: getting global/holistic information of an image is hard.

Using size 2 kernel

How about Fully connected layers?

37 of 63

Attention

“Selectively” combine/compare neurons!

38 of 63

Vision transformer

38

(2) Vectorize each of them

+ encode each with a shared MLP

+ “spatial” encoding

(1) Split an image into patches

1-layer of Transformer Encoder

[Dosovitskiy et al., An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, ICLR 2021]

39 of 63

1-layer of transformer encoder

39

K

Q

V

key, query, value

“learnable” matrices

Relatedness of patch-5 to others (after softmax)

Weighted value vectors

Single-head case

40 of 63

Data type in attention: tokens

Each token “encapsulates” group of information

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

41 of 63

Operations on tokens: combination

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

42 of 63

Operations on tokens: parallel computation

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

The “F” function here is a “shared” MLP

43 of 63

Operations on tokens: overall

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

44 of 63

The attention layer

The combination weights are not fixed

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

45 of 63

The attention layer

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

46 of 63

How to obtain A?

An “explicit” query guided attention

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

47 of 63

How to obtain A?

An “explicit” query guided attention

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

Each token will generate a key vector k

48 of 63

How to obtain A?

An “explicit” query guided attention

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

Softmax:

49 of 63

What if there is no explicit question? Self-attention

Each token generates its own query (question), key, and value, by using the “shared learnable” W matrices

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

50 of 63

Self-attention expanded

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

51 of 63

Illustration

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

52 of 63

MLP vs. (multi-layer) transformer

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

53 of 63

Vision Transformer

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

Skip connection

Multi-head

Self-attention

Check textbook!

54 of 63

Properties: permutation invariant

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

55 of 63

Masked attention

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

I go to school

56 of 63

Position encoding

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

57 of 63

Position encoding

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

58 of 63

CNN vs. Vision transformer

58

CNN

Convolutions

Vision transformer

Transformer

59 of 63

New approach to object detection

DEtection Transformer [Nicolas Carion et al.]

60 of 63

New approach to object detection

DEtection Transformer [Nicolas Carion et al.]

61 of 63

New approach to object detection

DEtection Transformer [Nicolas Carion et al.]

62 of 63

New approach to object detection

DEtection Transformer [Nicolas Carion et al.]

63 of 63

New approach to object detection

DEtection Transformer [Nicolas Carion et al.]