1 of 91

CSE 5539: �Computer Vision

2 of 91

Today

Basic deep learning blocks for computer vision

Convolutional neural nets
Visual transformers

Applications: 2D recognition

2D semantic segmentation
2D object detection

3 of 91

Presentation

Roughly 60 mins
Cover 1-2 papers and some background
The survey (must use Latex) must include more paper (e.g., >10 papers)
(Recommended) Please come to the office hours before you presentation
Outline:
10-min introduction
40-min approaches
10-min experiments

4 of 91

Presentation

Presentation schedule

https://sites.google.com/view/osu-cse-5539-au23-chao/

5 of 91

Questions?

6 of 91

Today

Basic deep learning blocks for computer vision

Convolutional neural nets
Visual transformers

Applications: 2D recognition

2D semantic segmentation
2D object detection

7 of 91

Exemplar computer vision tasks

[C. Rieke, 2019]

8 of 91

Exemplar computer vision tasks

Retrieval, representation learning

Image generation

Vision and language

Neural Radiance Fields (NeRF)

9 of 91

Object-centric vs. scene-centric images

9

Object-centric images:

contain a single class of objects
The object size is usually large
The background is simple

Scene-centric images:

contain multiple classes of objects
The object sizes can vary
The background is challenging
Objects may be occluded

ImageNet [image-level label]:

1K classes (~1M images)
21K classes (~14M images)

MSCOCO [instance segment]:

82 classes (~0.3M images)

10 of 91

Classification on object-centric images

Single object class (not multi-label cases)

Properties to capture:
Translation invariant
Rotation invariant (left-right flip)
Scale invariant

car

elephant

11 of 91

The progress of deep learning for classification

11

ImageNet-1K (ILSVRC)

1,000 object classes
1,000 training images/class
Each image contains just one class of object!

Metric: Top-k accuracy

For each image, return a list of top-k possible classes
If the true class is within the list, the classification is correct

12 of 91

The progress of deep learning for classification

[Simonyan et al., 2015]

[Szegedy et al., 2015]

[Huang et al., 2017]

[He et al., 2016]

[Krizhevsky et al., 2012]

Top-5 error rate

13 of 91

General formulation for all these variants

13

Image (pixels)

14 of 91

Deep neural networks (DNN)

Homework:

-Dropout

-Batch norm

15 of 91

Convolution

A special computation between layers

A current node is not directly affected by “all nodes in the previous layer”
The network “weights” on the edges can be “re-used”

15

16 of 91

Convolution

16

0	0	0	0	1
0	0	0	1	1
0	0	1	1	1
0	1	1	1	1
1	1	1	1	1

0	0	1
0	1	1
1	1	1

Feature map (nodes) at layer t

Feature map at layer t+1

“Filter” weights

(3-by-3)

Inner product

Element-wise multiplication and sum

1

17 of 91

Convolution

17

0	0	0	0	1
0	0	0	1	1
0	0	1	1	1
0	1	1	1	1
1	1	1	1	1

0	0	1
0	1	1
1	1	1

“Filter” weights

(3-by-3)

Inner product

6

Feature map (nodes) at layer t

Feature map at layer t+1

18 of 91

Convolution

18

0	0	0	0	1
0	0	0	1	1
0	0	1	1	1
0	1	1	1	1
1	1	1	1	1

0	0	1
0	1	1
1	1	1

“Filter” weights

(3-by-3)

Inner product

1

Zero-padding: Set the missing values to be 0

Feature map (nodes) at layer t

Feature map at layer t+1

19 of 91

Convolution example

19

0	0	0
1	1	1
0	0	0

1	1	1
0	0	0
1	1	1

20 of 91

Convolution

20

“Filter” weights

(3-by-3)

“Filter” weights

(3-by-3-by-“2”)

0	0	0	0	1
0	0	0	1	1
0	0	1	1	1
0	1	1	1	1
1	1	1	1	1

0	0	1
0	1	1
1	1	1

Inner product

Feature map (nodes) at layer t

Feature map at layer t+1

21 of 91

Convolution

21

“Filter” weights

(3-by-3-by-“2”)

0	0	0	0	1
0	0	0	1	1
0	0	1	1	1
0	1	1	1	1
1	1	1	1	1

0	0	1
0	1	1
1	1	1

Inner product

1	1	1
0	0	0
1	1	1

Feature map (nodes) at layer t

Feature map at layer t+1

One filter for one output “channel” to capture a different “pattern” (e.g., edges, circles, eyes, etc.)

22 of 91

Convolution: properties

Process nearby pixels together
Translation invariant: “local patterns” can show up at different pixel locations
Can process arbitrary-size images

22

Top-left, Top right: has ears

Middle: has eyes

23 of 91

Convolutional neural networks (CNN)

23

Shared weights

Vectorization + FC layers

Max pooling + down-sampling

Remove redundancy
Translation-invariant
Enlarge receptive filed

24 of 91

Receptive field

24

Linear receptive field

Exponential receptive field

(with pooling + down-sampling)

25 of 91

Layers of feature maps (representations)

25

What does a large response at each layer/channel mean?

26 of 91

Representative CNN networks

AlexNet

[Krizhevsky et al., 2012]

VGGnet

[Simonyan et al., 2015]

26

A block: computation
Edge: nodes/tensors

27 of 91

Representative CNN networks

GoogleNet [Szegedy et al., 2014]
Inception

28 of 91

Representative CNN networks

ResNet

[He et al, 2016]

DenseNet

[Huang et al, 2017]

28

A block: computation
Edge: nodes/tensors

Advantages:

Optimization
Collect more information

29 of 91

Representative CNN networks

A general architecture involves

Multiple layers of convolutions + ReLU (nonlinearity) + pooling + striding
These result in a (final) feature map

Positions on the map correspond to the image

The map goes through FC layers (MLP)
Usually, we keep the network till the feature map

For feature extraction
For down-stream tasks
For image-to-image search

29

30 of 91

Training a CNN for classification

30

100: elephant

Minimize the empirical risk

31 of 91

Classification on object-centric images

Properties:
Multiple object classes
Scale invariant
Occlusion (partially observable), cluttering, cropping

Car�Person

Bike

Tree

Elephant

Giraffe

Tree

32 of 91

The diversity of deep learning models

32

Visual transformers

[Liu et al., 2021]

[Battaglia et al., 2018]

Graph neural networks

[Qi et al., 2017]

PointNet

[Zoph et al., 2017]

Neural architecture search

33 of 91

The diversity of deep learning algorithms

33

Meta-learning

[Finn et al., 2017]

Adversarial learning

[Ganin et al., 2016]

[He et al., 2020]

Contrastive learning

34 of 91

Today

Basic deep learning blocks for computer vision

Convolutional neural nets
Visual transformers

Applications: 2D recognition

2D semantic segmentation
2D object detection

35 of 91

Visual transformer

35

Image (pixels)

36 of 91

CNN vs. Visual transformer

36

CNN

Convolution

Visual transformer

Transformer

37 of 91

Visual transformer

37

(2) Vectorize each of them

+ encode each with a shared MLP

+ “spatial” encoding

(1) Split an image into patches

1-layer of Transformer Encoder

[Dosovitskiy et al., An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, ICLR 2021]

38 of 91

Position encoding

[https://erdem.pl/2021/05/understanding-positional-encoding-in-transformers]

39 of 91

1-layer of transformer encoder

39

K

Q

V

key, query, value

“learnable” matrices

Relatedness of patch-5 to others (after softmax)

Weighted value vectors

Single-head case

40 of 91

CNN vs. Visual transformer

40

CNN

Convolutions

Visual transformer

Transformer

41 of 91

Swin transformer

41

Consider smaller patches and local “transformer”
Produce feature maps of different resolutions, like CNNs

[Liu et al., Swin Transformer: Hierarchical Vision Transformer using Shifted Windows, ICCV 2021]

42 of 91

ImageNet classification accuracy

42

[Liu et al., 2021]

43 of 91

Question: How to perform final classification?

43

44 of 91

Adding a classification token

[Dosovitskiy et al., An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, ICLR 2021]

44

Often called a [CLS] token which is learnable

45 of 91

1-layer of transformer encoder

K

Q

V

key, query, value

“learnable” matrices

Single-head case

45

46 of 91

1-layer of transformer encoder

K

Q

V

key, query, value

“learnable” matrices

Multi-head case

46

K

Q

V

47 of 91

Multi-head attention

[Mathilde Caron et al., Emerging Properties in Self-Supervised Vision Transformers, 2021] [DINO]

48 of 91

Short summary

A general architecture of CNN or visual transformers involves

Multiple layers of computations + nonlinearity + (pooling + striding)
These result in a (final) feature map
The map goes through FC layers (MLP)
Usually, we keep the network till the feature map

48

49 of 91

Today

Basic deep learning blocks for computer vision

Convolutional neural nets
Visual transformers

Applications: 2D recognition

2D semantic segmentation
2D object detection

50 of 91

Representative 2D recognition tasks

“Same” input: images

“Different” outputs:
A C-dim class probability vector
A set of bounding boxes, each with box location and class probability
An W x H x C feature map
A combination of b) and c)

“Different” labeled training data

50

Dog

Cat

Horse

Sheep

W

H

a)

c)

b)

d)

51 of 91

Object- vs. scene centric images

51

MSCOCO [scene-centric]:

Instance-level label
82 classes (~0.3M images)

ImageNet [object-centric]:

Image-level class label
1K classes (~1M images)
21K classes (~14M images)

52 of 91

Object- vs. scene centric images

Object-centric images usually contain a single class of objects.
Object frequency and semantic cues in different kinds of images can be different!

52

53 of 91

Today

Basic deep learning blocks for computer vision

Convolutional neural nets
Visual transformers

Applications: 2D recognition

2D semantic segmentation
2D object detection

54 of 91

Semantic segmentation

Every “pixel” to have a class label

Properties:
High-resolution output
Context
Localization
Multi-scale

54

55 of 91

New architecture?

55

Single spatial output!

56 of 91

Fully-convolutional network (FCN)

56

CNN

Feature map

Vector after vectorization

Dog

Cat

Boat

Bird

Matrix multiplication, inner product

57 of 91

Fully-convolutional network (FCN)

57

CNN

Dog

Cat

Boat

Bird

Each row = a Conv filter

Feature map

58 of 91

Fully-convolutional network (FCN)

58

[Long et al., Fully Convolutional Networks for Semantic Segmentation, CVPR 2015]

59 of 91

Up-sampling

59

Interpolation

Deconvolution

60 of 91

Fully-convolutional network (FCN)

60

Help localization

Help

context + semantics

[Long et al., Fully Convolutional Networks for Semantic Segmentation, CVPR 2015]

61 of 91

U-Net

61

Help localization

Help

context + semantics

[Ronneberger et al., U-Net: Convolutional Networks for Biomedical Image Segmentation, MICCAI 2015]

62 of 91

U-Net (aka, Hourglass network)

62

63 of 91

Dilated (Atrous) convolution

Exponential receptive field:�w/o down-sampling + up-sampling

w/ same # of parameters to learn

64 of 91

CRF to improve localization

DeepLab

64

CRF: similar and nearby pixels have the same class label

[Chen et al., DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs, PAMI 2017]

65 of 91

Atrous Spatial Pyramid Pooling (ASPP)�for multi-scale features

65

66 of 91

Example results

66

[Nirkin et al., HyperSeg, 2021]

Ground truth

[Zhao et al., Pyramid scene parsing network, 2017]

67 of 91

Today

Basic deep learning blocks for computer vision

Convolutional neural nets
Visual transformers

Applications: 2D recognition

2D semantic segmentation
2D object detection

68 of 91

Object detection

Properties:

Labels + bounding boxes
Localization
Multi-scale
Context
“Undetermined” numbers

68

[class, u-center, v-center, width, height]

69 of 91

Naïve way

Sliding window
Time consuming
What size?

69

ResNet classifier

70 of 91

R-CNN

Objectness proposal
CNN classifier
Box refinement

70

[Girshick et al., Rich feature hierarchies for accurate object detection and semantic segmentation, CVPR 2014]

71 of 91

Selective search for proposal generation

Step 1:

Not deep learning
super-pixel-based segmentation

Step 2:

Recursively combine similar regions into larger ones

Step 3:

Boxes fitting

71

[Stanford CS 231b]

72 of 91

R-CNN

72

[Girshick et al., Rich feature hierarchies for accurate object detection and semantic segmentation, CVPR 2014]

[Girshick, CVPR 2019 tutorial]

73 of 91

R-CNN

Box regression:
(du, dv)
(dw, dh)

By offset = MLP(feature)

73

Proposal

Ground truth

74 of 91

R-CNN

Problems:
Slow: every proposal needs to go through a “full” CNN
Mis-detection: the proposal algorithm is not trained together

74

75 of 91

Fast R-CNN

75

ROI pooling

[Girshick, CVPR 2019 tutorial]

[Girshick, Fast R-CNN, ICCV 2015]

76 of 91

ROI pooling vs. ROI align

76

ROI Align

ROI Pooling

Making features extracted from different proposals the same size!

77 of 91

Faster R-CNN

77

ROI pooling

[Girshick, CVPR 2019 tutorial]

[Ren et al., Faster r-cnn: Towards real-time object detection with region proposal networks, NIPS 2015]

78 of 91

How to develop RPN�(region proposal network)?

78

5 * 8 * K * (2 + 4)

[Ren et al., 2015]

Ground truth

Anchor

79 of 91

What do we learn from RPN?

“How to encode your labeled data so that your CNN can learn from them and predict them” is important!

Inference: predict these “values” and accordingly transfer them to bounding boxes!

79

80 of 91

Questions?

81 of 91

How to deal with object sizes?

81

[Lin et al., Feature Pyramid Networks for Object Detection, CVPR 2017]

82 of 91

Mask R-CNN

82

[Girshick, CVPR 2019 tutorial]

[He et al., Mask r-cnn, ICCV 2017]

83 of 91

Mask R-CNN: for instance segmentation

83

CNN: convolutional neural network

RPN: region proposal network

Bulldozer: 80%

Bus: 15%

Motorcycle: 5%

84 of 91

2-stage vs. 1-stage detectors

Other names: single-shot, single-pass, … (e.g., YOLO, SSD)
Difference: no ROI pooling/align

84

[Redmon et al., 2016]

2-stage detector

1-stage detector

85 of 91

Exemplar 1-stage detectors

85

[Liu et al., 2016]

SSD

YOLO

[Redmon et al., 2016]

86 of 91

Exemplar 1-stage detectors (Retina Net)

86

[Lin et al., 2017]

87 of 91

2-stage vs. 1-stage detectors

Pros for 1-stage:

Faster!

Cons for 1-stage:

Too many negative locations; scale

87

[Redmon et al., 2016]

88 of 91

Inference: choose few from many

Non-Maximum Suppression (NMS)

88

[Pictures from “towards data science” post]

89 of 91

Example results

89

[Zhang, et al., 2021]

90 of 91

Key names

Tsung-Yi Lin
Ross Girshick
Kaiming He
Piotr Dollar

91 of 91

Take home

Good tutorials online:

CVPR 2017-2022, ECCV 2018-2020, ICCV 2017-2021 [search tutorial or workshop]
ICML/NeurIPS/ICLR 2018-2021 [search tutorial or workshop]

Good framework:

PyTorch: Torchvision
PyTorch: Detectron2

Good source code:

Papers with code: https://paperswithcode.com/

91