1 of 91

CSE 5539: �Computer Vision

2 of 91

Today

  • Basic deep learning blocks for computer vision
    • Convolutional neural nets
    • Visual transformers
  • Applications: 2D recognition
    • 2D semantic segmentation
    • 2D object detection

3 of 91

Presentation

  • Roughly 60 mins
  • Cover 1-2 papers and some background
  • The survey (must use Latex) must include more paper (e.g., >10 papers)
  • (Recommended) Please come to the office hours before you presentation
  • Outline:
  • 10-min introduction
  • 40-min approaches
  • 10-min experiments

4 of 91

Presentation

5 of 91

Questions?

6 of 91

Today

  • Basic deep learning blocks for computer vision
    • Convolutional neural nets
    • Visual transformers
  • Applications: 2D recognition
    • 2D semantic segmentation
    • 2D object detection

7 of 91

Exemplar computer vision tasks

[C. Rieke, 2019]

8 of 91

Exemplar computer vision tasks

Retrieval, representation learning

Image generation

Vision and language

Neural Radiance Fields (NeRF)

9 of 91

Object-centric vs. scene-centric images

9

Object-centric images:

  • contain a single class of objects
  • The object size is usually large
  • The background is simple

Scene-centric images:

  • contain multiple classes of objects
  • The object sizes can vary
  • The background is challenging
  • Objects may be occluded

ImageNet [image-level label]:

  • 1K classes (~1M images)
  • 21K classes (~14M images)

MSCOCO [instance segment]:

  • 82 classes (~0.3M images)

10 of 91

Classification on object-centric images

  • Single object class (not multi-label cases)

  • Properties to capture:
  • Translation invariant
  • Rotation invariant (left-right flip)
  • Scale invariant

car

elephant

11 of 91

The progress of deep learning for classification

11

ImageNet-1K (ILSVRC)

  • 1,000 object classes
  • 1,000 training images/class
  • Each image contains just one class of object!

Metric: Top-k accuracy

  • For each image, return a list of top-k possible classes
  • If the true class is within the list, the classification is correct

12 of 91

The progress of deep learning for classification

[Simonyan et al., 2015]

[Szegedy et al., 2015]

[Huang et al., 2017]

[He et al., 2016]

[Krizhevsky et al., 2012]

Top-5 error rate

13 of 91

General formulation for all these variants

13

 

Image (pixels)

 

 

 

 

14 of 91

Deep neural networks (DNN)

Homework:

-Dropout

-Batch norm

15 of 91

Convolution

A special computation between layers

  • A current node is not directly affected by “all nodes in the previous layer”
  • The network “weights” on the edges can be “re-used”

15

16 of 91

Convolution

16

0

0

0

0

1

0

0

0

1

1

0

0

1

1

1

0

1

1

1

1

1

1

1

1

1

0

0

1

0

1

1

1

1

1

Feature map (nodes) at layer t

Feature map at layer t+1

“Filter” weights

(3-by-3)

Inner product

Element-wise multiplication and sum

1

17 of 91

Convolution

17

0

0

0

0

1

0

0

0

1

1

0

0

1

1

1

0

1

1

1

1

1

1

1

1

1

0

0

1

0

1

1

1

1

1

“Filter” weights

(3-by-3)

Inner product

6

Feature map (nodes) at layer t

Feature map at layer t+1

18 of 91

Convolution

18

0

0

0

0

1

0

0

0

1

1

0

0

1

1

1

0

1

1

1

1

1

1

1

1

1

0

0

1

0

1

1

1

1

1

“Filter” weights

(3-by-3)

Inner product

1

Zero-padding: Set the missing values to be 0

Feature map (nodes) at layer t

Feature map at layer t+1

19 of 91

Convolution example

19

0

0

0

1

1

1

0

0

0

1

1

1

0

0

0

1

1

1

20 of 91

Convolution

20

“Filter” weights

(3-by-3)

“Filter” weights

(3-by-3-by-“2”)

0

0

0

0

1

0

0

0

1

1

0

0

1

1

1

0

1

1

1

1

1

1

1

1

1

0

0

1

0

1

1

1

1

1

Inner product

Feature map (nodes) at layer t

Feature map at layer t+1

21 of 91

Convolution

21

“Filter” weights

(3-by-3-by-“2”)

0

0

0

0

1

0

0

0

1

1

0

0

1

1

1

0

1

1

1

1

1

1

1

1

1

0

0

1

0

1

1

1

1

1

Inner product

1

1

1

0

0

0

1

1

1

Feature map (nodes) at layer t

Feature map at layer t+1

One filter for one output “channel” to capture a different “pattern” (e.g., edges, circles, eyes, etc.)

22 of 91

Convolution: properties

  • Process nearby pixels together
  • Translation invariant: “local patterns” can show up at different pixel locations
  • Can process arbitrary-size images

22

Top-left, Top right: has ears

Middle: has eyes

23 of 91

Convolutional neural networks (CNN)

23

Shared weights

Vectorization + FC layers

Max pooling + down-sampling

  • Remove redundancy
  • Translation-invariant
  • Enlarge receptive filed

24 of 91

Receptive field

24

Linear receptive field

Exponential receptive field

(with pooling + down-sampling)

25 of 91

Layers of feature maps (representations)

25

What does a large response at each layer/channel mean?

26 of 91

Representative CNN networks

  • AlexNet

[Krizhevsky et al., 2012]

  • VGGnet

[Simonyan et al., 2015]

26

  • A block: computation
  • Edge: nodes/tensors

27 of 91

Representative CNN networks

  • GoogleNet [Szegedy et al., 2014]
  • Inception

28 of 91

Representative CNN networks

  • ResNet

[He et al, 2016]

  • DenseNet

[Huang et al, 2017]

28

 

  • A block: computation
  • Edge: nodes/tensors

 

 

Advantages:

  • Optimization
  • Collect more information

29 of 91

Representative CNN networks

A general architecture involves

  • Multiple layers of convolutions + ReLU (nonlinearity) + pooling + striding
  • These result in a (final) feature map
    • Positions on the map correspond to the image
  • The map goes through FC layers (MLP)
  • Usually, we keep the network till the feature map
    • For feature extraction
    • For down-stream tasks
    • For image-to-image search

29

30 of 91

Training a CNN for classification

  •  

30

100: elephant

 

Minimize the empirical risk

31 of 91

Classification on object-centric images

  • Properties:
  • Multiple object classes
  • Scale invariant
  • Occlusion (partially observable), cluttering, cropping

Car�Person

Bike

Tree

Elephant

Giraffe

Tree

32 of 91

The diversity of deep learning models

32

Visual transformers

[Liu et al., 2021]

[Battaglia et al., 2018]

Graph neural networks

[Qi et al., 2017]

PointNet

[Zoph et al., 2017]

Neural architecture search

33 of 91

The diversity of deep learning algorithms

33

Meta-learning

[Finn et al., 2017]

Adversarial learning

[Ganin et al., 2016]

[He et al., 2020]

Contrastive learning

34 of 91

Today

  • Basic deep learning blocks for computer vision
    • Convolutional neural nets
    • Visual transformers
  • Applications: 2D recognition
    • 2D semantic segmentation
    • 2D object detection

35 of 91

Visual transformer

  •  

35

 

Image (pixels)

 

 

36 of 91

CNN vs. Visual transformer

36

CNN

Convolution

Visual transformer

Transformer

37 of 91

Visual transformer

37

(2) Vectorize each of them

+ encode each with a shared MLP

+ “spatial” encoding

 

 

 

 

(1) Split an image into patches

 

 

 

 

1-layer of Transformer Encoder

[Dosovitskiy et al., An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, ICLR 2021]

38 of 91

Position encoding

[https://erdem.pl/2021/05/understanding-positional-encoding-in-transformers]

39 of 91

1-layer of transformer encoder

39

K

Q

V

key, query, value

“learnable” matrices

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Relatedness of patch-5 to others (after softmax)

 

 

 

 

 

 

 

 

 

 

 

 

Weighted value vectors

 

 

 

 

 

 

 

 

 

 

 

 

Single-head case

40 of 91

CNN vs. Visual transformer

40

CNN

Convolutions

Visual transformer

Transformer

 

41 of 91

Swin transformer

41

  • Consider smaller patches and local “transformer”
  • Produce feature maps of different resolutions, like CNNs

[Liu et al., Swin Transformer: Hierarchical Vision Transformer using Shifted Windows, ICCV 2021]

42 of 91

ImageNet classification accuracy

42

[Liu et al., 2021]

43 of 91

Question: How to perform final classification?

43

44 of 91

Adding a classification token

[Dosovitskiy et al., An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, ICLR 2021]

44

Often called a [CLS] token which is learnable

45 of 91

1-layer of transformer encoder

K

Q

V

key, query, value

“learnable” matrices

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Single-head case

45

46 of 91

1-layer of transformer encoder

K

Q

V

key, query, value

“learnable” matrices

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Multi-head case

46

K

Q

V

 

47 of 91

Multi-head attention

 

 

 

[Mathilde Caron et al., Emerging Properties in Self-Supervised Vision Transformers, 2021] [DINO]

48 of 91

Short summary

A general architecture of CNN or visual transformers involves

  • Multiple layers of computations + nonlinearity + (pooling + striding)
  • These result in a (final) feature map
  • The map goes through FC layers (MLP)
  • Usually, we keep the network till the feature map

48

49 of 91

Today

  • Basic deep learning blocks for computer vision
    • Convolutional neural nets
    • Visual transformers
  • Applications: 2D recognition
    • 2D semantic segmentation
    • 2D object detection

50 of 91

Representative 2D recognition tasks

  • “Same” input: images

  • “Different” outputs:
  • A C-dim class probability vector
  • A set of bounding boxes, each with box location and class probability
  • An W x H x C feature map
  • A combination of b) and c)

  • “Different” labeled training data

50

Dog

Cat

Horse

Sheep

W

H

a)

c)

b)

d)

51 of 91

Object- vs. scene centric images

51

MSCOCO [scene-centric]:

  • Instance-level label
  • 82 classes (~0.3M images)

ImageNet [object-centric]:

  • Image-level class label
  • 1K classes (~1M images)
  • 21K classes (~14M images)

52 of 91

Object- vs. scene centric images

  • Object-centric images usually contain a single class of objects.
  • Object frequency and semantic cues in different kinds of images can be different!

52

53 of 91

Today

  • Basic deep learning blocks for computer vision
    • Convolutional neural nets
    • Visual transformers
  • Applications: 2D recognition
    • 2D semantic segmentation
    • 2D object detection

54 of 91

Semantic segmentation

  • Every “pixel” to have a class label

  • Properties:
  • High-resolution output
  • Context
  • Localization
  • Multi-scale

54

55 of 91

New architecture?

55

Single spatial output!

56 of 91

Fully-convolutional network (FCN)

56

CNN

Feature map

Vector after vectorization

Dog

Cat

Boat

Bird

Matrix multiplication, inner product

57 of 91

Fully-convolutional network (FCN)

57

CNN

Dog

Cat

Boat

Bird

Each row = a Conv filter

Feature map

58 of 91

Fully-convolutional network (FCN)

58

[Long et al., Fully Convolutional Networks for Semantic Segmentation, CVPR 2015]

59 of 91

Up-sampling

59

Interpolation

Deconvolution

60 of 91

Fully-convolutional network (FCN)

60

Help localization

Help

context + semantics

[Long et al., Fully Convolutional Networks for Semantic Segmentation, CVPR 2015]

61 of 91

U-Net

61

Help localization

Help

context + semantics

[Ronneberger et al., U-Net: Convolutional Networks for Biomedical Image Segmentation, MICCAI 2015]

62 of 91

U-Net (aka, Hourglass network)

62

63 of 91

Dilated (Atrous) convolution

Exponential receptive field:�w/o down-sampling + up-sampling

w/ same # of parameters to learn

64 of 91

CRF to improve localization

  • DeepLab

64

CRF: similar and nearby pixels have the same class label

[Chen et al., DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs, PAMI 2017]

65 of 91

Atrous Spatial Pyramid Pooling (ASPP)�for multi-scale features

65

66 of 91

Example results

66

[Nirkin et al., HyperSeg, 2021]

Ground truth

[Zhao et al., Pyramid scene parsing network, 2017]

67 of 91

Today

  • Basic deep learning blocks for computer vision
    • Convolutional neural nets
    • Visual transformers
  • Applications: 2D recognition
    • 2D semantic segmentation
    • 2D object detection

68 of 91

Object detection

  • Properties:

  • Labels + bounding boxes
  • Localization
  • Multi-scale
  • Context
  • “Undetermined” numbers

68

[class, u-center, v-center, width, height]

69 of 91

Naïve way

  • Sliding window
  • Time consuming
  • What size?

69

ResNet classifier

70 of 91

R-CNN

  • Objectness proposal
  • CNN classifier
  • Box refinement

70

[Girshick et al., Rich feature hierarchies for accurate object detection and semantic segmentation, CVPR 2014]

71 of 91

Selective search for proposal generation

  • Step 1:
    • Not deep learning
    • super-pixel-based segmentation

  • Step 2:
    • Recursively combine similar regions into larger ones

  • Step 3:
    • Boxes fitting

71

[Stanford CS 231b]

72 of 91

R-CNN

72

[Girshick et al., Rich feature hierarchies for accurate object detection and semantic segmentation, CVPR 2014]

[Girshick, CVPR 2019 tutorial]

73 of 91

R-CNN

  • Box regression:
  • (du, dv)
  • (dw, dh)

By offset = MLP(feature)

73

Proposal

Ground truth

74 of 91

R-CNN

  • Problems:
  • Slow: every proposal needs to go through a “full” CNN
  • Mis-detection: the proposal algorithm is not trained together

74

75 of 91

Fast R-CNN

75

ROI pooling

[Girshick, CVPR 2019 tutorial]

[Girshick, Fast R-CNN, ICCV 2015]

76 of 91

ROI pooling vs. ROI align

76

ROI Align

ROI Pooling

Making features extracted from different proposals the same size!

77 of 91

Faster R-CNN

77

ROI pooling

[Girshick, CVPR 2019 tutorial]

[Ren et al., Faster r-cnn: Towards real-time object detection with region proposal networks, NIPS 2015]

78 of 91

How to develop RPN�(region proposal network)?

78

5 * 8 * K * (2 + 4)

[Ren et al., 2015]

Ground truth

Anchor

79 of 91

What do we learn from RPN?

  • “How to encode your labeled data so that your CNN can learn from them and predict them” is important!

  • Inference: predict these “values” and accordingly transfer them to bounding boxes!

79

80 of 91

Questions?

81 of 91

How to deal with object sizes?

81

[Lin et al., Feature Pyramid Networks for Object Detection, CVPR 2017]

82 of 91

Mask R-CNN

82

[Girshick, CVPR 2019 tutorial]

[He et al., Mask r-cnn, ICCV 2017]

83 of 91

Mask R-CNN: for instance segmentation

83

CNN: convolutional neural network

RPN: region proposal network

Bulldozer: 80%

Bus: 15%

Motorcycle: 5%

84 of 91

2-stage vs. 1-stage detectors

  • Other names: single-shot, single-pass, … (e.g., YOLO, SSD)
  • Difference: no ROI pooling/align

84

[Redmon et al., 2016]

2-stage detector

1-stage detector

85 of 91

Exemplar 1-stage detectors

85

[Liu et al., 2016]

SSD

YOLO

[Redmon et al., 2016]

86 of 91

Exemplar 1-stage detectors (Retina Net)

86

[Lin et al., 2017]

87 of 91

2-stage vs. 1-stage detectors

  • Pros for 1-stage:
    • Faster!

  • Cons for 1-stage:
    • Too many negative locations; scale

87

[Redmon et al., 2016]

88 of 91

Inference: choose few from many

  • Non-Maximum Suppression (NMS)

88

[Pictures from “towards data science” post]

89 of 91

Example results

89

[Zhang, et al., 2021]

90 of 91

Key names

  • Tsung-Yi Lin
  • Ross Girshick
  • Kaiming He
  • Piotr Dollar

91 of 91

Take home

  • Good tutorials online:
    • CVPR 2017-2022, ECCV 2018-2020, ICCV 2017-2021 [search tutorial or workshop]
    • ICML/NeurIPS/ICLR 2018-2021 [search tutorial or workshop]

  • Good framework:
    • PyTorch: Torchvision
    • PyTorch: Detectron2

  • Good source code:
    • Papers with code: https://paperswithcode.com/

91