1 of 81

Introduction to convnets

CS 20: TensorFlow for Deep Learning Research

Lecture 6

1/31/2017

1

2 of 81

Agenda

Computer Vision

Convolutional Neural Networks

Convolution

Pooling

Feature Visualization

Slides adapted from Justin Johnson

Used with permission.

2

3 of 81

Convolutional Neural Networks:

Deep Learning with Images

4 of 81

Computer Vision - A bit of history

4

https://dspace.mit.edu/bitstream/handle/1721.1/6125/AIM-100.pdf

5 of 81

Computer Vision - A bit of history

5

https://xkcd.com/1425/

6 of 81

7 of 81

8 of 81

Object Segmentation

Figure credit: Dai, He, and Sun, “Instance-aware Semantic Segmentation via Multi-task Network Cascades”, CVPR 2016

9 of 81

Pose Estimation

Figure credit: Cao et al, “Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields”, arXiv 2016

10 of 81

Image Captioning

Figure credit: Karpathy and Fei-Fei, “Deep Visual-Semantic Alignments for Generating Image Descriptions”, CVPR 2015

11 of 81

Dense Image Captioning

Figure credit: Johnson*, Karpathy*, and Fei-Fei, “DenseCap: Fully Convolutional Localization Networks for Dense Captioning”, CVPR 2016

12 of 81

Visual Question Answering

Figure credit: Agrawal et al, “VQA: Visual Question Answering”, ICCV 2015 (left)

Zhu et al, “Visual7W: Grounded Question Answering in Images”, CVPR 2016 (right)

(r

13 of 81

Image Super-resolution

Figure credit: Ledig et al, “Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network”, arXiv 2016

14 of 81

Art generation

Gatys, Ecker, and Bethge, “Image Style Transfer using Convolutional Neural Networks”, CVPR 2016 (left)

Mordvintsev, Olah, and Tyka, “Inceptionism: Going Deeper into Neural Networks” (upper right)

Johnson, Alahi, and Fei-Fei: “Perceptual Losses for Real-Time Style Transfer and Super-Resolution”, ECCV 2016 (bottom left)

15 of 81

Convolutional Neural Networks

16 of 81

Recall: fully connected neural network

16

x

[C₁]

w₁

[C₁×C₂]

Matrix Multiply

s

[C₂]

Nonlinearity

a

[C₂]

w₂

[C₂×C₃]

ŷ

[C₃]

Matrix Multiply

17 of 81

Recall: fully connected neural network

17

3072

1

32x32x3 image -> stretch to 3072 x 1

10 x 3072

weights

activation

input

1

10

18 of 81

Convolutional Neural Network

18

x

C₁×H×W

w₁

C₂×C₁×k×k

Convolution

s

C₂×H×W

Nonlinearity

a

C₂×H×W

w₂

C₂HW/4×C₃

ŷ

C₃

p

C₂×H/2×W/2

Pooling

Fully

Connected

19 of 81

Convolution

Image courtesy Apple

20 of 81

Convolving “filters” is not a new idea

Sobel operator:

21 of 81

Convolution Layer

21

32

3

32x32x3 image

width

height

depth

Slide credit: CS231n Lecture 7

22 of 81

Convolution Layer

22

32

3

32x32x3 image

width

height

depth

Slide credit: CS231n Lecture 7

5x5x3 filter

Convolve the filter with the image

i.e. “slide over the image spatially, computing dot products”

Filters always extend the full depth of the input volume

23 of 81

Convolution Layer

23

Slide credit: CS231n Lecture 7

32

3

32x32x3 image

5x5x3 filter

1 number:

the result of taking a dot product between the filter and a small 5x5x3 chunk of the image

(i.e. 5*5*3 = 75-dimensional dot product + bias)

24 of 81

Convolution Layer

24

Slide credit: CS231n Lecture 7

32

3

32x32x3 image

5x5x3 filter

convolve (slide) over all spatial locations

activation maps

1

28

25 of 81

Output

Filter

Input

Padding

26 of 81

Convolution Layer

26

Slide credit: CS231n Lecture 7

32

3

32x32x3 image

5x5x3 filter

convolve (slide) over all spatial locations

activation maps

2

28

consider a second, green filter

27 of 81

32

3

Convolution Layer

activation maps

6

28

For example, if we had 6 5x5 filters, we’ll get 6 separate activation maps:

We stack these up to get a new “image” of size 28x28x6!

Slide credit: CS231n Lecture 7

28 of 81

Slide credit: CS231n Lecture 7

ConvNet is a sequence of Convolution Layers, interspersed with activation functions

32

3

28

6

CONV,

ReLU

e.g. 6 5x5x3 filters

29 of 81

Slide credit: CS231n Lecture 7

ConvNet is a sequence of Convolution Layers, interspersed with activation functions

32

3

28

6

CONV,

ReLU

e.g. 6 5x5x3 filters

CONV,

ReLU

e.g. 10 5x5x6 filters

CONV,

ReLU

….

10

24

30 of 81

Two key insights

Features are hierarchical

Composing high-complexity features out of low-complexity features is more efficient than learning high-complexity features directly.

e.g.: having an “circle” detector is useful for detecting faces… and basketballs

Features are translationally invariant

If a feature is useful to compute at (x, y) it is useful to compute that feature at (x’, y’) as well

31 of 81

32 of 81

33 of 81

example 5x5 filters

(32 total)

We call the layer convolutional because it is related to convolution of two signals:

element-wise multiplication and sum of a filter and the signal (image)

one filter =>

one activation map

Figure copyright Andrej Karpathy.

34 of 81

7x7 input (spatially)

assume 3x3 filter

7

A closer look at spatial dimensions:

35 of 81

7x7 input (spatially)

assume 3x3 filter

7

A closer look at spatial dimensions:

36 of 81

7x7 input (spatially)

assume 3x3 filter

7

A closer look at spatial dimensions:

37 of 81

7x7 input (spatially)

assume 3x3 filter

7

A closer look at spatial dimensions:

38 of 81

7x7 input (spatially)

assume 3x3 filter

7

A closer look at spatial dimensions:

=> 5x5 output

39 of 81

7x7 input (spatially)

assume 3x3 filter

applied with stride 2

7

A closer look at spatial dimensions:

40 of 81

7x7 input (spatially)

assume 3x3 filter

applied with stride 2

7

A closer look at spatial dimensions:

41 of 81

7x7 input (spatially)

assume 3x3 filter

applied with stride 2

=> 3x3 output!

7

A closer look at spatial dimensions:

42 of 81

7x7 input (spatially)

assume 3x3 filter

applied with stride 3?

7

A closer look at spatial dimensions:

43 of 81

7x7 input (spatially)

assume 3x3 filter

applied with stride 3?

7

A closer look at spatial dimensions:

doesn’t fit!

cannot apply 3x3 filter on 7x7 input with stride 3.

44 of 81

N

F

Output size:

(N - F) / stride + 1

e.g. N = 7, F = 3:

stride 1 => (7 - 3)/1 + 1 = 5

stride 2 => (7 - 3)/2 + 1 = 3

stride 3 => (7 - 3)/3 + 1 = 2.33 :\

45 of 81

In practice: Common to zero pad the border

0	0	0	0	0	0
0
0
0
0

e.g. input 7x7

3x3 filter, applied with stride 1

pad with 1 pixel border => what is the output?

(recall:)

(N - F) / stride + 1

46 of 81

In practice: Common to zero pad the border

0	0	0	0	0	0
0
0
0
0

e.g. input 7x7

3x3 filter, applied with stride 1

pad with 1 pixel border => what is the output?

7x7 output!

47 of 81

In practice: Common to zero pad the border

0	0	0	0	0	0
0
0
0
0

e.g. input 7x7

3x3 filter, applied with stride 1

pad with 1 pixel border => what is the output?

7x7 output!

in general, common to see CONV layers with stride 1, filters of size FxF, and zero-padding with (F-1)/2. (will preserve size spatially)

e.g. F = 3 => zero pad with 1

F = 5 => zero pad with 2

F = 7 => zero pad with 3

48 of 81

Remember back to…

E.g. 32x32 input convolved repeatedly with 5x5 filters shrinks volumes spatially!

(32 -> 28 -> 24 ...). Shrinking too fast is not good, doesn’t work well.

32

3

CONV,

ReLU

e.g. 6 5x5x3 filters

28

6

CONV,

ReLU

e.g. 10 5x5x6 filters

CONV,

ReLU

….

10

24

49 of 81

Common settings:

K = (powers of 2, e.g. 32, 64, 128, 512)

F = 3, S = 1, P = 1
F = 5, S = 1, P = 2
F = 5, S = 2, P = ? (whatever fits)
F = 1, S = 1, P = 0

50 of 81

51 of 81

TensorFlow Padding Options

Input width = 13

Filter width = 6

Stride = 5

52 of 81

Pooling Layer

52

Slide credit: CS231n Lecture 7

makes the representations smaller and more manageable
operates over each activation map independently

53 of 81

Max Pooling

53

Slide credit: CS231n Lecture 7

1	1	2	4
5	6	7	8
3	2	1	0
1	2	3	4

Single depth slice

x

y

max pool with

2x2 filters and stride 2

6	8
3	4

54 of 81

Max Pooling

54

Slide credit: CS231n Lecture 7

Common settings:

F = 2, S = 2

F = 3, S = 2

55 of 81

Case study: LeNet-5 [LeCun et al., 1998]

55

Slide credit: CS231n Lecture 7

Conv filters were 5x5, applied at stride 1

Subsampling (Pooling) layers were 2x2 applied at stride 2

i.e. architecture is [CONV-POOL-CONV-POOL-CONV-FC]

56 of 81

Case study: AlexNet [Krizhevsky et al. 2012]

56

Slide credit: CS231n Lecture 7

Full (simplified) AlexNet architecture

[227x227x3] INPUT

[55x55x96] CONV1: 96 11x11 filters at stride 4, pad 0

[27x27x96] MAX POOL1: 3x3 filters at stride 2

[27x27x96] NORM1: Normalization layer

[27x27x256] CONV2: 256 5x5 filters at stride 1, pad 2

[13x13x256] MAX POOL2: 3x3 filters at stride 2

[13x13x256] NORM2: Normalization layer

[13x13x384] CONV3: 384 3x3 filters at stride 1, pad 1

[13x13x384] CONV4: 384 3x3 filters at stride 1, pad 1

[13x13x256] CONV5: 256 3x3 filters at stride 1, pad 1

[6x6x256] MAX POOL3: 3x3 filters at stride 2

[4096] FC6: 4096 neurons

[4096] FC7: 4096 neurons

[1000] FC8: 1000 neurons (class scores)

57 of 81

Case study: VGGNet [Simonyan and Zisserman, 2014]

57

best model

Only 3x3 CONV stride 1, pad 1

and 2x2 MAX POOL stride 2

11.2% top 5 error in ILSVRC 2013

-> 7.3% top 5 error

Slide credit: CS231n Lecture 7

58 of 81

Case study: GoogLeNet [Szegedy et al., 2014]

58

Slide credit: CS231n Lecture 7

Inception module

ILSVRC 2014 winner (6.7% top 5 error)

59 of 81

Case study: ResNet [He et al., 2015]

59

Slide credit: CS231n Lecture 7

spatial dimension only 56x56!

Intuitively, deeper nets should perform no worse than their shallower counterparts, at least at train time (when there is no risk of overfitting). As a thought experiment, let’s say we’ve built a net with n layers that achieves a certain accuracy. At minimum, a net with n+1 layers should be able to achieve the exact same accuracy, if only by copying over the same first n layers and performing an identity mapping for the last layer. Similarly, nets of n+2, n+3, and n+4 layers could all continue performing identity mappings and achieve the same accuracy. In practice, however, these deeper nets almost always degrade in performance.

The authors of ResNet boiled these problems down to a single hypothesis: direct mappings are hard to learn. And they proposed a fix: instead of trying to learn an underlying mapping from x to H(x), learn the difference between the two, or the “residual.” Then, to calculate H(x), we can just add the residual to the input.

Say the residual is F(x)=H(x)-x. Now, instead of trying to learn H(x) directly, our nets are trying to learn F(x)+x.

Each “block” in ResNet consists of a series of layers and a “shortcut” connection adding the input of the block to its output. The “add” operation is performed element-wise, and if the input and output are of different sizes, zero-padding or projections (via 1x1 convolutions) can be used to create matching dimensions.

60 of 81

Case study: ResNet [He et al., 2015]

60

Slide credit: CS231n Lecture 7

ILSVRC 2015 winner

(3.6% top 5 error)

(slide from Kaiming He’s ICCV 2015 presentation)

2-3 weeks of training on 8 GPU machine

at runtime: faster than a VGGNet!

(even though it has 8x more layers)

61 of 81

Case study: ResNet [He et al., 2015]

61

Slide credit: CS231n Lecture 7

(slide from Kaiming He’s ICCV 2015 presentation)

62 of 81

Visualizing ConvNet Features

63 of 81

This image is CC0 public domain

Class Scores:

1000 numbers

What’s going on inside ConvNets?

Input Image:

3 x 224 x 224

What are the intermediate features looking for?

Krizhevsky et al, “ImageNet Classification with Deep Convolutional Neural Networks”, NIPS 2012.�Figure reproduced with permission.

64 of 81

Visualizing CNN features: Look at filters

64

Slide credit: CS231n Lecture 9

conv1

65 of 81

First layers: networks learn similar features

65

Slide credit: CS231n Lecture 9

66 of 81

Visualizing CNN features: Look at filters

66

Slide credit: CS231n Lecture 9

Filters from ConvNetJS CIFAR-10 model

Filters from higher layers don’t make much sense

67 of 81

68 of 81

69 of 81

70 of 81

Visualizing CNN features: (guided) backprop

70

Slide credit: CS231n Lecture 9

Choose an image

Choose a layer and a neuron in a CNN

Question: �How does the chosen neuron respond to the image?

71 of 81

Visualizing CNN features: (guided) backprop

71

2. Set gradient of chosen layer to all zero, except 1 for the chosen neuron�

3. Backprop to image:

Feed image into net

Guided

backpropagation:

instead

Zeiler and Fergus, “Visualizing and Understanding Convolutional Networks”, ECCV 2014.�Dosovitskiy et al., “Striving for Simplicity: The All Convolutional Net”, ICLR Workshop 2015�Slide credit: CS231n Lecture 9

72 of 81

Visualizing CNN features: (guided) backprop

72

Visualization of patterns learned by the layer conv6 (top) and layer conv9 (bottom) of the network trained on ImageNet.

Each row corresponds to one filter.

The visualization using “guided backpropagation” is based on the top 10 image patches activating this filter taken from the ImageNet dataset.

Dosovitskiy et al, “Striving for Simplicity: The All Convolutional Net”, ICLR Workshop 2015

Slide credit: CS231n Lecture 9

73 of 81

Visualizing CNN features: Gradient ascent

73

Slide credit: CS231n Lecture 9

(Guided) backprop:

Find the part of an image that a neuron responds to

Gradient ascent:

Generate a synthetic image that maximally activates a neuron

I* = arg max_I f(I) + R(I)

Neuron value

Natural image regularizer

74 of 81

Visualizing CNN features: Gradient ascent

74

Simonyan et al, “Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps”, ICLR Workshop 2014

score for class c (before Softmax)

zero image

Initialize image to zeros

Repeat:

2. Forward image to compute current scores

3. Set gradient of scores to be 1 for target class, 0 for others

4. Backprop to get gradient on image

5. Make a small update to the image

75 of 81

Visualizing CNN features: Gradient ascent

75

Simonyan et al, “Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps”, ICLR Workshop 2014

76 of 81

Visualizing CNN features: Gradient ascent

76

Yosinski et al, “Understanding Neural Networks Through Deep Visualization”, ICML DL Workshop 2015

Better image regularizers give prettier results:

77 of 81

Visualizing CNN features: Gradient ascent

77

Yosinski et al, “Understanding Neural Networks Through Deep Visualization”, ICML DL Workshop 2015

Use the same approach to visualize intermediate features

78 of 81

Visualizing CNN features: Gradient ascent

78

Yosinski et al, “Understanding Neural Networks Through Deep Visualization”, ICML DL Workshop 2015

Use the same approach to visualize intermediate features

79 of 81

Visualizing CNN features: Gradient ascent

79

Nguyen et al, “Multifaceted Feature Visualization: Uncovering the Different Types of Features Learned By Each Neuron in Deep Neural Networks”, ICML Visualization for Deep Learning Workshop 2016

You can add even more tricks to get nicer results

80 of 81

Take-aways

Convolutional networks are tailor-made for computer vision tasks.

They exploit:

Hierarchical nature of features
Translation invariance of features

“Understanding” what a convnet learns is non-trivial, but some clever approaches exist.

80

81 of 81

Next class

ConvNet in TensorFlow

Feedback: huyenn@stanford.edu

Thanks!

81