Introduction to convnets
CS 20: TensorFlow for Deep Learning Research
Lecture 6
1/31/2017
1
Agenda
Computer Vision
Convolutional Neural Networks
Convolution
Pooling
Feature Visualization
Slides adapted from Justin Johnson
Used with permission.
2
Convolutional Neural Networks:
Deep Learning with Images
Computer Vision - A bit of history
4
https://dspace.mit.edu/bitstream/handle/1721.1/6125/AIM-100.pdf
Computer Vision - A bit of history
5
https://xkcd.com/1425/
Object Segmentation
Figure credit: Dai, He, and Sun, “Instance-aware Semantic Segmentation via Multi-task Network Cascades”, CVPR 2016
Pose Estimation
Figure credit: Cao et al, “Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields”, arXiv 2016
Image Captioning
Figure credit: Karpathy and Fei-Fei, “Deep Visual-Semantic Alignments for Generating Image Descriptions”, CVPR 2015
Dense Image Captioning
Figure credit: Johnson*, Karpathy*, and Fei-Fei, “DenseCap: Fully Convolutional Localization Networks for Dense Captioning”, CVPR 2016
Visual Question Answering
Figure credit: Agrawal et al, “VQA: Visual Question Answering”, ICCV 2015 (left)
Zhu et al, “Visual7W: Grounded Question Answering in Images”, CVPR 2016 (right)
(r
Image Super-resolution
Figure credit: Ledig et al, “Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network”, arXiv 2016
Art generation
Gatys, Ecker, and Bethge, “Image Style Transfer using Convolutional Neural Networks”, CVPR 2016 (left)
Mordvintsev, Olah, and Tyka, “Inceptionism: Going Deeper into Neural Networks” (upper right)
Johnson, Alahi, and Fei-Fei: “Perceptual Losses for Real-Time Style Transfer and Super-Resolution”, ECCV 2016 (bottom left)
Convolutional Neural Networks
Recall: fully connected neural network
16
x
[C1]
w1
[C1×C2]
Matrix Multiply
s
[C2]
Nonlinearity
a
[C2]
w2
[C2×C3]
ŷ
[C3]
Matrix Multiply
Recall: fully connected neural network
17
3072
1
32x32x3 image -> stretch to 3072 x 1
10 x 3072
weights
activation
input
1
10
Convolutional Neural Network
18
x
C1×H×W
w1
C2×C1×k×k
Convolution
s
C2×H×W
Nonlinearity
a
C2×H×W
w2
C2HW/4×C3
ŷ
C3
p
C2×H/2×W/2
Pooling
Fully
Connected
Convolution
Image courtesy Apple
Convolving “filters” is not a new idea
Sobel operator:
Convolution Layer
21
32
32
3
32x32x3 image
width
height
depth
Slide credit: CS231n Lecture 7
Convolution Layer
22
32
32
3
32x32x3 image
width
height
depth
Slide credit: CS231n Lecture 7
5x5x3 filter
Convolve the filter with the image
i.e. “slide over the image spatially, computing dot products”
Filters always extend the full depth of the input volume
Convolution Layer
23
Slide credit: CS231n Lecture 7
32
32
3
32x32x3 image
5x5x3 filter
1 number:
the result of taking a dot product between the filter and a small 5x5x3 chunk of the image
(i.e. 5*5*3 = 75-dimensional dot product + bias)
Convolution Layer
24
Slide credit: CS231n Lecture 7
32
32
3
32x32x3 image
5x5x3 filter
convolve (slide) over all spatial locations
activation maps
1
28
28
Output
Filter
Input
Padding
Convolution Layer
26
Slide credit: CS231n Lecture 7
32
32
3
32x32x3 image
5x5x3 filter
convolve (slide) over all spatial locations
activation maps
2
28
28
consider a second, green filter
32
32
3
Convolution Layer
activation maps
6
28
28
For example, if we had 6 5x5 filters, we’ll get 6 separate activation maps:
We stack these up to get a new “image” of size 28x28x6!
Slide credit: CS231n Lecture 7
Slide credit: CS231n Lecture 7
ConvNet is a sequence of Convolution Layers, interspersed with activation functions
32
32
3
28
28
6
CONV,
ReLU
e.g. 6 5x5x3 filters
Slide credit: CS231n Lecture 7
ConvNet is a sequence of Convolution Layers, interspersed with activation functions
32
32
3
28
28
6
CONV,
ReLU
e.g. 6 5x5x3 filters
CONV,
ReLU
e.g. 10 5x5x6 filters
CONV,
ReLU
….
10
24
24
Two key insights
Composing high-complexity features out of low-complexity features is more efficient than learning high-complexity features directly.
e.g.: having an “circle” detector is useful for detecting faces… and basketballs
If a feature is useful to compute at (x, y) it is useful to compute that feature at (x’, y’) as well
example 5x5 filters
(32 total)
We call the layer convolutional because it is related to convolution of two signals:
element-wise multiplication and sum of a filter and the signal (image)
one filter =>
one activation map
Figure copyright Andrej Karpathy.
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
7x7 input (spatially)
assume 3x3 filter
7
7
A closer look at spatial dimensions:
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
7x7 input (spatially)
assume 3x3 filter
7
7
A closer look at spatial dimensions:
7x7 input (spatially)
assume 3x3 filter
7
7
A closer look at spatial dimensions:
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
7x7 input (spatially)
assume 3x3 filter
7
7
A closer look at spatial dimensions:
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
7x7 input (spatially)
assume 3x3 filter
7
7
A closer look at spatial dimensions:
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
=> 5x5 output
7x7 input (spatially)
assume 3x3 filter
applied with stride 2
7
7
A closer look at spatial dimensions:
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
7x7 input (spatially)
assume 3x3 filter
applied with stride 2
7
7
A closer look at spatial dimensions:
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
7x7 input (spatially)
assume 3x3 filter
applied with stride 2
=> 3x3 output!
7
7
A closer look at spatial dimensions:
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
7x7 input (spatially)
assume 3x3 filter
applied with stride 3?
7
7
A closer look at spatial dimensions:
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
7x7 input (spatially)
assume 3x3 filter
applied with stride 3?
7
7
A closer look at spatial dimensions:
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
doesn’t fit!
cannot apply 3x3 filter on 7x7 input with stride 3.
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
N
N
F
F
Output size:
(N - F) / stride + 1
e.g. N = 7, F = 3:
stride 1 => (7 - 3)/1 + 1 = 5
stride 2 => (7 - 3)/2 + 1 = 3
stride 3 => (7 - 3)/3 + 1 = 2.33 :\
In practice: Common to zero pad the border
0 | 0 | 0 | 0 | 0 | 0 | | | |
0 | | | | | | | | |
0 | | | | | | | | |
0 | | | | | | | | |
0 | | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
e.g. input 7x7
3x3 filter, applied with stride 1
pad with 1 pixel border => what is the output?
(recall:)
(N - F) / stride + 1
In practice: Common to zero pad the border
0 | 0 | 0 | 0 | 0 | 0 | | | |
0 | | | | | | | | |
0 | | | | | | | | |
0 | | | | | | | | |
0 | | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
e.g. input 7x7
3x3 filter, applied with stride 1
pad with 1 pixel border => what is the output?
7x7 output!
In practice: Common to zero pad the border
0 | 0 | 0 | 0 | 0 | 0 | | | |
0 | | | | | | | | |
0 | | | | | | | | |
0 | | | | | | | | |
0 | | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
e.g. input 7x7
3x3 filter, applied with stride 1
pad with 1 pixel border => what is the output?
7x7 output!
in general, common to see CONV layers with stride 1, filters of size FxF, and zero-padding with (F-1)/2. (will preserve size spatially)
e.g. F = 3 => zero pad with 1
F = 5 => zero pad with 2
F = 7 => zero pad with 3
Remember back to…
E.g. 32x32 input convolved repeatedly with 5x5 filters shrinks volumes spatially!
(32 -> 28 -> 24 ...). Shrinking too fast is not good, doesn’t work well.
32
32
3
CONV,
ReLU
e.g. 6 5x5x3 filters
28
28
6
CONV,
ReLU
e.g. 10 5x5x6 filters
CONV,
ReLU
….
10
24
24
Common settings:
K = (powers of 2, e.g. 32, 64, 128, 512)
TensorFlow Padding Options
Input width = 13
Filter width = 6
Stride = 5
Pooling Layer
52
Slide credit: CS231n Lecture 7
Max Pooling
53
Slide credit: CS231n Lecture 7
1 | 1 | 2 | 4 |
5 | 6 | 7 | 8 |
3 | 2 | 1 | 0 |
1 | 2 | 3 | 4 |
Single depth slice
x
y
max pool with
2x2 filters and stride 2
6 | 8 |
3 | 4 |
Max Pooling
54
Slide credit: CS231n Lecture 7
Common settings:
F = 2, S = 2
F = 3, S = 2
Case study: LeNet-5 [LeCun et al., 1998]
55
Slide credit: CS231n Lecture 7
Conv filters were 5x5, applied at stride 1
Subsampling (Pooling) layers were 2x2 applied at stride 2
i.e. architecture is [CONV-POOL-CONV-POOL-CONV-FC]
Case study: AlexNet [Krizhevsky et al. 2012]
56
Slide credit: CS231n Lecture 7
Full (simplified) AlexNet architecture
[227x227x3] INPUT
[55x55x96] CONV1: 96 11x11 filters at stride 4, pad 0
[27x27x96] MAX POOL1: 3x3 filters at stride 2
[27x27x96] NORM1: Normalization layer
[27x27x256] CONV2: 256 5x5 filters at stride 1, pad 2
[13x13x256] MAX POOL2: 3x3 filters at stride 2
[13x13x256] NORM2: Normalization layer
[13x13x384] CONV3: 384 3x3 filters at stride 1, pad 1
[13x13x384] CONV4: 384 3x3 filters at stride 1, pad 1
[13x13x256] CONV5: 256 3x3 filters at stride 1, pad 1
[6x6x256] MAX POOL3: 3x3 filters at stride 2
[4096] FC6: 4096 neurons
[4096] FC7: 4096 neurons
[1000] FC8: 1000 neurons (class scores)
Case study: VGGNet [Simonyan and Zisserman, 2014]
57
best model
Only 3x3 CONV stride 1, pad 1
and 2x2 MAX POOL stride 2
11.2% top 5 error in ILSVRC 2013
-> 7.3% top 5 error
Slide credit: CS231n Lecture 7
Case study: GoogLeNet [Szegedy et al., 2014]
58
Slide credit: CS231n Lecture 7
Inception module
ILSVRC 2014 winner (6.7% top 5 error)
Case study: ResNet [He et al., 2015]
59
Slide credit: CS231n Lecture 7
spatial dimension only 56x56!
Case study: ResNet [He et al., 2015]
60
Slide credit: CS231n Lecture 7
ILSVRC 2015 winner
(3.6% top 5 error)
(slide from Kaiming He’s ICCV 2015 presentation)
2-3 weeks of training on 8 GPU machine
at runtime: faster than a VGGNet!
(even though it has 8x more layers)
Case study: ResNet [He et al., 2015]
61
Slide credit: CS231n Lecture 7
(slide from Kaiming He’s ICCV 2015 presentation)
Visualizing ConvNet Features
Class Scores:
1000 numbers
What’s going on inside ConvNets?
Input Image:
3 x 224 x 224
What are the intermediate features looking for?
Krizhevsky et al, “ImageNet Classification with Deep Convolutional Neural Networks”, NIPS 2012.�Figure reproduced with permission.
Visualizing CNN features: Look at filters
64
Slide credit: CS231n Lecture 9
conv1
First layers: networks learn similar features
65
Slide credit: CS231n Lecture 9
Visualizing CNN features: Look at filters
66
Slide credit: CS231n Lecture 9
Filters from ConvNetJS CIFAR-10 model
Filters from higher layers don’t make much sense
Visualizing CNN features: (guided) backprop
70
Slide credit: CS231n Lecture 9
Choose an image
Choose a layer and a neuron in a CNN
Question: �How does the chosen neuron respond to the image?
Visualizing CNN features: (guided) backprop
71
2. Set gradient of chosen layer to all zero, except 1 for the chosen neuron�
3. Backprop to image:
Guided
backpropagation:
instead
Zeiler and Fergus, “Visualizing and Understanding Convolutional Networks”, ECCV 2014.�Dosovitskiy et al., “Striving for Simplicity: The All Convolutional Net”, ICLR Workshop 2015�Slide credit: CS231n Lecture 9
Visualizing CNN features: (guided) backprop
72
Visualization of patterns learned by the layer conv6 (top) and layer conv9 (bottom) of the network trained on ImageNet.
Each row corresponds to one filter.
The visualization using “guided backpropagation” is based on the top 10 image patches activating this filter taken from the ImageNet dataset.
Dosovitskiy et al, “Striving for Simplicity: The All Convolutional Net”, ICLR Workshop 2015
Slide credit: CS231n Lecture 9
Visualizing CNN features: Gradient ascent
73
Slide credit: CS231n Lecture 9
(Guided) backprop:
Find the part of an image that a neuron responds to
Gradient ascent:
Generate a synthetic image that maximally activates a neuron
I* = arg maxI f(I) + R(I)
Neuron value
Natural image regularizer
Visualizing CNN features: Gradient ascent
74
Simonyan et al, “Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps”, ICLR Workshop 2014
score for class c (before Softmax)
zero image
Repeat:
2. Forward image to compute current scores
3. Set gradient of scores to be 1 for target class, 0 for others
4. Backprop to get gradient on image
5. Make a small update to the image
Visualizing CNN features: Gradient ascent
75
Simonyan et al, “Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps”, ICLR Workshop 2014
Visualizing CNN features: Gradient ascent
76
Yosinski et al, “Understanding Neural Networks Through Deep Visualization”, ICML DL Workshop 2015
Better image regularizers give prettier results:
Visualizing CNN features: Gradient ascent
77
Yosinski et al, “Understanding Neural Networks Through Deep Visualization”, ICML DL Workshop 2015
Use the same approach to visualize intermediate features
Visualizing CNN features: Gradient ascent
78
Yosinski et al, “Understanding Neural Networks Through Deep Visualization”, ICML DL Workshop 2015
Use the same approach to visualize intermediate features
Visualizing CNN features: Gradient ascent
79
Nguyen et al, “Multifaceted Feature Visualization: Uncovering the Different Types of Features Learned By Each Neuron in Deep Neural Networks”, ICML Visualization for Deep Learning Workshop 2016
You can add even more tricks to get nicer results
Take-aways
Convolutional networks are tailor-made for computer vision tasks.
They exploit:
“Understanding” what a convnet learns is non-trivial, but some clever approaches exist.
80
Next class
81