Convolutional Neural Networks
Dinesh K. Vishwakarma, Ph.D.
1
4/6/2022
Dr. Dinesh Kumar Vishwakarma�Professor, Department of Information Technology�Delhi Technological University, Delhi
Email: dinesh@dtu.ac.in
Web page: http://www.dtu.ac.in/Web/Departments/InformationTechnology/faculty/dkvishwakarma.php
Biometric Research Laboratory
http://www.dtu.ac.in/Web/Departments/InformationTechnology/lab_and_infra/bml/
What is CNN?
Dinesh K. Vishwakarma, Ph.D.
2
4/6/2022
A bit of history...
Dinesh K. Vishwakarma, Ph.D.
3
4/6/2022
This image by Rocky Acosta is licensed under CC-BY 3.0
A bit of history...
Dinesh K. Vishwakarma, Ph.D.
4
4/6/2022
Rumelhart et al., 1986: First time back-propagation became popular
recognizable math
Illustration of Rumelhart et al., 1986 by Lane McIntosh
A bit of history...
Dinesh K. Vishwakarma, Ph.D.
5
4/6/2022
[Hinton and Salakhutdinov 2006]
Reinvigorated (a new energy) research in Deep Learning
Illustration of Hinton and Salakhutdinov 2006 by Lane McIntosh
First Strong Results
Dinesh K. Vishwakarma, Ph.D.
6
4/6/2022
Illustration of Dahl et al. 2012 by Lane McIntosh, copyright CS231n 2017
Figures copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission.
A bit of history:
Dinesh K. Vishwakarma, Ph.D.
7
4/6/2022
LeNet-5
Implemented on CPU
A bit of history…
Dinesh K. Vishwakarma, Ph.D.
8
4/6/2022
ImageNet Classification with Deep Convolutional Neural Networks [Krizhevsky, Sutskever, Hinton, 2012]., NIPS citation: 59454 (02.04.2020)
“AlexNet”
The success of AlexNet gives small Revolution
Dinesh K. Vishwakarma, Ph.D.
9
4/6/2022
Fast-forward to today: ConvNets are everywhere
Classification Retrieval
Figures copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission.
Fast-forward to today: ConvNets are everywhere
Dinesh K. Vishwakarma, Ph.D.
10
4/6/2022
Figures copyright Shaoqing Ren, Kaiming He, Ross Girschick, Jian Sun, 2015. Reproduced with permission.
[Faster R-CNN: Ren, He, Girshick, Sun 2015]
Detection Segmentation
[Farabet et al., 2012]
Figures copyright Clement Farabet, 2012. Reproduced with permission.
Fast-forward to today: ConvNets are everywhere
Dinesh K. Vishwakarma, Ph.D.
11
4/6/2022
NVIDIA Tesla line
(these are the GPUs on rye01.stanford.edu)
Note that for embedded systems a typical setup would involve NVIDIA Tegras, with integrated GPU and ARM-based CPU cores.
self-driving cars
Photo by Lane McIntosh. Copyright CS231n 2017.
This image by GBPublic_PR is licensed under CC-BY 2.0
Fast-forward to today: ConvNets are everywhere
Dinesh K. Vishwakarma, Ph.D.
12
4/6/2022
[Taigman et al. 2014]
[Simonyan et al. 2014]
Figures copyright Simonyan et al., 2014. Reproduced with permission.
Activations of inception-v3 architecture [Szegedy et al. 2015] to image of Emma McIntosh, used with permission. Figure and architecture not from Taigman et al. 2014.
Illustration by Lane McIntosh, photos of Katie Cumnock used with permission.
Fast-forward to today: ConvNets are everywhere
Dinesh K. Vishwakarma, Ph.D.
13
4/6/2022
[Toshev, Szegedy 2014]
[Guo et al. 2014]
Images are examples of pose estimation, not actually from Toshev & Szegedy 2014. Copyright Lane McIntosh.
Figures copyright Xiaoxiao Guo, Satinder Singh, Honglak Lee, Richard Lewis, and Xiaoshi Wang, 2014. Reproduced with permission.
Fast-forward to today: ConvNets are everywhere
Dinesh K. Vishwakarma, Ph.D.
14
4/6/2022
[Levy et al. 2016]
[Sermanet et al. 2011] [Ciresan et al.]
Photos by Lane McIntosh. Copyright CS231n 2017.
[Dieleman et al. 2014]
From left to right: public domain by NASA, usage permitted by ESA/Hubble, public domain by NASA, and public domain.
Figure copyright Levy et al. 2016. Reproduced with permission.
No errors
Dinesh K. Vishwakarma, Ph.D.
15
4/6/2022
[Vinyals et al., 2015] [Karpathy and Fei-Fei, 2015]
Image Captioning
Minor errors
Somewhat related
A white teddy bear sitting in the grass
A man riding a wave on top of a surfboard
A man in a baseball uniform throwing a ball
A cat sitting on a suitcase on the floor
A woman is holding a cat in her hand
All images are CC0 Public domain: https://pixabay.com/en/luggage-antique-cat-1643010/
https://pixabay.com/en/teddy-plush-bears-cute-teddy-bear-1623436/ https://pixabay.com/en/surf-wave-summer-sport-litoral-1668716/ https://pixabay.com/en/woman-female-model-portrait-adult-983967/ https://pixabay.com/en/handstand-lake-meditation-496008/ https://pixabay.com/en/baseball-player-shortstop-infield-1045263/
Captions generated by Justin Johnson using Neuraltalk2
A woman standing on a beach holding a surfboard
Dinesh K. Vishwakarma, Ph.D.
16
4/6/2022
CNN Fundamentals
Fundamental Architecture of CNN
Dinesh K. Vishwakarma, Ph.D.
17
4/6/2022
Fundamental Architecture of CNN
Dinesh K. Vishwakarma, Ph.D.
18
4/6/2022
Why CNN not Feedforward Neural Network?
Dinesh K. Vishwakarma, Ph.D.
19
4/6/2022
Convolution Layer
Dinesh K. Vishwakarma, Ph.D.
20
4/6/2022
32
3
32
depth
32x32x3 image -> preserve spatial structure
width
height
Convolution operation
Dinesh K. Vishwakarma, Ph.D.
21
4/6/2022
Convolution operation…
Dinesh K. Vishwakarma, Ph.D.
22
4/6/2022
Convolution Layer
32x32x3 image
5x5x3 filter
32
Convolve the filter with the image
i.e. “slide over the image spatially, computing dot products”
Dinesh K. Vishwakarma, Ph.D.
23
4/6/2022
32
3
Convolution Layer
32x32x3 image
5x5x3 filter
32
Convolve the filter with the image
i.e. “slide over the image spatially, computing dot products”
Dinesh K. Vishwakarma, Ph.D.
24
4/6/2022
Filters always extend the full depth of the input volume
32
3
Convolution Layer
Dinesh K. Vishwakarma, Ph.D.
25
4/6/2022
32
32x32x3 image 5x5x3 filter
32
1 number:
the result of taking a dot product between the filter and a small 5x5x3 chunk of the image
(i.e. 5*5*3 = 75-dimensional dot product + bias)
3
Convolution Layer
Dinesh K. Vishwakarma, Ph.D.
26
4/6/2022
32
32x32x3 image 5x5x3 filter
32
convolve (slide) over all spatial locations
activation map
3
1
28
28
consider a second, green filter
Dinesh K. Vishwakarma, Ph.D.
27
4/6/2022
32
32
3
Convolution Layer
32x32x3 image 5x5x3 filter
convolve (slide) over all spatial locations
activation maps
1
28
28
For example, if we had 6 5x5 filters, we’ll get 6 separate activation maps:
Dinesh K. Vishwakarma, Ph.D.
28
4/6/2022
32
32
3
Convolution Layer
activation maps
6
28
28
We stack these up to get a “new image” of size 28x28x6!
Preview: ConvNet is a sequence of Convolution Layers, interspersed with activation functions
Dinesh K. Vishwakarma, Ph.D.
29
4/6/2022
32
32
3
28
28
6
CONV, ReLU
e.g. 6 5x5x3 filters
Preview: ConvNet is a sequence of Convolutional Layers, interspersed with activation functions
Dinesh K. Vishwakarma, Ph.D.
30
4/6/2022
32
32
3
CONV, ReLU
e.g. 6 5x5x3 filters
28
28
6
CONV, ReLU
e.g. 10 5x5x6 filters
CONV, ReLU
….
10
24
24
Preview
Dinesh K. Vishwakarma, Ph.D.
31
4/6/2022
[Zeiler and Fergus 2013]
Visualization of VGG-16 by Lane McIntosh. VGG-16 architecture from [Simonyan and Zisserman 2014].
one filter => one activation map
Dinesh K. Vishwakarma, Ph.D.
32
4/6/2022
example 5x5 filters (32 total)
We call the layer convolutional because it is related to convolution of two signals:
elementwise multiplication and sum of a filter and the signal (image)
Figure copyright Andrej Karpathy.
A closer look at spatial dimensions:
Dinesh K. Vishwakarma, Ph.D.
33
4/6/2022
32
3
32x32x3 image 5x5x3 filter
32
convolve (slide) over all spatial locations
activation map
1
28
28
A closer look at spatial dimensions:
Dinesh K. Vishwakarma, Ph.D.
34
4/6/2022
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
7x7 input (spatially) assume 3x3 filter
7
7
A closer look at spatial dimensions:
Dinesh K. Vishwakarma, Ph.D.
35
4/6/2022
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
7x7 input (spatially) assume 3x3 filter
7
7
A closer look at spatial dimensions:
Dinesh K. Vishwakarma, Ph.D.
36
4/6/2022
7x7 input (spatially) assume 3x3 filter
7
7
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
A closer look at spatial dimensions:
Dinesh K. Vishwakarma, Ph.D.
37
4/6/2022
7x7 input (spatially) assume 3x3 filter
7
7
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
A closer look at spatial dimensions:
Dinesh K. Vishwakarma, Ph.D.
38
4/6/2022
7x7 input (spatially) assume 3x3 filter
=> 5x5 output
7
7
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
A closer look at spatial dimensions:
Dinesh K. Vishwakarma, Ph.D.
39
4/6/2022
7x7 input (spatially) assume 3x3 filter applied with stride 2
7
7
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
A closer look at spatial dimensions:
Dinesh K. Vishwakarma, Ph.D.
40
4/6/2022
7x7 input (spatially) assume 3x3 filter applied with stride 2
7
7
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
A closer look at spatial dimensions:
Dinesh K. Vishwakarma, Ph.D.
41
4/6/2022
7x7 input (spatially) assume 3x3 filter applied with stride 2
=> 3x3 output!
7
7
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
A closer look at spatial dimensions:
Dinesh K. Vishwakarma, Ph.D.
42
4/6/2022
7x7 input (spatially) assume 3x3 filter applied with stride 3?
7
7
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
A closer look at spatial dimensions:
Dinesh K. Vishwakarma, Ph.D.
43
4/6/2022
7x7 input (spatially) assume 3x3 filter applied with stride 3?
7
7
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
doesn’t fit!
cannot apply 3x3 filter on 7x7 input with stride 3.
Dinesh K. Vishwakarma, Ph.D.
44
4/6/2022
| | | | | | |
| | | F | | | |
| | | | | | |
| F | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
N
N
Output size:
(N - F) / stride + 1
e.g. N = 7, F = 3:
stride 1 => (7 - 3)/1 + 1 = 5
stride 2 => (7 - 3)/2 + 1 = 3
stride 3 => (7 - 3)/3 + 1 = 2.33 :\
In practice: Common to zero pad the border
Dinesh K. Vishwakarma, Ph.D.
45
4/6/2022
0 | 0 | 0 | 0 | 0 | 0 | | | |
0 | | | | | | | | |
0 | | | | | | | | |
0 | | | | | | | | |
0 | | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
e.g. input 7x7
3x3 filter, applied with stride 1
pad with 1 pixel border => what is the output?
(recall:)
(N - F) / stride + 1
In practice: Common to zero pad the border
Dinesh K. Vishwakarma, Ph.D.
46
4/6/2022
e.g. input 7x7
3x3 filter, applied with stride 1
pad with 1 pixel border => what is the output?
7x7 output!
0 | 0 | 0 | 0 | 0 | 0 | | | |
0 | | | | | | | | |
0 | | | | | | | | |
0 | | | | | | | | |
0 | | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
In practice: Common to zero pad the border
Dinesh K. Vishwakarma, Ph.D.
47
4/6/2022
e.g. input 7x7
3x3 filter, applied with stride 1
pad with 1 pixel border => what is the output?
7x7 output!
in general, common to see CONV layers with stride 1, filters of size FxF, and zero-padding with (F-1)/2. (will preserve size spatially)
e.g. F = 3 => zero pad with 1 F = 5 => zero pad with 2 F = 7 => zero pad with 3
0 | 0 | 0 | 0 | 0 | 0 | | | |
0 | | | | | | | | |
0 | | | | | | | | |
0 | | | | | | | | |
0 | | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
Remember back to…
Dinesh K. Vishwakarma, Ph.D.
48
4/6/2022
E.g. 32x32 input convolved repeatedly with 5x5 filters shrinks volumes spatially! (32 -> 28 -> 24 ...). Shrinking too fast is not good, doesn’t work well.
32
32
3
CONV, ReLU
e.g. 6 5x5x3 filters
28
28
6
CONV, ReLU
e.g. 10 5x5x6 filters
CONV, ReLU
….
10
24
24
Dinesh K. Vishwakarma, Ph.D.
49
4/6/2022
Examples time
Dinesh K. Vishwakarma, Ph.D.
50
4/6/2022
Examples time:
Input volume: 32x32x3
10 5x5 filters with stride 1, pad 2
Output volume size:
(32+2*2-5)/1+1 = 32 spatially, so
32x32x10
Dinesh K. Vishwakarma, Ph.D.
51
4/6/2022
Examples time:
Input volume: 32x32x3
10 5x5 filters with stride 1, pad 2
Number of parameters in this layer?
Examples time:
Dinesh K. Vishwakarma, Ph.D.
52
4/6/2022
Input volume: 32x32x3
10 5x5 filters with stride 1, pad 2
Number of parameters in this layer? each filter has 5*5*3 + 1 = 76 params
=> 76*10 = 760
(+1 for bias)
Dinesh K. Vishwakarma, Ph.D.
53
4/6/2022
Common settings:
K = (powers of 2, e.g. 32, 64, 128, 512)
- F = 3, S = 1, P = 1
- F = 5, S = 1, P = 2
- F = 5, S = 2, P = ? (whatever fits)
- F = 1, S = 1, P = 0
To summarize, the Conv Layer
(btw, 1x1 convolution layers make perfect sense)
Dinesh K. Vishwakarma, Ph.D.
54
4/6/2022
64
56
56
1x1 CONV
with 32 filters
32
56
56
(each filter has size 1x1x64, and performs a 64-dimensional dot product)
The brain/neuron view of CONV Layer
Dinesh K. Vishwakarma, Ph.D.
55
4/6/2022
32x32x3 image 5x5x3 filter
32
1 number:
the result of taking a dot product between the filter and this part of the image
(i.e. 5*5*3 = 75-dimensional dot product)
32
3
The brain/neuron view of CONV Layer
Dinesh K. Vishwakarma, Ph.D.
56
4/6/2022
32x32x3 image 5x5x3 filter
32
It’s just a neuron with local connectivity...
1 number:
the result of taking a dot product between the filter and this part of the image
(i.e. 5*5*3 = 75-dimensional dot product)
32
3
The brain/neuron view of CONV Layer
Dinesh K. Vishwakarma, Ph.D.
57
4/6/2022
32
32
3
An activation map is a 28x28 sheet of neuron outputs:
“5x5 filter” -> “5x5 receptive field for each neuron”
28
28
The Brain/Neuron view of CONV Layer
Dinesh K. Vishwakarma, Ph.D.
58
4/6/2022
32
32
3
28
28
E.g. with 5 filters, CONV layer consists of neurons arranged in a 3D grid (28x28x5)
There will be 5 different neurons all looking at the same region in the input volume.
5
Pooling Layer
Dinesh K. Vishwakarma, Ph.D.
59
4/6/2022
MAX and AVERAGE POOLING
Dinesh K. Vishwakarma, Ph.D.
60
4/6/2022
Max and Average pool with 2x2 filters and stride 2
Fully Connected Layer
Dinesh K. Vishwakarma, Ph.D.
61
4/6/2022
Flatten the final output and feed it to a regular NN for classification purposes
Fully Connected Layer
Dinesh K. Vishwakarma, Ph.D.
62
4/6/2022
3072
1
32x32x3 image -> stretch to 3072 x 1
10 x 3072
weights
activation
input
1 number:
the result of taking a dot product between a row of W and the input (a 3072-dimensional dot product)
1
10
Each neuron looks at the full input volume
Fully Connected Layer (FC layer)
Dinesh K. Vishwakarma, Ph.D.
63
4/6/2022
[ConvNetJS demo: training on CIFAR-10]
Dinesh K. Vishwakarma, Ph.D.
64
4/6/2022
Summary
Dinesh K. Vishwakarma, Ph.D.
65
4/6/2022
[(CONV-RELU)*N-POOL?]*M-(FC-RELU)*K,SOFTMAX
where N is usually up to ~5, M is large, 0 <= K <= 2.
- but recent advances such as ResNet/GoogLeNet challenge this paradigm
Reference
Dinesh K. Vishwakarma, Ph.D.
66
4/6/2022