1 of 66

Convolutional Neural Networks

Dinesh K. Vishwakarma, Ph.D.

1

4/6/2022

Dr. Dinesh Kumar Vishwakarma�Professor, Department of Information Technology�Delhi Technological University, Delhi

Email: dinesh@dtu.ac.in

Web page: http://www.dtu.ac.in/Web/Departments/InformationTechnology/faculty/dkvishwakarma.php

Biometric Research Laboratory

http://www.dtu.ac.in/Web/Departments/InformationTechnology/lab_and_infra/bml/

2 of 66

What is CNN?

Dinesh K. Vishwakarma, Ph.D.

2

4/6/2022

A Convolutional Neural Network (ConvNet/CNN) is a Deep Learning algorithm.
It can take in an input image, assign importance (learnable weights and biases) to various aspects/objects in the image and be able to differentiate one from the other.
The pre-processing required in a ConvNet is much lower as compared to other classification algorithms.
While in primitive methods filters are hand-engineered, with enough training, CNNs have the ability to learn these filters/characteristics.
The architecture of a CNN is analogous to that of the connectivity pattern of Neurons in the Human Brain and was inspired by the organization of the Visual Cortex.

3 of 66

A bit of history...

Dinesh K. Vishwakarma, Ph.D.

3

4/6/2022

This image by Rocky Acosta is licensed under CC-BY 3.0

4 of 66

A bit of history...

Dinesh K. Vishwakarma, Ph.D.

4

4/6/2022

Rumelhart et al., 1986: First time back-propagation became popular

recognizable math

Illustration of Rumelhart et al., 1986 by Lane McIntosh

5 of 66

A bit of history...

Dinesh K. Vishwakarma, Ph.D.

5

4/6/2022

[Hinton and Salakhutdinov 2006]

Reinvigorated (a new energy) research in Deep Learning

Illustration of Hinton and Salakhutdinov 2006 by Lane McIntosh

6 of 66

First Strong Results

Dinesh K. Vishwakarma, Ph.D.

6

4/6/2022

Acoustic Modeling using Deep Belief Networks, Abdel-rahman Mohamed, George Dahl, Geoffrey Hinton, 2010
Context-Dependent Pre-trained Deep Neural Networks for Large Vocabulary Speech Recognition, George Dahl, Dong Yu, Li Deng, Alex Acero, 2012.
Imagenet classification with deep convolutional neural networks, Alex Krizhevsky, Ilya Sutskever, Geoffrey E Hinton, 2012

Illustration of Dahl et al. 2012 by Lane McIntosh, copyright CS231n 2017

Figures copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission.

7 of 66

A bit of history:

Dinesh K. Vishwakarma, Ph.D.

7

4/6/2022

LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278-2324.

LeNet-5

Implemented on CPU

8 of 66

A bit of history…

Dinesh K. Vishwakarma, Ph.D.

8

4/6/2022

ImageNet Classification with Deep Convolutional Neural Networks [Krizhevsky, Sutskever, Hinton, 2012]., NIPS citation: 59454 (02.04.2020)

It is deeper and much wider version of the LeNet and won by a large margin the difficult ImageNet competition.
AlexNet is scaled LeNet in to much large NN and it can learn more complex objects and object hierarchies.
Used GPUs NVIDIA GTX 580 to reduce training time.

“AlexNet”

The success of AlexNet gives small Revolution

9 of 66

Dinesh K. Vishwakarma, Ph.D.

9

4/6/2022

Fast-forward to today: ConvNets are everywhere

Classification Retrieval

Figures copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission.

10 of 66

Fast-forward to today: ConvNets are everywhere

Dinesh K. Vishwakarma, Ph.D.

10

4/6/2022

Figures copyright Shaoqing Ren, Kaiming He, Ross Girschick, Jian Sun, 2015. Reproduced with permission.

[Faster R-CNN: Ren, He, Girshick, Sun 2015]

Detection Segmentation

[Farabet et al., 2012]

Figures copyright Clement Farabet, 2012. Reproduced with permission.

11 of 66

Fast-forward to today: ConvNets are everywhere

Dinesh K. Vishwakarma, Ph.D.

11

4/6/2022

NVIDIA Tesla line

(these are the GPUs on rye01.stanford.edu)

Note that for embedded systems a typical setup would involve NVIDIA Tegras, with integrated GPU and ARM-based CPU cores.

self-driving cars

Photo by Lane McIntosh. Copyright CS231n 2017.

This image by GBPublic_PR is licensed under CC-BY 2.0

12 of 66

Fast-forward to today: ConvNets are everywhere

Dinesh K. Vishwakarma, Ph.D.

12

4/6/2022

[Taigman et al. 2014]

[Simonyan et al. 2014]

Figures copyright Simonyan et al., 2014. Reproduced with permission.

Activations of inception-v3 architecture [Szegedy et al. 2015] to image of Emma McIntosh, used with permission. Figure and architecture not from Taigman et al. 2014.

Illustration by Lane McIntosh, photos of Katie Cumnock used with permission.

13 of 66

Fast-forward to today: ConvNets are everywhere

Dinesh K. Vishwakarma, Ph.D.

13

4/6/2022

[Toshev, Szegedy 2014]

[Guo et al. 2014]

Images are examples of pose estimation, not actually from Toshev & Szegedy 2014. Copyright Lane McIntosh.

Figures copyright Xiaoxiao Guo, Satinder Singh, Honglak Lee, Richard Lewis, and Xiaoshi Wang, 2014. Reproduced with permission.

14 of 66

Fast-forward to today: ConvNets are everywhere

Dinesh K. Vishwakarma, Ph.D.

14

4/6/2022

[Levy et al. 2016]

[Sermanet et al. 2011] [Ciresan et al.]

Photos by Lane McIntosh. Copyright CS231n 2017.

[Dieleman et al. 2014]

From left to right: public domain by NASA, usage permitted by ESA/Hubble, public domain by NASA, and public domain.

Figure copyright Levy et al. 2016. Reproduced with permission.

15 of 66

No errors

Dinesh K. Vishwakarma, Ph.D.

15

4/6/2022

[Vinyals et al., 2015] [Karpathy and Fei-Fei, 2015]

Image Captioning

Minor errors

Somewhat related

A white teddy bear sitting in the grass

A man riding a wave on top of a surfboard

A man in a baseball uniform throwing a ball

A cat sitting on a suitcase on the floor

A woman is holding a cat in her hand

All images are CC0 Public domain: https://pixabay.com/en/luggage-antique-cat-1643010/

https://pixabay.com/en/teddy-plush-bears-cute-teddy-bear-1623436/ https://pixabay.com/en/surf-wave-summer-sport-litoral-1668716/ https://pixabay.com/en/woman-female-model-portrait-adult-983967/ https://pixabay.com/en/handstand-lake-meditation-496008/ https://pixabay.com/en/baseball-player-shortstop-infield-1045263/

Captions generated by Justin Johnson using Neuraltalk2

A woman standing on a beach holding a surfboard

16 of 66

Dinesh K. Vishwakarma, Ph.D.

16

4/6/2022

CNN Fundamentals

17 of 66

Fundamental Architecture of CNN

Dinesh K. Vishwakarma, Ph.D.

17

4/6/2022

18 of 66

Fundamental Architecture of CNN

Dinesh K. Vishwakarma, Ph.D.

18

4/6/2022

19 of 66

Why CNN not Feedforward Neural Network?

Dinesh K. Vishwakarma, Ph.D.

19

4/6/2022

20 of 66

Convolution Layer

Dinesh K. Vishwakarma, Ph.D.

20

4/6/2022

32

3

32

depth

32x32x3 image -> preserve spatial structure

width

height

21 of 66

Convolution operation

Dinesh K. Vishwakarma, Ph.D.

21

4/6/2022

22 of 66

Convolution operation…

Dinesh K. Vishwakarma, Ph.D.

22

4/6/2022

23 of 66

Convolution Layer

32x32x3 image

5x5x3 filter

32

Convolve the filter with the image

i.e. “slide over the image spatially, computing dot products”

Dinesh K. Vishwakarma, Ph.D.

23

4/6/2022

32

3

24 of 66

Convolution Layer

32x32x3 image

5x5x3 filter

32

Convolve the filter with the image

i.e. “slide over the image spatially, computing dot products”

Dinesh K. Vishwakarma, Ph.D.

24

4/6/2022

Filters always extend the full depth of the input volume

32

3

25 of 66

Convolution Layer

Dinesh K. Vishwakarma, Ph.D.

25

4/6/2022

32

32x32x3 image 5x5x3 filter

32

1 number:

the result of taking a dot product between the filter and a small 5x5x3 chunk of the image

(i.e. 5*5*3 = 75-dimensional dot product + bias)

3

26 of 66

Convolution Layer

Dinesh K. Vishwakarma, Ph.D.

26

4/6/2022

32

32x32x3 image 5x5x3 filter

32

convolve (slide) over all spatial locations

activation map

3

1

28

27 of 66

consider a second, green filter

Dinesh K. Vishwakarma, Ph.D.

27

4/6/2022

32

3

Convolution Layer

32x32x3 image 5x5x3 filter

convolve (slide) over all spatial locations

activation maps

1

28

28 of 66

For example, if we had 6 5x5 filters, we’ll get 6 separate activation maps:

Dinesh K. Vishwakarma, Ph.D.

28

4/6/2022

32

3

Convolution Layer

activation maps

6

28

We stack these up to get a “new image” of size 28x28x6!

29 of 66

Preview: ConvNet is a sequence of Convolution Layers, interspersed with activation functions

Dinesh K. Vishwakarma, Ph.D.

29

4/6/2022

32

3

28

6

CONV, ReLU

e.g. 6 5x5x3 filters

30 of 66

Preview: ConvNet is a sequence of Convolutional Layers, interspersed with activation functions

Dinesh K. Vishwakarma, Ph.D.

30

4/6/2022

32

3

CONV, ReLU

e.g. 6 5x5x3 filters

28

6

CONV, ReLU

e.g. 10 5x5x6 filters

CONV, ReLU

….

10

24

31 of 66

Preview

Dinesh K. Vishwakarma, Ph.D.

31

4/6/2022

[Zeiler and Fergus 2013]

Visualization of VGG-16 by Lane McIntosh. VGG-16 architecture from [Simonyan and Zisserman 2014].

32 of 66

one filter => one activation map

Dinesh K. Vishwakarma, Ph.D.

32

4/6/2022

example 5x5 filters (32 total)

We call the layer convolutional because it is related to convolution of two signals:

elementwise multiplication and sum of a filter and the signal (image)

Figure copyright Andrej Karpathy.

33 of 66

A closer look at spatial dimensions:

Dinesh K. Vishwakarma, Ph.D.

33

4/6/2022

32

3

32x32x3 image 5x5x3 filter

32

convolve (slide) over all spatial locations

activation map

1

28

34 of 66

A closer look at spatial dimensions:

Dinesh K. Vishwakarma, Ph.D.

34

4/6/2022

7x7 input (spatially) assume 3x3 filter

7

35 of 66

A closer look at spatial dimensions:

Dinesh K. Vishwakarma, Ph.D.

35

4/6/2022

7x7 input (spatially) assume 3x3 filter

7

36 of 66

A closer look at spatial dimensions:

Dinesh K. Vishwakarma, Ph.D.

36

4/6/2022

7x7 input (spatially) assume 3x3 filter

7

37 of 66

A closer look at spatial dimensions:

Dinesh K. Vishwakarma, Ph.D.

37

4/6/2022

7x7 input (spatially) assume 3x3 filter

7

38 of 66

A closer look at spatial dimensions:

Dinesh K. Vishwakarma, Ph.D.

38

4/6/2022

7x7 input (spatially) assume 3x3 filter

=> 5x5 output

7

39 of 66

A closer look at spatial dimensions:

Dinesh K. Vishwakarma, Ph.D.

39

4/6/2022

7x7 input (spatially) assume 3x3 filter applied with stride 2

7

40 of 66

A closer look at spatial dimensions:

Dinesh K. Vishwakarma, Ph.D.

40

4/6/2022

7x7 input (spatially) assume 3x3 filter applied with stride 2

7

41 of 66

A closer look at spatial dimensions:

Dinesh K. Vishwakarma, Ph.D.

41

4/6/2022

7x7 input (spatially) assume 3x3 filter applied with stride 2

=> 3x3 output!

7

42 of 66

A closer look at spatial dimensions:

Dinesh K. Vishwakarma, Ph.D.

42

4/6/2022

7x7 input (spatially) assume 3x3 filter applied with stride 3?

7

43 of 66

A closer look at spatial dimensions:

Dinesh K. Vishwakarma, Ph.D.

43

4/6/2022

7x7 input (spatially) assume 3x3 filter applied with stride 3?

7

doesn’t fit!

cannot apply 3x3 filter on 7x7 input with stride 3.

44 of 66

Dinesh K. Vishwakarma, Ph.D.

44

4/6/2022


			F

	F

N

Output size:

(N - F) / stride + 1

e.g. N = 7, F = 3:

stride 1 => (7 - 3)/1 + 1 = 5

stride 2 => (7 - 3)/2 + 1 = 3

stride 3 => (7 - 3)/3 + 1 = 2.33 :\

45 of 66

In practice: Common to zero pad the border

Dinesh K. Vishwakarma, Ph.D.

45

4/6/2022

0	0	0	0	0	0
0
0
0
0

e.g. input 7x7

3x3 filter, applied with stride 1

pad with 1 pixel border => what is the output?

(recall:)

(N - F) / stride + 1

46 of 66

In practice: Common to zero pad the border

Dinesh K. Vishwakarma, Ph.D.

46

4/6/2022

e.g. input 7x7

3x3 filter, applied with stride 1

pad with 1 pixel border => what is the output?

7x7 output!

0	0	0	0	0	0
0
0
0
0

47 of 66

In practice: Common to zero pad the border

Dinesh K. Vishwakarma, Ph.D.

47

4/6/2022

e.g. input 7x7

3x3 filter, applied with stride 1

pad with 1 pixel border => what is the output?

7x7 output!

in general, common to see CONV layers with stride 1, filters of size FxF, and zero-padding with (F-1)/2. (will preserve size spatially)

e.g. F = 3 => zero pad with 1 F = 5 => zero pad with 2 F = 7 => zero pad with 3

0	0	0	0	0	0
0
0
0
0

48 of 66

Remember back to…

Dinesh K. Vishwakarma, Ph.D.

48

4/6/2022

E.g. 32x32 input convolved repeatedly with 5x5 filters shrinks volumes spatially! (32 -> 28 -> 24 ...). Shrinking too fast is not good, doesn’t work well.

32

3

CONV, ReLU

e.g. 6 5x5x3 filters

28

6

CONV, ReLU

e.g. 10 5x5x6 filters

CONV, ReLU

….

10

24

49 of 66

Dinesh K. Vishwakarma, Ph.D.

49

4/6/2022

Examples time

Input volume: 32x32x3
10 5x5 filters with stride 1, pad 2
Output volume size: ?

50 of 66

Dinesh K. Vishwakarma, Ph.D.

50

4/6/2022

Examples time:

Input volume: 32x32x3

10 5x5 filters with stride 1, pad 2

Output volume size:

(32+2*2-5)/1+1 = 32 spatially, so

32x32x10

51 of 66

Dinesh K. Vishwakarma, Ph.D.

51

4/6/2022

Examples time:

Input volume: 32x32x3

10 5x5 filters with stride 1, pad 2

Number of parameters in this layer?

52 of 66

Examples time:

Dinesh K. Vishwakarma, Ph.D.

52

4/6/2022

Input volume: 32x32x3

10 5x5 filters with stride 1, pad 2

Number of parameters in this layer? each filter has 5*5*3 + 1 = 76 params

=> 76*10 = 760

(+1 for bias)

53 of 66

Dinesh K. Vishwakarma, Ph.D.

53

4/6/2022

Accepts a volume of size W1×H1×D1
Requires four hyperparameters:

Number of filters K,
their spatial extent F,
the stride S,
the amount of zero padding P.

Produces a volume of size W2×H2×D2 where:

W2=(W1−F+2P)/S+1
H2=(H1−F+2P)/S+1 (i.e. width and height are computed equally by symmetry)
D2=K

With parameter sharing, it introduces F⋅F⋅D1 weights per filter, for a total of (F⋅F⋅D1)⋅K weights and K biases.
In the output volume, the d-th depth slice (of size W2×H2 ) is the result of performing a valid convolution of the d-th filter over the input volume with a stride of S, and then offset by d-th bias.

Common settings:

K = (powers of 2, e.g. 32, 64, 128, 512)

- F = 3, S = 1, P = 1

- F = 5, S = 1, P = 2

- F = 5, S = 2, P = ? (whatever fits)

- F = 1, S = 1, P = 0

To summarize, the Conv Layer

54 of 66

(btw, 1x1 convolution layers make perfect sense)

Dinesh K. Vishwakarma, Ph.D.

54

4/6/2022

64

56

1x1 CONV

with 32 filters

32

56

(each filter has size 1x1x64, and performs a 64-dimensional dot product)

55 of 66

The brain/neuron view of CONV Layer

Dinesh K. Vishwakarma, Ph.D.

55

4/6/2022

32x32x3 image 5x5x3 filter

32

1 number:

the result of taking a dot product between the filter and this part of the image

(i.e. 5*5*3 = 75-dimensional dot product)

32

3

56 of 66

The brain/neuron view of CONV Layer

Dinesh K. Vishwakarma, Ph.D.

56

4/6/2022

32x32x3 image 5x5x3 filter

32

It’s just a neuron with local connectivity...

1 number:

the result of taking a dot product between the filter and this part of the image

(i.e. 5*5*3 = 75-dimensional dot product)

32

3

57 of 66

The brain/neuron view of CONV Layer

Dinesh K. Vishwakarma, Ph.D.

57

4/6/2022

32

3

An activation map is a 28x28 sheet of neuron outputs:

Each is connected to a small region in the input
All of them share parameters

“5x5 filter” -> “5x5 receptive field for each neuron”

28

58 of 66

The Brain/Neuron view of CONV Layer

Dinesh K. Vishwakarma, Ph.D.

58

4/6/2022

32

3

28

E.g. with 5 filters, CONV layer consists of neurons arranged in a 3D grid (28x28x5)

There will be 5 different neurons all looking at the same region in the input volume.

5

59 of 66

Pooling Layer

Dinesh K. Vishwakarma, Ph.D.

59

4/6/2022

Makes the representations smaller and more manageable
Operates over each activation map independently.
Pooling layer is responsible for reducing the spatial size of the Convolved Feature.
This is to decrease the computational power required to process the data through dimensionality reduction.
Furthermore, it is useful for extracting dominant features which are rotational and positional invariant, thus maintaining the process of effectively training of the model.

Pooling is two types: Max and Average

60 of 66

MAX and AVERAGE POOLING

Dinesh K. Vishwakarma, Ph.D.

60

4/6/2022

Max Pooling returns the maximum value from the portion of the image covered by the Kernel.
Average Pooling returns the average of all the values from the portion of the image covered by the Kernel.
Max Pooling also performs as a Noise Suppressant. It discards the noisy activations altogether and also performs de-noising along with dimensionality reduction.
On the other hand, Average Pooling simply performs dimensionality reduction as a noise suppressing mechanism. Hence, we can say that Max Pooling performs a lot better than Average Pooling.
The Convo Layer and the Pooling Layer, together form the i-th layer of a CNN. These layers may be increased to have low level details but computational complexity increases.

Max and Average pool with 2x2 filters and stride 2

61 of 66

Fully Connected Layer

Dinesh K. Vishwakarma, Ph.D.

61

4/6/2022

Flatten the final output and feed it to a regular NN for classification purposes

62 of 66

Fully Connected Layer

Dinesh K. Vishwakarma, Ph.D.

62

4/6/2022

3072

1

32x32x3 image -> stretch to 3072 x 1

10 x 3072

weights

activation

input

1 number:

the result of taking a dot product between a row of W and the input (a 3072-dimensional dot product)

1

10

Each neuron looks at the full input volume

63 of 66

Fully Connected Layer (FC layer)

Dinesh K. Vishwakarma, Ph.D.

63

4/6/2022

Contains neurons that connect to the entire input volume, as in ordinary Neural Networks

64 of 66

[ConvNetJS demo: training on CIFAR-10]

Dinesh K. Vishwakarma, Ph.D.

64

4/6/2022

http://cs.stanford.edu/people/karpathy/convnetjs/demo/cifar10.html

65 of 66

Summary

Dinesh K. Vishwakarma, Ph.D.

65

4/6/2022

ConvNets stack CONV,POOL,FC layers
Trend towards smaller filters and deeper architectures
Trend towards getting rid of POOL/FC layers (just CONV)
Typical architectures look like

[(CONV-RELU)*N-POOL?]*M-(FC-RELU)*K,SOFTMAX

where N is usually up to ~5, M is large, 0 <= K <= 2.

- but recent advances such as ResNet/GoogLeNet challenge this paradigm

66 of 66

Reference

Fei-Fei Li & Justin Johnson & Serena Yeung, Lecture Series
https://towardsdatascience.com/

Dinesh K. Vishwakarma, Ph.D.

66

4/6/2022