1 of 52

Дълбоко обучение с Tensorflow

Лекция 5. Конволюционни невронни мрежи

2 of 52

Lecture 5

Convolutional Neural Nets

3 of 52

Agenda

What CNNs Can Do
Why CNNs
Convolutional Neural Network Basics
Convolutional Filters and Weight-Sharing
Basic CNN Architecture
Data Augmentation

4 of 52

What Can CNNs Do?

Image Classification
Object Detection
Neural Style Transfer
Face Recognition
Image Synthesis

5 of 52

Image Classification

6 of 52

Object Detection

7 of 52

`Neural Style Transfer

8 of 52

`Neural Style Transfer

9 of 52

Face Recognition

11 of 52

Different lighting, contrast, viewpoints

12 of 52

Number of Parameters

1000x1000 pixel image with three channels then we would have 3 million input features.
If the first layer has 1000 neurons the neural net with just one layer would have approximately 3 billion parameters.

13 of 52

Traditional Approaches

14 of 52

Traditional Approaches

15 of 52

CNN Basics

16 of 52

CNN Principles

Translation invariance - Our network should respond similarly to the same patch, regardless of where it appears in the image.
Locality principle - Our network should focus on local regions, without regard for the contents of the image in distant regions.

17 of 52

WbWhat Is a CNN?

The convolutional neural network (CNN) is a specialized feedforward neural network designed to process multi-dimensional data, e.g., images
A CNN is a special case of neural networks which uses convolution instead of full matrix multiplication in the hidden layers
A CNN architecture is typically comprised of convolutional layers, pooling (subsampling) layers, and fully-connected layers (LeNet-5)

18 of 52

What is a Convolution?

In each convolutional layer of a CNN, a convolutional operation is performed along with the input of predefined width and strides. Each of these convolutional operations is called a "kernel" or a "filter" and is somewhat equivalent to a "neuron" in an MLP
An activation function is applied after each convolution to produce the output

19 of 52

Convolution Operation

A single unit of a convolutional layer is only connected to a small receptive field of its input, where the weights of its connections define a filter
The convolution operation is used to slide the filter across the input, producing activations at each receptive field

20 of 52

Convolution Operation

The shaded portions are the first output element and the input and kernel array elements used in its computation:

0×0+1×1+3×2+4×3=19

21 of 52

Parameters Defining Convolution

Convolutions are defined by two key parameters:

Size of the filters extracted from the inputs. These are typically 3×3 (most frequently used), or 5×5
Depth of the output feature map - The number of filters computed by the convolution

22 of 52

Main Concepts Behind Convolutional Neural Networks

Sparse-connectivity: A single element in the feature map is connected to only a small patch of pixels. (This is very different from connecting to the whole input image, in the case of multilayer perceptrons.)
Parameter-sharing: The same weights are used for different patches of the input image.
Many layers: Combining extracted local patterns to global patterns

23 of 52

Weight Sharing

A "feature detector" (filter, kernel) slides over the inputs to generate a feature map
Rationale: A feature detector that works well in one region may also work well in another region
A big reduction in parameters to fit
Multiple "feature detectors" (kernels) are used to create multiple feature maps

24 of 52

Edge Detection Example

25 of 52

How Do We Choose the Weights of the Filters?

We use Gradient Descent and Backpropagation to find the optimal weights of the different filters.

26 of 52

Padding

Padding can increase the height and width of the output.
Padding works by adding an appropriate number of rows and columns on each side of the input feature map
This is often used to give the output the same height and width as the input to avoid undesirable shrinkage of the output
Padding ensures that all pixels are used equally frequently.
Typically we pick symmetric padding on both sides of the input height and width.

28 of 52

Padding

Some padding terminologies:

“valid” padding: no padding
“same” padding: padding so that the output dimension is the same as the input

29 of 52

Stride

The distance between two successive windows is a parameter of the convolution, called its stride, which defaults to 1
It is possible to have strided convolutions: convolutions with a stride higher than 1.

30 of 52

Stride of 2 Example

31 of 52

Pooling

Pooling is an exceedingly simple operation. It does exactly what its name indicates, aggregate results over a window of values
A big difference from convolution is that max-pooling is usually done with 2×2 windows and stride 2, to downsample the feature maps by a factor of 2
Pooling does not have any learnable parameters

33 of 52

Size Before and After Convolutions

34 of 52

Convolutions with Color Channels

35 of 52

Convolutions with Color Channels

Convolving 6X6X3 with 3X3X3 gives you a 4X4 2D output

36 of 52

Convolutions with Color Channels

37 of 52

Convolution Operation with Multiple Filters

Multiple filters can be used in a convolution layer to detect multiple features.
The output of the layer then will have the same number of channels as the number of filters in the layer.

38 of 52

Convolution Operation with Multiple Filters

39 of 52

One Convolution Layer

Finally to make up a convolution layer, a bias is added and an activation function such as ReLU or tanh is applied.

40 of 52

One Convolution Layer Number of Parameters

If you have 10 filters 3x3x3 how many parameters do you have in this layer?
Does the number of parameters depend on the input size?

41 of 52

One Convolution Layer Number of Parameters

If you have 10 filters 3x3x3 how many parameters do you have in this layer?

Weights = 3*3*3 = 27

Bias = 1

Total parameters per filter = 28

Total Params for 10 Filters = 10 * 28 = 280 params

42 of 52

Basic CNN Architecture

43 of 52

Le Net CNN Architecture

Handwritten digit recognition, developed by Yann Lecun in the 90s.

At a high level, LeNet (LeNet-5) consists of two parts: (i) a convolutional encoder consisting of two convolutional layers; and (ii) a dense block consisting of three fully connected layers;

The basic units in each convolutional block are a convolutional layer, a sigmoid activation function, and a subsequent average pooling operation.

Note that while ReLUs and max-pooling work better, these discoveries had not yet been made at the time.

Each convolutional layer uses a 5×5 kernel.

44 of 52

Hidden Layers

47 of 52

Illustrated Example

48 of 52

Data Augmentation

49 of 52

Data Augmentation

Image augmentation generates random images based on existing training data to improve the generalization ability of models.
In order to obtain definitive results during prediction, we usually only apply image augmentation to training examples, and do not use image augmentation with random operations during prediction.
Deep learning frameworks provide many different image augmentation methods, which can be applied simultaneously.

50 of 52

Data Augmentation

52 of 52

References

Convolutional Neural Networks (Course 4 of the Deep Learning Specialization)
CS231n: Deep Learning for Computer Vision - Stanford University
Introduction to Deep Learning by Sebastian Raschka
Dive into Deep Learning - Zhang, Lipton, Li

1 of 52

2 of 52

3 of 52

4 of 52

5 of 52

6 of 52

7 of 52

8 of 52

9 of 52

10 of 52

11 of 52

12 of 52

13 of 52

14 of 52

15 of 52

16 of 52

17 of 52

18 of 52

19 of 52

20 of 52

21 of 52

22 of 52

23 of 52

24 of 52

25 of 52

26 of 52

27 of 52

28 of 52

29 of 52

30 of 52

31 of 52

32 of 52

33 of 52

34 of 52

35 of 52

36 of 52

37 of 52

38 of 52

39 of 52

40 of 52

41 of 52

42 of 52

43 of 52

44 of 52

45 of 52

46 of 52

47 of 52

48 of 52

49 of 52

50 of 52

51 of 52

52 of 52