1 of 29

Convolutional Neural Networks

TJ Machine Learning

Slide 1

TJ Machine Learning Club

2 of 29

CNNs over Neural Networks

Trains weights and biases similarly to Neural Networks
CNNs convolve filters to extract information

Updates filter patterns in backprop

NNs flatten images

Works fine for small, low info images
Updates weights and biases in backprop

Slide 2

TJ Machine Learning Club

3 of 29

Conceptual Idea of a Convolutional Neural Network

In an image, pixels that are close together are related
Pixels that are far away from each other are not likely to be related
In a fully connected network, the network must learn which pixels are related to which others

By using the kernels and the convolution operation, we effectively define many receptive fields (collections of close-together pixels) from which the network can more easily understand spatial relations in the image

Slide 3

TJ Machine Learning Club

4 of 29

What is the Convolution?

The convolution operation has widespread applications in many scientific and engineering disciplines such as signal processing and acoustics, and differential equations
For our purposes, we can say that a convolution is the result of putting together the terms resulting from element wise multiplication of a kernel and an image

Note that there is a lot more mathematical theory behind convolutions

Kernel = dark blue area

Image = light blue area

Feature Map = output

Slide 4

TJ Machine Learning Club

5 of 29

Why use Convolutional Layers?

Take advantage of spatial structuring in an image

Pixels close together are more correlated than pixels far apart

By sliding a kernel across an image, we consider only small regions of an image at a time.
We address the problem of translations

Translational invariance: A kernel will find every place that matches the feature it's looking for

Slide 5

TJ Machine Learning Club

6 of 29

Why use Convolutional Layers?

Convolutional layers can handle larger input sizes

Using a 224x224x3 image would mean we need a input layer of 150,000 neurons. Attempting to add a single hidden layer of half that size would require over 10 billion parameters

Using kernels is what dramatically reduces the number of weights needed from layer to layer
The 152-layer deep ResNet architecture, which runs on 224x224x3 input images, has fewer than 20 million parameters

Slide 6

TJ Machine Learning Club

7 of 29

Example of Convolution

Image matrix

Kernel matrix

Slide 7

TJ Machine Learning Club

8 of 29

Example of Convolution

142

This calculation was done using a stride length of one. We will go deeper into this in the next lecture

Yellow Entry = 4x1 + 2x2 + 0x3 + 3x4 + 2x5 + 3x6 + 2x7 + 1x8 + 8x9 = 142

Slide 8

TJ Machine Learning Club

9 of 29

Example of Convolution

142	189
225	230

This calculation was done using a stride length of one. We will go deeper into this in later

Blue Entry = 2x1 + 0x2 + 0x3 + 2x4 + 3x5 + 8x6 + 1x7 + 8x8 + 5x9 = 189

Green Entry = 3x1 + 2x2 + 3x3 + 2x4 + 1x5 + 8x6 + 4x7 + 6x8 + 8x9 = 225

Red Entry = 2x1 + 3x2 + 8x3 + 1x4 + 8x5 + 5x6 + 6x7 + 8x8 + 2x9 = 230

Slide 9

TJ Machine Learning Club

10 of 29

Hand-Designed Kernels

You can also think of kernels identifying areas in image that are most similar to it (i.e. the Sobel kernel being used for edge detection)

This kernel blurs an image by replacing each pixel value with the average value of the pixels around it and itself

Slide 10

TJ Machine Learning Club

11 of 29

Another Example

Slide 11

TJ Machine Learning Club

12 of 29

Kernels in a Convolutional Neural Network

In a CNN, we don’t use hand-designed kernels. We treat kernel values similar to how we viewed weight values in a traditional neural network and learn them using backpropagation

After we perform a convolution operation, we also apply a nonlinearity function like the ReLU function (just like in traditional neural networks)

We can apply multiple kernels to a single layer. We just stack the resulting matrices on top of each other.

Slide 12

TJ Machine Learning Club

13 of 29

Convolution over Multiple Channels

If our input “image” has multiple channels (is “RGB”) or is the result of multiple kernels being convolved, our kernels must have the same number of channels as our input image.

3D Image

3D Kernel

1D Feature Map

Slide 13

TJ Machine Learning Club

14 of 29

Full CNN Structure

A feature map is the output of a kernel convolved over an image
A convolutional layer is simply multiple kernels run over the same input

Number of feature maps is determined by number of kernels

For images that have multiple channels (28,28,3 RGB image), we can stack kernels to make a filter
By feeding successive convolutional layers the inputs of the prior layer, we structure our CNN to be able to abstractly combine features

Slide 14

TJ Machine Learning Club

15 of 29

Activation Function

How will our convolutional layers learn what to detect?
Standard Neural Network activation functions:

Sigmoid
Tanh

Vanishing Gradient Problem:

Gradients become negligibly small when inputs are very large or very small

Slide 15

TJ Machine Learning Club

16 of 29

ReLU

For CNNs, the standard activation function is ReLU: Rectified Linear Unit

Slide 16

TJ Machine Learning Club

17 of 29

ReLU

The slope of ReLU for any positive inputs is one
ReLU is applied directly to the output feature map

Slide 17

TJ Machine Learning Club

18 of 29

Pooling Layers

To further reduce computational time, we add pooling layers
These layers pool together values that are close together
The most common type of pooling is max pooling, where the highest value within a certain region is selected and passed down, reducing the size of the matrix given.

Slide 18

TJ Machine Learning Club

19 of 29

Fully-Connected Tail

We understand now that convolutional layers produce stacks of feature maps, which, after some additional processing with ReLU and max pooling layers, are fed as inputs to successive convolutional layers

Slide 19

TJ Machine Learning Club

20 of 29

Full CNN Structure

How do we turn a stack of feature maps into a prediction?

Turn back to the fully-connected layer
The CNN should have a good sense of the features that exist in the image
Dimensionality has been significantly reduced

Slide 20

TJ Machine Learning Club

21 of 29

Flattening

Slide 21

TJ Machine Learning Club

22 of 29

Softmax

Another type of activation function
Used at the end of the Dense layer
Converts the numerical outputs of a neural networks into probabilities, with higher numbers translating into higher probabilities.

Sigmoid is a two-class form of Softmax

We use softmax layers for categorical data.

To turn a categorical label into the desired training output of the CNN, we just have to create an array with as many entries as we have classes, and set the probability of the correct class to 1.
Called one-hot encoding

Slide 22

TJ Machine Learning Club

23 of 29

CNN Hyperparameters

Basic parameters to tune a CNN:

Kernel Size
Number of Kernels
Stride
Padding

Slide 23

TJ Machine Learning Club

24 of 29

Kernel Size

Changing the kernel size impacts the computational cost in forward and backward propagation and the scale of the features learned.
Smaller kernels learn smaller patterns whereas larger kernels are more difficult to train but can extract more spatial information.
Typically, odd numbers are used for kernel sizes so that at each step the center is a specific pixel and not an area between pixels.

Common sizes are 3x3, 5x5

Slide 24

TJ Machine Learning Club

25 of 29

Number of Kernels

In CNN models there are often many kernels per convolutional layer

16,32,64,128 kernels are most commonly used for a single layer

These different convolution kernels each act as a different filter creating a channel/feature map representing something different.

Kernels could be filtering top edges, bottom edges, diagonal lines and so on.
In much deeper networks these kernels could be filtering to animal features such as eyes or bird wings.

Having a higher number of convolutional kernels creates a higher number of channels/feature maps and a growing amount of data and this uses more memory.

Slide 25

TJ Machine Learning Club

26 of 29

Stride Length

In the examples above, we slid a kernel over every possible pixel of the input.
This means our kernel had a stride length of 1, as at every step we move the kernel by 1 in the right direction.
However, it can be the case that two kernel locations may have high enough overlap that calculating both is repetitive.

This is especially true for larger kernels.
Unnecessary computations

Solution: Increase stride length

Slide 26

TJ Machine Learning Club

27 of 29

Stride Length 1:

Stride Length 2:

Slide 27

TJ Machine Learning Club

28 of 29

Zero Padding

If a kernel is centered around bordering pixels, it could be missing valuable edge information:

To fix this, we use a technique called zero padding

Slide 28

TJ Machine Learning Club

29 of 29

Transfer Learning

Machine learning technique where a model developed for one task is used as the starting point for another task
Since CNNs are generally more generic in early layers and then become more data-specific towards the end, this can help reduce computational time and increase accuracy
A common approach is to use a pretrained model on the ImageNet 1000-class image database so the model gets an understanding of how to distinguish between classes

ResNet, EfficientNet, Inception, etc.

Slide 29

TJ Machine Learning Club