JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.

1 of 113

Artificial intelligence for quality control with active infrared thermography�Introduction to Deep Learning�

ING-IND/14, 2 CFU

Roberto Marani - April 26th, 2023

2 of 113

Introduction to machine learning

Definition by Tom Mitchell (1998):

Machine Learning is the study of algorithms that:

improve their performance P
at some task T
with experience E

A well-defined learning task is given by <P,T,E>.

“Learning is any process by which a system improves performance from experience” (Herbert Simon)

Roberto Marani

Slide 2

3 of 113

Introduction to machine learning

Supervised learning

Unsupervised learning

Roberto Marani

Slide 3

4 of 113

Evaluation

Roberto Marani

Slide 4

5 of 113

Computer Vision Tasks

Roberto Marani

Slide 5

6 of 113

No-Free-Lunch Theorem

Wolpert (2002) - The Supervised Learning No-Free-Lunch Theorems
The derived classification models for supervised learning are simplifications of the reality

The simplifications are based on certain assumptions
The assumptions fail in some situations

E.g., due to the inability to perfectly estimate ML model parameters from limited data

In summary, the No-Free-Lunch Theorem states:

No single classifier works best for all possible problems
Since we need to make assumptions to generalize

Roberto Marani

Slide 6

7 of 113

Evaluation

Performance on test data is a good indicator of generalization

The test accuracy is more important than the training accuracy

Roberto Marani

Slide 7

8 of 113

Use case

Inspection of a calibrated plate of GFRP

Ø 7.85

Ø 14.1

Ø 20.2

depth

15.7

Hole depths

12.4

9.82

7.08

4.35

315

290

Ø 17.44

Ø 13.3

Ø 9.54

Ø 8.3

Ø 7.85

In-depth defects

Surface defects

Sound region

Roberto Marani

Slide 8

9 of 113

Machine learning vs deep learning
Neural networks
Training Neural Networks

Loss Function
Optimization
Regularization
Searching for the best
Architectures

For time series
For fixed-size data

Outlines

Roberto Marani

Slide 9

10 of 113

Machine learning vs deep learning
Neural networks
Training Neural Networks

Loss Function
Optimization
Regularization
Searching for the best
Architectures

For time series
For fixed-size data

Outlines

Roberto Marani

Slide 10

11 of 113

Machine learning vs deep learning

Conventional machine learning methods rely on human-designed feature representations

ML becomes just optimizing weights to make a final prediction best

Roberto Marani

Slide 11

12 of 113

Machine learning vs deep learning

Deep learning (DL) is a machine learning subfield that uses multiple layers for learning data representations

DL is exceptionally effective at learning patterns

Roberto Marani

Slide 12

13 of 113

Machine learning vs deep learning

DL applies a multi-layer process for learning rich hierarchical features (i.e., data representations)

Input image pixels → Edges → Textures → Parts → Objects

Low-Level Features

Mid-Level Features

Output

High-Level Features

Trainable Classifier

Roberto Marani

Slide 13

14 of 113

Why is deep learning useful?

DL provides a flexible, learnable framework for representing visual, text, linguistic information

Can learn in a supervised or unsupervised manner

DL represents an effective end-to-end learning system
Requires large amounts of training data
Since about 2010, DL has outperformed other ML techniques

First in vision and speech, then NLP, and other applications

Roberto Marani

Slide 14

15 of 113

Why is deep learning useful?

Roberto Marani

Slide 15

16 of 113

DL Frameworks

Roberto Marani

Slide 16

17 of 113

Outlines

Machine learning vs deep learning
Neural networks
Training Neural Networks

Loss Function
Optimization
Regularization
Searching for the best
Architectures

For time series
For fixed-size data

Roberto Marani

Slide 17

18 of 113

Neural Networks

Handwritten digit recognition (MNIST dataset)

The intensity of each pixel is considered an input element
Output is the class of the digit

Input

16 x 16 = 256

……

y₁

y₂

y₁₀

Each dimension represents the confidence of a digit

is 1

is 2

is 0

……

0.1

0.7

0.2

The image is “2”

Output

Roberto Marani

Slide 18

19 of 113

Neural Networks

Handwritten digit recognition

Machine

“2”

……

y₁

y₂

y₁₀

Roberto Marani

Slide 19

20 of 113

Neural Networks

NNs consist of hidden layers with neurons (i.e., computational units)
A single neuron maps a set of inputs into an output number, or 𝑓:𝑅^𝐾→𝑅

…

bias

Activation function

weights

input

output

…

Roberto Marani

Slide 20

21 of 113

Neural Networks

A NN with one hidden layer and one output layer

Weights

Biases

Activation functions

4 + 2 = 6 neurons (not counting inputs)

[3 × 4] + [4 × 2] = 20 weights

4 + 2 = 6 biases

26 learnable parameters

Roberto Marani

Slide 21

22 of 113

Deep Neural Networks

Deep NNs have many hidden layers

Fully-connected (dense) layers (Multi-Layer Perceptron or MLP)
Each neuron is connected to all neurons in the succeeding layer
It can be expressed in a matrix form

Output Layer

Hidden Layers

Input Layer

Input

Output

Layer 1

……

Layer 2

……

Layer L

……

y₁

y₂

y_M

Roberto Marani

Slide 22

23 of 113

Deep Neural Networks

Example

Sigmoid Function

-1

-2

-1

-2

0.98

0.12

Roberto Marani

Slide 23

24 of 113

Deep Neural Networks

Example

-2

-1

-2

0.98

0.12

-1

-2

-1

0.86

0.11

0.62

0.83

-2

-1

Roberto Marani

Slide 24

25 of 113

Deep Neural Networks

Matrix operation in multilayer NN

……

y₁

y₂

y_M

W¹

W²

W^L

b²

b^L

a¹

a²

b¹

W¹

b²

W²

a¹

b^L

W^L

a^L-1

b¹

W¹

b²

W²

b^L

W^L

…

Roberto Marani

Slide 25

26 of 113

Classification layer

In multi-class classification tasks, the output layer is typically a softmax layer

I.e., it employs a softmax activation function
If a layer with a sigmoid activation function is used as the output layer instead, the predictions by the NN may not be easy to interpret

Note that an output layer with sigmoid activations can still be used for binary classification

The output is a probability value in the range [0,1]

A Layer with Sigmoid Activations

-3

0.95

0.05

0.73

A Softmax Layer

-3

2.7

0.05

0.88

0.12

≈0

Roberto Marani

Slide 26

27 of 113

Activation functions

Non-linear activations are needed to learn complex (non-linear) data representations

Otherwise, NNs would be just a linear function (such as W₁W₂𝑥=𝑊𝑥)
NNs with many layers (and neurons) can approximate more complex functions

Figure: more neurons improve representation (but, may overfit)

Roberto Marani

Slide 27

28 of 113

Activation functions

Sigmoid function σ

It takes a real-valued number and “squashes” it into the range between 0 and 1

The output can be interpreted as the firing rate of a biological neuron

Not firing = 0; Fully firing = 1

When the neuron’s activation is 0 or 1, sigmoid neurons saturate

Gradients at these regions are almost zero (almost no signal will flow)

Sigmoid activations are less common in modern NNs

Roberto Marani

Slide 28

29 of 113

Activation functions

Tanh function:

It takes a real-valued number and “squashes” it into a range between -1 and 1
Like sigmoid, tanh neurons saturate
Unlike sigmoid, the output is zero-centered

It is therefore preferred over sigmoid

Tanh is a scaled sigmoid: tanh⁡(𝑥)=2∙𝜎(2𝑥)−1

Roberto Marani

Slide 29

30 of 113

Activation functions

ReLU (Rectified Linear Unit):

It takes a real-valued number and thresholds it at zero: f(x) = max(0,x)
Most modern deep NNs use ReLU activations

ReLU is fast to compute compared to sigmoid and tanh
Accelerates the convergence of gradient descent

Due to linear, non-saturating form

Prevents the gradient vanishing problem

ReLU could cause weights to update in a way that the gradients can become zero and the neuron will not activate again on any data

Roberto Marani

Slide 30

31 of 113

Activation functions

Leaky ReLU activation

It is a variant of ReLU

Instead of the function being 0 when 𝑥<0, a leaky ReLU has a small negative slope (e.g., α = 0.01, or similar)
This resolves the dying ReLU problem

Most current works still use ReLU
With a proper setting of the learning rate, the problem of dying ReLU can be avoided

Roberto Marani

Slide 31

32 of 113

Activation functions

Linear function

The output signal is proportional to the input signal to the neuron

If the value of the constant c is 1, it is also called identity activation function
This activation type is used in regression problems

E.g., the last layer can have linear activation function, in order to output a real number (and not a class membership)

Roberto Marani

Slide 32

33 of 113

Outlines

Machine learning vs deep learning
Neural networks
Training Neural Networks

Loss Function
Optimization
Regularization
Searching for the best
Architectures

For time series
For fixed-size data

Roberto Marani

Slide 33

34 of 113

Training Neural Networks

Train a network means determining the parameters of each of its layers, given a specific architecture

The network parameters 𝜃 include the weight matrices and bias vectors from all layers

Often, the model parameters 𝜃 are referred to as weights

Training a model to learn a set of parameters 𝜃 that are optimal (according to a criterion) is one of the greatest challenges in ML

Roberto Marani

Slide 34

35 of 113

Training Neural Networks

Data Preprocessing

It is a fundamental task to help training in reaching convergence

Mean subtraction

Subtract the mean for each individual data dimension (feature) to obtain zero-centered data

Normalization

Divide each feature by its standard deviation

To obtain a standard deviation of 1 for each data dimension (feature)

Or, scale the data within the range [0,1] or [-1, 1]

E.g., image pixel intensities are divided by 255 to be scaled in the [0,1] range

Roberto Marani

Slide 35

36 of 113

Training Neural Networks

To train a network it is necessary to define a loss function (objective or cost function)

ℒ(𝜃) calculates the difference (error) between the model prediction and the true label
E.g., ℒ(𝜃) can be a mean-squared error, a cross-entropy value, etc.

……

y₁

y₂

y₁₀

Cost

0.2

0.3

0.5

……

True label “1”

……

Prediction score

Roberto Marani

Slide 36

37 of 113

Training Neural Networks

Training formalization

For a training set of N images:

Calculate the total loss overall all images ℒ(𝜃)=∑ _𝑛=1 ^Nℒ _𝑛(𝜃)

Find the optimal parameters 𝜃^∗ that minimize the total loss ℒ(𝜃)

x¹

x²

x^N

……

y¹

y²

y^N

……

x³

y³

Which function can work best?

Roberto Marani

Slide 37

38 of 113

Outlines

Machine learning vs deep learning
Neural networks
Training Neural Networks

Loss Function
Optimization
Regularization
Searching for the best
Architectures

For time series
For fixed-size data

Roberto Marani

Slide 38

39 of 113

Loss function for classification

Training examples

Pairs of 𝑁 inputs 𝑥_𝑖 and ground-truth class labels 𝑦_𝑖

Output layer

Softmax activations (to map to a probability)

Loss function

Cross-entropy

GT labels

Model predicted labels

i = no. examples

k = no. of classes

Roberto Marani

Slide 39

40 of 113

Loss function for regression

Training examples

Pairs of 𝑁 inputs 𝑥_𝑖 and ground-truth output values 𝑦_𝑖

Output layer

Linear of sigmoid activation

Loss function

Mean Squared Error

Mean Absolute Error

Roberto Marani

Slide 40

41 of 113

Outlines

Machine learning vs deep learning
Neural networks
Training Neural Networks

Loss Function
Optimization
Regularization
Searching for the best
Architectures

For time series
For fixed-size data

Roberto Marani

Slide 41

42 of 113

Optimizing the loss function

Almost all DL models these days are trained with a variant of the gradient descent (GD) algorithm

GD applies iterative refinement of the network parameters 𝜃
GD uses the opposite direction of the gradient of the loss with respect to the NN parameters (i.e.,𝛻ℒ(𝜃)=[𝜕ℒ∕𝜕𝜃_𝑖] ) for updating 𝜃

The gradient of the loss function 𝛻ℒ(𝜃) gives the direction of the fastest increase of the loss function ℒ(𝜃) when the parameters 𝜃 are changed

Roberto Marani

Slide 42

43 of 113

Gradient Descent Algorithm

Randomly initialize the model parameters,𝜃⁰
Compute the gradient of the loss function at the initial parameters 𝜃⁰: 𝛻ℒ(𝜃⁰ )
Update the parameters as: 𝜃^𝑛𝑒𝑤=𝜃⁰−𝛼𝛻ℒ(𝜃⁰ )

Where α is the learning rate

Go to step 2 and repeat (until a terminating criterion is reached)

Roberto Marani

Slide 43

44 of 113

Gradient Descent Algorithm

Gradient descent algorithm stops when a local minimum of the loss is reached

GD does not guarantee reaching a global minimum
Empirical evidence suggests that GD works well for NNs

Roberto Marani

Slide 44

45 of 113

Gradient Descent Algorithm

For most tasks, the loss function ℒ(𝜃) is highly complex (and non-convex)

Random initialization in NNs results in different initial parameters 𝜃⁰ every time the NN is trained

Gradient descent may reach different minima at every run
Therefore, NN will produce different predicted outputs

Any algorithm can guarantee reaching a global minimum for an arbitrary loss function

Roberto Marani

Slide 45

46 of 113

Backpropagation

Modern NNs employ the backpropagation (“backward propagation”) method for calculating the gradients of the loss function 𝛻ℒ(𝜃)=𝜕ℒ∕𝜕𝜃_𝑖

For training NNs, forward propagation (forward pass) refers to passing the inputs 𝑥 through the hidden layers to obtain the model outputs (predictions) 𝑦

The loss ℒ(𝑦,𝑦 ̂) function is then calculated

Backpropagation traverses the network in reverse order, from the outputs 𝑦 backward toward the inputs 𝑥 to calculate the gradients of the loss 𝛻ℒ(𝜃)

The chain rule is used for calculating the partial derivatives of the loss function with respect to the parameters 𝜃 in the different layers of the network

Automatic calculation of the gradients (automatic differentiation) is available in all current deep learning libraries to simplify the network implementation

No need to derive the partial derivatives of the loss function by hand

Roberto Marani

Slide 46

47 of 113

GD optimization

Mini-batch gradient descent

The loss is computed on small batches of the training dataset (it is wasteful to perform a full training set analysis to update a single parameter)

Mini-batch GD results in much faster training
It works because the gradient from a mini-batch is a good approximation of the gradient from the entire training set

Approach

Compute the loss ℒ(𝜃) on a mini-batch of images, update the parameters 𝜃, and repeat until all images are used
At the next epoch, shuffle the training data, and repeat the above process

Stochastic GD 🡪 A mini-batch has the size of a single example

Less used as it can lead to huge fluctuations in the loss function at each step
SGD typically refers to GD applied to mini-batches of inputs

Roberto Marani

Slide 47

48 of 113

GD optimization

The GD algorithm can be very slow at plateaus, and it can get stuck at saddle points

Very slow at the plateau

Stuck at a local minimum

Stuck at a saddle point

Roberto Marani

Slide 48

49 of 113

GD with Momentum

Gradient descent with momentum uses the momentum of the gradient for parameter optimization

Movement = Negative of Gradient + Momentum

Gradient = 0

Negative of Gradient

Momentum

Real Movement

Roberto Marani

Slide 49

50 of 113

GD with Momentum

The GD with Momentum updates the parameters 𝜃 in the direction of the weighted average of the past gradients

At iteration 𝑡

Standard GD: 𝜃^𝑡=𝜃^𝑡−1−𝛼𝛻ℒ(𝜃^𝑡−1)

Where 𝜃^𝑡−1 are the parameters from the previous iteration 𝑡−1

GDM: 𝜃^𝑡=𝜃^𝑡−1−𝑉^𝑡

Where: 𝑉^𝑡= 𝛽 𝑉^{𝑡 -1}+𝛼𝛻ℒ(𝜃^𝑡−1)
𝑉^𝑡 is called momentum

It accumulates the gradients from the past several steps
This term is analogous to a momentum of a heavy ball rolling down the hill

𝛽 is referred to as a coefficient of momentum

A typical value of the parameter 𝛽 is 0.9

Roberto Marani

Slide 50

51 of 113

GD with Nesterov Accelerated Momentum

Update term: 𝜃^𝑡=𝜃^𝑡−1−𝑉^𝑡

Where: 𝑉^𝑡= 𝛽 𝑉^{𝑡 -1}+𝛼𝛻ℒ(𝜃^𝑡−1 + 𝛽 𝑉^{𝑡 -1})
The term 𝜃^𝑡−1 + 𝛽 𝑉^{𝑡 -1}allows predicting the position of the parameters in the next step

GD with momentum

GD with Nesterov momentum

Roberto Marani

Slide 51

52 of 113

Adaptive Momentum Estimation (Adam)

Roberto Marani

Slide 52

53 of 113

Optimizer comparison

Animation from: https://imgur.com/s25RsOr

Roberto Marani

Slide 53

54 of 113

Learning rate

The gradient tells us the direction in which the loss has the steepest rate of increase, but it does not tell us how far along the opposite direction we should step
Choosing the learning rate (also called the step size) is one of the most important hyper-parameter settings for NN training

LR too small

LR too large

Roberto Marani

Slide 54

55 of 113

Learning rate

Training with different learning rates can result in different loss values:

High learning rate: the loss increases or plateaus too quickly
Low learning rate: the loss decreases too slowly (takes many epochs to reach a solution)

Roberto Marani

Slide 55

56 of 113

Scheduling the learning rate

Learning rate scheduling is applied to change the values of the learning rate during the training

Annealing: reducing the learning rate over time

Approach 1: reduce the learning rate by some factor every few epochs

Typical values: reduce the learning rate by a half every 5 epochs, or divide by 10 every 20 epochs

Approach 2: exponential or cosine decay gradually reduce the learning rate over time
Approach 3: reduce the learning rate by a constant (e.g., by half) whenever the validation loss stops improving

Warmup:

Gradually increasing the learning rate initially
Let the learning rate cool down until the end of the training

Exponential

Cosine

Warmup

Roberto Marani

Slide 56

57 of 113

Outlines

Machine learning vs deep learning
Neural networks
Training Neural Networks

Loss Function
Optimization
Regularization
Searching for the best
Architectures

For time series
For fixed-size data

Roberto Marani

Slide 57

58 of 113

Regularization

Regularization is a set of techniques to:

Prevent overfitting 🡪 Improve accuracy when facing new data

Underfitting

The model is too “simple” to represent all the relevant class characteristics
E.g., model with too few parameters
Produces high error on the training set and high error on the validation set

Overfitting

The model is too “complex” and fits irrelevant characteristics (noise) in the data
E.g., model with too many parameters
Produces low error on the training error and high error on the validation set

Roberto Marani

Slide 58

59 of 113

Regularization

Overfitting

A model with high capacity fits the noise in the data instead of the underlying relationship

Roberto Marani

Slide 59

60 of 113

L₂ regularization

Roberto Marani

Slide 60

61 of 113

L₁ regularization

Roberto Marani

Slide 61

62 of 113

Dropout regularization

Randomly drop units (along with their connections) during training

Each unit is retained with a fixed dropout rate p, independent of the other units
The hyper-parameter p needs to be chosen (tuned)

Often, between 20% and 50% of the units are dropped

Roberto Marani

Slide 62

63 of 113

Dropout regularization

This technique, using mini-batches, is similar to ensemble learning

Every mini-batch trains a slightly-different network

mini-batch 1

mini-batch 2

mini-batch 3

mini-batch n

……

Roberto Marani

Slide 63

64 of 113

Early stopping

During model training, use a validation set

E.g., validation/train ratio of about 25% to 75%

Stop when the validation accuracy (or loss) has not improved after n epochs

The parameter n is called patience

Stop training

validation

Roberto Marani

Slide 64

65 of 113

Outlines

Machine learning vs deep learning
Neural networks
Training Neural Networks

Loss Function
Optimization
Regularization
Searching for the best
Architectures

For time series
For fixed-size data

Roberto Marani

Slide 65

66 of 113

Tuning the hyper-parameter

Training NNs can involve setting many hyper-parameters

The most common hyper-parameters include:

Number of layers, and number of neurons per layer
Initial learning rate
Learning rate decay schedule (e.g., decay constant)
Optimizer type

Other hyper-parameters may include:

Regularization parameters (ℓ_2 penalty, dropout rate)
Batch size
Activation functions
Loss function

Hyper-parameter tuning can be time-consuming for larger NNs

Grid search

Check all values in a range with a step value

Random search

Randomly sample values for the parameter
Often preferred to grid search

Bayesian hyper-parameter optimization

Roberto Marani

Slide 66

67 of 113

Ensemble Learning

Ensemble learning is training multiple classifiers separately and combining their predictions

Ensemble learning often outperforms individual classifiers
Better results were obtained with higher model variety in the ensemble

Bagging (bootstrap aggregating)

Randomly draw subsets from the training set (i.e., bootstrap samples)
Train separate classifiers on each subset of the training set
Perform classification based on the average vote of all classifiers

Boosting

Train a classifier, and apply weights on the training set (apply higher weights on misclassified examples, focus on “hard examples”)
Train new classifier, reweight training set according to prediction error
Repeat
Perform classification based on weighted vote of the classifiers

Roberto Marani

Slide 67

68 of 113

k-fold Cross-Validation

Typically used when the training dataset is small

Roberto Marani

Slide 68

69 of 113

Batch Normalization

Roberto Marani

Slide 69

70 of 113

Machine learning vs deep learning
Neural networks
Training Neural Networks

Loss Function
Optimization
Regularization
Searching for the best
Architectures

For time series
For fixed-size data

Outlines

Roberto Marani

Slide 70

71 of 113

Architectures

Deep learning models can result from different architectures, depending on:

The task

Classification
Segmentation
Regression

The domain of the input

Still data

Complete signals
Images

Time series

Evolving signals
Videos

The architecture is the structure of the network to be then trained

Number of layers
Type of layers (with their internal parameters): convolutional, pooling, batchnorm, activation, …
Interconnection topology
…

Roberto Marani

Slide 71

72 of 113

Outlines

Machine learning vs deep learning
Neural networks
Training Neural Networks

Loss Function
Optimization
Regularization
Searching for the best
Architectures

For time series
For fixed-size data

Roberto Marani

Slide 72

73 of 113

Working on Time Series

Recurrent NNs are used for modeling sequential data and data with varying length of inputs and outputs

Videos, text, speech, DNA sequences, human skeletal data

RNNs introduce recurrent connections between the neurons

This allows processing sequential data one element at a time by selectively passing information across a sequence
Memory of the previous inputs is stored in the model’s internal state and affect the model predictions
Can capture correlations in sequential data

RNNs use backpropagation-through-time for training
RNNs are more sensitive to the vanishing gradient problem than CNNs

Roberto Marani

Slide 73

74 of 113

RNNs

OUTPUT

Roberto Marani

Slide 74

75 of 113

RNNs

A person riding a motorbike on dirt road

Awesome movie. Highly recommended.

Positive

Happy Diwali

शुभ दीपावली

Image Captioning

Sentiment Analysis

Machine Translation

RNN

Application

Input

Output

Roberto Marani

Slide 75

76 of 113

Bidirectional RNNs

Bidirectional RNNs incorporate both forward and backward passes through sequential data

The output may not only depend on the previous elements in the sequence, but also on future elements in the sequence
It resembles two RNNs stacked on top of each other

Outputs both past and future elements

Roberto Marani

Slide 76

77 of 113

LSTM Networks

Long Short-Term Memory (LSTM) networks are a variant of RNNs

LSTM mitigates the vanishing/exploding gradient problem

Solution: a Memory Cell, updated at each step in the sequence

Three gates control the flow of information to and from the Memory Cell

Input Gate: protects the current step from irrelevant inputs
Output Gate: prevents the current step from passing irrelevant information to later steps
Forget Gate: limits information passed from one cell to the next

Most modern RNN models use either LSTM units or other more advanced types of recurrent units (e.g., GRU units)

Roberto Marani

Slide 77

78 of 113

LSTM Networks

LSTM cell

Input gate, output gate, forget gate, memory cell
LSTM can learn long-term correlations within data sequences

Roberto Marani

Slide 78

79 of 113

Outlines

Machine learning vs deep learning
Neural networks
Training Neural Networks

Loss Function
Optimization
Regularization
Searching for the best
Architectures

For time series
For fixed-size data

Roberto Marani

Slide 79

80 of 113

Working on fixed-size data

Convolutional Neural Networks are a type of Feed-Forward Neural Network used in:

Image analysis
Natural language processing
Other complex image classification problems

A CNN has hidden layers of convolutional layers that form the base of ConvNets

Features refer to minute details in the image data like:

Edges
Borders
Shapes
Textures
Etc.

At a higher level, convolutional layers detect these patterns in the image data with the help of filters

The higher-level details are taken care of by the first few convolutional layers

The deeper the network goes, the more sophisticated the pattern searching becomes

Roberto Marani

Slide 80

81 of 113

Convolutional Neural Networks (CNNs)

Roberto Marani

Slide 81

82 of 113

Convolutional Neural Networks (CNNs)

A filter can be thought as a relatively small matrix for which we decide the number of rows and columns

The value of this feature matrix is initialized with random numbers
When this convolutional layer receives pixel values of input data, the filter will convolve over each patch of the input matrix.

The output of the convolutional layer is usually passed through the ReLU activation function to bring non-linearity to the model

It takes the feature map and replaces all the negative values with zero

Pooling is a very important step in the ConvNets

It reduces the computation
It makes the model tolerant of distortions and variations

A Fully Connected Dense Neural Network would use a flattened feature matrix and predict according to the use case

Roberto Marani

Slide 82

83 of 113

Convolutional Neural Networks (CNNs)

Aims behind the use of CNNs

Starting from Fully Connected networks (then layers) we target:

Small models
Reduced number of weights
Shared weights

Maybe unnecessarily complex

Roberto Marani

Slide 83

84 of 113

Convolutional Neural Networks (CNNs)

Why convolutions?

Some patterns are much smaller than the image

A small region can be represented with fewer parameters (need for a small detector)

These patterns can be wherever in the image

Such small detectors must move around the image

We can define a small beak detector

The beak detector should move

Roberto Marani

Slide 84

85 of 113

Convolutional Neural Networks (CNNs)

A convolutional layer can match these requisites

It is made of filters (filter bank) that do convolutional operations

Acts as a filter

Beak detector

Roberto Marani

Slide 85

86 of 113

Convolutional Neural Networks (CNNs)

2D convolutions

A convolution captures useful features (e.g., edge detector)

Input matrix

Convolutional

3x3 filter

Filter

0 1 0

1 -4 1

0 1 0

Input Image

Convoluted Image

Roberto Marani

Slide 86

87 of 113

Convolutional Neural Networks (CNNs)

Stride

The stride parameter sets the jump n the kernel translation

1	0	0	0	0	1
0	1	0	0	1	0
0	0	1	1	0	0
1	0	0	0	1	0
0	1	0	0	1	0
0	0	1	0	1	0

6 x 6 image

1	-1	-1
-1	1	-1
-1	-1	1

Filter kernel

-1

Stride = 1

Dot

product

1	0	0	0	0	1
0	1	0	0	1	0
0	0	1	1	0	0
1	0	0	0	1	0
0	1	0	0	1	0
0	0	1	0	1	0

-3

Stride = 2

Dot

product

6 x 6 image

Roberto Marani

Slide 87

88 of 113

Convolutional Neural Networks (CNNs)

Multiple filters can form a feature map

1	0	0	0	0	1
0	1	0	0	1	0
0	0	1	1	0	0
1	0	0	0	1	0
0	1	0	0	1	0
0	0	1	0	1	0

1	-1	-1
-1	1	-1
-1	-1	1

Filter 1

-1

-3

-1

-3

-2

-1

Stride = 1

Weights are shared

Roberto Marani

Slide 88

89 of 113

Convolutional Neural Networks (CNNs)

Multiple filters can form a feature map

1	0	0	0	0	1
0	1	0	0	1	0
0	0	1	1	0	0
1	0	0	0	1	0
0	1	0	0	1	0
0	0	1	0	1	0

-1

-3

-1

-3

-2

-1

-1	1	-1
-1	1	-1
-1	1	-1

Filter 2

-1

-2

-1

-2

-1

-4

Stride = 1

Two 4 x 4 images

(4 x 4 x 2 matrix)

Feature

Map

Roberto Marani

Slide 89

90 of 113

Convolutional Neural Networks (CNNs)

In CNNs, the small regions "seen" by the filters are called local receptive fields
The depth of each feature map corresponds to the number of convolutional filters used at each layer
The depth of the architecture is given by the number of layers

Input Image

Layer 1 Feature Map

Layer 2 Feature Map

w1	w2
w3	w4

w5	w6
w7	w8

Filter 1

Filter 2

Roberto Marani

Slide 90

91 of 113

Convolutional Neural Networks (CNNs)

Color images are made of 3 channels

Each channel has its own filter

1	0	0	0	0	1
0	1	0	0	1	0
0	0	1	1	0	0
1	0	0	0	1	0
0	1	0	0	1	0
0	0	1	0	1	0

1	0	0	0	0	1
0	1	0	0	1	0
0	0	1	1	0	0
1	0	0	0	1	0
0	1	0	0	1	0
0	0	1	0	1	0

1	0	0	0	0	1
0	1	0	0	1	0
0	0	1	1	0	0
1	0	0	0	1	0
0	1	0	0	1	0
0	0	1	0	1	0

1	-1	-1
-1	1	-1
-1	-1	1

Filter 1

-1	1	-1
-1	1	-1
-1	1	-1

Filter 2

1	-1	-1
-1	1	-1
-1	-1	1

1	-1	-1
-1	1	-1
-1	-1	1

-1	1	-1
-1	1	-1
-1	1	-1

-1	1	-1
-1	1	-1
-1	1	-1

Color image

Roberto Marani

Slide 91

92 of 113

Convolutional Neural Networks (CNNs)

A convolutional layer can be represented as a fully-connected

The output has to be flattened to have the same representation

1	0	0	0	0	1
0	1	0	0	1	0
0	0	1	1	0	0
1	0	0	0	1	0
0	1	0	0	1	0
0	0	1	0	1	0

Convolution layer

-1	1	-1
-1	1	-1
-1	1	-1

1	-1	-1
-1	1	-1
-1	-1	1

1	0	0	0	0	1
0	1	0	0	1	0
0	0	1	1	0	0
1	0	0	0	1	0
0	1	0	0	1	0
0	0	1	0	1	0

Fully-connected

…

Roberto Marani

Slide 92

93 of 113

Convolutional Neural Networks (CNNs)

A convolutional layer can be represented as a fully-connected

1	0	0	0	0	1
0	1	0	0	1	0
0	0	1	1	0	0
1	0	0	0	1	0
0	1	0	0	1	0
0	0	1	0	1	0

1	-1	-1
-1	1	-1
-1	-1	1

Filter 1

…

Only connect to 9 inputs, not fully connected

Fewer parameters

Roberto Marani

Slide 93

94 of 113

Convolutional Neural Networks (CNNs)

A convolutional layer can be represented as a fully-connected

…

The weights (unknown) are the same

Even fewer parameters to be learned

1	0	0	0	0	1
0	1	0	0	1	0
0	0	1	1	0	0
1	0	0	0	1	0
0	1	0	0	1	0
0	0	1	0	1	0

1	-1	-1
-1	1	-1
-1	-1	1

Filter 1

-1

Roberto Marani

Slide 94

95 of 113

Convolutional Neural Networks (CNNs)

Convolutional layers increase the size of the data flowing into the network, depending on the number of filters in the bank
A pooling layer subsamples the feature maps to reduce the input size

Subsampling

This is a bird even after subsampling

An image (or feature map) smaller needs fewer parameters to be characterized

Roberto Marani

Slide 95

96 of 113

Convolutional Neural Networks (CNNs)

Pooling layers reduce the spatial size of the feature maps
Reduce the number of parameters, prevent overfitting

Max pooling

Reports the maximum output within a rectangular neighborhood

Average pooling

Reports the average output of a rectangular neighborhood

1	3	5	3
4	2	3	1
3	1	1	3
0	1	0	4

MaxPool with a 2×2 filter and a stride of 2

Input Matrix

Output Matrix

4	5
3	4

Roberto Marani

Slide 96

97 of 113

Convolutional Neural Networks (CNNs)

Example of a feature extraction architecture

After 2 convolutional layers, a max-pooling layer reduces the size of the feature maps (typically by 2)
A fully connected and a softmax layers perform classification

The fully connected works on flattened feature vectors

128

256

512

Conv layer

Max Pool

Fully Connected Layer

Living Room

Bedroom

Kitchen

Bathroom

Outdoor

Roberto Marani

Slide 97

98 of 113

Convolutional Neural Networks (CNNs)

Be careful with the data size!

128

256

512

Living Room

Bedroom

Kitchen

Bathroom

Outdoor

Input color image

640x480x3

1st feature map

(640:2)x(480:2)x64 = 320x240x64

Conv layer

Max Pool

3x3 kernel

stride = 1

2x2 poolsize

stride = 2

2nd feature map

160x120x128

3rd feature map

80x60x256

4th feature map

40x30x512

flattened vector

614400x1

64x9x3 = 1728 learnables

128x64x9 = 73728 learnables

256x128x9 = 294912 learnables

512x256x9 = 1179648 learnables

614400x5 weights

1x5 bias

~4.62M learnables

The output of the convolution has size

[(Input_Size − Kernel_Size + 2*Padding_Size) / Stride] + 1

(or it can be equal)

Roberto Marani

Slide 98

99 of 113

Example of CNNs

Roberto Marani

Slide 99

100 of 113

Example of CNNs

LeNet-5

60,000 parameters
Trained on greyscale 32×32 digit images
Aim: Recognize digits in the range [0, 9]

Roberto Marani

Slide 100

101 of 113

Example of CNNs

AlexNet

60 million parameters
Tested with the ImageNet dataset (benchmark dataset with 1000 classes)
Default AlexNet accepts colored images with dimensions 224×224

Roberto Marani

Slide 101

102 of 113

Example of CNNs

VGG-19

138 million parameters
It outputs one of the 1000 classes of the ImageNet dataset
Default VGG-16 accepts colored images with dimensions 224×224

Roberto Marani

Slide 102

103 of 113

Example of CNNs

Inception v1

First tacked the problem of vanishing/exploding gradients with

Two auxiliary classifiers connected to intermediate layers

Discarded during testing
Cooperate to the training (computation of the loss)

Inception modules, which process in parallel and then concatenate the output of convolutions of different sizes

7 million parameters
It outputs one of the 1000 classes of the ImageNet dataset
Default Inception accepts colored images with dimensions 224×224

Roberto Marani

Slide 103

104 of 113

Example of CNNs

ResNet-50

Tackles the degradation problem

As the network depth increases, accuracy gets saturated and then degrades rapidly
Mitigate the problem of vanishing gradients during training

Solution: bottleneck residual blocks:

Identity block: consists of 3 convolution layers with 1×1, 3×3, and 1×1 kernel sizes, all of which are equipped with BN. The ReLU activation function is applied to the first two layers, while the input of the identity block is added to the last layer before applying ReLU.
Convolution block: same as identity block, but the input of the convolution block is first passed through a convolution layer with 1×1 kernel size and BN before being added to the last convolution layer of the main series.

26 million parameters

Roberto Marani

Slide 104

105 of 113

Transfer learning

Transfer learning is commonly used in deep learning applications

A pretrained network and use it as a starting point to learn a new task
Fine-tuning a network with transfer learning is usually much faster and easier than training a network with randomly initialized weights from scratch
It is possible to quickly transfer learned features to a new task using a smaller number of training images

Roberto Marani

Slide 105

106 of 113

Deconvolutional Neural Networks (DNNs)

Deconvolutional Neural Networks are CNNs that work in a reverse manner.

To go to the original size, DNNs use upsampling and transpose convolutional layers

Upsampling does not have trainable parameters: it just repeats the rows and columns of the image data by their corresponding sizes
Transpose Convolutional layer means applying convolutional operation and upsampling at the same time

It is represented as Conv2DTranspose (number of filters, filter size, stride)
Stride = 1 🡪 No upsampling (receive an output of the same input size)

Roberto Marani

Slide 106

107 of 113

Encoder – Decoder Architectures

The cascade of a CNN and a DNN is used to create a new representation of the input image (encoded) and its decoding to a new semantic space
Useful for image segmentation

Roberto Marani

Slide 107

108 of 113

Encoder – Decoder Architectures

Encoder:

The encoder branch is also known as backbone
Takes an input image and generates a high-dimensional feature vector
Aggregate features at multiple levels
It can be borrow from pre-trained classification networks (e.g., AlexNet, VGG-19, …)

Decoder:

Takes a high-dimensional feature vector and generates a semantic segmentation mask
Decode features aggregated by the encoder at multiple levels

Roberto Marani

Slide 108

109 of 113

Encoder – Decoder Architectures

Several encoder-decoder architectures exists:

DeepLab: The process of encoding and decoding exploits atrous convolutions, which introduce a spacing value within the kernel elements to process dilated areas. Convolutions work on a wider field of view, without increasing the computational cost
UNet: Nested connections link the encoder and decoder sub-networks with skip pathways to have semantically similar feature maps on both sides of the network
MANet: It introduces channel attention mechanisms to fuse the local feature maps captured by the backbones, with the global channel weights

In any case, the encoding structure (also known as backbone) comes from pretrained CNNs

VGG
ResNet
Inception
EfficientNet
…

Its weights can be initialized depending on the training set used

Roberto Marani

Slide 109

110 of 113

CNN example

Live demo: Lecture_6_CNN_with_Actual_Data.m

Roberto Marani

Slide 110

111 of 113

Current research topics
Exam summary

Next steps

Roberto Marani

Slide 111

112 of 113

Roberto Marani

Researcher

National Research Council of Italy (CNR)

Institute of Intelligent Industrial Technologies and Systems for Advanced Manufacturing (STIIMA)

via Amendola 122/D-O, 70126 Bari, Italy

+39 080 592 94 58

roberto.marani@stiima.cnr.it

robertomarani.com

cnr.it/people/roberto.marani

stiima.cnr.it/ricercatori/roberto-marani/

Roberto Marani

Slide 112

113 of 113

Credits

Special Topics: Adversarial Machine Learning, Alex Vakanski
The Essential Guide to Neural Network Architectures, v7labs.com
5 Popular CNN Architectures Clearly Explained and Visualized - towarddatascience.com
Transfer Learning Using Pretrained Network, Mathworks, Matlab Help
Encoder-Decoder Networks for Semantic Segmentation, Sachin Mehta
Smaller network: CNN, Ming Li

Roberto Marani

Slide 113