Artificial intelligence for quality control with active infrared thermography�Introduction to Deep Learning�
ING-IND/14, 2 CFU
Roberto Marani - April 26th, 2023
Introduction to machine learning
Definition by Tom Mitchell (1998):
Machine Learning is the study of algorithms that:
A well-defined learning task is given by <P,T,E>.
“Learning is any process by which a system improves performance from experience” (Herbert Simon)
Roberto Marani
Slide 2
Introduction to machine learning
Supervised learning
Unsupervised learning
Roberto Marani
Slide 3
Evaluation
Roberto Marani
Slide 4
Computer Vision Tasks
Roberto Marani
Slide 5
No-Free-Lunch Theorem
Roberto Marani
Slide 6
Evaluation
Performance on test data is a good indicator of generalization
The test accuracy is more important than the training accuracy
Roberto Marani
Slide 7
Use case
Inspection of a calibrated plate of GFRP
Ø 7.85
Ø 14.1
Ø 20.2
depth
15.7
Hole depths
12.4
9.82
7.08
4.35
315
290
Ø 17.44
Ø 13.3
Ø 9.54
Ø 8.3
Ø 7.85
In-depth defects
Surface defects
Sound region
Roberto Marani
Slide 8
Outlines
Roberto Marani
Slide 9
Outlines
Roberto Marani
Slide 10
Machine learning vs deep learning
Roberto Marani
Slide 11
Machine learning vs deep learning
Roberto Marani
Slide 12
Machine learning vs deep learning
Low-Level Features
Mid-Level Features
Output
High-Level Features
Trainable Classifier
Roberto Marani
Slide 13
Why is deep learning useful?
Roberto Marani
Slide 14
Why is deep learning useful?
Roberto Marani
Slide 15
DL Frameworks
Roberto Marani
Slide 16
Outlines
Roberto Marani
Slide 17
Neural Networks
Input
16 x 16 = 256
……
……
y1
y2
y10
Each dimension represents the confidence of a digit
is 1
is 2
is 0
……
0.1
0.7
0.2
The image is “2”
Output
Roberto Marani
Slide 18
Neural Networks
Machine
“2”
……
……
y1
y2
y10
Roberto Marani
Slide 19
Neural Networks
…
bias
Activation function
weights
input
output
…
Roberto Marani
Slide 20
Neural Networks
Weights
Biases
Activation functions
4 + 2 = 6 neurons (not counting inputs)
[3 × 4] + [4 × 2] = 20 weights
4 + 2 = 6 biases
26 learnable parameters
Roberto Marani
Slide 21
Deep Neural Networks
Output Layer
Hidden Layers
Input Layer
Input
Output
Layer 1
……
……
Layer 2
……
Layer L
……
……
……
……
……
y1
y2
yM
Roberto Marani
Slide 22
Deep Neural Networks
Example
Sigmoid Function
1
-1
1
-2
1
-1
1
0
4
-2
0.98
0.12
Roberto Marani
Slide 23
Deep Neural Networks
Example
1
-2
1
-1
1
0
4
-2
0.98
0.12
2
-1
-1
-2
3
-1
4
-1
0.86
0.11
0.62
0.83
0
0
-2
2
1
-1
Roberto Marani
Slide 24
Deep Neural Networks
Matrix operation in multilayer NN
……
……
……
……
……
……
……
……
y1
y2
yM
W1
W2
WL
b2
bL
x
a1
a2
y
b1
W1
x
+
b2
W2
a1
+
bL
WL
+
aL-1
b1
y
x
b1
W1
x
+
b2
W2
+
bL
WL
+
…
…
Roberto Marani
Slide 25
Classification layer
A Layer with Sigmoid Activations
3
-3
1
0.95
0.05
0.73
A Softmax Layer
3
-3
1
2.7
20
0.05
0.88
0.12
≈0
Roberto Marani
Slide 26
Activation functions
Roberto Marani
Slide 27
Activation functions
Sigmoid function σ
Roberto Marani
Slide 28
Activation functions
Tanh function:
Roberto Marani
Slide 29
Activation functions
ReLU (Rectified Linear Unit):
ReLU could cause weights to update in a way that the gradients can become zero and the neuron will not activate again on any data
Roberto Marani
Slide 30
Activation functions
Leaky ReLU activation
Roberto Marani
Slide 31
Activation functions
Linear function
Roberto Marani
Slide 32
Outlines
Roberto Marani
Slide 33
Training Neural Networks
Train a network means determining the parameters of each of its layers, given a specific architecture
Roberto Marani
Slide 34
Training Neural Networks
Data Preprocessing
It is a fundamental task to help training in reaching convergence
Roberto Marani
Slide 35
Training Neural Networks
To train a network it is necessary to define a loss function (objective or cost function)
……
……
……
……
……
……
y1
y2
y10
Cost
0.2
0.3
0.5
……
1
0
0
……
True label “1”
……
Prediction score
Roberto Marani
Slide 36
Training Neural Networks
Training formalization
For a training set of N images:
x1
x2
xN
NN
NN
NN
……
……
y1
y2
yN
……
……
x3
NN
y3
Which function can work best?
Roberto Marani
Slide 37
Outlines
Roberto Marani
Slide 38
Loss function for classification
Training examples
Pairs of 𝑁 inputs 𝑥𝑖 and ground-truth class labels 𝑦𝑖
Output layer
Softmax activations (to map to a probability)
Loss function
Cross-entropy
GT labels
Model predicted labels
i = no. examples
k = no. of classes
Roberto Marani
Slide 39
Loss function for regression
Training examples
Pairs of 𝑁 inputs 𝑥𝑖 and ground-truth output values 𝑦𝑖
Output layer
Linear of sigmoid activation
Loss function
Mean Squared Error
Mean Absolute Error
Roberto Marani
Slide 40
Outlines
Roberto Marani
Slide 41
Optimizing the loss function
Almost all DL models these days are trained with a variant of the gradient descent (GD) algorithm
Roberto Marani
Slide 42
Gradient Descent Algorithm
Roberto Marani
Slide 43
Gradient Descent Algorithm
Gradient descent algorithm stops when a local minimum of the loss is reached
Roberto Marani
Slide 44
Gradient Descent Algorithm
For most tasks, the loss function ℒ(𝜃) is highly complex (and non-convex)
Roberto Marani
Slide 45
Backpropagation
Modern NNs employ the backpropagation (“backward propagation”) method for calculating the gradients of the loss function 𝛻ℒ(𝜃)=𝜕ℒ∕𝜕𝜃𝑖
Roberto Marani
Slide 46
GD optimization
Mini-batch gradient descent
The loss is computed on small batches of the training dataset (it is wasteful to perform a full training set analysis to update a single parameter)
Approach
Stochastic GD 🡪 A mini-batch has the size of a single example
Roberto Marani
Slide 47
GD optimization
The GD algorithm can be very slow at plateaus, and it can get stuck at saddle points
Very slow at the plateau
Stuck at a local minimum
Stuck at a saddle point
Roberto Marani
Slide 48
GD with Momentum
Gradient descent with momentum uses the momentum of the gradient for parameter optimization
Movement = Negative of Gradient + Momentum
Gradient = 0
Negative of Gradient
Momentum
Real Movement
Roberto Marani
Slide 49
GD with Momentum
The GD with Momentum updates the parameters 𝜃 in the direction of the weighted average of the past gradients
At iteration 𝑡
Roberto Marani
Slide 50
GD with Nesterov Accelerated Momentum
GD with momentum
GD with Nesterov momentum
Roberto Marani
Slide 51
Adaptive Momentum Estimation (Adam)
Roberto Marani
Slide 52
Optimizer comparison
Animation from: https://imgur.com/s25RsOr
Roberto Marani
Slide 53
Learning rate
LR too small
LR too large
Roberto Marani
Slide 54
Learning rate
Roberto Marani
Slide 55
Scheduling the learning rate
Learning rate scheduling is applied to change the values of the learning rate during the training
Exponential
Cosine
Warmup
Roberto Marani
Slide 56
Outlines
Roberto Marani
Slide 57
Regularization
Regularization is a set of techniques to:
Prevent overfitting 🡪 Improve accuracy when facing new data
Underfitting
Overfitting
Roberto Marani
Slide 58
Regularization
Overfitting
A model with high capacity fits the noise in the data instead of the underlying relationship
Roberto Marani
Slide 59
L2 regularization
Roberto Marani
Slide 60
L1 regularization
Roberto Marani
Slide 61
Dropout regularization
Randomly drop units (along with their connections) during training
Roberto Marani
Slide 62
Dropout regularization
This technique, using mini-batches, is similar to ensemble learning
mini-batch 1
mini-batch 2
mini-batch 3
mini-batch n
……
Roberto Marani
Slide 63
Early stopping
Stop training
validation
Roberto Marani
Slide 64
Outlines
Roberto Marani
Slide 65
Tuning the hyper-parameter
Roberto Marani
Slide 66
Ensemble Learning
Ensemble learning is training multiple classifiers separately and combining their predictions
Roberto Marani
Slide 67
k-fold Cross-Validation
Typically used when the training dataset is small
Roberto Marani
Slide 68
Batch Normalization
Roberto Marani
Slide 69
Outlines
Roberto Marani
Slide 70
Architectures
Deep learning models can result from different architectures, depending on:
Roberto Marani
Slide 71
Outlines
Roberto Marani
Slide 72
Working on Time Series
Recurrent NNs are used for modeling sequential data and data with varying length of inputs and outputs
Roberto Marani
Slide 73
RNNs
x1
h0
h1
x2
h2
x3
h3
OUTPUT
Roberto Marani
Slide 74
RNNs
A person riding a motorbike on dirt road
Awesome movie. Highly recommended.
Positive
Happy Diwali
शुभ दीपावली
Image Captioning
Sentiment Analysis
Machine Translation
RNN
Application
Input
Output
Roberto Marani
Slide 75
Bidirectional RNNs
Outputs both past and future elements
Roberto Marani
Slide 76
LSTM Networks
Long Short-Term Memory (LSTM) networks are a variant of RNNs
Roberto Marani
Slide 77
LSTM Networks
LSTM cell
Roberto Marani
Slide 78
Outlines
Roberto Marani
Slide 79
Working on fixed-size data
Roberto Marani
Slide 80
Convolutional Neural Networks (CNNs)
Roberto Marani
Slide 81
Convolutional Neural Networks (CNNs)
Roberto Marani
Slide 82
Convolutional Neural Networks (CNNs)
Aims behind the use of CNNs
Starting from Fully Connected networks (then layers) we target:
Maybe unnecessarily complex
Roberto Marani
Slide 83
Convolutional Neural Networks (CNNs)
Why convolutions?
We can define a small beak detector
The beak detector should move
Roberto Marani
Slide 84
Convolutional Neural Networks (CNNs)
A convolutional layer can match these requisites
Acts as a filter
Beak detector
Roberto Marani
Slide 85
Convolutional Neural Networks (CNNs)
2D convolutions
Input matrix
Convolutional
3x3 filter
Filter
0 1 0
1 -4 1
0 1 0
Input Image
Convoluted Image
Roberto Marani
Slide 86
Convolutional Neural Networks (CNNs)
Stride
1 | 0 | 0 | 0 | 0 | 1 |
0 | 1 | 0 | 0 | 1 | 0 |
0 | 0 | 1 | 1 | 0 | 0 |
1 | 0 | 0 | 0 | 1 | 0 |
0 | 1 | 0 | 0 | 1 | 0 |
0 | 0 | 1 | 0 | 1 | 0 |
6 x 6 image
1 | -1 | -1 |
-1 | 1 | -1 |
-1 | -1 | 1 |
Filter kernel
3
-1
Stride = 1
Dot
product
1 | 0 | 0 | 0 | 0 | 1 |
0 | 1 | 0 | 0 | 1 | 0 |
0 | 0 | 1 | 1 | 0 | 0 |
1 | 0 | 0 | 0 | 1 | 0 |
0 | 1 | 0 | 0 | 1 | 0 |
0 | 0 | 1 | 0 | 1 | 0 |
3
-3
Stride = 2
Dot
product
6 x 6 image
Roberto Marani
Slide 87
Convolutional Neural Networks (CNNs)
Multiple filters can form a feature map
1 | 0 | 0 | 0 | 0 | 1 |
0 | 1 | 0 | 0 | 1 | 0 |
0 | 0 | 1 | 1 | 0 | 0 |
1 | 0 | 0 | 0 | 1 | 0 |
0 | 1 | 0 | 0 | 1 | 0 |
0 | 0 | 1 | 0 | 1 | 0 |
1 | -1 | -1 |
-1 | 1 | -1 |
-1 | -1 | 1 |
Filter 1
3
-1
-3
-1
-3
1
0
-3
-3
-3
0
1
3
-2
-2
-1
Stride = 1
Weights are shared
Roberto Marani
Slide 88
Convolutional Neural Networks (CNNs)
Multiple filters can form a feature map
1 | 0 | 0 | 0 | 0 | 1 |
0 | 1 | 0 | 0 | 1 | 0 |
0 | 0 | 1 | 1 | 0 | 0 |
1 | 0 | 0 | 0 | 1 | 0 |
0 | 1 | 0 | 0 | 1 | 0 |
0 | 0 | 1 | 0 | 1 | 0 |
3
-1
-3
-1
-3
1
0
-3
-3
-3
0
1
3
-2
-2
-1
-1 | 1 | -1 |
-1 | 1 | -1 |
-1 | 1 | -1 |
Filter 2
-1
-1
-1
-1
-1
-1
-2
1
-1
-1
-2
1
-1
0
-4
3
Stride = 1
Two 4 x 4 images
(4 x 4 x 2 matrix)
Feature
Map
Roberto Marani
Slide 89
Convolutional Neural Networks (CNNs)
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
Input Image
Layer 1 Feature Map
Layer 2 Feature Map
w1 | w2 |
w3 | w4 |
w5 | w6 |
w7 | w8 |
Filter 1
Filter 2
Roberto Marani
Slide 90
Convolutional Neural Networks (CNNs)
Color images are made of 3 channels
1 | 0 | 0 | 0 | 0 | 1 |
0 | 1 | 0 | 0 | 1 | 0 |
0 | 0 | 1 | 1 | 0 | 0 |
1 | 0 | 0 | 0 | 1 | 0 |
0 | 1 | 0 | 0 | 1 | 0 |
0 | 0 | 1 | 0 | 1 | 0 |
1 | 0 | 0 | 0 | 0 | 1 |
0 | 1 | 0 | 0 | 1 | 0 |
0 | 0 | 1 | 1 | 0 | 0 |
1 | 0 | 0 | 0 | 1 | 0 |
0 | 1 | 0 | 0 | 1 | 0 |
0 | 0 | 1 | 0 | 1 | 0 |
1 | 0 | 0 | 0 | 0 | 1 |
0 | 1 | 0 | 0 | 1 | 0 |
0 | 0 | 1 | 1 | 0 | 0 |
1 | 0 | 0 | 0 | 1 | 0 |
0 | 1 | 0 | 0 | 1 | 0 |
0 | 0 | 1 | 0 | 1 | 0 |
1 | -1 | -1 |
-1 | 1 | -1 |
-1 | -1 | 1 |
Filter 1
-1 | 1 | -1 |
-1 | 1 | -1 |
-1 | 1 | -1 |
Filter 2
1 | -1 | -1 |
-1 | 1 | -1 |
-1 | -1 | 1 |
1 | -1 | -1 |
-1 | 1 | -1 |
-1 | -1 | 1 |
-1 | 1 | -1 |
-1 | 1 | -1 |
-1 | 1 | -1 |
-1 | 1 | -1 |
-1 | 1 | -1 |
-1 | 1 | -1 |
Color image
Roberto Marani
Slide 91
Convolutional Neural Networks (CNNs)
A convolutional layer can be represented as a fully-connected
1 | 0 | 0 | 0 | 0 | 1 |
0 | 1 | 0 | 0 | 1 | 0 |
0 | 0 | 1 | 1 | 0 | 0 |
1 | 0 | 0 | 0 | 1 | 0 |
0 | 1 | 0 | 0 | 1 | 0 |
0 | 0 | 1 | 0 | 1 | 0 |
Convolution layer
-1 | 1 | -1 |
-1 | 1 | -1 |
-1 | 1 | -1 |
1 | -1 | -1 |
-1 | 1 | -1 |
-1 | -1 | 1 |
1 | 0 | 0 | 0 | 0 | 1 |
0 | 1 | 0 | 0 | 1 | 0 |
0 | 0 | 1 | 1 | 0 | 0 |
1 | 0 | 0 | 0 | 1 | 0 |
0 | 1 | 0 | 0 | 1 | 0 |
0 | 0 | 1 | 0 | 1 | 0 |
Fully-connected
…
…
Roberto Marani
Slide 92
Convolutional Neural Networks (CNNs)
A convolutional layer can be represented as a fully-connected
1 | 0 | 0 | 0 | 0 | 1 |
0 | 1 | 0 | 0 | 1 | 0 |
0 | 0 | 1 | 1 | 0 | 0 |
1 | 0 | 0 | 0 | 1 | 0 |
0 | 1 | 0 | 0 | 1 | 0 |
0 | 0 | 1 | 0 | 1 | 0 |
1 | -1 | -1 |
-1 | 1 | -1 |
-1 | -1 | 1 |
Filter 1
1
2
3
…
8
9
…
13
14
15
…
Only connect to 9 inputs, not fully connected
Fewer parameters
4
10
16
1
0
0
0
0
1
0
0
0
0
1
1
3
7
Roberto Marani
Slide 93
Convolutional Neural Networks (CNNs)
A convolutional layer can be represented as a fully-connected
1
2
3
…
8
9
…
13
14
15
…
The weights (unknown) are the same
Even fewer parameters to be learned
4
10
16
1
0
0
0
0
1
0
0
0
0
1
1
3
7
1 | 0 | 0 | 0 | 0 | 1 |
0 | 1 | 0 | 0 | 1 | 0 |
0 | 0 | 1 | 1 | 0 | 0 |
1 | 0 | 0 | 0 | 1 | 0 |
0 | 1 | 0 | 0 | 1 | 0 |
0 | 0 | 1 | 0 | 1 | 0 |
1 | -1 | -1 |
-1 | 1 | -1 |
-1 | -1 | 1 |
Filter 1
-1
Roberto Marani
Slide 94
Convolutional Neural Networks (CNNs)
Subsampling
This is a bird even after subsampling
An image (or feature map) smaller needs fewer parameters to be characterized
Roberto Marani
Slide 95
Convolutional Neural Networks (CNNs)
Max pooling
Average pooling
1 | 3 | 5 | 3 |
4 | 2 | 3 | 1 |
3 | 1 | 1 | 3 |
0 | 1 | 0 | 4 |
MaxPool with a 2×2 filter and a stride of 2
Input Matrix
Output Matrix
4 | 5 |
3 | 4 |
Roberto Marani
Slide 96
Convolutional Neural Networks (CNNs)
Example of a feature extraction architecture
64
64
128
128
256
256
256
512
512
512
512
512
512
Conv layer
Max Pool
Fully Connected Layer
Living Room
Bedroom
Kitchen
Bathroom
Outdoor
Roberto Marani
Slide 97
Convolutional Neural Networks (CNNs)
Be careful with the data size!
64
128
256
512
Living Room
Bedroom
Kitchen
Bathroom
Outdoor
Input color image
640x480x3
1st feature map
(640:2)x(480:2)x64 = 320x240x64
Conv layer
Max Pool
3x3 kernel
stride = 1
2x2 poolsize
stride = 2
2nd feature map
160x120x128
3rd feature map
80x60x256
4th feature map
40x30x512
flattened vector
614400x1
64x9x3 = 1728 learnables
128x64x9 = 73728 learnables
256x128x9 = 294912 learnables
512x256x9 = 1179648 learnables
614400x5 weights
1x5 bias
~4.62M learnables
The output of the convolution has size
[(Input_Size − Kernel_Size + 2*Padding_Size) / Stride] + 1
(or it can be equal)
Roberto Marani
Slide 98
Example of CNNs
Roberto Marani
Slide 99
Example of CNNs
LeNet-5
Roberto Marani
Slide 100
Example of CNNs
AlexNet
Roberto Marani
Slide 101
Example of CNNs
VGG-19
Roberto Marani
Slide 102
Example of CNNs
Inception v1
Roberto Marani
Slide 103
Example of CNNs
ResNet-50
Roberto Marani
Slide 104
Transfer learning
Roberto Marani
Slide 105
Deconvolutional Neural Networks (DNNs)
Roberto Marani
Slide 106
Encoder – Decoder Architectures
Roberto Marani
Slide 107
Encoder – Decoder Architectures
Roberto Marani
Slide 108
Encoder – Decoder Architectures
Roberto Marani
Slide 109
CNN example
Roberto Marani
Slide 110
Next steps
Roberto Marani
Slide 111
Roberto Marani
Researcher
National Research Council of Italy (CNR)
Institute of Intelligent Industrial Technologies and Systems for Advanced Manufacturing (STIIMA)
via Amendola 122/D-O, 70126 Bari, Italy
+39 080 592 94 58
roberto.marani@stiima.cnr.it
robertomarani.com
cnr.it/people/roberto.marani
stiima.cnr.it/ricercatori/roberto-marani/
Roberto Marani
Slide 112
Credits
Roberto Marani
Slide 113