1 of 113

Artificial intelligence for quality control with active infrared thermographyIntroduction to Deep Learning

ING-IND/14, 2 CFU

Roberto Marani - April 26th, 2023

2 of 113

Introduction to machine learning

Definition by Tom Mitchell (1998):

Machine Learning is the study of algorithms that:

  • improve their performance P
  • at some task T
  • with experience E

A well-defined learning task is given by <P,T,E>.

Learning is any process by which a system improves performance from experience” (Herbert Simon)

Roberto Marani

Slide 2

3 of 113

Introduction to machine learning

Supervised learning

Unsupervised learning

Roberto Marani

Slide 3

4 of 113

Evaluation

Roberto Marani

Slide 4

5 of 113

Computer Vision Tasks

Roberto Marani

Slide 5

6 of 113

No-Free-Lunch Theorem

  • Wolpert (2002) - The Supervised Learning No-Free-Lunch Theorems
  • The derived classification models for supervised learning are simplifications of the reality
    • The simplifications are based on certain assumptions
    • The assumptions fail in some situations
      • E.g., due to the inability to perfectly estimate ML model parameters from limited data

  • In summary, the No-Free-Lunch Theorem states:
    • No single classifier works best for all possible problems
    • Since we need to make assumptions to generalize

Roberto Marani

Slide 6

7 of 113

Evaluation

Performance on test data is a good indicator of generalization

The test accuracy is more important than the training accuracy

Roberto Marani

Slide 7

8 of 113

Use case

Inspection of a calibrated plate of GFRP

Ø 7.85

Ø 14.1

Ø 20.2

depth

15.7

Hole depths

12.4

9.82

7.08

4.35

315

290

Ø 17.44

Ø 13.3

Ø 9.54

Ø 8.3

Ø 7.85

In-depth defects

Surface defects

Sound region

Roberto Marani

Slide 8

9 of 113

  • Machine learning vs deep learning
  • Neural networks
  • Training Neural Networks
    • Loss Function
    • Optimization
    • Regularization
    • Searching for the best
    • Architectures
      • For time series
      • For fixed-size data

Outlines

Roberto Marani

Slide 9

10 of 113

  • Machine learning vs deep learning
  • Neural networks
  • Training Neural Networks
    • Loss Function
    • Optimization
    • Regularization
    • Searching for the best
    • Architectures
      • For time series
      • For fixed-size data

Outlines

Roberto Marani

Slide 10

11 of 113

Machine learning vs deep learning

  • Conventional machine learning methods rely on human-designed feature representations
    • ML becomes just optimizing weights to make a final prediction best

Roberto Marani

Slide 11

12 of 113

Machine learning vs deep learning

  • Deep learning (DL) is a machine learning subfield that uses multiple layers for learning data representations
    • DL is exceptionally effective at learning patterns

Roberto Marani

Slide 12

13 of 113

Machine learning vs deep learning

  • DL applies a multi-layer process for learning rich hierarchical features (i.e., data representations)
    • Input image pixels → Edges → Textures → Parts → Objects

Low-Level Features

Mid-Level Features

Output

High-Level Features

Trainable Classifier

Roberto Marani

Slide 13

14 of 113

Why is deep learning useful?

  • DL provides a flexible, learnable framework for representing visual, text, linguistic information
    • Can learn in a supervised or unsupervised manner
  • DL represents an effective end-to-end learning system
  • Requires large amounts of training data
  • Since about 2010, DL has outperformed other ML techniques
    • First in vision and speech, then NLP, and other applications

Roberto Marani

Slide 14

15 of 113

Why is deep learning useful?

 

Roberto Marani

Slide 15

16 of 113

DL Frameworks

Roberto Marani

Slide 16

17 of 113

Outlines

  • Machine learning vs deep learning
  • Neural networks
  • Training Neural Networks
    • Loss Function
    • Optimization
    • Regularization
    • Searching for the best
    • Architectures
      • For time series
      • For fixed-size data

Roberto Marani

Slide 17

18 of 113

Neural Networks

  • Handwritten digit recognition (MNIST dataset)
    • The intensity of each pixel is considered an input element
    • Output is the class of the digit

Input

16 x 16 = 256

……

……

y1

y2

y10

Each dimension represents the confidence of a digit

is 1

is 2

is 0

……

0.1

0.7

0.2

The image is “2”

Output

Roberto Marani

Slide 18

19 of 113

Neural Networks

  • Handwritten digit recognition

Machine

“2”

……

……

y1

y2

y10

 

 

Roberto Marani

Slide 19

20 of 113

Neural Networks

  • NNs consist of hidden layers with neurons (i.e., computational units)
  • A single neuron maps a set of inputs into an output number, or 𝑓:𝑅^𝐾→𝑅

bias

Activation function

weights

 

input

output

Roberto Marani

Slide 20

21 of 113

Neural Networks

  • A NN with one hidden layer and one output layer

 

 

 

 

Weights

Biases

Activation functions

4 + 2 = 6 neurons (not counting inputs)

[3 × 4] + [4 × 2] = 20 weights

4 + 2 = 6 biases

26 learnable parameters

Roberto Marani

Slide 21

22 of 113

Deep Neural Networks

  • Deep NNs have many hidden layers
    • Fully-connected (dense) layers (Multi-Layer Perceptron or MLP)
    • Each neuron is connected to all neurons in the succeeding layer
    • It can be expressed in a matrix form

Output Layer

Hidden Layers

Input Layer

Input

Output

Layer 1

……

……

Layer 2

……

Layer L

……

……

……

……

……

y1

y2

yM

Roberto Marani

Slide 22

23 of 113

Deep Neural Networks

Example

Sigmoid Function

1

-1

1

-2

1

-1

1

0

4

-2

0.98

0.12

 

 

Roberto Marani

Slide 23

24 of 113

Deep Neural Networks

Example

1

-2

1

-1

1

0

4

-2

0.98

0.12

2

-1

-1

-2

3

-1

4

-1

0.86

0.11

0.62

0.83

0

0

-2

2

1

-1

 

 

Roberto Marani

Slide 24

25 of 113

Deep Neural Networks

Matrix operation in multilayer NN

……

……

……

……

……

……

……

……

y1

y2

yM

W1

W2

WL

b2

bL

x

a1

a2

y

b1

W1

x

+

 

b2

W2

a1

+

 

bL

WL

+

 

aL-1

b1

 

 

y

 

x

b1

W1

x

+

 

b2

W2

+

bL

WL

+

Roberto Marani

Slide 25

26 of 113

Classification layer

  • In multi-class classification tasks, the output layer is typically a softmax layer
    • I.e., it employs a softmax activation function
    • If a layer with a sigmoid activation function is used as the output layer instead, the predictions by the NN may not be easy to interpret
      • Note that an output layer with sigmoid activations can still be used for binary classification

  • The output is a probability value in the range [0,1]

A Layer with Sigmoid Activations

3

-3

1

0.95

0.05

0.73

A Softmax Layer

3

-3

1

2.7

20

0.05

0.88

0.12

≈0

Roberto Marani

Slide 26

27 of 113

Activation functions

  • Non-linear activations are needed to learn complex (non-linear) data representations
    • Otherwise, NNs would be just a linear function (such as W1W2𝑥=𝑊𝑥)
    • NNs with many layers (and neurons) can approximate more complex functions
      • Figure: more neurons improve representation (but, may overfit)

Roberto Marani

Slide 27

28 of 113

Activation functions

Sigmoid function σ

  • It takes a real-valued number and “squashes” it into the range between 0 and 1
    • The output can be interpreted as the firing rate of a biological neuron
      • Not firing = 0; Fully firing = 1
    • When the neuron’s activation is 0 or 1, sigmoid neurons saturate
      • Gradients at these regions are almost zero (almost no signal will flow)
    • Sigmoid activations are less common in modern NNs

 

 

 

Roberto Marani

Slide 28

29 of 113

Activation functions

Tanh function:

  • It takes a real-valued number and “squashes” it into a range between -1 and 1
  • Like sigmoid, tanh neurons saturate
  • Unlike sigmoid, the output is zero-centered
    • It is therefore preferred over sigmoid
  • Tanh is a scaled sigmoid: tanh⁡(𝑥)=2∙𝜎(2𝑥)−1

 

 

 

Roberto Marani

Slide 29

30 of 113

Activation functions

ReLU (Rectified Linear Unit):

  • It takes a real-valued number and thresholds it at zero: f(x) = max(0,x)
  • Most modern deep NNs use ReLU activations
    • ReLU is fast to compute compared to sigmoid and tanh
    • Accelerates the convergence of gradient descent
      • Due to linear, non-saturating form
    • Prevents the gradient vanishing problem

 

 

 

ReLU could cause weights to update in a way that the gradients can become zero and the neuron will not activate again on any data

Roberto Marani

Slide 30

31 of 113

Activation functions

Leaky ReLU activation

  • It is a variant of ReLU
    • Instead of the function being 0 when 𝑥<0, a leaky ReLU has a small negative slope (e.g., α = 0.01, or similar)
    • This resolves the dying ReLU problem
      • Most current works still use ReLU
      • With a proper setting of the learning rate, the problem of dying ReLU can be avoided

 

Roberto Marani

Slide 31

32 of 113

Activation functions

Linear function

  • The output signal is proportional to the input signal to the neuron
    • If the value of the constant c is 1, it is also called identity activation function
    • This activation type is used in regression problems
      • E.g., the last layer can have linear activation function, in order to output a real number (and not a class membership)

 

 

Roberto Marani

Slide 32

33 of 113

Outlines

  • Machine learning vs deep learning
  • Neural networks
  • Training Neural Networks
    • Loss Function
    • Optimization
    • Regularization
    • Searching for the best
    • Architectures
      • For time series
      • For fixed-size data

Roberto Marani

Slide 33

34 of 113

Training Neural Networks

Train a network means determining the parameters of each of its layers, given a specific architecture

  • The network parameters 𝜃 include the weight matrices and bias vectors from all layers

    • Often, the model parameters 𝜃 are referred to as weights

  • Training a model to learn a set of parameters 𝜃 that are optimal (according to a criterion) is one of the greatest challenges in ML

 

Roberto Marani

Slide 34

35 of 113

Training Neural Networks

Data Preprocessing

It is a fundamental task to help training in reaching convergence

  • Mean subtraction
    • Subtract the mean for each individual data dimension (feature) to obtain zero-centered data
  • Normalization
    • Divide each feature by its standard deviation
      • To obtain a standard deviation of 1 for each data dimension (feature)
    • Or, scale the data within the range [0,1] or [-1, 1]
      • E.g., image pixel intensities are divided by 255 to be scaled in the [0,1] range

Roberto Marani

Slide 35

36 of 113

Training Neural Networks

To train a network it is necessary to define a loss function (objective or cost function)

  • ℒ(𝜃) calculates the difference (error) between the model prediction and the true label
  • E.g., ℒ(𝜃) can be a mean-squared error, a cross-entropy value, etc.

……

……

……

……

……

……

y1

y2

y10

Cost

0.2

0.3

0.5

……

1

0

0

……

True label “1”

 

……

Prediction score

Roberto Marani

Slide 36

37 of 113

Training Neural Networks

Training formalization

For a training set of N images:

    • Calculate the total loss overall all images ℒ(𝜃)=∑ 𝑛=1 N𝑛(𝜃)

    • Find the optimal parameters 𝜃 that minimize the total loss ℒ(𝜃)

x1

x2

xN

NN

NN

NN

……

……

y1

y2

yN

 

 

 

 

……

……

x3

NN

y3

 

 

 

 

Which function can work best?

Roberto Marani

Slide 37

38 of 113

Outlines

  • Machine learning vs deep learning
  • Neural networks
  • Training Neural Networks
    • Loss Function
    • Optimization
    • Regularization
    • Searching for the best
    • Architectures
      • For time series
      • For fixed-size data

Roberto Marani

Slide 38

39 of 113

Loss function for classification

Training examples

Pairs of 𝑁 inputs 𝑥𝑖 and ground-truth class labels 𝑦𝑖

Output layer

Softmax activations (to map to a probability)

Loss function

Cross-entropy

GT labels

Model predicted labels

 

i = no. examples

k = no. of classes

Roberto Marani

Slide 39

40 of 113

Loss function for regression

Training examples

Pairs of 𝑁 inputs 𝑥𝑖 and ground-truth output values 𝑦𝑖

Output layer

Linear of sigmoid activation

Loss function

Mean Squared Error

Mean Absolute Error

 

 

 

Roberto Marani

Slide 40

41 of 113

Outlines

  • Machine learning vs deep learning
  • Neural networks
  • Training Neural Networks
    • Loss Function
    • Optimization
    • Regularization
    • Searching for the best
    • Architectures
      • For time series
      • For fixed-size data

Roberto Marani

Slide 41

42 of 113

Optimizing the loss function

Almost all DL models these days are trained with a variant of the gradient descent (GD) algorithm

  • GD applies iterative refinement of the network parameters 𝜃
  • GD uses the opposite direction of the gradient of the loss with respect to the NN parameters (i.e.,𝛻ℒ(𝜃)=[𝜕ℒ∕𝜕𝜃𝑖] ) for updating 𝜃
    • The gradient of the loss function 𝛻ℒ(𝜃) gives the direction of the fastest increase of the loss function ℒ(𝜃) when the parameters 𝜃 are changed

 

 

 

Roberto Marani

Slide 42

43 of 113

Gradient Descent Algorithm

  1. Randomly initialize the model parameters,𝜃0
  2. Compute the gradient of the loss function at the initial parameters 𝜃0: 𝛻ℒ(𝜃0 )
  3. Update the parameters as: 𝜃𝑛𝑒𝑤=𝜃0−𝛼𝛻ℒ(𝜃0 )
    • Where α is the learning rate
  4. Go to step 2 and repeat (until a terminating criterion is reached)

 

 

 

 

 

 

Roberto Marani

Slide 43

44 of 113

Gradient Descent Algorithm

Gradient descent algorithm stops when a local minimum of the loss is reached

    • GD does not guarantee reaching a global minimum
    • Empirical evidence suggests that GD works well for NNs

 

 

Roberto Marani

Slide 44

45 of 113

Gradient Descent Algorithm

For most tasks, the loss function ℒ(𝜃) is highly complex (and non-convex)

    • Random initialization in NNs results in different initial parameters 𝜃0 every time the NN is trained
      • Gradient descent may reach different minima at every run
      • Therefore, NN will produce different predicted outputs
    • Any algorithm can guarantee reaching a global minimum for an arbitrary loss function

 

 

 

Roberto Marani

Slide 45

46 of 113

Backpropagation

Modern NNs employ the backpropagation (“backward propagation”) method for calculating the gradients of the loss function 𝛻ℒ(𝜃)=𝜕ℒ∕𝜕𝜃𝑖

  • For training NNs, forward propagation (forward pass) refers to passing the inputs 𝑥 through the hidden layers to obtain the model outputs (predictions) 𝑦
    • The loss ℒ(𝑦,𝑦 ̂) function is then calculated

  • Backpropagation traverses the network in reverse order, from the outputs 𝑦 backward toward the inputs 𝑥 to calculate the gradients of the loss 𝛻ℒ(𝜃)
    • The chain rule is used for calculating the partial derivatives of the loss function with respect to the parameters 𝜃 in the different layers of the network

  • Automatic calculation of the gradients (automatic differentiation) is available in all current deep learning libraries to simplify the network implementation
    • No need to derive the partial derivatives of the loss function by hand

Roberto Marani

Slide 46

47 of 113

GD optimization

Mini-batch gradient descent

The loss is computed on small batches of the training dataset (it is wasteful to perform a full training set analysis to update a single parameter)

  • Mini-batch GD results in much faster training
  • It works because the gradient from a mini-batch is a good approximation of the gradient from the entire training set

Approach

  1. Compute the loss ℒ(𝜃) on a mini-batch of images, update the parameters 𝜃, and repeat until all images are used
  2. At the next epoch, shuffle the training data, and repeat the above process

Stochastic GD 🡪 A mini-batch has the size of a single example

  • Less used as it can lead to huge fluctuations in the loss function at each step
  • SGD typically refers to GD applied to mini-batches of inputs

Roberto Marani

Slide 47

48 of 113

GD optimization

The GD algorithm can be very slow at plateaus, and it can get stuck at saddle points

 

Very slow at the plateau

Stuck at a local minimum

 

Stuck at a saddle point

 

 

 

Roberto Marani

Slide 48

49 of 113

GD with Momentum

Gradient descent with momentum uses the momentum of the gradient for parameter optimization

Movement = Negative of Gradient + Momentum

Gradient = 0

Negative of Gradient

Momentum

Real Movement

 

 

Roberto Marani

Slide 49

50 of 113

GD with Momentum

The GD with Momentum updates the parameters 𝜃 in the direction of the weighted average of the past gradients

At iteration 𝑡

  • Standard GD: 𝜃𝑡=𝜃𝑡−1−𝛼𝛻ℒ(𝜃𝑡−1)
    • Where 𝜃𝑡−1 are the parameters from the previous iteration 𝑡−1

  • GDM: 𝜃𝑡=𝜃𝑡−1−𝑉𝑡
    • Where: 𝑉𝑡 = 𝛽 𝑉𝑡 -1+𝛼𝛻ℒ(𝜃𝑡−1)
    • 𝑉𝑡 is called momentum
      • It accumulates the gradients from the past several steps
      • This term is analogous to a momentum of a heavy ball rolling down the hill
    • 𝛽 is referred to as a coefficient of momentum
      • A typical value of the parameter 𝛽 is 0.9

Roberto Marani

Slide 50

51 of 113

GD with Nesterov Accelerated Momentum

  • Update term: 𝜃𝑡=𝜃𝑡−1−𝑉𝑡
    • Where: 𝑉𝑡 = 𝛽 𝑉𝑡 -1+𝛼𝛻ℒ(𝜃𝑡−1 + 𝛽 𝑉𝑡 -1)
    • The term 𝜃𝑡−1 + 𝛽 𝑉𝑡 -1 allows predicting the position of the parameters in the next step

 

GD with momentum

GD with Nesterov momentum

Roberto Marani

Slide 51

52 of 113

Adaptive Momentum Estimation (Adam)

 

Roberto Marani

Slide 52

53 of 113

Optimizer comparison

Animation from: https://imgur.com/s25RsOr

Roberto Marani

Slide 53

54 of 113

Learning rate

  • The gradient tells us the direction in which the loss has the steepest rate of increase, but it does not tell us how far along the opposite direction we should step
  • Choosing the learning rate (also called the step size) is one of the most important hyper-parameter settings for NN training

LR too small

LR too large

Roberto Marani

Slide 54

55 of 113

Learning rate

  • Training with different learning rates can result in different loss values:
    • High learning rate: the loss increases or plateaus too quickly
    • Low learning rate: the loss decreases too slowly (takes many epochs to reach a solution)

Roberto Marani

Slide 55

56 of 113

Scheduling the learning rate

Learning rate scheduling is applied to change the values of the learning rate during the training

  • Annealing: reducing the learning rate over time
    • Approach 1: reduce the learning rate by some factor every few epochs
      • Typical values: reduce the learning rate by a half every 5 epochs, or divide by 10 every 20 epochs
    • Approach 2: exponential or cosine decay gradually reduce the learning rate over time
    • Approach 3: reduce the learning rate by a constant (e.g., by half) whenever the validation loss stops improving
  • Warmup:
    1. Gradually increasing the learning rate initially
    2. Let the learning rate cool down until the end of the training

Exponential

Cosine

Warmup

Roberto Marani

Slide 56

57 of 113

Outlines

  • Machine learning vs deep learning
  • Neural networks
  • Training Neural Networks
    • Loss Function
    • Optimization
    • Regularization
    • Searching for the best
    • Architectures
      • For time series
      • For fixed-size data

Roberto Marani

Slide 57

58 of 113

Regularization

Regularization is a set of techniques to:

Prevent overfitting 🡪 Improve accuracy when facing new data

Underfitting

  • The model is too “simple” to represent all the relevant class characteristics
  • E.g., model with too few parameters
  • Produces high error on the training set and high error on the validation set

Overfitting

  • The model is too “complex” and fits irrelevant characteristics (noise) in the data
  • E.g., model with too many parameters
  • Produces low error on the training error and high error on the validation set

Roberto Marani

Slide 58

59 of 113

Regularization

Overfitting

A model with high capacity fits the noise in the data instead of the underlying relationship

Roberto Marani

Slide 59

60 of 113

L2 regularization

 

Roberto Marani

Slide 60

61 of 113

L1 regularization

 

Roberto Marani

Slide 61

62 of 113

Dropout regularization

Randomly drop units (along with their connections) during training

  • Each unit is retained with a fixed dropout rate p, independent of the other units
  • The hyper-parameter p needs to be chosen (tuned)
    • Often, between 20% and 50% of the units are dropped

Roberto Marani

Slide 62

63 of 113

Dropout regularization

This technique, using mini-batches, is similar to ensemble learning

  • Every mini-batch trains a slightly-different network

mini-batch 1

mini-batch 2

mini-batch 3

mini-batch n

……

Roberto Marani

Slide 63

64 of 113

Early stopping

  • During model training, use a validation set
    • E.g., validation/train ratio of about 25% to 75%
  • Stop when the validation accuracy (or loss) has not improved after n epochs
    • The parameter n is called patience

Stop training

validation

Roberto Marani

Slide 64

65 of 113

Outlines

  • Machine learning vs deep learning
  • Neural networks
  • Training Neural Networks
    • Loss Function
    • Optimization
    • Regularization
    • Searching for the best
    • Architectures
      • For time series
      • For fixed-size data

Roberto Marani

Slide 65

66 of 113

Tuning the hyper-parameter

  • Training NNs can involve setting many hyper-parameters

  • The most common hyper-parameters include:
    • Number of layers, and number of neurons per layer
    • Initial learning rate
    • Learning rate decay schedule (e.g., decay constant)
    • Optimizer type

  • Other hyper-parameters may include:
    • Regularization parameters (ℓ_2 penalty, dropout rate)
    • Batch size
    • Activation functions
    • Loss function

  • Hyper-parameter tuning can be time-consuming for larger NNs
  • Grid search
    • Check all values in a range with a step value
  • Random search
    • Randomly sample values for the parameter
    • Often preferred to grid search
  • Bayesian hyper-parameter optimization

Roberto Marani

Slide 66

67 of 113

Ensemble Learning

Ensemble learning is training multiple classifiers separately and combining their predictions

  • Ensemble learning often outperforms individual classifiers
  • Better results were obtained with higher model variety in the ensemble

  • Bagging (bootstrap aggregating)
    • Randomly draw subsets from the training set (i.e., bootstrap samples)
    • Train separate classifiers on each subset of the training set
    • Perform classification based on the average vote of all classifiers

  • Boosting
    • Train a classifier, and apply weights on the training set (apply higher weights on misclassified examples, focus on “hard examples”)
    • Train new classifier, reweight training set according to prediction error
    • Repeat
    • Perform classification based on weighted vote of the classifiers

Roberto Marani

Slide 67

68 of 113

k-fold Cross-Validation

Typically used when the training dataset is small

Roberto Marani

Slide 68

69 of 113

Batch Normalization

 

Roberto Marani

Slide 69

70 of 113

  • Machine learning vs deep learning
  • Neural networks
  • Training Neural Networks
    • Loss Function
    • Optimization
    • Regularization
    • Searching for the best
    • Architectures
      • For time series
      • For fixed-size data

Outlines

Roberto Marani

Slide 70

71 of 113

Architectures

Deep learning models can result from different architectures, depending on:

  • The task
    • Classification
    • Segmentation
    • Regression
  • The domain of the input
    • Still data
      • Complete signals
      • Images
    • Time series
      • Evolving signals
      • Videos

  • The architecture is the structure of the network to be then trained
    • Number of layers
    • Type of layers (with their internal parameters): convolutional, pooling, batchnorm, activation, …
    • Interconnection topology

Roberto Marani

Slide 71

72 of 113

Outlines

  • Machine learning vs deep learning
  • Neural networks
  • Training Neural Networks
    • Loss Function
    • Optimization
    • Regularization
    • Searching for the best
    • Architectures
      • For time series
      • For fixed-size data

Roberto Marani

Slide 72

73 of 113

Working on Time Series

Recurrent NNs are used for modeling sequential data and data with varying length of inputs and outputs

  • Videos, text, speech, DNA sequences, human skeletal data

  • RNNs introduce recurrent connections between the neurons
    • This allows processing sequential data one element at a time by selectively passing information across a sequence
    • Memory of the previous inputs is stored in the model’s internal state and affect the model predictions
    • Can capture correlations in sequential data

  • RNNs use backpropagation-through-time for training
  • RNNs are more sensitive to the vanishing gradient problem than CNNs

Roberto Marani

Slide 73

74 of 113

RNNs

 

x1

h0

 

 

 

h1

x2

 

 

 

h2

x3

 

 

 

h3

 

 

OUTPUT

 

 

Roberto Marani

Slide 74

75 of 113

RNNs

A person riding a motorbike on dirt road

Awesome movie. Highly recommended.

Positive

Happy Diwali

शुभ दीपावली

Image Captioning

Sentiment Analysis

Machine Translation

RNN

Application

Input

Output

Roberto Marani

Slide 75

76 of 113

Bidirectional RNNs

  • Bidirectional RNNs incorporate both forward and backward passes through sequential data
    • The output may not only depend on the previous elements in the sequence, but also on future elements in the sequence
    • It resembles two RNNs stacked on top of each other

 

 

 

Outputs both past and future elements

Roberto Marani

Slide 76

77 of 113

LSTM Networks

Long Short-Term Memory (LSTM) networks are a variant of RNNs

  • LSTM mitigates the vanishing/exploding gradient problem
    • Solution: a Memory Cell, updated at each step in the sequence

  • Three gates control the flow of information to and from the Memory Cell
    • Input Gate: protects the current step from irrelevant inputs
    • Output Gate: prevents the current step from passing irrelevant information to later steps
    • Forget Gate: limits information passed from one cell to the next

  • Most modern RNN models use either LSTM units or other more advanced types of recurrent units (e.g., GRU units)

Roberto Marani

Slide 77

78 of 113

LSTM Networks

LSTM cell

  • Input gate, output gate, forget gate, memory cell
  • LSTM can learn long-term correlations within data sequences

Roberto Marani

Slide 78

79 of 113

Outlines

  • Machine learning vs deep learning
  • Neural networks
  • Training Neural Networks
    • Loss Function
    • Optimization
    • Regularization
    • Searching for the best
    • Architectures
      • For time series
      • For fixed-size data

Roberto Marani

Slide 79

80 of 113

Working on fixed-size data

  • Convolutional Neural Networks are a type of Feed-Forward Neural Network used in:
    • Image analysis
    • Natural language processing
    • Other complex image classification problems

  • A CNN has hidden layers of convolutional layers that form the base of ConvNets

  • Features refer to minute details in the image data like:
    • Edges
    • Borders
    • Shapes
    • Textures
    • Etc.

  • At a higher level, convolutional layers detect these patterns in the image data with the help of filters
    • The higher-level details are taken care of by the first few convolutional layers

  • The deeper the network goes, the more sophisticated the pattern searching becomes

Roberto Marani

Slide 80

81 of 113

Convolutional Neural Networks (CNNs)

Roberto Marani

Slide 81

82 of 113

Convolutional Neural Networks (CNNs)

  • A filter can be thought as a relatively small matrix for which we decide the number of rows and columns
    • The value of this feature matrix is initialized with random numbers
    • When this convolutional layer receives pixel values of input data, the filter will convolve over each patch of the input matrix.

  • The output of the convolutional layer is usually passed through the ReLU activation function to bring non-linearity to the model
    • It takes the feature map and replaces all the negative values with zero

  • Pooling is a very important step in the ConvNets
    • It reduces the computation
    • It makes the model tolerant of distortions and variations

  • A Fully Connected Dense Neural Network would use a flattened feature matrix and predict according to the use case

Roberto Marani

Slide 82

83 of 113

Convolutional Neural Networks (CNNs)

Aims behind the use of CNNs

Starting from Fully Connected networks (then layers) we target:

  • Small models
  • Reduced number of weights
  • Shared weights

Maybe unnecessarily complex

Roberto Marani

Slide 83

84 of 113

Convolutional Neural Networks (CNNs)

Why convolutions?

  • Some patterns are much smaller than the image
    • A small region can be represented with fewer parameters (need for a small detector)
  • These patterns can be wherever in the image
    • Such small detectors must move around the image

We can define a small beak detector

The beak detector should move

Roberto Marani

Slide 84

85 of 113

Convolutional Neural Networks (CNNs)

A convolutional layer can match these requisites

  • It is made of filters (filter bank) that do convolutional operations

Acts as a filter

Beak detector

Roberto Marani

Slide 85

86 of 113

Convolutional Neural Networks (CNNs)

2D convolutions

  • A convolution captures useful features (e.g., edge detector)

Input matrix

Convolutional

3x3 filter

Filter

0 1 0

1 -4 1

0 1 0

Input Image

Convoluted Image

Roberto Marani

Slide 86

87 of 113

Convolutional Neural Networks (CNNs)

Stride

  • The stride parameter sets the jump n the kernel translation

1

0

0

0

0

1

0

1

0

0

1

0

0

0

1

1

0

0

1

0

0

0

1

0

0

1

0

0

1

0

0

0

1

0

1

0

6 x 6 image

1

-1

-1

-1

1

-1

-1

-1

1

Filter kernel

3

-1

Stride = 1

Dot

product

1

0

0

0

0

1

0

1

0

0

1

0

0

0

1

1

0

0

1

0

0

0

1

0

0

1

0

0

1

0

0

0

1

0

1

0

3

-3

Stride = 2

Dot

product

6 x 6 image

Roberto Marani

Slide 87

88 of 113

Convolutional Neural Networks (CNNs)

Multiple filters can form a feature map

1

0

0

0

0

1

0

1

0

0

1

0

0

0

1

1

0

0

1

0

0

0

1

0

0

1

0

0

1

0

0

0

1

0

1

0

1

-1

-1

-1

1

-1

-1

-1

1

Filter 1

3

-1

-3

-1

-3

1

0

-3

-3

-3

0

1

3

-2

-2

-1

Stride = 1

Weights are shared

Roberto Marani

Slide 88

89 of 113

Convolutional Neural Networks (CNNs)

Multiple filters can form a feature map

1

0

0

0

0

1

0

1

0

0

1

0

0

0

1

1

0

0

1

0

0

0

1

0

0

1

0

0

1

0

0

0

1

0

1

0

3

-1

-3

-1

-3

1

0

-3

-3

-3

0

1

3

-2

-2

-1

-1

1

-1

-1

1

-1

-1

1

-1

Filter 2

-1

-1

-1

-1

-1

-1

-2

1

-1

-1

-2

1

-1

0

-4

3

Stride = 1

Two 4 x 4 images

(4 x 4 x 2 matrix)

Feature

Map

Roberto Marani

Slide 89

90 of 113

Convolutional Neural Networks (CNNs)

  • In CNNs, the small regions "seen" by the filters are called local receptive fields
  • The depth of each feature map corresponds to the number of convolutional filters used at each layer
  • The depth of the architecture is given by the number of layers

Input Image

Layer 1 Feature Map

Layer 2 Feature Map

w1

w2

w3

w4

w5

w6

w7

w8

Filter 1

Filter 2

Roberto Marani

Slide 90

91 of 113

Convolutional Neural Networks (CNNs)

Color images are made of 3 channels

  • Each channel has its own filter

1

0

0

0

0

1

0

1

0

0

1

0

0

0

1

1

0

0

1

0

0

0

1

0

0

1

0

0

1

0

0

0

1

0

1

0

1

0

0

0

0

1

0

1

0

0

1

0

0

0

1

1

0

0

1

0

0

0

1

0

0

1

0

0

1

0

0

0

1

0

1

0

1

0

0

0

0

1

0

1

0

0

1

0

0

0

1

1

0

0

1

0

0

0

1

0

0

1

0

0

1

0

0

0

1

0

1

0

1

-1

-1

-1

1

-1

-1

-1

1

Filter 1

-1

1

-1

-1

1

-1

-1

1

-1

Filter 2

1

-1

-1

-1

1

-1

-1

-1

1

1

-1

-1

-1

1

-1

-1

-1

1

-1

1

-1

-1

1

-1

-1

1

-1

-1

1

-1

-1

1

-1

-1

1

-1

Color image

Roberto Marani

Slide 91

92 of 113

Convolutional Neural Networks (CNNs)

A convolutional layer can be represented as a fully-connected

  • The output has to be flattened to have the same representation

1

0

0

0

0

1

0

1

0

0

1

0

0

0

1

1

0

0

1

0

0

0

1

0

0

1

0

0

1

0

0

0

1

0

1

0

Convolution layer

-1

1

-1

-1

1

-1

-1

1

-1

1

-1

-1

-1

1

-1

-1

-1

1

1

0

0

0

0

1

0

1

0

0

1

0

0

0

1

1

0

0

1

0

0

0

1

0

0

1

0

0

1

0

0

0

1

0

1

0

Fully-connected

Roberto Marani

Slide 92

93 of 113

Convolutional Neural Networks (CNNs)

A convolutional layer can be represented as a fully-connected

1

0

0

0

0

1

0

1

0

0

1

0

0

0

1

1

0

0

1

0

0

0

1

0

0

1

0

0

1

0

0

0

1

0

1

0

1

-1

-1

-1

1

-1

-1

-1

1

Filter 1

1

2

3

8

9

13

14

15

Only connect to 9 inputs, not fully connected

Fewer parameters

4

10

16

1

0

0

0

0

1

0

0

0

0

1

1

3

7

Roberto Marani

Slide 93

94 of 113

Convolutional Neural Networks (CNNs)

A convolutional layer can be represented as a fully-connected

1

2

3

8

9

13

14

15

The weights (unknown) are the same

Even fewer parameters to be learned

4

10

16

1

0

0

0

0

1

0

0

0

0

1

1

3

7

1

0

0

0

0

1

0

1

0

0

1

0

0

0

1

1

0

0

1

0

0

0

1

0

0

1

0

0

1

0

0

0

1

0

1

0

1

-1

-1

-1

1

-1

-1

-1

1

Filter 1

-1

Roberto Marani

Slide 94

95 of 113

Convolutional Neural Networks (CNNs)

  • Convolutional layers increase the size of the data flowing into the network, depending on the number of filters in the bank
  • A pooling layer subsamples the feature maps to reduce the input size

Subsampling

This is a bird even after subsampling

An image (or feature map) smaller needs fewer parameters to be characterized

Roberto Marani

Slide 95

96 of 113

Convolutional Neural Networks (CNNs)

  • Pooling layers reduce the spatial size of the feature maps
  • Reduce the number of parameters, prevent overfitting

Max pooling

  • Reports the maximum output within a rectangular neighborhood

Average pooling

  • Reports the average output of a rectangular neighborhood

1

3

5

3

4

2

3

1

3

1

1

3

0

1

0

4

MaxPool with a 2×2 filter and a stride of 2

Input Matrix

Output Matrix

4

5

3

4

Roberto Marani

Slide 96

97 of 113

Convolutional Neural Networks (CNNs)

Example of a feature extraction architecture

  • After 2 convolutional layers, a max-pooling layer reduces the size of the feature maps (typically by 2)
  • A fully connected and a softmax layers perform classification
    • The fully connected works on flattened feature vectors

64

64

128

128

256

256

256

512

512

512

512

512

512

Conv layer

Max Pool

Fully Connected Layer

Living Room

Bedroom

Kitchen

Bathroom

Outdoor

Roberto Marani

Slide 97

98 of 113

Convolutional Neural Networks (CNNs)

Be careful with the data size!

64

128

256

512

Living Room

Bedroom

Kitchen

Bathroom

Outdoor

Input color image

640x480x3

1st feature map

(640:2)x(480:2)x64 = 320x240x64

Conv layer

Max Pool

3x3 kernel

stride = 1

2x2 poolsize

stride = 2

2nd feature map

160x120x128

3rd feature map

80x60x256

4th feature map

40x30x512

flattened vector

614400x1

64x9x3 = 1728 learnables

128x64x9 = 73728 learnables

256x128x9 = 294912 learnables

512x256x9 = 1179648 learnables

614400x5 weights

1x5 bias

~4.62M learnables

The output of the convolution has size

[(Input_Size − Kernel_Size + 2*Padding_Size) / Stride] + 1

(or it can be equal)

Roberto Marani

Slide 98

99 of 113

Example of CNNs

Roberto Marani

Slide 99

100 of 113

Example of CNNs

LeNet-5

  • 60,000 parameters
  • Trained on greyscale 32×32 digit images
  • Aim: Recognize digits in the range [0, 9]

Roberto Marani

Slide 100

101 of 113

Example of CNNs

AlexNet

  • 60 million parameters
  • Tested with the ImageNet dataset (benchmark dataset with 1000 classes)
  • Default AlexNet accepts colored images with dimensions 224×224

Roberto Marani

Slide 101

102 of 113

Example of CNNs

VGG-19

  • 138 million parameters
  • It outputs one of the 1000 classes of the ImageNet dataset
  • Default VGG-16 accepts colored images with dimensions 224×224

Roberto Marani

Slide 102

103 of 113

Example of CNNs

Inception v1

  • First tacked the problem of vanishing/exploding gradients with
    • Two auxiliary classifiers connected to intermediate layers
      • Discarded during testing
      • Cooperate to the training (computation of the loss)
    • Inception modules, which process in parallel and then concatenate the output of convolutions of different sizes
  • 7 million parameters
  • It outputs one of the 1000 classes of the ImageNet dataset
  • Default Inception accepts colored images with dimensions 224×224

Roberto Marani

Slide 103

104 of 113

Example of CNNs

ResNet-50

  • Tackles the degradation problem
    • As the network depth increases, accuracy gets saturated and then degrades rapidly
    • Mitigate the problem of vanishing gradients during training
  • Solution: bottleneck residual blocks:
    • Identity block: consists of 3 convolution layers with 1×1, 3×3, and 1×1 kernel sizes, all of which are equipped with BN. The ReLU activation function is applied to the first two layers, while the input of the identity block is added to the last layer before applying ReLU.
    • Convolution block: same as identity block, but the input of the convolution block is first passed through a convolution layer with 1×1 kernel size and BN before being added to the last convolution layer of the main series.
  • 26 million parameters

Roberto Marani

Slide 104

105 of 113

Transfer learning

  • Transfer learning is commonly used in deep learning applications
    • A pretrained network and use it as a starting point to learn a new task
    • Fine-tuning a network with transfer learning is usually much faster and easier than training a network with randomly initialized weights from scratch
    • It is possible to quickly transfer learned features to a new task using a smaller number of training images

Roberto Marani

Slide 105

106 of 113

Deconvolutional Neural Networks (DNNs)

  • Deconvolutional Neural Networks are CNNs that work in a reverse manner.

  • To go to the original size, DNNs use upsampling and transpose convolutional layers
    • Upsampling does not have trainable parameters: it just repeats the rows and columns of the image data by their corresponding sizes
    • Transpose Convolutional layer means applying convolutional operation and upsampling at the same time
      • It is represented as Conv2DTranspose (number of filters, filter size, stride)
      • Stride = 1 🡪 No upsampling (receive an output of the same input size)

Roberto Marani

Slide 106

107 of 113

Encoder – Decoder Architectures

  • The cascade of a CNN and a DNN is used to create a new representation of the input image (encoded) and its decoding to a new semantic space
  • Useful for image segmentation

Roberto Marani

Slide 107

108 of 113

Encoder – Decoder Architectures

  • Encoder:
    • The encoder branch is also known as backbone
    • Takes an input image and generates a high-dimensional feature vector
    • Aggregate features at multiple levels
    • It can be borrow from pre-trained classification networks (e.g., AlexNet, VGG-19, …)

  • Decoder:
    • Takes a high-dimensional feature vector and generates a semantic segmentation mask
    • Decode features aggregated by the encoder at multiple levels

Roberto Marani

Slide 108

109 of 113

Encoder – Decoder Architectures

  • Several encoder-decoder architectures exists:
    • DeepLab: The process of encoding and decoding exploits atrous convolutions, which introduce a spacing value within the kernel elements to process dilated areas. Convolutions work on a wider field of view, without increasing the computational cost
    • UNet: Nested connections link the encoder and decoder sub-networks with skip pathways to have semantically similar feature maps on both sides of the network
    • MANet: It introduces channel attention mechanisms to fuse the local feature maps captured by the backbones, with the global channel weights

  • In any case, the encoding structure (also known as backbone) comes from pretrained CNNs
    • VGG
    • ResNet
    • Inception
    • EfficientNet
  • Its weights can be initialized depending on the training set used

Roberto Marani

Slide 109

110 of 113

CNN example

  • Live demo: Lecture_6_CNN_with_Actual_Data.m

Roberto Marani

Slide 110

111 of 113

  • Current research topics
  • Exam summary

Next steps

Roberto Marani

Slide 111

112 of 113

Roberto Marani

Researcher

National Research Council of Italy (CNR)

Institute of Intelligent Industrial Technologies and Systems for Advanced Manufacturing (STIIMA)

via Amendola 122/D-O, 70126 Bari, Italy

+39 080 592 94 58

roberto.marani@stiima.cnr.it

robertomarani.com

cnr.it/people/roberto.marani

stiima.cnr.it/ricercatori/roberto-marani/

Roberto Marani

Slide 112

113 of 113

Credits

  • Special Topics: Adversarial Machine Learning, Alex Vakanski
  • The Essential Guide to Neural Network Architectures, v7labs.com
  • 5 Popular CNN Architectures Clearly Explained and Visualized - towarddatascience.com
  • Transfer Learning Using Pretrained Network, Mathworks, Matlab Help
  • Encoder-Decoder Networks for Semantic Segmentation, Sachin Mehta
  • Smaller network: CNN, Ming Li

Roberto Marani

Slide 113