1 of 108

Artificial intelligence for quality control with active infrared thermography�Introduction to Machine Learning

ING-IND/14, 2 CFU

Roberto Marani - April 21st, 2023

2 of 108

Linear regression as an indicator

Principle

Since the surface temperature of sound regions decays linearly in a log-log scale, any difference from a linear T-t dependence indicates a defect

log(T)

log(t)

log(T)

Roberto Marani

Slide 2

3 of 108

Linear regression as an indicator

If the model is appropriate (i.e., no defects), the residuals approximate independent random errors
If the model is not appropriate (i.e., with defects), the residuals follow a specific pattern (that is, residual data points do not appear to have a random scatter)
The randomness indicates that the model does not properly fit the data

One measure of goodness of fit is the coefficient of determination (or R²)

This statistic indicates how closely values you obtain from fitting a model match the dependent variable the model is intended to predict

R² = 1 – SS_resid / SS_total

SS_resid is the sum of the squared residuals from the regression
SS_total is the sum of the squared differences from the mean of the dependent variable (total sum of squares).

R² ∈ [0,1] indicates how the linear model predicts the variance of the dependent variable y

Roberto Marani

Slide 3

4 of 108

Linear regression as an indicator

R² discrimination has several flaws:

Even in the case of theoretical data, it is difficult to discriminate surface defects and sound regions
Strongly dependent on the number of samples that can be acquired

Frequency of the acquisition
Duration of the monitoring

In processing actual data, measurement errors can alter the estimation of R²

Need for more sophisticated approaches (machine learning!)

Roberto Marani

Slide 4

5 of 108

Use case

Inspection of a calibrated plate of GFRP

Ø 7.85

Ø 14.1

Ø 20.2

depth

15.7

Hole depths

12.4

9.82

7.08

4.35

315

290

Ø 17.44

Ø 13.3

Ø 9.54

Ø 8.3

Ø 7.85

In-depth defects

Surface defects

Sound region

Roberto Marani

Slide 5

6 of 108

Introduction to machine learning
Feature extraction
Unsupervised learning

K-means
Hierarchical clustering

Supervised learning

K Nearest Neighbors
Supported vector machines
Decision trees
Neural networks

Outlines

Roberto Marani

Slide 6

7 of 108

Introduction to machine learning
Feature extraction
Unsupervised learning

K-means
Hierarchical clustering

Supervised learning

K Nearest Neighbors
Supported vector machines
Decision trees
Neural networks

Outlines

Roberto Marani

Slide 7

8 of 108

Introduction to machine learning

Definition by Tom Mitchell (1998):

Machine Learning is the study of algorithms that:

improve their performance P
at some task T
with experience E

A well-defined learning task is given by <P,T,E>.

“Learning is any process by which a system improves performance from experience” (Herbert Simon)

Roberto Marani

Slide 8

9 of 108

Introduction to machine learning

Why ML?

Develop systems that can automatically adapt and customize themselves to individual users

Personalized news or mail filter

Discover new knowledge from large databases (data mining)

Market basket analysis (e.g. diapers and beer)

Ability to mimic humans and replace certain monotonous tasks which require some intelligence

like recognizing handwritten characters

Develop systems that are too difficult/expensive to construct manually because they require specific detailed skills or knowledge tuned to a specific task (knowledge engineering bottleneck)

Roberto Marani

Slide 9

10 of 108

Introduction to machine learning

Why now?

Flood of available data (especially with the advent of the Internet)
Increasing computational power
Growing progress in available algorithms and theory developed by researchers
Increasing support from industries

Roberto Marani

Slide 10

11 of 108

Introduction to machine learning

When do we need ML?

Human expertise does not exist (navigating on Mars)
Humans cannot explain their expertise (speech recognition)
Models must be customized (personalized medicine)
Models are based on huge amounts of data (genomics)

Roberto Marani

Slide 11

12 of 108

Introduction to machine learning

History of ML

1950s

Samuel’s checker player
Selfridge’s Pandemonium

1960s

Neural networks: Perceptron
Pattern recognition
Learning in the limit theory
Minsky and Papert prove limitations of Perceptron

1970s

Symbolic concept induction
Winston’s arch learner
Expert systems and the knowledge acquisition bottleneck
Quinlan’s ID3
Michalski’s AQ and soybean diagnosis
Scientific discovery with BACON
Mathematical discovery with AM

Roberto Marani

Slide 12

13 of 108

Introduction to machine learning

History of ML

1980s

Advanced decision tree and rule learning
Explanation-based Learning (EBL)
Learning and planning and problem solving
Utility problem
Analogy
Cognitive architectures
Resurgence of neural networks (connectionism, backpropagation)
Valiant’s PAC Learning Theory
Focus on experimental methodology

1990s

Data mining
Adaptive software agents and web applications
Text learning
Reinforcement learning (RL)
Inductive Logic Programming (ILP)
Ensembles: Bagging, Boosting, and Stacking
Bayes Net learning

Roberto Marani

Slide 13

14 of 108

Introduction to machine learning

History of ML

2000s

Support vector machines & kernel methods
Graphical models
Statistical relational learning
Transfer learning
Sequence labeling
Collective classification and structured outputs
Computer Systems Applications (Compilers, Debugging, Graphics, Security)
E-mail management
Personalized assistants that learn
Learning in robotics and vision

2010s - today

Deep learning systems
Learning for big data
Bayesian methods
Multi-task & lifelong learning
Applications to vision, speech, social networks, Human-Machine Interaction, etc.
Natural Language Processing
???

Roberto Marani

Slide 14

15 of 108

Introduction to machine learning

Possible applications:

Web search
Computational biology
Finance
E-commerce
Space exploration
Robotics
Information extraction
Social networks
Debugging software
Manufacturing quality control
[Your favorite area]

State of the art:

Autonomous cars
Natural language processing
Scene labeling
Automatic speech recognition

Roberto Marani

Slide 15

16 of 108

Introduction to machine learning

Roberto Marani

Slide 16

17 of 108

Introduction to machine learning

Roberto Marani

Slide 17

18 of 108

Introduction to machine learning

Roberto Marani

Slide 18

19 of 108

Introduction to machine learning

Types of learning

Supervised (inductive) learning

Given: training data + target outputs (labels)

Unsupervised learning

Given: training data (without target outputs)

Semi-supervised learning

Given: training data + a few target outputs

Reinforcement learning

Rewards from a sequence of actions

Roberto Marani

Slide 19

20 of 108

Introduction to machine learning

Types of learning

Supervised learning

Semi-supervised learning

Unsupervised learning

Roberto Marani

Slide 20

21 of 108

Introduction to machine learning

Supervised learning

Unsupervised learning

Roberto Marani

Slide 21

22 of 108

Introduction to machine learning

Classification vs. regression

Given (x₁, y₁), (x₂, y₂), ..., (x_n, y_n), learn a function y = f(x) to predict y given x

Y is categorical 🡪 classification

Y is real-valued 🡪 regression

Roberto Marani

Slide 22

23 of 108

Designing a learning system

Choose the training experience
Choose exactly what is to be learned (the target function)
Choose how to represent the target function
Choose a learning algorithm to infer the target function from the experience

Roberto Marani

Slide 23

24 of 108

Designing a learning system

Training vs test set

We generally assume that the training and test examples are independently drawn from the same overall distribution of data (“independent and identically distributed”)

If examples are not independent, requires collective classification

If test distribution is different, requires transfer learning

Roberto Marani

Slide 24

25 of 108

Designing a learning system

ML in practice

Understand domain, prior knowledge, and goals
Data integration, selection, cleaning, pre-processing, etc.
Learn models
Interpret results
Consolidate and deploy discovered knowledge

LOOP

Roberto Marani

Slide 25

26 of 108

Evaluation

Roberto Marani

Slide 26

27 of 108

Evaluation

Performance on test data is a good indicator of generalization

The test accuracy is more important than the training accuracy

Roberto Marani

Slide 27

28 of 108

Generalization problem

Simple decision boundary

Complex decision boundary

In classification

Roberto Marani

Slide 28

29 of 108

Generalization problem

Simple model

Complex model

In regression

Roberto Marani

Slide 29

30 of 108

Generalization problem

Over-fitting and under-fitting

Roberto Marani

Slide 30

31 of 108

Comparing models: k-fold cross-validation

Roberto Marani

Slide 31

32 of 108

Comparing models: k-fold cross-validation

Roberto Marani

Slide 32

33 of 108

Comparing models: k-fold cross-validation

Cross-validation in training

Roberto Marani

Slide 33

34 of 108

Outlines

Introduction to machine learning
Feature extraction
Unsupervised learning

K-means
Hierarchical clustering

Supervised learning

K Nearest Neighbors
Supported vector machines
Decision trees
Neural networks

Roberto Marani

Slide 34

35 of 108

Reducing dimensionality

Why do we need features?

In any ML problem, data needs to be preprocessed to:

Clean possible source of error
Make it comparable
Reduce the dimensionality to have more efficient solutions

Features are measurable properties or characteristics of a phenomenon. They need to be:

Informative
Discriminating
Independent
In the correct number

Compact representation of an image (or an object within)

Roberto Marani

Slide 35

36 of 108

Reducing dimensionality

Examples

In character recognition

Histograms counting the number of black pixels along horizontal and vertical directions
Number of internal holes
Stroke detection

In speech recognition

Noise ratios
Length of sounds
Relative power
Filter matches

In spam detection algorithms

Presence or absence of certain email headers
Email structure
Language
Frequency of specific terms
Grammatical correctness of the text

In computer vision

Large number of possible features, such as edges and objects

Roberto Marani

Slide 36

37 of 108

Feature in image processing

Examples

Boundary descriptors

Chain code
Shape number
Polygonal approximation

Skeleton based

Medial axis transform

Histogram based

Standard deviations
Skew
Energy

Texture features

Roberto Marani

Slide 37

38 of 108

Feature in image processing

Histogram of Oriented Gradients (HOG)

It is used in computer vision and image processing for the purpose of object detection
The technique counts occurrences of gradient orientation in the localized portion of an image, focusing on the structure or the shape of an object
It uses magnitude as well as the angle of the gradient to compute the features

Steps

Resize the image to a specific size (128x64 pixels)
Considering a block of 3x3 pixels, the gradients Gx and Gy are calculated for each pixel
The magnitude and angle of each pixel are calculated using the formulae

The resulting image is divided into blocks of 8x8 pixels
Magnitude and angle values of the blocks are collected in 9 bins (histogram)
Histograms are further processed and rearranged to constitute a 36-feature vector for each block
Feature vectors are normalized to reduce the effects of contrast changes

Roberto Marani

Slide 38

39 of 108

Feature in image processing

Histogram of Oriented Gradients (HOG)

Roberto Marani

Slide 39

40 of 108

Feature in image processing

Histogram of Oriented Gradients (HOG)

Computer vision algorithm to detect and describe local features in images
Invariant against:

Illumination
Scale
Rotation
Perspective

Roberto Marani

Slide 40

41 of 108

Feature in image processing

Feature vector:

Location

Scale level where the gradient is maximum (or minimum)

Strength and orientation

Gradient sign

Scale-Invariant Feature Transform (SIFT)

Steps

The image is progressively blurred out
At each step, the size is split (images of the same size are called octaves)
Differences of Gaussian-blurred images (DoG) are computed at each scale
DoG images are explored to find local maxima/minima (key points)
Bad key points (smooth contrast) are deleted
For each key point, gradients and magnitudes are collected
Features vectors are made of these entries

Roberto Marani

Slide 41

42 of 108

Feature in image processing

Applications

Object detection

Image stitching

Roberto Marani

Slide 42

43 of 108

Outlines

Introduction to machine learning
Feature extraction
Unsupervised learning

K-means
Hierarchical clustering

Supervised learning

K Nearest Neighbors
Supported vector machines
Decision trees
Neural networks

Roberto Marani

Slide 43

44 of 108

Unsupervised learning

Supervised learning used labeled data pairs (x, y) to learn a function f : X→Y

But, what if we don’t have labels?

No labels = unsupervised learning
Only some points are labeled = semi-supervised learning

Labels may be expensive to obtain, so we only get a few

Clustering is the unsupervised grouping of data points. It can be used for knowledge discovery

Roberto Marani

Slide 44

45 of 108

Clustering data

Roberto Marani

Slide 45

46 of 108

Outlines

Introduction to machine learning
Feature extraction
Unsupervised learning

K-means
Hierarchical clustering

Supervised learning

K Nearest Neighbors
Supported vector machines
Decision trees
Neural networks

Roberto Marani

Slide 46

47 of 108

K-means

Roberto Marani

Slide 47

48 of 108

K-means

Roberto Marani

Slide 48

49 of 108

K-means

Roberto Marani

Slide 49

50 of 108

K-means

https://mubaris.com/posts/kmeans-clustering/

Roberto Marani

Slide 50

51 of 108

K-means

Very sensitive to the initial points

Do many runs of K-Means, each with different initial centroids
Seed the centroids using a better method than randomly choosing the centroids

e.g., Farthest-first sampling

Must manually choose k

Learn the optimal k for the clustering

Note that this requires a performance measure

Easy implementation

Roberto Marani

Slide 51

52 of 108

Outlines

Introduction to machine learning
Feature extraction
Unsupervised learning

K-means
Hierarchical clustering

Supervised learning

K Nearest Neighbors
Supported vector machines
Decision trees
Neural networks

Roberto Marani

Slide 52

53 of 108

Hierarchical clustering

An algorithm that groups similar objects into groups (clusters) to create a set of exclusive clusters, whose objects are broadly similar to each other

It starts by treating each observation as a separate cluster

Then, it repeatedly executes the followings:

Identify the two clusters that are closest together
Merge the two most similar clusters

This iterative process continues until all the clusters are merged

Roberto Marani

Slide 53

54 of 108

Hierarchical clustering

New structure of information: Dendrogram

Couples of close clusters are merged
Euclidean distance

Possible cuts

1 class

2 classes

3 classes

5 classes

Feature vector

Agglomerative approach

Roberto Marani

Slide 54

55 of 108

Hierarchical clustering

Roberto Marani

Slide 55

56 of 108

Hierarchical clustering

Hierarchical clustering typically works by sequentially merging similar clusters
This is known as agglomerative hierarchical clustering

In theory, it can also be done by initially grouping all the observations into one cluster, and then successively splitting these clusters
This is known as divisive hierarchical clustering
Divisive clustering is rarely done in practice

Roberto Marani

Slide 56

57 of 108

Hierarchical clustering

No assumptions about the number of clusters

Any desired number of clusters can be obtained by ‘cutting’ the dendrogram at the proper level

Hierarchical clustering may correspond to meaningful taxonomies

Examples in biological sciences (e.g., phylogeny reconstruction, etc), web (e.g., product catalogs), etc.

Roberto Marani

Slide 57

58 of 108

Implementation of Unsupervised Clustering

Lecture_5_Unsupervised_Clustering.m

Roberto Marani

Slide 58

59 of 108

Introduction to machine learning
Feature extraction
Unsupervised learning

K-means
Hierarchical clustering

Supervised learning

K Nearest Neighbors
Supported vector machines
Decision trees
Neural networks

Outlines

Roberto Marani

Slide 59

60 of 108

Supervised learning

Supervised learning used labeled data pairs (x, y) to learn a function f : X→Y

We have labels, so we can learn from experience

Learning can be viewed as using direct or indirect experience to approximate a chosen target function
Function approximation can be viewed as a search through a space of hypotheses (representations of functions) for one that best fits a set of training data
Different learning methods assume different hypothesis spaces (representation languages) and/or employ different search techniques

Roberto Marani

Slide 60

61 of 108

Example: Detection of mail spam

The aim is the classification of all emails the user does not want to receive and has not asked to receive
Input:

% of spam emails that were filtered
% of non-spam emails that were incorrectly filtered-out
Mail content

Labeled database: all emails that were labeled by users

Roberto Marani

Slide 61

62 of 108

Example: Detection of mail spam

Working principle:

During training, the learner tries to understand how to filter out spam
Minimization of the error on known examples

Known examples depending on the number of recipients

Roberto Marani

Slide 62

63 of 108

Example: Detection of mail spam

Linear separation

High error

Best separation

The model will learn this lesson to predict spam in new mails

Roberto Marani

Slide 63

64 of 108

Example: Detection of mail spam

Increasing the number of features

How can we separate the dataspace to have a new improved model?

Roberto Marani

Slide 64

65 of 108

Introduction to machine learning
Feature extraction
Unsupervised learning

K-means
Hierarchical clustering

Supervised learning

K Nearest Neighbors
Supported vector machines
Decision trees
Neural networks

Outlines

Roberto Marani

Slide 65

66 of 108

k-Nearest Neighbor (kNN)

Basic idea

It classifies new cases based on a similarity measure

A new instance is classified by most of the votes of its k closest classes

Several similarity measures (“closest”) can be defined (e.g., Euclidean distance function)

Advantages

Easy to implement
Easy to understand
Can be applied to data of any distribution
Good classification if datasets are large

Disadvantages

Need to define the index k
Time-consuming in prediction (need for distance computation)

Roberto Marani

Slide 66

67 of 108

k-Nearest Neighbor (kNN)

Roberto Marani

Slide 67

68 of 108

k-Nearest Neighbor (kNN)

How to choose k?

If k is too small, it is sensitive to noise points
Larger k works well
Too large k may include majority points from other classes
In principle k < sqrt(n) (n is the number of examples)
A trial-and-error approach can tune the k parameter

Roberto Marani

Slide 68

69 of 108

k-Nearest Neighbor (kNN)

Improvements

Feature weighting

Scale each input feature by its importance
The distance is weighted (e.g., Euclidean distance D(a,b) = sqrt(∑_Iw_i(a_i-b_i)²))
Need for prior knowledge about feature significance
Weights can be learned using cross-validation

Feature normalization

Distance between neighbors could be dominated by some attributes with relatively large numbers
Amplitude normalization of features carries all entities on a comparable scale (e.g., a bounded dynamics between 0 and 1)

Roberto Marani

Slide 69

70 of 108

k-Nearest Neighbor (kNN)

Possible distance function

Given:

mx-by-n data matrix X (mx · (1-by-n) row vectors x₁, x₂, ..., x_mx)
my-by-n data matrix Y (my · (1-by-n) row vectors y₁, y₂, ..., y_my)
Distance between x_s and y_t can be:

Euclidean distance

d²= (x_s−y_t) (x_s−y_t)′

Standardized Euclidean distance (V is an n-by-n diagonal matrix of S(j)2 scaling factors, i.e. weights)

d²= (x_s−y_t) V⁻¹(x_s−y_t)′

Mahalanobis distance (C is the covariance matrix)

d²= (x_s−y_t) C⁻¹(x_s−y_t)′

Manhattan

d = ∑_j ∣x_sj−y_tj∣

Minkowski

d^p= ∑_j ∣x_sj−y_tj∣^p

Chebychev distance

d = max_j{ ∣x_sj−y_tj∣ }

Roberto Marani

Slide 70

71 of 108

Introduction to machine learning
Feature extraction
Unsupervised learning

K-means
Hierarchical clustering

Supervised learning

K Nearest Neighbors
Supported vector machines
Decision trees
Neural networks

Outlines

Roberto Marani

Slide 71

72 of 108

Support Vector Machine (SVM)

Advantages

Good generalization capabilities
Works well with few training instances
Find the globally-best model
Efficient algorithms

Disadvantages

Works in binary classification (in theory labels are marked by y_i ∈ {-1, 1})
Relatively slow with huge data sets
Need to choose the kernel (and tune it)

Roberto Marani

Slide 72

73 of 108

Support Vector Machine (SVM)

Input

x ∈ R^d+1, x₀= 1, d = size of input features

Output (known labels)

y_i ∈ {-1, 1})

Objective

Find the model parameters ϴ ∈ R^d+1so that a prediction is made as follows

h(x) = sign(ϴ^Tx) = sign(<ϴ,x>) (the sign function realizes the classification)
<u,v> is the inner (dot) product: u·v = ∑_i u_iv_i
The error on the i-th prediction is defined as y_i- h(x_i)

The dot product <ϴ,x> = 0 is known as hyperplane (separation between input feature that defines the final binary classification)

Roberto Marani

Slide 73

74 of 108

Support Vector Machine (SVM)

Example

Roberto Marani

Slide 74

75 of 108

Support Vector Machine (SVM)

Separators

Roberto Marani

Slide 75

76 of 108

Support Vector Machine (SVM)

What if we have noise?

Increasing noise 🡪 One separator remains

Roberto Marani

Slide 76

77 of 108

Support Vector Machine (SVM)

Basic idea

Use fat separators
Increase the margin till its maximization

Fewer possible models

Roberto Marani

Slide 77

78 of 108

Support Vector Machine (SVM)

Practical example

x₁

ϴ = [1,0]

Hyperplane: x₁ = 0 (i.e., x₂ axis)

Margin = 2

The sign function applied to the feature x₁ allows the best prediction since the margin is maximum

Note that x₂ has no meaning for this problem

x₂

Roberto Marani

Slide 78

79 of 108

Support Vector Machine (SVM)

What if the input data are not linearly separable?

The basic idea is to use a new function (kernel) that can map the input data in a new feature space where features are linearly separable

Non-linearly separable

Linearly separable

Φ = (·)²

Roberto Marani

Slide 79

80 of 108

Support Vector Machine (SVM)

Kernel trick

Instead of using the direct formulation, find kernel K such that

K(x_i,x_j) = <Φ(x_i), Φ(x_j)>

Computing K(x_i,x_j) should be more efficient than computing Φ(x_i) and Φ(x_j) separately
Use K(x_i,x_j) in SVM algorithm rather than <x_i,x_j>

Possible kernels

Polynomial kernel: K(x_i,x_j) = <x_i, x_j>^d(as before)
Gaussian kernel: K(x_i,x_j) = exp( -(|x_i-x_j|₂²) / (2σ²) )
Sigmoid kernel: K(x_i,x_j) = tanh(αx_i^Tx_j+ c)
Cosine similarity, chi-squared, tree, graph, wavelet, string, …

Roberto Marani

Slide 80

81 of 108

Support Vector Machine (SVM)

Multi-class classification with SVMs

Many SVM packages already have multi-class classification built-in
Alternatively, use a one-vs-all approach

Define K parameters ϴ^(k), k = 1, …, K
Predict the k-th class looking at the largest (ϴ^(k))^Tx

Roberto Marani

Slide 81

82 of 108

Introduction to machine learning
Feature extraction
Unsupervised learning

K-means
Hierarchical clustering

Supervised learning

K Nearest Neighbors
Supported vector machines
Decision trees
Neural networks

Outlines

Roberto Marani

Slide 82

83 of 108

Decision trees

Basic idea

A classification scheme that generates a tree and a set of rules from a given dataset
Inputs are processed by the rules toward the last decision (classification)
Each branch node represents a choice between several alternatives, and each leaf node represents a decision

Advantages

Decision trees can generate understandable rules
Can handle both numerical and categorical attributes
Provide a clear indication of which fields are most important for prediction or classification

Disadvantages

The process of growing a decision tree is computationally expensive

At each node, all candidate splitting-fields must be examined to find the best split

Some decision trees can only deal with binary-valued target classes

Roberto Marani

Slide 83

84 of 108

Decision trees

A decision tree is a tree with the following properties:

An inner node represents an attribute
An edge represents a test on the attribute of the parent node
A leaf represents one of the classes

Construction of a decision tree

Based on the training data
Top-Down strategy

Roberto Marani

Slide 84

85 of 108

Decision trees

Example: When can they play a tennis game?

Columns are the features (predictors)
Rows are the instances {x_i,y_i}
y ∈ [can play, can’t play]
In a decision tree, each leaf node represents a rule.

Roberto Marani

Slide 85

86 of 108

Decision trees

Example

We have five leaf nodes
In a decision tree, each leaf node represents a rule

Rules:
If it is sunny and the humidity is NORMAL then play
If it is sunny and the humidity is HIGH, then do not play.
If it is overcast, then play.
If it is rainy and WEAKLY windy, then play.
If it is rainy and STRONGLY windy, then don't play

Categorical attributes (can also be numerical)

Roberto Marani

Slide 86

87 of 108

Decision trees

Example with numerical attributes

Roberto Marani

Slide 87

88 of 108

Decision trees

Decision boundary

Decision trees divide the feature space into axis-parallel (hyper-)rectangles
Each rectangular region is labeled with one label

or a probability distribution over labels

Roberto Marani

Slide 88

89 of 108

Decision trees

How to select the attributes (and their order)

Principle stated by William of Ockham (1285-1347)

“non sunt mulplicanda entia praeter necessitatem”
Entities are not to be multiplied beyond necessity
AKA Occam’s Razor, Law of Economy, or Law of Parsimony

The simplest consistent explanation is the best

The smallest decision tree that correctly classifies all the training examples is the best

Too hard to find in close form
Try to construct something pretty small

Roberto Marani

Slide 89

90 of 108

Decision trees

How to select the attributes (and their order)

Random: Select any attribute at random
Least-Values: Choose the attribute with the smallest number of possible values
Most-Values: Choose the attribute with the largest number of possible values
Max-Gain: Choose the attribute that has the largest expected information gain

attribute that results in the smallest expected size of subtrees rooted in its children

Which test is more informative?

Roberto Marani

Slide 90

91 of 108

Decision trees

Overfitting

If a decision tree is fully grown, it may lose some generalization capability
This is overfitting

Overfitting Due to Presence of Noise: Mislabeled instances may contradict the class labels of other similar records
Overfitting Due to Lack of Representative Instances: Lack of representative instances in the training data can prevent refinement of the learning algorithm

Solutions:

Stop growing the tree earlier, before it reaches the point where it perfectly classifies the training data
Allow the tree to overfit the data, and then post-prune the tree

Roberto Marani

Slide 91

92 of 108

Decision trees

Applications

As accurate as human experts:

A study for diagnosing breast cancer:

Humans correctly classify the examples 65% of the time
The decision tree classified 72% correct

British Petroleum designed a decision tree for gas-oil separation for offshore oil platforms
Cessna designed an airplane flight controller

Microsoft's first release of Kinect Body Tracking used decision forests (ensembles of decision trees) to skeletonize humans

Roberto Marani

Slide 92

93 of 108

Introduction to machine learning
Feature extraction
Unsupervised learning

K-means
Hierarchical clustering

Supervised learning

K Nearest Neighbors
Supported vector machines
Decision trees
Neural networks

Outlines

Roberto Marani

Slide 93

94 of 108

Neural networks

Basic idea

Artificial Neural Network (ANN) is an efficient computing system borrowed from the analogy of biological neural networks
Every neuron is connected with another neuron through a connection link
Each connection link is associated with a weight that has information about the input signal
Each neuron has an internal state, which is called an activation signal
Output signals, which are produced after combining the input signals and activation rules, may be sent to other units

Roberto Marani

Slide 94

95 of 108

Neural networks

Biological neuron

A neuron is a cell that carries electrical impulses
Neurons are the basic units of the nervous system and its most important part is the brain
Every neuron is made of a cell body (also called a soma), dendrites and an axon

Dendrites and axons are nerve fibers

There are about 86 billion neurons in the human brain, which comprises roughly 10% of all brain cells
Neurons are connected to one another
They form tiny gaps called synapses

These gaps can be chemical synapses or electrical synapses and pass the signal from one neuron to the next

Roberto Marani

Slide 95

96 of 108

Neural networks

ANN vs. BNN

Roberto Marani

Slide 96

97 of 108

Neural networks

ANN vs. BNN

Criteria	BNN	ANN
Processing	Massively parallel	Massively parallel, but less than BNN
Learning	Can tolerate ambiguity	Very precise, structured, and formatted data is required to tolerate ambiguity
Fault tolerance	Performance degrades with even partial damage	Robust resiliency (can be fault tolerant)
Storage capacity	Stores the information in the synapse	Stores the information in memory locations

Roberto Marani

Slide 97

98 of 108

Neural networks

ANN vs. BNN

Roberto Marani

Slide 98

99 of 108

Neural networks

ANN vs. BNN

The dendrites in a biological neural network are analogous to the weighted inputs based on their synaptic interconnection in artificial neural network
Cell body is analogous to the artificial neuron unit in artificial neural network which also comprises summation and threshold unit
Axon carries output that is analogous to the output unit in the case of an artificial neural network

ANNs are modeled using the working of basic biological neurons

Roberto Marani

Slide 99

100 of 108

Neural networks

Model of an artificial neuron

Roberto Marani

Slide 100

101 of 108

Neural networks

Model of an artificial neuron

Artificial neural networks can be viewed as a weighted directed graph in which artificial neurons are nodes and the directed weighted edges connect the input neurons to the output ones

The Artificial Neural Network receives input from the external world in the form of vectors

Each input is multiplied by its corresponding weights

Weights are the information used by the neural network to solve a problem (represents the strength of the interconnection between neurons)

The weighted inputs are all summed up inside the computing unit (artificial neuron)

A bias is added to scale up the system response

Roberto Marani

Slide 101

102 of 108

Neural networks

Model of an artificial neuron

The sum corresponds to any numerical value ranging from 0 to infinity

In order to limit the response to arrive at desired value, the threshold value is set up, i.e. the result is passed through an activation function

The activation function is set to get the desired output:

Binary — The output has only two values either 0 or 1 depending on a threshold. For instance, if the net weighted input is greater than 1, the output is assumed to be 1 otherwise 0
Sigmoidal Hyperbolic — This function has an S-shaped curve. Here the tan hyperbolic function is used to approximate output from the net input

f (x) = (1/1+ exp(-𝝈x)) (𝝈 is the steepness parameter)

Roberto Marani

Slide 102

103 of 108

Neural networks

Network architecture

Input layer: Units (artificial neurons) that receive input from the world

Output layer: Units that produce the final prediction (the number of output is equal to the number of target classes)

Hidden layer: Units in between. They transform the input into something that output units can use in some way

Most neural networks are fully connected (each hidden neuron is fully connected to every neuron of the input and output layers) 🡪 It has a matrix form

Roberto Marani

Slide 103

104 of 108

Neural networks

Advantages

Can also solve non-linear problems
Implemented in many applications

Function approximation, or regression analysis, including time series prediction, fitness approximation and modeling
Classification, including pattern and sequence recognition, novelty detection and sequential decision making
Data processing, including filtering, clustering, blind source separation and compression.
Robotics (including directing manipulators) and control

Disadvantages

The number of hidden units has to be tuned to make the network capable but avoiding overfitting
There is no control over what the network is doing during the classification

Roberto Marani

Slide 104

105 of 108

Implementation of supervised learning for defect classification

Lecture_5_Physical_Feature_Extraction.m

Lecture_5_Supervised_Classification.m

Roberto Marani

Slide 105

106 of 108

From machine learning to deep learning
Convolutional Neural Networks
Test on actual data

Next steps

Roberto Marani

Slide 106

107 of 108

Roberto Marani

Researcher

National Research Council of Italy (CNR)

Institute of Intelligent Industrial Technologies and Systems for Advanced Manufacturing (STIIMA)

via Amendola 122/D-O, 70126 Bari, Italy

+39 080 592 94 58

roberto.marani@stiima.cnr.it

robertomarani.com

cnr.it/people/roberto.marani

stiima.cnr.it/ricercatori/roberto-marani/

Roberto Marani

Slide 107

108 of 108

Credits

Introduction to Machine Learning, Eric Eaton
Introduction to Machine Learning, Lior Rokach
Machine learning, Rajat Sharma
Implementation of SIFT in CUDA, Nitin Ramchandani
K Nearest Neighbors, Tilani Gunawardena
Machine Learning with DECISION TREES, Ramandeep Kaur et al.
Decision Tree, R. Akerkar
Artificial Neural Network, Arafat Shaikh

Roberto Marani

Slide 108