1 of 108

Artificial intelligence for quality control with active infrared thermographyIntroduction to Machine Learning

ING-IND/14, 2 CFU

Roberto Marani - April 21st, 2023

2 of 108

Linear regression as an indicator

Principle

Since the surface temperature of sound regions decays linearly in a log-log scale, any difference from a linear T-t dependence indicates a defect

log(T)

log(t)

log(t)

log(T)

Roberto Marani

Slide 2

3 of 108

Linear regression as an indicator

  • If the model is appropriate (i.e., no defects), the residuals approximate independent random errors
  • If the model is not appropriate (i.e., with defects), the residuals follow a specific pattern (that is, residual data points do not appear to have a random scatter)
  • The randomness indicates that the model does not properly fit the data

One measure of goodness of fit is the coefficient of determination (or R2)

  • This statistic indicates how closely values you obtain from fitting a model match the dependent variable the model is intended to predict

R2 = 1 – SSresid / SStotal

  • SSresid is the sum of the squared residuals from the regression
  • SStotal is the sum of the squared differences from the mean of the dependent variable (total sum of squares).

  • R2 ∈ [0,1] indicates how the linear model predicts the variance of the dependent variable y

Roberto Marani

Slide 3

4 of 108

Linear regression as an indicator

R2 discrimination has several flaws:

  • Even in the case of theoretical data, it is difficult to discriminate surface defects and sound regions
  • Strongly dependent on the number of samples that can be acquired
    • Frequency of the acquisition
    • Duration of the monitoring
  • In processing actual data, measurement errors can alter the estimation of R2

Need for more sophisticated approaches (machine learning!)

Roberto Marani

Slide 4

5 of 108

Use case

Inspection of a calibrated plate of GFRP

Ø 7.85

Ø 14.1

Ø 20.2

depth

15.7

Hole depths

12.4

9.82

7.08

4.35

315

290

Ø 17.44

Ø 13.3

Ø 9.54

Ø 8.3

Ø 7.85

In-depth defects

Surface defects

Sound region

Roberto Marani

Slide 5

6 of 108

  • Introduction to machine learning
  • Feature extraction
  • Unsupervised learning
    • K-means
    • Hierarchical clustering
  • Supervised learning
    • K Nearest Neighbors
    • Supported vector machines
    • Decision trees
    • Neural networks

Outlines

Roberto Marani

Slide 6

7 of 108

  • Introduction to machine learning
  • Feature extraction
  • Unsupervised learning
    • K-means
    • Hierarchical clustering
  • Supervised learning
    • K Nearest Neighbors
    • Supported vector machines
    • Decision trees
    • Neural networks

Outlines

Roberto Marani

Slide 7

8 of 108

Introduction to machine learning

Definition by Tom Mitchell (1998):

Machine Learning is the study of algorithms that:

  • improve their performance P
  • at some task T
  • with experience E

A well-defined learning task is given by <P,T,E>.

Learning is any process by which a system improves performance from experience” (Herbert Simon)

Roberto Marani

Slide 8

9 of 108

Introduction to machine learning

Why ML?

  • Develop systems that can automatically adapt and customize themselves to individual users
    • Personalized news or mail filter

  • Discover new knowledge from large databases (data mining)
    • Market basket analysis (e.g. diapers and beer)

  • Ability to mimic humans and replace certain monotonous tasks which require some intelligence
    • like recognizing handwritten characters

  • Develop systems that are too difficult/expensive to construct manually because they require specific detailed skills or knowledge tuned to a specific task (knowledge engineering bottleneck)

Roberto Marani

Slide 9

10 of 108

Introduction to machine learning

Why now?

  • Flood of available data (especially with the advent of the Internet)
  • Increasing computational power
  • Growing progress in available algorithms and theory developed by researchers
  • Increasing support from industries

Roberto Marani

Slide 10

11 of 108

Introduction to machine learning

When do we need ML?

  • Human expertise does not exist (navigating on Mars)
  • Humans cannot explain their expertise (speech recognition)
  • Models must be customized (personalized medicine)
  • Models are based on huge amounts of data (genomics)

Roberto Marani

Slide 11

12 of 108

Introduction to machine learning

History of ML

  • 1950s
    • Samuel’s checker player
    • Selfridge’s Pandemonium
  • 1960s
    • Neural networks: Perceptron
    • Pattern recognition
    • Learning in the limit theory
    • Minsky and Papert prove limitations of Perceptron
  • 1970s
    • Symbolic concept induction
    • Winston’s arch learner
    • Expert systems and the knowledge acquisition bottleneck
    • Quinlan’s ID3
    • Michalski’s AQ and soybean diagnosis
    • Scientific discovery with BACON
    • Mathematical discovery with AM

Roberto Marani

Slide 12

13 of 108

Introduction to machine learning

History of ML

  • 1980s
    • Advanced decision tree and rule learning
    • Explanation-based Learning (EBL)
    • Learning and planning and problem solving
    • Utility problem
    • Analogy
    • Cognitive architectures
    • Resurgence of neural networks (connectionism, backpropagation)
    • Valiant’s PAC Learning Theory
    • Focus on experimental methodology
  • 1990s
    • Data mining
    • Adaptive software agents and web applications
    • Text learning
    • Reinforcement learning (RL)
    • Inductive Logic Programming (ILP)
    • Ensembles: Bagging, Boosting, and Stacking
    • Bayes Net learning

Roberto Marani

Slide 13

14 of 108

Introduction to machine learning

History of ML

  • 2000s
    • Support vector machines & kernel methods
    • Graphical models
    • Statistical relational learning
    • Transfer learning
    • Sequence labeling
    • Collective classification and structured outputs
    • Computer Systems Applications (Compilers, Debugging, Graphics, Security)
    • E-mail management
    • Personalized assistants that learn
    • Learning in robotics and vision
  • 2010s - today
    • Deep learning systems
    • Learning for big data
    • Bayesian methods
    • Multi-task & lifelong learning
    • Applications to vision, speech, social networks, Human-Machine Interaction, etc.
    • Natural Language Processing
    • ???

Roberto Marani

Slide 14

15 of 108

Introduction to machine learning

Possible applications:

  • Web search
  • Computational biology
  • Finance
  • E-commerce
  • Space exploration
  • Robotics
  • Information extraction
  • Social networks
  • Debugging software
  • Manufacturing quality control
  • [Your favorite area]

State of the art:

  • Autonomous cars
  • Natural language processing
  • Scene labeling
  • Automatic speech recognition

Roberto Marani

Slide 15

16 of 108

Introduction to machine learning

Roberto Marani

Slide 16

17 of 108

Introduction to machine learning

Roberto Marani

Slide 17

18 of 108

Introduction to machine learning

Roberto Marani

Slide 18

19 of 108

Introduction to machine learning

Types of learning

  • Supervised (inductive) learning
    • Given: training data + target outputs (labels)
  • Unsupervised learning
    • Given: training data (without target outputs)
  • Semi-supervised learning
    • Given: training data + a few target outputs
  • Reinforcement learning
    • Rewards from a sequence of actions

Roberto Marani

Slide 19

20 of 108

Introduction to machine learning

Types of learning

Supervised learning

Semi-supervised learning

Unsupervised learning

Roberto Marani

Slide 20

21 of 108

Introduction to machine learning

Supervised learning

Unsupervised learning

Roberto Marani

Slide 21

22 of 108

Introduction to machine learning

Classification vs. regression

Given (x1, y1), (x2, y2), ..., (xn, yn), learn a function y = f(x) to predict y given x

Y is categorical 🡪 classification

Y is real-valued 🡪 regression

Roberto Marani

Slide 22

23 of 108

Designing a learning system

  • Choose the training experience
  • Choose exactly what is to be learned (the target function)
  • Choose how to represent the target function
  • Choose a learning algorithm to infer the target function from the experience

Roberto Marani

Slide 23

24 of 108

Designing a learning system

Training vs test set

  • We generally assume that the training and test examples are independently drawn from the same overall distribution of data (“independent and identically distributed”)

  • If examples are not independent, requires collective classification

  • If test distribution is different, requires transfer learning

Roberto Marani

Slide 24

25 of 108

Designing a learning system

ML in practice

  • Understand domain, prior knowledge, and goals
  • Data integration, selection, cleaning, pre-processing, etc.
  • Learn models
  • Interpret results
  • Consolidate and deploy discovered knowledge

LOOP

Roberto Marani

Slide 25

26 of 108

Evaluation

Roberto Marani

Slide 26

27 of 108

Evaluation

Performance on test data is a good indicator of generalization

The test accuracy is more important than the training accuracy

Roberto Marani

Slide 27

28 of 108

Generalization problem

Simple decision boundary

Complex decision boundary

In classification

Roberto Marani

Slide 28

29 of 108

Generalization problem

Simple model

Complex model

In regression

Roberto Marani

Slide 29

30 of 108

Generalization problem

Over-fitting and under-fitting

Roberto Marani

Slide 30

31 of 108

Comparing models: k-fold cross-validation

Roberto Marani

Slide 31

32 of 108

Comparing models: k-fold cross-validation

Roberto Marani

Slide 32

33 of 108

Comparing models: k-fold cross-validation

Cross-validation in training

Roberto Marani

Slide 33

34 of 108

Outlines

  • Introduction to machine learning
  • Feature extraction
  • Unsupervised learning
    • K-means
    • Hierarchical clustering
  • Supervised learning
    • K Nearest Neighbors
    • Supported vector machines
    • Decision trees
    • Neural networks

Roberto Marani

Slide 34

35 of 108

Reducing dimensionality

Why do we need features?

In any ML problem, data needs to be preprocessed to:

  • Clean possible source of error
  • Make it comparable
  • Reduce the dimensionality to have more efficient solutions

Features are measurable properties or characteristics of a phenomenon. They need to be:

  • Informative
  • Discriminating
  • Independent
  • In the correct number

Compact representation of an image (or an object within)

Roberto Marani

Slide 35

36 of 108

Reducing dimensionality

Examples

In character recognition

  • Histograms counting the number of black pixels along horizontal and vertical directions
  • Number of internal holes
  • Stroke detection

In speech recognition

  • Noise ratios
  • Length of sounds
  • Relative power
  • Filter matches

In spam detection algorithms

  • Presence or absence of certain email headers
  • Email structure
  • Language
  • Frequency of specific terms
  • Grammatical correctness of the text

In computer vision

  • Large number of possible features, such as edges and objects

Roberto Marani

Slide 36

37 of 108

Feature in image processing

Examples

  • Boundary descriptors
    • Chain code
    • Shape number
    • Polygonal approximation

  • Skeleton based
    • Medial axis transform

  • Histogram based
    • Standard deviations
    • Skew
    • Energy

  • Texture features

Roberto Marani

Slide 37

38 of 108

Feature in image processing

Histogram of Oriented Gradients (HOG)

  • It is used in computer vision and image processing for the purpose of object detection
  • The technique counts occurrences of gradient orientation in the localized portion of an image, focusing on the structure or the shape of an object
  • It uses magnitude as well as the angle of the gradient to compute the features

Steps

  1. Resize the image to a specific size (128x64 pixels)
  2. Considering a block of 3x3 pixels, the gradients Gx and Gy are calculated for each pixel
  3. The magnitude and angle of each pixel are calculated using the formulae

  • The resulting image is divided into blocks of 8x8 pixels
  • Magnitude and angle values of the blocks are collected in 9 bins (histogram)
  • Histograms are further processed and rearranged to constitute a 36-feature vector for each block
  • Feature vectors are normalized to reduce the effects of contrast changes

Roberto Marani

Slide 38

39 of 108

Feature in image processing

Histogram of Oriented Gradients (HOG)

Roberto Marani

Slide 39

40 of 108

Feature in image processing

Histogram of Oriented Gradients (HOG)

  • Computer vision algorithm to detect and describe local features in images
  • Invariant against:
    • Illumination
    • Scale
    • Rotation
    • Perspective

Roberto Marani

Slide 40

41 of 108

Feature in image processing

Feature vector:

Location

Scale level where the gradient is maximum (or minimum)

Strength and orientation

Gradient sign

Scale-Invariant Feature Transform (SIFT)

Steps

  1. The image is progressively blurred out
  2. At each step, the size is split (images of the same size are called octaves)
  3. Differences of Gaussian-blurred images (DoG) are computed at each scale
  4. DoG images are explored to find local maxima/minima (key points)
  5. Bad key points (smooth contrast) are deleted
  6. For each key point, gradients and magnitudes are collected
  7. Features vectors are made of these entries

Roberto Marani

Slide 41

42 of 108

Feature in image processing

Applications

Object detection

Image stitching

Roberto Marani

Slide 42

43 of 108

Outlines

  • Introduction to machine learning
  • Feature extraction
  • Unsupervised learning
    • K-means
    • Hierarchical clustering
  • Supervised learning
    • K Nearest Neighbors
    • Supported vector machines
    • Decision trees
    • Neural networks

Roberto Marani

Slide 43

44 of 108

Unsupervised learning

  • Supervised learning used labeled data pairs (x, y) to learn a function f : X→Y
    • But, what if we don’t have labels?

  • No labels = unsupervised learning
  • Only some points are labeled = semi-supervised learning
    • Labels may be expensive to obtain, so we only get a few

  • Clustering is the unsupervised grouping of data points. It can be used for knowledge discovery

Roberto Marani

Slide 44

45 of 108

Clustering data

Roberto Marani

Slide 45

46 of 108

Outlines

  • Introduction to machine learning
  • Feature extraction
  • Unsupervised learning
    • K-means
    • Hierarchical clustering
  • Supervised learning
    • K Nearest Neighbors
    • Supported vector machines
    • Decision trees
    • Neural networks

Roberto Marani

Slide 46

47 of 108

K-means

Roberto Marani

Slide 47

48 of 108

K-means

Roberto Marani

Slide 48

49 of 108

K-means

Roberto Marani

Slide 49

50 of 108

K-means

https://mubaris.com/posts/kmeans-clustering/

Roberto Marani

Slide 50

51 of 108

K-means

  • Very sensitive to the initial points
    • Do many runs of K-Means, each with different initial centroids
    • Seed the centroids using a better method than randomly choosing the centroids
      • e.g., Farthest-first sampling

  • Must manually choose k
    • Learn the optimal k for the clustering
      • Note that this requires a performance measure

  • Easy implementation

Roberto Marani

Slide 51

52 of 108

Outlines

  • Introduction to machine learning
  • Feature extraction
  • Unsupervised learning
    • K-means
    • Hierarchical clustering
  • Supervised learning
    • K Nearest Neighbors
    • Supported vector machines
    • Decision trees
    • Neural networks

Roberto Marani

Slide 52

53 of 108

Hierarchical clustering

An algorithm that groups similar objects into groups (clusters) to create a set of exclusive clusters, whose objects are broadly similar to each other

It starts by treating each observation as a separate cluster

Then, it repeatedly executes the followings:

  1. Identify the two clusters that are closest together
  2. Merge the two most similar clusters

This iterative process continues until all the clusters are merged

Roberto Marani

Slide 53

54 of 108

Hierarchical clustering

New structure of information: Dendrogram

  • Couples of close clusters are merged
  • Euclidean distance

Possible cuts

1 class

2 classes

3 classes

5 classes

Feature vector

Agglomerative approach

Roberto Marani

Slide 54

55 of 108

Hierarchical clustering

Roberto Marani

Slide 55

56 of 108

Hierarchical clustering

  • Hierarchical clustering typically works by sequentially merging similar clusters
  • This is known as agglomerative hierarchical clustering

  • In theory, it can also be done by initially grouping all the observations into one cluster, and then successively splitting these clusters
  • This is known as divisive hierarchical clustering
  • Divisive clustering is rarely done in practice

Roberto Marani

Slide 56

57 of 108

Hierarchical clustering

  • No assumptions about the number of clusters
    • Any desired number of clusters can be obtained by ‘cutting’ the dendrogram at the proper level

  • Hierarchical clustering may correspond to meaningful taxonomies
    • Examples in biological sciences (e.g., phylogeny reconstruction, etc), web (e.g., product catalogs), etc.

Roberto Marani

Slide 57

58 of 108

Implementation of Unsupervised Clustering

Lecture_5_Unsupervised_Clustering.m

Roberto Marani

Slide 58

59 of 108

  • Introduction to machine learning
  • Feature extraction
  • Unsupervised learning
    • K-means
    • Hierarchical clustering
  • Supervised learning
    • K Nearest Neighbors
    • Supported vector machines
    • Decision trees
    • Neural networks

Outlines

Roberto Marani

Slide 59

60 of 108

Supervised learning

  • Supervised learning used labeled data pairs (x, y) to learn a function f : X→Y
    • We have labels, so we can learn from experience

  • Learning can be viewed as using direct or indirect experience to approximate a chosen target function
  • Function approximation can be viewed as a search through a space of hypotheses (representations of functions) for one that best fits a set of training data
  • Different learning methods assume different hypothesis spaces (representation languages) and/or employ different search techniques

Roberto Marani

Slide 60

61 of 108

Example: Detection of mail spam

  • The aim is the classification of all emails the user does not want to receive and has not asked to receive
  • Input:
    • % of spam emails that were filtered
    • % of non-spam emails that were incorrectly filtered-out
    • Mail content
  • Labeled database: all emails that were labeled by users

Roberto Marani

Slide 61

62 of 108

Example: Detection of mail spam

  • Working principle:
    • During training, the learner tries to understand how to filter out spam
    • Minimization of the error on known examples

Known examples depending on the number of recipients

Roberto Marani

Slide 62

63 of 108

Example: Detection of mail spam

  • Linear separation

High error

Best separation

The model will learn this lesson to predict spam in new mails

Roberto Marani

Slide 63

64 of 108

Example: Detection of mail spam

  • Increasing the number of features

How can we separate the dataspace to have a new improved model?

Roberto Marani

Slide 64

65 of 108

  • Introduction to machine learning
  • Feature extraction
  • Unsupervised learning
    • K-means
    • Hierarchical clustering
  • Supervised learning
    • K Nearest Neighbors
    • Supported vector machines
    • Decision trees
    • Neural networks

Outlines

Roberto Marani

Slide 65

66 of 108

k-Nearest Neighbor (kNN)

  • Basic idea
    • It classifies new cases based on a similarity measure
      • A new instance is classified by most of the votes of its k closest classes
    • Several similarity measures (“closest”) can be defined (e.g., Euclidean distance function)

  • Advantages
    • Easy to implement
    • Easy to understand
    • Can be applied to data of any distribution
    • Good classification if datasets are large

  • Disadvantages
    • Need to define the index k
    • Time-consuming in prediction (need for distance computation)

Roberto Marani

Slide 66

67 of 108

k-Nearest Neighbor (kNN)

Roberto Marani

Slide 67

68 of 108

k-Nearest Neighbor (kNN)

  • How to choose k?
    • If k is too small, it is sensitive to noise points
    • Larger k works well
    • Too large k may include majority points from other classes
    • In principle k < sqrt(n) (n is the number of examples)
    • A trial-and-error approach can tune the k parameter

Roberto Marani

Slide 68

69 of 108

k-Nearest Neighbor (kNN)

Improvements

  • Feature weighting
    • Scale each input feature by its importance
    • The distance is weighted (e.g., Euclidean distance D(a,b) = sqrt(∑I wi (ai-bi)2))
    • Need for prior knowledge about feature significance
    • Weights can be learned using cross-validation

  • Feature normalization
    • Distance between neighbors could be dominated by some attributes with relatively large numbers
    • Amplitude normalization of features carries all entities on a comparable scale (e.g., a bounded dynamics between 0 and 1)

Roberto Marani

Slide 69

70 of 108

k-Nearest Neighbor (kNN)

Possible distance function

  • Given:
    • mx-by-n data matrix X (mx · (1-by-n) row vectors x1, x2, ..., xmx)
    • my-by-n data matrix Y (my · (1-by-n) row vectors y1, y2, ..., ymy)
    • Distance between xs and yt can be:

  • Euclidean distance

d2 = (xs−yt) (xs−yt)′

  • Standardized Euclidean distance (V is an n-by-n diagonal matrix of S(j)2 scaling factors, i.e. weights)

d2 = (xs−yt) V−1 (xs−yt)′

  • Mahalanobis distance (C is the covariance matrix)

d2 = (xs−yt) C−1 (xs−yt)′

  • Manhattan

d = ∑j ∣xsj−ytj

  • Minkowski

dp = ∑j ∣xsj−ytjp

  • Chebychev distance

d = maxj{ ∣xsj−ytj∣ }

Roberto Marani

Slide 70

71 of 108

  • Introduction to machine learning
  • Feature extraction
  • Unsupervised learning
    • K-means
    • Hierarchical clustering
  • Supervised learning
    • K Nearest Neighbors
    • Supported vector machines
    • Decision trees
    • Neural networks

Outlines

Roberto Marani

Slide 71

72 of 108

Support Vector Machine (SVM)

  • Advantages
    • Good generalization capabilities
    • Works well with few training instances
    • Find the globally-best model
    • Efficient algorithms

  • Disadvantages
    • Works in binary classification (in theory labels are marked by yi ∈ {-1, 1})
    • Relatively slow with huge data sets
    • Need to choose the kernel (and tune it)

Roberto Marani

Slide 72

73 of 108

Support Vector Machine (SVM)

  • Input
    • xRd+1, x0 = 1, d = size of input features
  • Output (known labels)
    • yi ∈ {-1, 1})

  • Objective
    • Find the model parameters ϴRd+1 so that a prediction is made as follows
      • h(x) = sign(ϴTx) = sign(<ϴ,x>) (the sign function realizes the classification)
      • <u,v> is the inner (dot) product: u·v = ∑i uivi
      • The error on the i-th prediction is defined as yi- h(xi)

  • The dot product <ϴ,x> = 0 is known as hyperplane (separation between input feature that defines the final binary classification)

Roberto Marani

Slide 73

74 of 108

Support Vector Machine (SVM)

Example

Roberto Marani

Slide 74

75 of 108

Support Vector Machine (SVM)

Separators

Roberto Marani

Slide 75

76 of 108

Support Vector Machine (SVM)

What if we have noise?

Increasing noise 🡪 One separator remains

Roberto Marani

Slide 76

77 of 108

Support Vector Machine (SVM)

Basic idea

  • Use fat separators
  • Increase the margin till its maximization
    • Fewer possible models

Roberto Marani

Slide 77

78 of 108

Support Vector Machine (SVM)

Practical example

x1

ϴ = [1,0]

Hyperplane: x1 = 0 (i.e., x2 axis)

Margin = 2

The sign function applied to the feature x1 allows the best prediction since the margin is maximum

Note that x2 has no meaning for this problem

x2

Roberto Marani

Slide 78

79 of 108

Support Vector Machine (SVM)

What if the input data are not linearly separable?

The basic idea is to use a new function (kernel) that can map the input data in a new feature space where features are linearly separable

Non-linearly separable

Linearly separable

Φ = (·)2

Roberto Marani

Slide 79

80 of 108

Support Vector Machine (SVM)

Kernel trick

  • Instead of using the direct formulation, find kernel K such that

K(xi,xj) = <Φ(xi), Φ(xj)>

  • Computing K(xi,xj) should be more efficient than computing Φ(xi) and Φ(xj) separately
  • Use K(xi,xj) in SVM algorithm rather than <xi,xj>

Possible kernels

  • Polynomial kernel: K(xi,xj) = <xi, xj>d (as before)
  • Gaussian kernel: K(xi,xj) = exp( -(|xi-xj|22) / (2σ2) )
  • Sigmoid kernel: K(xi,xj) = tanh(αxiTxj + c)
  • Cosine similarity, chi-squared, tree, graph, wavelet, string, …

Roberto Marani

Slide 80

81 of 108

Support Vector Machine (SVM)

Multi-class classification with SVMs

  • Many SVM packages already have multi-class classification built-in
  • Alternatively, use a one-vs-all approach
    • Define K parameters ϴ(k), k = 1, …, K
    • Predict the k-th class looking at the largest (ϴ(k))Tx

Roberto Marani

Slide 81

82 of 108

  • Introduction to machine learning
  • Feature extraction
  • Unsupervised learning
    • K-means
    • Hierarchical clustering
  • Supervised learning
    • K Nearest Neighbors
    • Supported vector machines
    • Decision trees
    • Neural networks

Outlines

Roberto Marani

Slide 82

83 of 108

Decision trees

  • Basic idea
    • A classification scheme that generates a tree and a set of rules from a given dataset
    • Inputs are processed by the rules toward the last decision (classification)
    • Each branch node represents a choice between several alternatives, and each leaf node represents a decision

  • Advantages
    • Decision trees can generate understandable rules
    • Can handle both numerical and categorical attributes
    • Provide a clear indication of which fields are most important for prediction or classification

  • Disadvantages
    • The process of growing a decision tree is computationally expensive
      • At each node, all candidate splitting-fields must be examined to find the best split
    • Some decision trees can only deal with binary-valued target classes

Roberto Marani

Slide 83

84 of 108

Decision trees

  • A decision tree is a tree with the following properties:
    • An inner node represents an attribute
    • An edge represents a test on the attribute of the parent node
    • A leaf represents one of the classes
  • Construction of a decision tree
    • Based on the training data
    • Top-Down strategy

Roberto Marani

Slide 84

85 of 108

Decision trees

Example: When can they play a tennis game?

  • Columns are the features (predictors)
  • Rows are the instances {xi,yi}
  • y ∈ [can play, can’t play]
  • In a decision tree, each leaf node represents a rule.

Roberto Marani

Slide 85

86 of 108

Decision trees

Example

  • We have five leaf nodes
  • In a decision tree, each leaf node represents a rule

  • Rules:
  • If it is sunny and the humidity is NORMAL then play
  • If it is sunny and the humidity is HIGH, then do not play.
  • If it is overcast, then play.
  • If it is rainy and WEAKLY windy, then play.
  • If it is rainy and STRONGLY windy, then don't play

Categorical attributes (can also be numerical)

Roberto Marani

Slide 86

87 of 108

Decision trees

Example with numerical attributes

Roberto Marani

Slide 87

88 of 108

Decision trees

Decision boundary

  • Decision trees divide the feature space into axis-parallel (hyper-)rectangles
  • Each rectangular region is labeled with one label
    • or a probability distribution over labels

Roberto Marani

Slide 88

89 of 108

Decision trees

How to select the attributes (and their order)

  • Principle stated by William of Ockham (1285-1347)
    • “non sunt mulplicanda entia praeter necessitatem”
    • Entities are not to be multiplied beyond necessity
    • AKA Occam’s Razor, Law of Economy, or Law of Parsimony

  • The simplest consistent explanation is the best

  • The smallest decision tree that correctly classifies all the training examples is the best
    • Too hard to find in close form
    • Try to construct something pretty small

Roberto Marani

Slide 89

90 of 108

Decision trees

How to select the attributes (and their order)

  • Random: Select any attribute at random
  • Least-Values: Choose the attribute with the smallest number of possible values
  • Most-Values: Choose the attribute with the largest number of possible values
  • Max-Gain: Choose the attribute that has the largest expected information gain
    • attribute that results in the smallest expected size of subtrees rooted in its children

Which test is more informative?

Roberto Marani

Slide 90

91 of 108

Decision trees

Overfitting

  • If a decision tree is fully grown, it may lose some generalization capability
  • This is overfitting

  • Overfitting Due to Presence of Noise: Mislabeled instances may contradict the class labels of other similar records
  • Overfitting Due to Lack of Representative Instances: Lack of representative instances in the training data can prevent refinement of the learning algorithm

Solutions:

  • Stop growing the tree earlier, before it reaches the point where it perfectly classifies the training data
  • Allow the tree to overfit the data, and then post-prune the tree

Roberto Marani

Slide 91

92 of 108

Decision trees

Applications

  • As accurate as human experts:
    • A study for diagnosing breast cancer:
      • Humans correctly classify the examples 65% of the time
      • The decision tree classified 72% correct
    • British Petroleum designed a decision tree for gas-oil separation for offshore oil platforms
    • Cessna designed an airplane flight controller

  • Microsoft's first release of Kinect Body Tracking used decision forests (ensembles of decision trees) to skeletonize humans

Roberto Marani

Slide 92

93 of 108

  • Introduction to machine learning
  • Feature extraction
  • Unsupervised learning
    • K-means
    • Hierarchical clustering
  • Supervised learning
    • K Nearest Neighbors
    • Supported vector machines
    • Decision trees
    • Neural networks

Outlines

Roberto Marani

Slide 93

94 of 108

Neural networks

Basic idea

  • Artificial Neural Network (ANN) is an efficient computing system borrowed from the analogy of biological neural networks
  • Every neuron is connected with another neuron through a connection link
  • Each connection link is associated with a weight that has information about the input signal
  • Each neuron has an internal state, which is called an activation signal
  • Output signals, which are produced after combining the input signals and activation rules, may be sent to other units

Roberto Marani

Slide 94

95 of 108

Neural networks

Biological neuron

  • A neuron is a cell that carries electrical impulses
  • Neurons are the basic units of the nervous system and its most important part is the brain
  • Every neuron is made of a cell body (also called a soma), dendrites and an axon
    • Dendrites and axons are nerve fibers
  • There are about 86 billion neurons in the human brain, which comprises roughly 10% of all brain cells
  • Neurons are connected to one another
  • They form tiny gaps called synapses
    • These gaps can be chemical synapses or electrical synapses and pass the signal from one neuron to the next

Roberto Marani

Slide 95

96 of 108

Neural networks

ANN vs. BNN

Roberto Marani

Slide 96

97 of 108

Neural networks

ANN vs. BNN

Criteria

BNN

ANN

Processing

Massively parallel

Massively parallel, but less than BNN

Learning

Can tolerate ambiguity

Very precise, structured, and formatted data is required to tolerate ambiguity

Fault tolerance

Performance degrades with even partial damage

Robust resiliency (can be fault tolerant)

Storage capacity

Stores the information in the synapse

Stores the information in memory locations

Roberto Marani

Slide 97

98 of 108

Neural networks

ANN vs. BNN

Roberto Marani

Slide 98

99 of 108

Neural networks

ANN vs. BNN

  • The dendrites in a biological neural network are analogous to the weighted inputs based on their synaptic interconnection in artificial neural network
  • Cell body is analogous to the artificial neuron unit in artificial neural network which also comprises summation and threshold unit
  • Axon carries output that is analogous to the output unit in the case of an artificial neural network

  • ANNs are modeled using the working of basic biological neurons

Roberto Marani

Slide 99

100 of 108

Neural networks

Model of an artificial neuron

Roberto Marani

Slide 100

101 of 108

Neural networks

Model of an artificial neuron

  • Artificial neural networks can be viewed as a weighted directed graph in which artificial neurons are nodes and the directed weighted edges connect the input neurons to the output ones

  • The Artificial Neural Network receives input from the external world in the form of vectors

  • Each input is multiplied by its corresponding weights
    • Weights are the information used by the neural network to solve a problem (represents the strength of the interconnection between neurons)

  • The weighted inputs are all summed up inside the computing unit (artificial neuron)

  • A bias is added to scale up the system response

Roberto Marani

Slide 101

102 of 108

Neural networks

Model of an artificial neuron

  • The sum corresponds to any numerical value ranging from 0 to infinity

  • In order to limit the response to arrive at desired value, the threshold value is set up, i.e. the result is passed through an activation function
    • The activation function is set to get the desired output:
      • Binary — The output has only two values either 0 or 1 depending on a threshold. For instance, if the net weighted input is greater than 1, the output is assumed to be 1 otherwise 0
      • Sigmoidal Hyperbolic — This function has an S-shaped curve. Here the tan hyperbolic function is used to approximate output from the net input

f (x) = (1/1+ exp(-𝝈x)) (𝝈 is the steepness parameter)

Roberto Marani

Slide 102

103 of 108

Neural networks

Network architecture

  • Input layer: Units (artificial neurons) that receive input from the world

  • Output layer: Units that produce the final prediction (the number of output is equal to the number of target classes)

  • Hidden layer: Units in between. They transform the input into something that output units can use in some way

  • Most neural networks are fully connected (each hidden neuron is fully connected to every neuron of the input and output layers) 🡪 It has a matrix form

Roberto Marani

Slide 103

104 of 108

Neural networks

  • Advantages
    • Can also solve non-linear problems
    • Implemented in many applications
      • Function approximation, or regression analysis, including time series prediction, fitness approximation and modeling
      • Classification, including pattern and sequence recognition, novelty detection and sequential decision making
      • Data processing, including filtering, clustering, blind source separation and compression.
      • Robotics (including directing manipulators) and control

  • Disadvantages
    • The number of hidden units has to be tuned to make the network capable but avoiding overfitting
    • There is no control over what the network is doing during the classification

Roberto Marani

Slide 104

105 of 108

Implementation of supervised learning for defect classification

Lecture_5_Physical_Feature_Extraction.m

Lecture_5_Supervised_Classification.m

Roberto Marani

Slide 105

106 of 108

  • From machine learning to deep learning
  • Convolutional Neural Networks
  • Test on actual data

Next steps

Roberto Marani

Slide 106

107 of 108

Roberto Marani

Researcher

National Research Council of Italy (CNR)

Institute of Intelligent Industrial Technologies and Systems for Advanced Manufacturing (STIIMA)

via Amendola 122/D-O, 70126 Bari, Italy

+39 080 592 94 58

roberto.marani@stiima.cnr.it

robertomarani.com

cnr.it/people/roberto.marani

stiima.cnr.it/ricercatori/roberto-marani/

Roberto Marani

Slide 107

108 of 108

Credits

  • Introduction to Machine Learning, Eric Eaton
  • Introduction to Machine Learning, Lior Rokach
  • Machine learning, Rajat Sharma
  • Implementation of SIFT in CUDA, Nitin Ramchandani
  • K Nearest Neighbors, Tilani Gunawardena
  • Machine Learning with DECISION TREES, Ramandeep Kaur et al.
  • Decision Tree, R. Akerkar
  • Artificial Neural Network, Arafat Shaikh

Roberto Marani

Slide 108