1 of 48

ME5990

Seminar: Historical Terms� in Machine Learning (pattern recognition)

2 of 48

Outline

  • Supervised Learning
    • K-nearest neighbors
    • Principal Component Analysis
    • Perceptron
    • Support Vector Machine
    • Ensemble Methods

3 of 48

K-nearest neighborhood

  • Non-parametric classification
  • First developed in 1951 by Evelyn Fix and Joseph Hodges 
  • Can my neighbor vote?

https://brilliantmaps.com/2020-county-election-map/

4 of 48

Problem for Start

  • If we were given a dataset with 2 classes:

  • Can we propose a voting mechanism to determine which class the green dot belongs to?

 

 

(0,1)

?

?

5 of 48

K: number of neighbors

  •  

 

 

(0,1)

(-1,2)

d

6 of 48

Some other distances

7 of 48

K: number of neighbors

  •  

 

 

(0,1)

(-1,2)

d

8 of 48

K: number of neighbors

  • K-Neighbor: K-closest point

  • The green point belongs to red class because among 3 neighbors:
    • 3 out of 3 are red
    • 0 out of 3 are blue

 

 

 

(0,1)

(-1,2)

d

 

9 of 48

K: number of neighbors

  • K-Neighbor: K-closest point

  • The green point belongs to blue class because among 3 neighbors:
    • 1 out of 3 are red
    • 2 out of 3 are blue

 

 

 

(0,1)

(-1,2)

d

 

10 of 48

Number of K matters

  • For different K, the classification may be different

11 of 48

Decision Boundary

  • Decision boundary

https://www.analyticsvidhya.com/blog/2018/03/introduction-k-neighbours-algorithm-clustering/

12 of 48

K-nearest Neighbor

  • If you are near red, you are red
  • If you are near blue, you are blue

  • The people who you surround yourself with matters

13 of 48

Outline

  • Supervised Learning
    • K-nearest neighbors
    • Naïve Bayes Classifier
    • Principal Component Analysis
    • Perceptron
    • Support Vector Machine
    • Ensemble Methods

14 of 48

Outline

  • Supervised Learning
    • K-nearest neighbors
    • Naïve Bayes Classifier
    • Principal Component Analysis
    • Perceptron
    • Support Vector Machine
    • Ensemble Methods

15 of 48

Principal Component Analysis

  • Which feature is more important?

  • The direction “spread most” is most important
    • Looks like it is the “lucky color code”

16 of 48

PCA

  • The original features can be projected to another set feature with widest spread

17 of 48

PCA technique

  • Max principal Stress?

18 of 48

PCA

  • You probably were taught to solve the problem by Mohr’s circle

19 of 48

PCA

  •  

20 of 48

PCA

  •  

21 of 48

PCA cookbook (via eigenvector decoupling)

  •  

22 of 48

Demonstration of PCA

  •  

(-1,-1)

(1,1)

(3,3)

23 of 48

Demonstration of PCA

  •  

(-1,-1)

(1,1)

(3,3)

24 of 48

Demonstration of PCA

  •  

(-1,-1)

(1,1)

(3,3)

25 of 48

Demonstration of PCA

  •  

(-1,-1)

(1,1)

(3,3)

(0,0.2)

26 of 48

PCA

  • Too many features and you don’t know which one is most important? Do principal component analysis!
  • PCA has the same concept as max principal stress problem
  • Don’t use Mohr’s circle anymore, use Eigenvalue and eigenvectors!!

(https://amesweb.info/StressStrainTransformations/PlaneStressPrincipalStressCal/PrincipalStressCalculator.aspx)

(https://www.emathhelp.net/en/calculators/linear-algebra/eigenvalue-and-eigenvector-calculator/?i=%5B%5B100%2C40%5D%2C%5B40%2C70%5D%5D)

27 of 48

Outline

  • Supervised Learning
    • K-nearest neighbors
    • Naïve Bayes Classifier
    • Principal Component Analysis
    • Perceptron
    • Support Vector Machine
    • Ensemble Methods

28 of 48

Perceptron

  • One of the oldest classifier
    • Invented in 1943, before computer is invented

Mark I perceptron machine, from wikipedia

29 of 48

Decision boundary for LDF

  •  

Image from: https://www.csd.uwo.ca/~oveksler/Courses/CS434a_541a/Lecture9.pdf

30 of 48

LDF for 2 classes

  •  

(0,1)

(1,0)

 

 

Point

g(x)

(0,1)

-1 < 0

(1,0)

1 > 0

(2, 1)?

31 of 48

Hyperplane

  •  

(0,1)

(1,0)

 

 

 

32 of 48

LDF: augmented feature vector

  •  

33 of 48

LDF: Augmented feature vector

  •  

34 of 48

Demonstration of “augmented feature vector”

  •  

35 of 48

Summary

  • Traditionally, perceptron is a linear discriminator for binary classification. The data are encoded with (+1, -1).
  • We have a new name of perceptron: fully connected layer
  • Multi-layer perceptron: a feedforward network made with multiple fully connected layer
    • Proposed on top of perceptron in 1960s
    • Training method was only invented in 1970s

36 of 48

Outline

  • Supervised Learning
    • K-nearest neighbors
    • Naïve Bayes Classifier
    • Principal Component Analysis
    • Perceptron
    • Support Vector Machine
    • Ensemble Methods

37 of 48

Support Vector Machine

  • Another powerful linear classifier
  • Can be used to classify multiple classes

Support

38 of 48

Support vector machine

  • We want our line to be at “center” of these two classes
    • What is “center”?
  • We define margin by expanding the line till hit a point
  • We wish the margin to be as large as possible

 

 

 

 

39 of 48

Support vector machine

  • What is support vector?

  • What does “we want the margin as large as possible” mean?
    • We want the length of support vector to the boundary as large as possible!

 

 

Support vector

40 of 48

Support vector machine: some maths

  •  

 

 

m

41 of 48

One-to-one, one-to-rest

  • One-to-one

42 of 48

One-to-one, one-to-rest

  • One-to-rest

https://www.baeldung.com/cs/svm-multiclass-classification

43 of 48

Support Vector Machine

  • A really powerful linear discriminant method
  • Reported good at solving hand-writing problem, in bar with LeNET

44 of 48

Outline

  • Supervised Learning
    • K-nearest neighbors
    • Principal Component Analysis
    • Perceptron
    • Support Vector Machine
    • Ensemble Methods

45 of 48

Ensemble methods

  • Bagging (Breiman 1994,...): Bootstrap aggregating
  • Random forests (Breiman 2001,...)
  • Boosting (Freund and Schapire 1995, Friedman et al. 1998,...)

Predict class label for unseen data by aggregating a set of predictions (classifiers learned from the training data)

46 of 48

General Idea

47 of 48

Ensemble methods

All ensemble methods consist of:

  • Step 1: Generate multiple sub-dataset via various re-sampling technique
    • Bootstrapping
    • Pasting
  • Step 2: Train classifiers on each sub-dataset
    • Use same model for each sub-dataset (e.g. SVM, Naïve Bayes)
  • Step 3: Aggregate (combine) all classifier
    • Voting for classification questions is the most popular way
    • Find average of models (especially for regression problem)

48 of 48

Summary on Bagging

  • If majority’s voting is correct, we will be destined to choose the correct one
  • If majority’s voting is wrong, we will be destined to choose the wrong one