1 of 26

Lecture 22�� The Optimal Hyperplane and Overfitting

1

Instructor: Ercan Atam

Institute for Data Science & Artificial Intelligence

Course: DSAI 512-Machine Learning

2 of 26

2

List of contents for this lecture

  • Why is the fattest hyperplane the best?

  • Nonlinear transforms

3 of 26

3

Relevant readings for this lecture

  • e-Chapter 8 of Yaser S. Abu-Mostafa, Malik Magdon-Ismail, Hsuan-Tien Lin, “Learning from Data”, AMLBook, 2012.

  • Chapter 6 (6.5) of Hui Jiang , "Machine Learning Fundamentals: a Concise Introduction",

‎Cambridge University Press, 2022.

4 of 26

4

Recap: the optimal hyperplane

The optimal hyperplane

  • The fattest hyperplane that separates the data

  • Tolerates most measurement error
  • Can we efficiently find the fattest separator?
  • Is fatter better than thin?

The algorithm

PLA depends on the ordering of data

(e.g. random)

Support vectors: the data points that sit on the cushion.

-Using only support vectors, the classifier does not change.

Quadratic Programming:

5 of 26

5

Link to regularization

The optimal hyperplane

Regularization

The optimal hyperplane performs kind of an “automatic” regularization.

minimize

optimal hyperplane

regularization

subject to

 

 

 

 

(Note: this is another form of regularization, which is

different than the form we used)

6 of 26

6

Evidence that larger margin is better

 

7 of 26

7

Experimental evidence that larger margin is better (1)

 

  • Bigger margin is generally better.

  • Biggest is not the best (data other than support vectors can have role in fine-tuning).

Observations:

Experiment 1

(Histogram shows relative frequency of different margins)

random hyperplane

SVM

 

0.04

0.06

0.08

0

0.25

0.5

0.75

1

 

8 of 26

8

Experimental evidence that larger margin is better (2)

Experiment 2

  • Assume that we have a flat target function

which is +1 above x-axis and -1 below it.

  • Next create a random data set and for this data

set find a random separator and the optimal

hyperplane (SVM with dashed lines).

  • Now create many random data sets and for

each data set look at where a random hyperplane

and the optimal hyperplane are.

 

frequency

9 of 26

9

Experimental evidence that larger margin is better (3)

Experiment 2 (continue)

 

frequency

10 of 26

10

Fat hyperplanes shatter fewer points (1)

 

 

11 of 26

11

Fat hyperplanes shatter fewer points (2)

Thin hyperplanes can implement all 8 dichotomies (the other 4 dichotomies are obtained by negating the weights and bias).

Example:

12 of 26

12

Fat hyperplanes shatter fewer points (3)

Example (continue):

Thin (zero thickness) separators can shatter the three points shown above. As we increase the thickness of the separator, we will soon not be able to implement some dichotomies.

13 of 26

13

Fat hyperplanes shatter fewer points (4)

Example (continue):

  • This example illustrates why thick hyperplanes implement fewer of the possible dichotomies.

  • Note that the hyperplanes only “look thick because the data are close together.

  • If we moved the data further apart, then even a thick hyperplane can implement all the dichotomies.

  • What matters is the thickness of the hyperplane relative to the spacing of the data.

14 of 26

14

Fat hyperplanes shatter fewer points (5)

 

See the book for the proof.

15 of 26

15

Fat hyperplanes shatter fewer points (6)

16 of 26

16

 

17 of 26

17

 

Proof:

 

 

18 of 26

18

 

19 of 26

19

Summary of hyperplanes and generalization

 

General

PLA

SVM (Optimal hyperplane)

Algorithm for selecting separating hyperplane

20 of 26

20

Non-separable data

tolerate error

nonlinear transform

21 of 26

21

Nonlinear transform and SVM (1)

Non-linearly separable

22 of 26

22

Nonlinear transform and SVM (2)

Non-linearly separable

23 of 26

23

Example

 

24 of 26

24

Summary of nonlinear transform + SVM

complexity

control

SVM

SVM + nonlinear transform

boundary

 

 

linear

sophisticated

complexity

control

perceptrons

perceptron + nonlinear transform

boundary

 

 

linear

sophisticated

25 of 26

25

Going to even higher dimensions

26 of 26

References �(utilized for preparation of lecture notes or Matlab code)

26