Lecture 22�� The Optimal Hyperplane and Overfitting
1
Instructor: Ercan Atam
Institute for Data Science & Artificial Intelligence
Course: DSAI 512-Machine Learning
2
List of contents for this lecture
3
Relevant readings for this lecture
Cambridge University Press, 2022.
4
Recap: the optimal hyperplane
The optimal hyperplane
The algorithm
PLA depends on the ordering of data
(e.g. random)
Support vectors: the data points that sit on the cushion.
-Using only support vectors, the classifier does not change.
Quadratic Programming:
5
Link to regularization
The optimal hyperplane
Regularization
The optimal hyperplane performs kind of an “automatic” regularization.
minimize
optimal hyperplane
regularization
subject to
(Note: this is another form of regularization, which is
different than the form we used)
6
Evidence that larger margin is better
7
Experimental evidence that larger margin is better (1)
Observations:
Experiment 1
(Histogram shows relative frequency of different margins)
random hyperplane
SVM
0.04
0.06
0.08
0
0.25
0.5
0.75
1
8
Experimental evidence that larger margin is better (2)
Experiment 2
which is +1 above x-axis and -1 below it.
set find a random separator and the optimal
hyperplane (SVM with dashed lines).
each data set look at where a random hyperplane
and the optimal hyperplane are.
frequency
9
Experimental evidence that larger margin is better (3)
Experiment 2 (continue)
frequency
10
Fat hyperplanes shatter fewer points (1)
11
Fat hyperplanes shatter fewer points (2)
Thin hyperplanes can implement all 8 dichotomies (the other 4 dichotomies are obtained by negating the weights and bias).
Example:
12
Fat hyperplanes shatter fewer points (3)
Example (continue):
Thin (zero thickness) separators can shatter the three points shown above. As we increase the thickness of the separator, we will soon not be able to implement some dichotomies.
13
Fat hyperplanes shatter fewer points (4)
Example (continue):
14
Fat hyperplanes shatter fewer points (5)
See the book for the proof.
15
Fat hyperplanes shatter fewer points (6)
16
17
Proof:
18
19
Summary of hyperplanes and generalization
General
PLA
SVM (Optimal hyperplane)
Algorithm for selecting separating hyperplane
20
Non-separable data
tolerate error
nonlinear transform
21
Nonlinear transform and SVM (1)
Non-linearly separable
22
Nonlinear transform and SVM (2)
Non-linearly separable
23
Example
24
Summary of nonlinear transform + SVM
complexity
control
SVM
SVM + nonlinear transform
boundary
linear
sophisticated
complexity
control
perceptrons
perceptron + nonlinear transform
boundary
linear
sophisticated
25
Going to even higher dimensions
References �(utilized for preparation of lecture notes or Matlab code)
26