Cross Validation, Regularization
Different methods for ensuring the generalizability of our models to unseen data.
Data 100/Data 200, Spring 2022 @ UC Berkeley
Josh Hug and Lisa Yan
1
Lecture 15
Plan for Next Three Lectures: Model Selection
2
Model Selection Basics:�Cross Validation
Regularization
Question & Problem
Formulation
Data
Acquisition
Exploratory Data Analysis
Prediction and
Inference
Reports, Decisions, and Solutions
?
Probability I:�Random Variables
Estimators
(today)
Probability II:�Bias and Variance
Inference/Multicollinearity
Today’s Roadmap
Lecture 15, Data 100 Spring 2022
Cross Validation
Regularization
3
Review: Error vs. Complexity
As we increase the complexity of our model:
4
Review: Collecting More Data to Detect Overfitting
Suppose we collect the 9 new orange data points. Can compute MSE for our original models without refitting using the new orange data points.
5
Original 35 data points
New 9 data points
Best?
Best?
Review: Collecting More Data to Detect Overfitting
Which model do you like best? And why?
6
Original 35 data points
New 9 data points
Best?
Best?
Review: Collecting More Data to Detect Overfitting
The order 2 model seems best to me.
7
Original 35 data points
New 9 data points
Best?
Best?
Review: Collecting More Data to Detect Overfitting
Suppose we have 7 models and don’t know which is best.
We could wait for more data and see which of our 7 models does best on the new points.
8
Idea 1: The Holdout Method
The simplest approach for avoiding overfitting is to keep some of our data secret from ourselves.
Example:
9
Holdout Set Demo
The code below splits our data into two sets of size 25 and 10.
10
Used for Training
Used for Evaluation
Reflection: Shuffling
Question: Why did I shuffle first?
11
Reflection: Shuffling
Question: Why did I shuffle first?
I’m using a large contiguous block of data as my validation set.
Shuffling prevents this problem.
12
Hold Out Method Demo Step 1: Generating Models of Various Orders
First we train 7 models the experiment now on our training set of 25 points, yielding the MSEs shown below. As before, MSE decrease monotonically with model order.
13
Hold Out Method Demo Step 1: Generating Models of Various Orders (visualization)
Below, we show the order 0, 1, 2, and 6 models trained on our 25 training points.
14
Note: Our degree 6 model looks different than before. No surprise since variance is high and we’re using a different data set.
Hold Out Method Demo Step 2: Evaluating on the Validation Set (a.k.a. Dev Set)
Then we compute MSE on our 10 dev set points (in orange) for all 7 models without refitting using these orange data points. Models are only fit on the 25 training points.
15
Evaluation: Validation set MSE is best for degree = 2!
Plotting Training and Validation MSE
16
Idealized Picture of Training and Validation Error
As we increase the complexity of our model:
We pick the model complexity that minimizes validation set error.
17
Hyperparameter: Terminology
In machine learning, a hyperparameter is a value that controls the learning process itself.
We use:
18
K-Fold Cross Validation
Lecture 15, Data 100 Spring 2022
Cross Validation
Regularization
19
Another View of The Holdout Method
To determine the quality of a particular hyperparameter:
Example, imagine we are trying to pick between three values of a hyperparameter alpha.
20
Best! Use this one.
Another View of The Holdout Method
In the Holdout Method, we set aside the validation set at the beginning, and our choice is fixed.
Example below where the last 20% is used as the validation set.
21
Thought Experiment
If we decided (arbitrarily) to use non-overlapping contiguous chunks of 20% of the data, there are 5 possible “chunks” of data we could use as our validation set, as shown below.
22
Use first 20% and last 60% to train, remaining 20% as validation set.
Thought Experiment
If we decided (arbitrarily) to use non-overlapping contiguous chunks of 20% of the data, there are 5 possible “chunks” of data we could use as our validation set, as shown below.
23
Use folds 1, 3, 4, and 5 to train, and use fold 2 as validation set.
K-Fold Cross Validation
In the k-fold cross-validation approach, we split our data into k equally sized groups (often called folds).
Given k folds, to determine the quality of a particular hyperparameter:
Example for k = 5:
24
Use folds 1, 3, 4, and 5 to train, and use fold 2 as validation set.
5-Fold Cross Validation Demo
Given k folds, to determine the quality of a particular hyperparameter, e.g. alpha = 0.1:
25
5-Fold Cross Validation Demo
Given k folds, to determine the quality of a particular hyperparameter, e.g. alpha = 0.1:
26
Test Your Understanding: How Many MSEs?
Suppose we pick k = 3 and we have 4 possible hyperparameter values 𝛼=[0.01, 0.1, 1, 10].
27
Test Your Understanding: How Many MSEs?
Suppose we pick k = 3 and we have 4 possible hyperparameter values 𝛼=[0.01, 0.1, 1, 10].
28
Test Your Understanding: Selecting Alpha
Which 𝛼 should we pick?
What fold (or folds) should we use as our training set for computing our final model parameters 𝜃?
29
Test Your Understanding: Selecting Alpha
Which 𝛼 should we pick? 0.1
What fold (or folds) should we use as our training set for computing our final model parameters 𝜃?
30
Picking K
Typical choices of k are 5, 10, and N, where N is the amount of data.
Ultimately, the tradeoff is between k and computation time.
31
K-Fold Cross Validation and Hold Out Method in sklearn
As an example, the code below performs a GridSearchCV uses 5 fold cross validation to find the optimal parameters for a model called “scaled_ridge_model”.
Can also do the Hold Out method in sklearn:
32
Cross Validation Summary
When selecting between models, we want to pick the one that we believe would generalize best on unseen data. Generalization is estimated with a “cross validation score”*.
Two techniques to compute a “cross validation score”:
33
*Equivalently, I could have said “cross validation loss” instead of “cross validation score”.
Extra: Exhaustive Cross Validation
We don’t have to use non-overlapping contiguous chunks! Could use
Iterating over ALL possible such permutations is known as “exhaustive cross validation.” We will not discuss this in our course.
34
Use first 20% and last 60% to train, remaining 20% as validation set.
Test Sets
Lecture 15, Data 100 Spring 2022
Cross Validation
Regularization
35
Providing an Estimate of Model Generalization to the World
Suppose we’re researchers building a state of the art regression model.
Now we want to report this model out to the world so it can be compared to other models.
36
Why Are Validation Set Errors Biased?
Analogy:
Comments?
37
Test Sets
Test sets can be something that we generate ourselves. Or they can be a common data set whose solution is unknown.
In real world machine learning competitions, competing teams share a common test set.
38
Creating a Test Set Ourselves
We can do this easily in code. As before, we shuffle using scikit-learn then split using numpy.
Then we use np.split, now providing two numbers instead of one. For example, the code above splits the data into a Training, Validation, and Test set.
39
Test Set Terminology in Real World Practice
Warning: The terms “test set” and “validation set” are sometimes used interchangeably.
In practice, you may not need a test set at all!
40
Validation Sets and Test Sets in Real World Practice
Standard validation sets and test sets are used as standard benchmarks to compare ML algorithms.
Note: Since the competition uses the test set scores compare image classification algorithms, the best test set score is no longer an unbiased estimate of the best algorithm’s performance.
41
Idealized Training, Validation, and Test Error
As we increase the complexity of our model:
42
L2 Regularization (Ridge)
Lecture 15, Data 100 Spring 2022
Cross Validation
Regularization
43
Note: I might try to re-record and adjust these slides. Live lecture fell short of what I wanted to do!
Earlier
We saw how we can select model complexity by choosing the hyperparameter that minimizes validation error. This validation error can be computed using the Holdout Method or K-Fold Cross Validation.
44
Earlier
For the example below, our hyperparameter was the polynomial degree.
45
A More Complex Example
Suppose we have a dataset with 9 features.
46
Tweaking Complexity via Feature Selection
With 9 features, there are 29 different models.One approach:
Runtime is exponential in the number of parameters!
47
Tweaking Complexity via Feature Selection
Alternate Idea: What if we use all of the features, but only a little bit?
48
Recall: Gradient Descent
Imagine we have a two parameter model.
49
Constraining Gradient Descent
Suppose we arbitrarily decide that gradient descent can never land outside of the green ball.
50
Test Your Understanding
Suppose we arbitrarily decide that gradient descent can never land outside of the green ball.
51
Constraining Gradient Descent
Gradient descent ends up at instead.
52
Adjusting the Allowed Space
We can change the size of our arbitrary boundary.
53
Adjusting the Allowed Space
We can change the size of our arbitrary boundary.
54
Philosophical Question
How are “ball radius” and “complexity” related?
55
Philosophical Question: Your Answer
How are “ball radius” and “complexity” related?
56
Philosophical Question: My Answer
The ball radius is a complexity control parameter.
57
Test Your Understanding
Let’s return to our 9 feature model from before (d = 9).
58
Test Your Understanding Answer
Let’s return to our 9 feature model from before (d = 9).
59
Answer: It depends!
Test Your Understanding Answer
If the ball is very tiny, our gradient descent is stuck near the origin.
60
Test Your Understanding Answer
Traditionally the “ball restriction” only applies to non-intercept terms.
61
Test Your Understanding
Back to our 9 feature model from before (d = 9).
62
Test Your Understanding Answer
Back to our 9 feature model from before (d = 9).
63
Test Your Understanding Answer
64
Training and Validation Errors vs. Ball Size for Our 9D Model
For very small ball size:
For very large ball size:
65
L2 Regularization
Constraining our model’s parameters to a ball around the origin is called L2 Regularization.
Ordinary least squares. Find thetas that minimize:
Ordinary least squares with L2 regularization. Find thetas that minimize:
66
Such that θ1 through θd live inside a ball of radius Q.
L2 Regularization
Constraining our model’s parameters to a ball around the origin is called L2 Regularization.
Ordinary least squares. Find thetas that minimize:
Ordinary least squares with L2 regularization. Find thetas that minimize:
67
such that
Note, intercept term not included!
L2 Regularized Least Squares in sklearn
We can run least squares with an L2 regularization term by using the “Ridge” class.
Coefficients we get back:
68
Note: sklearn’s “alpha” parameter is proportional to the inverse of the ball radius!
L2 Regularized Least Squares in sklearn
We can run least squares with an L2 regularization term by using the “Ridge” class.
For a tiny alpha, the coefficients are larger:
69
Note: sklearn’s “alpha” parameter is proportional to the inverse of the ball radius!
L2 Regularized Least Squares in sklearn
We can run least squares with an L2 regularization term by using the “Ridge” class. For a tiny alpha, the coefficients are also about the same as a standard OLS model’s coefficients!
70
Green ball includes the OLS solution!
Figure (from lab 8)
In lab8, you’ll run an experiment for different values of alpha. The resulting plot is shown below.
71
Two Questions
Let’s address two quick questions:
72
Terminology Note
Why does sklearn use the word “Ridge”?
Because least squares with an L2 regularization term is also called “Ridge Regression”.
73
Quick Detour into EECS127 (not super important but worth mentioning)
In 127, you’ll learn (through the magic of Lagrangian Duality) that the two problems below are equivalent:
Problem 1: Find thetas that minimize:
Problem 2: Find thetas that minimize:
74
such that
Intuitively, this extra right term penalizes large thetas.
The “objective function” that gradient descent is minimizing now has an extra term.
Mathematical Note
Ridge Regression has a closed form solution which we will not derive.
75
Identity matrix
Scaling Data for Regularization
Lecture 15, Data 100 Spring 2022
Cross Validation
Regularization
76
One Issue With Our Approach
Our data from before has features of quite different numerical scale!
77
Theta will tend to be smaller for weight^2 than other parameters
Coefficients from Earlier
78
Making Things Fair
Ideally, our data should all be on the same scale.
No demo in lecture. You’ll do this on lab 8 using the “StandardScaler” transformer.
79
L1 Regularization (LASSO)
Lecture 15, Data 100 Spring 2022
Cross Validation
Regularization
80
L1 Regularization
We can also use other shapes.
81
L1 Regularization in Equation Form
Using a hypercube is known as L1 regularization. Expressed mathematically in the two equivalent forms below:
Problem 1: Find thetas that minimize:
Problem 2: Find thetas that minimize:
82
such that
L1 Regularized OLS in sklearn
In sklearn, we use the Lasso module.
83
LASSO and “Feature Selection”
The optimal parameters for a LASSO model tend to include a lot of zeroes! In other words, LASSO effectively selects only a subset of the features.
Intuitive (??) reason:
84
Summary of Regression Methods
Our regression models are summarized below.
85
Name | Model | Loss | Reg. | Objective | Solution |
OLS | | Squared loss | None | | |
Ridge Regression | | Squared loss | L2 | | |
LASSO | | Squared loss | L1 | | No closed form |
Cross Validation, Regularization
Content credit: Josh Hug, Joseph Gonzalez, Suraj Rampure
86
Lecture 15