Join at slido.com�#3726997
ⓘ Start presenting to display the joining instructions on this slide.
Cross Validation, Regularization
Different methods for ensuring the generalizability of our models to unseen data.
Data 100/Data 200, Spring 2023 @ UC Berkeley
Narges Norouzi and Lisa Yan
Content credit: Acknowledgments
2
Lecture 15
Plan for Next Three Lectures: Model Selection
3
Model Selection Basics:�Cross Validation
Regularization
Question & Problem
Formulation
Data
Acquisition
Exploratory Data Analysis
Prediction and
Inference
Reports, Decisions, and Solutions
?
Probability I:�Random Variables
Estimators
(today)
Probability II:�Bias and Variance
Inference/Multicollinearity
3726997
Today’s Roadmap
Lecture 15, Data 100 Spring 2023
Cross Validation
Regularization
4
3726997
Review: Error vs. Complexity
As we increase the complexity of our model:
5
3726997
Review: Dataset
Today we will use the mpg dataset from the seaborn library.
The dataset has 392 rows and 9 column. Our task is to use some of the columns and their transformations to predict the value of the mpg column.
6
3726997
Review: Collecting More Data to Detect Overfitting
Suppose we use 35 sample to fit the regression model and collect the 9 new orange data points. We can compute MSE for our original models without refitting using the new orange data points.
7
Original 35 data points
New 9 data points
Best?
Best?
3726997
Review: Collecting More Data to Detect Overfitting
Which model do you like best? And why?
8
Original 35 data points
New 9 data points
Best?
Best?
3726997
Review: Collecting More Data to Detect Overfitting
The order 2 model seems best to me.
9
Original 35 data points
New 9 data points
Best?
Best?
3726997
Review: Collecting More Data to Detect Overfitting
Suppose we have 7 models and don’t know which is best.
We could wait for more data and see which of our 7 models does best on the new points.
10
3726997
Idea 1: The Holdout Method
The simplest approach for avoiding overfitting is to keep some of our data secret from ourselves.
Example:
11
3726997
Holdout Set Demo
The code below splits our data into two sets of size 25 and 10.
12
Used for Training
Used for Evaluation
from sklearn.utils import shuffle
training_set, validation_set = np.split(shuffle(vehicle_data_sample_35), [25])
9
9
3726997
Reflection: Shuffling
Question: Why did I shuffle first?
13
from sklearn.utils import shuffle
training_set, validation_set = np.split(shuffle(vehicle_data_sample_35), [25])
3726997
Why did we shuffle the data before selecting the training and validation sets?
ⓘ Start presenting to display the poll results on this slide.
Reflection: Shuffling
Question: Why did I shuffle first?
I’m using a large contiguous block of data as my validation set.
Shuffling prevents this problem.
15
from sklearn.utils import shuffle
training_set, validation_set = np.split(shuffle(vehicle_data_sample_35), [25])
3726997
Hold Out Method Demo Step 1: Generating Models of Various Orders
First we train 7 models the experiment now on our training set of 25 points, yielding the MSEs shown below. As before, MSE decrease monotonically with model order.
16
3726997
Hold Out Method Demo Step 1: Generating Models of Various Orders (Visualization)
Below, we show the order 0, 1, 2, and 6 models trained on our 25 training points.
17
Note: Our degree 6 model looks different than before. No surprise since variance is high and we’re using a different data set.
3726997
Hold Out Method Demo Step 2: Evaluating on the Validation Set (a.k.a. Dev Set)
Then we compute MSE on our 10 validation set points (in orange) for all 7 models without refitting using these orange data points. Models are only fit on the 25 training points.
18
Evaluation: Validation set MSE is best for degree = 2!
3726997
Plotting Training and Validation MSE
19
3726997
Idealized Picture of Training and Validation Error
As we increase the complexity of our model:
We pick the model complexity that minimizes validation set error.
20
3726997
Hyperparameter: Terminology
In machine learning, a hyperparameter is a value that controls the learning process itself.
We use:
21
3726997
K-Fold Cross Validation
Lecture 15, Data 100 Spring 2023
Cross Validation
Regularization
22
3726997
Another View of The Holdout Method
To determine the quality of a particular hyperparameter:
Example, imagine we are trying to pick between three values of a hyperparameter .
23
Best! Use this one.
3726997
Another View of The Holdout Method
In the Holdout Method, we set aside the validation set at the beginning, and our choice is fixed.
Example below where the last 20% is used as the validation set.
24
3726997
Thought Experiment
If we decided (arbitrarily) to use non-overlapping contiguous chunks of 20% of the data, there are 5 possible “chunks” of data we could use as our validation set, as shown below.
25
Use first 20% and last 60% to train, remaining 20% as validation set.
3726997
Thought Experiment
If we decided (arbitrarily) to use non-overlapping contiguous chunks of 20% of the data, there are 5 possible “chunks” of data we could use as our validation set, as shown below.
26
Use folds 1, 3, 4, and 5 to train, and use fold 2 as validation set.
3726997
K-Fold Cross Validation
In the k-fold cross-validation approach, we split our data into k equally-sized groups (often called folds).
Given k folds, to determine the quality of a particular hyperparameter:
Example for k = 5:
27
Use folds 1, 3, 4, and 5 to train, and use fold 2 as validation set.
3726997
5-Fold Cross Validation Demo
Given k folds, to determine the quality of a particular hyperparameter, e.g. = 0.1:
28
3726997
5-Fold Cross Validation Demo
Given k folds, to determine the quality of a particular hyperparameter, e.g. alpha = 0.1:
29
3726997
Test Your Understanding: How Many MSEs?
Suppose we pick k = 3 and we have 4 possible hyperparameter values 𝛼=[0.01, 0.1, 1, 10].
30
3726997
Suppose we pick k = 3 and we have 4 possible hyperparameter values alpha=[0.01, 0.1, 1, 10]. How many total MSE values will we compute to get the quality of alpha=10?
ⓘ Start presenting to display the poll results on this slide.
Test Your Understanding: How Many MSEs?
Suppose we pick k = 3 and we have 4 possible hyperparameter values 𝛼=[0.01, 0.1, 1, 10].
32
𝛼 = 0.1
𝛼 = 10
12
3
6
3726997
Test Your Understanding: Selecting Alpha
Which 𝛼 should we pick?
What fold (or folds) should we use as our training set for computing our final model parameters 𝜃?
33
𝛼 = 0.1
𝛼 = 10
6
3726997
Which alpha should we pick?
ⓘ Start presenting to display the poll results on this slide.
Test Your Understanding: Selecting Alpha
Which 𝛼 should we pick? 0.1
What fold (or folds) should we use as our training set for computing our final model parameters 𝜃?
35
3726997
Picking K
Typical choices of k are 5, 10, and N, where N is the amount of data.
Ultimately, the tradeoff is between k and computation time.
36
3726997
Cross Validation Summary
When selecting between models, we want to pick the one that we believe would generalize best on unseen data. Generalization is estimated with a “cross validation score”*.
Two techniques to compute a “cross validation score”:
37
*Equivalently, I could have said “cross validation loss” instead of “cross validation score”.
3726997
Test Sets
Lecture 15, Data 100 Spring 2023
Cross Validation
Regularization
38
3726997
Providing an Estimate of Model Generalization to the World
Suppose we’re researchers building a state-of-the-art regression model.
Now we want to report this model out to the world so it can be compared to other models.
39
3726997
Why Are Validation Set Errors Biased?
Analogy:
40
3726997
Test Sets
Test sets can be something that we generate ourselves. Or they can be a common dataset whose solution is unknown.
In real world machine learning competitions, competing teams share a common test set.
41
3726997
Creating a Test Set Ourselves
We can do this easily in code. As before, we shuffle using scikit-learn then split using numpy.
Then we use np.split, now providing two numbers instead of one. For example, the code above splits the data into a Training, Validation, and Test set.
42
# Splitting the data into training, validation, and test set
train_set, val_set, test_set = np.split(shuffle(vehicle_data_sample_35), [25, 30])
3726997
Test Set Terminology in Real World Practice
Warning: The terms “test set” and “validation set” are sometimes used interchangeably.
In practice, you may not need a test set at all!
43
3726997
Validation Sets and Test Sets in Real World Practice
Standard validation sets and test sets are used as standard benchmarks to compare ML algorithms.
Note: Since the competition uses the test set scores compare image classification algorithms, the best test set score is no longer an unbiased estimate of the best algorithm’s performance.
44
3726997
Idealized Training, Validation, and Test Error
As we increase the complexity of our model:
45
3726997
L2 Regularization (Ridge)
Lecture 15, Data 100 Spring 2023
Cross Validation
Regularization
46
3726997
Earlier
We saw how we can select model complexity by choosing the hyperparameter that minimizes validation error. This validation error can be computed using the Holdout Method or K-Fold Cross Validation.
47
3726997
Earlier
For the example below, our hyperparameter was the polynomial degree.
48
3726997
A More Complex Example
Suppose we have a dataset with 9 features.
49
vehicle_data_with_squared_features
3726997
Tweaking Complexity via Feature Selection
With 9 features, there are 29 different models.One approach:
Runtime is exponential in the number of parameters!
50
3726997
Tweaking Complexity via Feature Selection
Alternate Idea: What if we use all of the features, but only a little bit?
51
3726997
Recall: Gradient Descent
Imagine we have a two parameter model.
52
3726997
Constraining Gradient Descent
We can decide that gradient descent can never land outside of the green ball (to force small parameters).
53
Idea for reducing model complexity: What if we use all of the features, but only a little bit?
3726997
Test Your Understanding
We can decide that gradient descent can never land outside of the green ball (to force small parameters).
54
3726997
Constraining Gradient Descent
Gradient descent ends up at instead.
55
3726997
Adjusting the Allowed Space
We can change the size of our arbitrary boundary.
56
3726997
Adjusting the Allowed Space
We can change the size of our arbitrary boundary.
57
3726997
Philosophical Question
How are “ball radius” and “complexity” related?
58
3726997
Philosophical Question: Your Answer
How are “ball radius” and “complexity” related?
59
3726997
Philosophical Question: My Answer
The ball radius is a complexity control parameter.
60
3726997
Test Your Understanding
Let’s return to our 9 feature model from before (d = 9).
61
3726997
Based on the concept of reducing the complexity of the model by constraining the parameters to be within a ball centered at the origin, what kind of model will we have if we pick a very small radius?
ⓘ Start presenting to display the poll results on this slide.
Test Your Understanding Answer
Let’s return to our 9 feature model from before (d = 9).
63
Answer: It depends!
3726997
Test Your Understanding Answer
If the ball is very tiny, our gradient descent is stuck near the origin.
64
3726997
Test Your Understanding Answer
Traditionally the “ball restriction” only applies to non-intercept terms.
65
3726997
Test Your Understanding
Back to our 9 feature model from before (d = 9).
66
3726997
Test Your Understanding Answer
Back to our 9 feature model from before (d = 9).
67
3726997
Test Your Understanding Answer
68
3726997
Training and Validation Errors vs. Ball Size for Our 9D Model
For very small ball size:
For very large ball size:
69
3726997
L2 Regularization
Constraining our model’s parameters to a ball around the origin is called L2 Regularization.
Ordinary least squares. Find thetas that minimize:
Ordinary least squares with L2 regularization. Find thetas that minimize:
70
Such that θ1 through θd live inside a ball of radius Q.
3726997
L2 Regularization
Constraining our model’s parameters to a ball around the origin is called L2 Regularization.
Ordinary least squares. Find thetas that minimize:
Ordinary least squares with L2 regularization. Find thetas that minimize:
71
such that
Note, intercept term not included!
3726997
Quick Detour into EECS127 (not tested in this class but worth mentioning)
In 127, you’ll learn (through the magic of Lagrangian Duality) that the two problems below are equivalent:
Problem 1: Find thetas that minimize:
Problem 2: Find thetas that minimize:
72
such that
Intuitively, this extra right term penalizes large thetas.
The “objective function” that gradient descent is minimizing now has an extra term.
Covered up until this slide on 3/7. Will continue in the next lecture
3726997
L2 Regularized Least Squares in sklearn
We can run least squares with an L2 regularization term by using the “Ridge” class.
Coefficients we get back:
73
Note: sklearn’s “alpha” parameter is equivalent to in the linear regression with L2 regularizer equation
from sklearn.linear_model import Ridge
ridge_model = Ridge(alpha=10000)
ridge_model.fit(vehicle_data_with_squared_features, vehicle_data["mpg"])
3726997
L2 Regularized Least Squares in sklearn
We can run least squares with an L2 regularization term by using the “Ridge” class.
For a tiny alpha, the coefficients are larger:
74
from sklearn.linear_model import Ridge
ridge_model = Ridge(alpha=10**-5)
ridge_model.fit(vehicle_data_with_squared_features, vehicle_data["mpg"])
Note: sklearn’s “alpha” parameter is equivalent to in the linear regression with L2 regularizer equation
3726997
L2 Regularized Least Squares in sklearn
We can run least squares with an L2 regularization term by using the “Ridge” class. For a tiny alpha, the coefficients are also about the same as a standard OLS model’s coefficients!
75
Green ball includes the OLS solution!
from sklearn.linear_model import LinearRegression
linear_model = LinearRegression()
linear_model.fit(vehicle_data_with_squared_features, vehicle_data["mpg"])
3726997
Figure (from lab 8)
In lab8, you’ll run an experiment for different values of alpha. The resulting plot is shown below.
76
3726997
Terminology Note
Why does sklearn use the word “Ridge”?
Because least squares with an L2 regularization term is also called “Ridge Regression”.
77
Why does sklearn use a hyperparameter which is the inverse of the ball radius?
3726997
Mathematical Note
Ridge Regression has a closed form solution which we will not derive.
78
Identity matrix
After lecture edit:
Note: This formula assumes that we regularize the intercept term (in practice we should not).
To avoid regularization on the intercept term, the matrix “I” should be replaced with the identity matrix with the first element (at index (0, 0)) replaced with 0.
3726997
Scaling Data for Regularization
Lecture 15, Data 100 Spring 2023
Cross Validation
Regularization
79
3726997
One Issue With Our Approach
Our data from before has features of quite different numerical scale!
80
Theta will tend to be smaller for weight^2 than other parameters
3726997
Coefficients from Earlier
81
3726997
Making Things Fair
Ideally, our data should all be on the same scale.
82
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
rescaled_df = pd.DataFrame(ss.fit_transform(vehicle_data_with_squared_features),
columns = ss.get_feature_names_out())
ridge_model = Ridge(alpha=10000)
ridge_model.fit(rescaled_df, vehicle_data["mpg"])
ridge_model.coef_
3726997
L1 Regularization (LASSO)
Lecture 15, Data 100 Spring 2023
Cross Validation
Regularization
83
3726997
L1 Regularization
We can also use other shapes.
84
3726997
L1 Regularization in Equation Form
Using a hypercube is known as L1 regularization. Expressed mathematically in the two equivalent forms below:
Problem 1: Find thetas that minimize:
Problem 2: Find thetas that minimize:
85
such that
3726997
L1 Regularized OLS in sklearn
In sklearn, we use the Lasso module.
86
from sklearn.linear_model import Lasso
lasso_model = Lasso(alpha = 10)
lasso_model.fit(vehicle_data_with_squared_features, vehicle_data["mpg"])
lasso_model.coef_
3726997
LASSO and “Feature Selection”
The optimal parameters for a LASSO model tend to include a lot of zeroes! In other words, LASSO effectively selects only a subset of the features.
Intuitive reason:
87
from sklearn.linear_model import Lasso
lasso_model = Lasso(alpha = 10)
lasso_model.fit(vehicle_data_with_squared_features, vehicle_data["mpg"])
lasso_model.coef_
3726997
Summary of Regression Methods
Our regression models are summarized below.
88
Name | Model | Loss | Reg. | Objective | Solution |
OLS | | Squared loss | None | | |
Ridge Regression | | Squared loss | L2 | | |
LASSO | | Squared loss | L1 | | No closed form |
3726997
Cross Validation, Regularization
89
Lecture 15