1 of 86

Cross Validation, Regularization

Different methods for ensuring the generalizability of our models to unseen data.

Data 100/Data 200, Spring 2022 @ UC Berkeley

Josh Hug and Lisa Yan

Lecture 15

2 of 86

Plan for Next Three Lectures: Model Selection

Model Selection Basics:�Cross Validation

Regularization

Question & Problem

Formulation

Data

Acquisition

Exploratory Data Analysis

Prediction and

Inference

Reports, Decisions, and Solutions

Probability I:�Random Variables

Estimators

(today)

Probability II:�Bias and Variance

Inference/Multicollinearity

3 of 86

Today’s Roadmap

Lecture 15, Data 100 Spring 2022

Cross Validation

The Holdout Method
K-Fold Cross Validation
Test Sets

Regularization

L2 Regularization (Ridge)
Scaling Data for Regularization
L1 Regularization (LASSO)

4 of 86

Review: Error vs. Complexity

As we increase the complexity of our model:

Training error decreases.
Variance increases.

5 of 86

Review: Collecting More Data to Detect Overfitting

Suppose we collect the 9 new orange data points. Can compute MSE for our original models without refitting using the new orange data points.

Original 35 data points

New 9 data points

Best?

6 of 86

Review: Collecting More Data to Detect Overfitting

Which model do you like best? And why?

Original 35 data points

New 9 data points

Best?

7 of 86

Review: Collecting More Data to Detect Overfitting

The order 2 model seems best to me.

Performs best on data that has not yet been seen.

Original 35 data points

New 9 data points

Best?

8 of 86

Review: Collecting More Data to Detect Overfitting

Suppose we have 7 models and don’t know which is best.

Can’t necessarily trust the training error. We may have overfit!

We could wait for more data and see which of our 7 models does best on the new points.

Unfortunately, that means we need to wait for more data. May be very expensive or time consuming.
“Will see an alternate approach next week.”

As promised! Let’s do it.

9 of 86

Idea 1: The Holdout Method

The simplest approach for avoiding overfitting is to keep some of our data secret from ourselves.

Example:

Previous approach: We fit 7 models on all 35 of the available data points. Then waited for 9 new data points to decide which is best.

Holdout Method: We train our models on all 25/35 of the available data points. Then we evaluate the models’ performance on the remaining 10 data points.

Data used to train is called the “training set”.
Held out data is often called the “validation set” or “development set” or “dev set”. These terms are all synonymous and used by different authors.

10 of 86

Holdout Set Demo

The code below splits our data into two sets of size 25 and 10.

Used for Training

Used for Evaluation

11 of 86

Reflection: Shuffling

Question: Why did I shuffle first?

12 of 86

Reflection: Shuffling

Question: Why did I shuffle first?

I’m using a large contiguous block of data as my validation set.

If the set is sorted by something e.g. vehicle MPG, then my model will perform poorly on this unseen data.

Model will have never seen a high MPG vehicle.

Shuffling prevents this problem.

Alternate mathematically equivalent approach: Picking 10 samples randomly.

13 of 86

Hold Out Method Demo Step 1: Generating Models of Various Orders

First we train 7 models the experiment now on our training set of 25 points, yielding the MSEs shown below. As before, MSE decrease monotonically with model order.

14 of 86

Hold Out Method Demo Step 1: Generating Models of Various Orders (visualization)

Below, we show the order 0, 1, 2, and 6 models trained on our 25 training points.

Note: Our degree 6 model looks different than before. No surprise since variance is high and we’re using a different data set.

15 of 86

Hold Out Method Demo Step 2: Evaluating on the Validation Set (a.k.a. Dev Set)

Then we compute MSE on our 10 dev set points (in orange) for all 7 models without refitting using these orange data points. Models are only fit on the 25 training points.

Evaluation: Validation set MSE is best for degree = 2!

16 of 86

Plotting Training and Validation MSE

17 of 86

Idealized Picture of Training and Validation Error

As we increase the complexity of our model:

Training error decreases.
Variance increases.

Typically, error on validation data decreases, then increases.

We pick the model complexity that minimizes validation set error.

18 of 86

Hyperparameter: Terminology

In machine learning, a hyperparameter is a value that controls the learning process itself.

For our example today, we built seven models, each of which had a hyperparameter called degree or k that controlled the order of our polynomial.

We use:

The training set to select parameters.
The validation set (a.k.a. development set) (a.k.a. cross validation set) to select hyperparameters, or more generally, between different competing models.

19 of 86

K-Fold Cross Validation

Lecture 15, Data 100 Spring 2022

Cross Validation

The Holdout Method
K-Fold Cross Validation
Test Sets

Regularization

L2 Regularization (Ridge)
Scaling Data for Regularization
L1 Regularization (LASSO)

20 of 86

Another View of The Holdout Method

To determine the quality of a particular hyperparameter:

Train model on ONLY the training set. Quality is model’s error on ONLY the validation set.

Example, imagine we are trying to pick between three values of a hyperparameter alpha.

Best! Use this one.

21 of 86

Another View of The Holdout Method

In the Holdout Method, we set aside the validation set at the beginning, and our choice is fixed.

Train model on ONLY the training set. Quality is model’s error on ONLY the validation set.

Example below where the last 20% is used as the validation set.

22 of 86

Thought Experiment

If we decided (arbitrarily) to use non-overlapping contiguous chunks of 20% of the data, there are 5 possible “chunks” of data we could use as our validation set, as shown below.

Use first 20% and last 60% to train, remaining 20% as validation set.

23 of 86

Thought Experiment

If we decided (arbitrarily) to use non-overlapping contiguous chunks of 20% of the data, there are 5 possible “chunks” of data we could use as our validation set, as shown below.

The common term for these chunks is a “fold”.

For example, for chunks of size 20%, we have 5 folds.

Use folds 1, 3, 4, and 5 to train, and use fold 2 as validation set.

24 of 86

K-Fold Cross Validation

In the k-fold cross-validation approach, we split our data into k equally sized groups (often called folds).

Given k folds, to determine the quality of a particular hyperparameter:

Pick a fold, which we’ll call the validation fold. Train model on all but this fold. Compute error on the validation fold.
Repeat the step above for all k possible choices of validation fold.
Quality is the average of the k validation fold errors.

Example for k = 5:

Use folds 1, 3, 4, and 5 to train, and use fold 2 as validation set.

25 of 86

5-Fold Cross Validation Demo

Given k folds, to determine the quality of a particular hyperparameter, e.g. alpha = 0.1:

Pick a fold, which we’ll call the validation fold. Train model on all but this fold. Compute error on the validation fold.
Repeat the step above for all k possible choices of validation fold.
Quality is the average of the k validation fold errors.

26 of 86

5-Fold Cross Validation Demo

Given k folds, to determine the quality of a particular hyperparameter, e.g. alpha = 0.1:

Pick a fold, which we’ll call the validation fold. Train model on all but this fold. Compute error on the validation fold.
Repeat the step above for all k possible choices of validation fold.
Quality is the average of the k validation fold errors.

27 of 86

Test Your Understanding: How Many MSEs?

Suppose we pick k = 3 and we have 4 possible hyperparameter values 𝛼=[0.01, 0.1, 1, 10].

How many total MSE values will we compute to get the quality of 𝛼=10?
How many total MSE values will we compute to find the best 𝛼?

28 of 86

Test Your Understanding: How Many MSEs?

Suppose we pick k = 3 and we have 4 possible hyperparameter values 𝛼=[0.01, 0.1, 1, 10].

How many total MSE values will we compute to get the quality of 𝛼=10? 3
How many total MSE values will we compute to find the best 𝛼? 12

29 of 86

Test Your Understanding: Selecting Alpha

Which 𝛼 should we pick?

What fold (or folds) should we use as our training set for computing our final model parameters 𝜃?

30 of 86

Test Your Understanding: Selecting Alpha

Which 𝛼 should we pick? 0.1

What fold (or folds) should we use as our training set for computing our final model parameters 𝜃?

There’s no reason to prefer any fold over any other. In practice, best to train all model on all of the data, i.e. use all 3 folds.

31 of 86

Picking K

Typical choices of k are 5, 10, and N, where N is the amount of data.

K=N is also known as “leave one out cross validation”, and will typically give you the best results.

In this approach, each validation set is only one point.
Every point gets a chance to get used as the validation set.

k=N is also very expensive, require you to fit a huge number of models.

Ultimately, the tradeoff is between k and computation time.

32 of 86

K-Fold Cross Validation and Hold Out Method in sklearn

As an example, the code below performs a GridSearchCV uses 5 fold cross validation to find the optimal parameters for a model called “scaled_ridge_model”.

The hyperparameters to try are stored in a dictionary called “parameters_to_try”.
The loss function is given by the “scoring” parameter.
The number of folds is given by “cv = 5”.
You’ll do this on lab 8.

Can also do the Hold Out method in sklearn:

33 of 86

Cross Validation Summary

When selecting between models, we want to pick the one that we believe would generalize best on unseen data. Generalization is estimated with a “cross validation score”*.

When selecting between models, keep the model with the best score.

Two techniques to compute a “cross validation score”:

The Holdout Method: Break data into a separate training set and validation set.

Use training set to fit parameters (thetas) for the model.
Use validation set to score the model.
Also called “Simple Cross Validation” in some sources.

K-Fold Cross Validation: Break data into K contiguous non-overlapping “folds”.

Perform K rounds of Simple Cross Validation, except:

Each fold gets to be the validation set exactly once.
The final score of a model is the average validation score across the K trials.

*Equivalently, I could have said “cross validation loss” instead of “cross validation score”.

34 of 86

Extra: Exhaustive Cross Validation

We don’t have to use non-overlapping contiguous chunks! Could use

Every 5th data point as validation set.
Data in positions 0-2%, 20-22%, 40-42%, 60-62%, 80-82% as validation set.
Etc.

Iterating over ALL possible such permutations is known as “exhaustive cross validation.” We will not discuss this in our course.

Use first 20% and last 60% to train, remaining 20% as validation set.

35 of 86

Test Sets

Lecture 15, Data 100 Spring 2022

Cross Validation

The Holdout Method
K-Fold Cross Validation
Test Sets

Regularization

L2 Regularization (Ridge)
Scaling Data for Regularization
L1 Regularization (LASSO)

36 of 86

Providing an Estimate of Model Generalization to the World

Suppose we’re researchers building a state of the art regression model.

After months of work and after comparing billions of candidate models, we find the model with the best validation set loss.

Now we want to report this model out to the world so it can be compared to other models.

Our validation set loss is not an unbiased estimator of its performance!
Instead, we’ll run our model just one more time on a special test set, that we’ve never seen or used for any purpose whatsoever.

37 of 86

Why Are Validation Set Errors Biased?

Analogy:

Imagine we have a golf ball hitting competition. Whoever can hit the ball the farthest wins.
Suppose we have the best 10000000 golfers in the world play a tournament. There are probably many roughly equal players near the top.
When we’re done, we want to provide an unbiased estimate of our best golfer’s distance in yards.
Using the tournament results may be biased, as the the winner maybe got just a bit lucky (maybe they had favorable wind during their rounds).
Better unbiased estimate: Have the winner play one more trial and report their score.

Comments?

38 of 86

Test Sets

Test sets can be something that we generate ourselves. Or they can be a common data set whose solution is unknown.

In real world machine learning competitions, competing teams share a common test set.

To avoid accidental or intentional overfitting, the correct predictions for the test set are never seen by the competitors.

39 of 86

Creating a Test Set Ourselves

We can do this easily in code. As before, we shuffle using scikit-learn then split using numpy.

Then we use np.split, now providing two numbers instead of one. For example, the code above splits the data into a Training, Validation, and Test set.

Recall that a validation set is just another name for a development set.
Training set used to pick parameters.
Validation set used to pick hyperparameters (or pick between different models).
Test set used to provide an unbiased MSE at the end.

40 of 86

Test Set Terminology in Real World Practice

Warning: The terms “test set” and “validation set” are sometimes used interchangeably.

You’ll see authors saying things like “then we used a test set to select hyperparameters”.

While this violates my personal definition of test set, it’s clear to me what they meant, namely: “we used a holdout set to select hyperparameters”.
This is a terminological confusion, not a procedural error! They didn’t do anything wrong.

Imagine they said “We used a b͞l̶o͟òp͏ ҉blo͟p̵ set”. Same thing , just weird name.

The error would be if they claimed later that the loss on their test set was unbiased. Since “validation set” error and actually completely unseen “test set” errors are typically very close, this terminology error is very minor.

In practice, you may not need a test set at all!

If all you need to do is pick the best model, and you don’t care about providing a numerical measure of model quality, you don’t need a test set.

41 of 86

Validation Sets and Test Sets in Real World Practice

Standard validation sets and test sets are used as standard benchmarks to compare ML algorithms.

Example: ImageNet is a dataset / competition used to compare to different image classification algorithms (e.g. this image is of a “Dog”).

1,281,167 training images. Images and correct label provided.
50,000 validation images. Images and correct label provided.
100,000 test images. Images provided, but no correct label.

When writing papers, researchers report their performance on the validation images.

This set is a “validation set” with respect to the entire global research community.
Research groups cannot report the test error because they cannot compute it!

When ImageNet was a competition, the test set was used to rank different algorithms.

Researchers provide their predictions for the test set to a central server.
Server (which knows the labels) reports back a test set score.
Best test set score wins.

Note: Since the competition uses the test set scores compare image classification algorithms, the best test set score is no longer an unbiased estimate of the best algorithm’s performance.

42 of 86

Idealized Training, Validation, and Test Error

As we increase the complexity of our model:

Training error decreases.
Variance increases.

Typically, validation error decreases, then increases.
The test error is the essentially the same thing as the validation error! Only difference is that we are much restrictive about computing the test error

Don’t get to see the whole curve!

43 of 86

L2 Regularization (Ridge)

Lecture 15, Data 100 Spring 2022

Cross Validation

The Holdout Method
K-Fold Cross Validation
Test Sets

Regularization

L2 Regularization (Ridge)
Scaling Data for Regularization
L1 Regularization (LASSO)

Note: I might try to re-record and adjust these slides. Live lecture fell short of what I wanted to do!

44 of 86

Earlier

We saw how we can select model complexity by choosing the hyperparameter that minimizes validation error. This validation error can be computed using the Holdout Method or K-Fold Cross Validation.

45 of 86

Earlier

For the example below, our hyperparameter was the polynomial degree.

Tweaking the “complexity” is simple, just increase or decrease the degree.

46 of 86

A More Complex Example

Suppose we have a dataset with 9 features.

We want to decide which of the 9 features to include in our linear regression.

47 of 86

Tweaking Complexity via Feature Selection

With 9 features, there are 2⁹ different models.One approach:

For each of the 2⁹ linear regression models, compute the validation MSE.
Pick the model that has the lowest validation MSE.

Runtime is exponential in the number of parameters!

48 of 86

Tweaking Complexity via Feature Selection

Alternate Idea: What if we use all of the features, but only a little bit?

Let’s see a simple example for a 2 feature model.
Will return to this 9 feature model later.

49 of 86

Recall: Gradient Descent

Imagine we have a two parameter model.

Optimal parameters are given by theta hat.
Gradient descent will find these parameter during training.

50 of 86

Constraining Gradient Descent

Suppose we arbitrarily decide that gradient descent can never land outside of the green ball.

51 of 86

Test Your Understanding

Suppose we arbitrarily decide that gradient descent can never land outside of the green ball.

Where will gradient descent terminate?

52 of 86

Constraining Gradient Descent

Gradient descent ends up at instead.

Different than our unconstrained solution.
Not optimal, but closer to origin!

53 of 86

Adjusting the Allowed Space

We can change the size of our arbitrary boundary.

54 of 86

Adjusting the Allowed Space

We can change the size of our arbitrary boundary.

55 of 86

Philosophical Question

How are “ball radius” and “complexity” related?

Bigger ball = more complex model?
Smaller ball = more complex model?

56 of 86

Philosophical Question: Your Answer

How are “ball radius” and “complexity” related?

Bigger ball = more complex model?
Smaller ball = more complex model?

57 of 86

Philosophical Question: My Answer

The ball radius is a complexity control parameter.

Smaller radius = less complex model.

58 of 86

Test Your Understanding

Let’s return to our 9 feature model from before (d = 9).

If we pick a very small ball radius, what kind of model will we have?

A model that only returns zero.
A constant model.
Ordinary least squares.
Something else.

59 of 86

Test Your Understanding Answer

Let’s return to our 9 feature model from before (d = 9).

If we pick a very small ball radius, what kind of model will we have?

A model that only returns zero.
A constant model.
Ordinary least squares.
Something else.

Answer: It depends!

60 of 86

Test Your Understanding Answer

If the ball is very tiny, our gradient descent is stuck near the origin.

If all parameters are zero including intercept, model always outputs zero.
If all parameters are zero except intercept, model is a constant model (returns mean of the observations).

61 of 86

Test Your Understanding Answer

Traditionally the “ball restriction” only applies to non-intercept terms.

θ₀ is allowed to be any value, not stuck in a ball.
If all parameters are zero except intercept, model is a constant model (returns mean of the observations).

62 of 86

Test Your Understanding

Back to our 9 feature model from before (d = 9).

If we pick a very large ball radius, what kind of model will we have?

A model that only returns zero.
A constant model.
Ordinary least squares.
Something else.

63 of 86

Test Your Understanding Answer

Back to our 9 feature model from before (d = 9).

If we pick a very large ball radius, what kind of model will we have?

A model that only returns zero.
A constant model.
Ordinary least squares.
Something else.

64 of 86

Test Your Understanding Answer

For very large ball sizes, the restriction has no effect.
The ball includes the OLS solution!

65 of 86

Training and Validation Errors vs. Ball Size for Our 9D Model

For very small ball size:

Model behaves like a constant model. Can’t actually use our 9 features!
High training error, low variance, high validation error.

For very large ball size:

Model behaves like OLS.
If we have tons of features, results in overfitting. Low training error, high variance, high validation error.

66 of 86

L2 Regularization

Constraining our model’s parameters to a ball around the origin is called L2 Regularization.

The smaller the ball, the simpler the model.

Ordinary least squares. Find thetas that minimize:

Ordinary least squares with L2 regularization. Find thetas that minimize:

Such that θ₁ through θ_d live inside a ball of radius Q.

67 of 86

L2 Regularization

Constraining our model’s parameters to a ball around the origin is called L2 Regularization.

The smaller the ball, the simpler the model.

Ordinary least squares. Find thetas that minimize:

Ordinary least squares with L2 regularization. Find thetas that minimize:

such that

Note, intercept term not included!

68 of 86

L2 Regularized Least Squares in sklearn

We can run least squares with an L2 regularization term by using the “Ridge” class.

Coefficients we get back:

Note: sklearn’s “alpha” parameter is proportional to the inverse of the ball radius!

Large alpha means small ball.

69 of 86

L2 Regularized Least Squares in sklearn

We can run least squares with an L2 regularization term by using the “Ridge” class.

For a tiny alpha, the coefficients are larger:

Note: sklearn’s “alpha” parameter is proportional to the inverse of the ball radius!

Small alpha means large ball.

70 of 86

L2 Regularized Least Squares in sklearn

We can run least squares with an L2 regularization term by using the “Ridge” class. For a tiny alpha, the coefficients are also about the same as a standard OLS model’s coefficients!

Green ball includes the OLS solution!

71 of 86

Figure (from lab 8)

In lab8, you’ll run an experiment for different values of alpha. The resulting plot is shown below.

Note: Since alpha is the inverse of the ball radius, the complexity is higher on the left!

72 of 86

Two Questions

Let’s address two quick questions:

Why does sklearn use the word “Ridge”?
Why does sklearn use a hyperparameter which is the inverse of the ball radius?

73 of 86

Terminology Note

Why does sklearn use the word “Ridge”?

Because least squares with an L2 regularization term is also called “Ridge Regression”.

Term is historical. Doesn’t really matter.

74 of 86

Quick Detour into EECS127 (not super important but worth mentioning)

In 127, you’ll learn (through the magic of Lagrangian Duality) that the two problems below are equivalent:

Problem 1: Find thetas that minimize:

Problem 2: Find thetas that minimize:

such that

Intuitively, this extra right term penalizes large thetas.

The “objective function” that gradient descent is minimizing now has an extra term.

75 of 86

Mathematical Note

Ridge Regression has a closed form solution which we will not derive.

Note: The solution exists even if the feature matrix has collinearity between its columns.

Identity matrix

76 of 86

Scaling Data for Regularization

Lecture 15, Data 100 Spring 2022

Cross Validation

The Holdout Method
K-Fold Cross Validation
Test Sets

Regularization

L2 Regularization (Ridge)
Scaling Data for Regularization
L1 Regularization (LASSO)

77 of 86

One Issue With Our Approach

Our data from before has features of quite different numerical scale!

Optimal theta for hp will probably be much further from origin than theta for weight^2.

Theta will tend to be smaller for weight^2 than other parameters

78 of 86

Coefficients from Earlier

79 of 86

Making Things Fair

Ideally, our data should all be on the same scale.

One approach: Standardize the data, i.e. replace everything with its Z-score.

No demo in lecture. You’ll do this on lab 8 using the “StandardScaler” transformer.

Resulting model coefficients will be all on the same scale.

80 of 86

L1 Regularization (LASSO)

Lecture 15, Data 100 Spring 2022

Cross Validation

The Holdout Method
K-Fold Cross Validation
Test Sets

Regularization

L2 Regularization (Ridge)
Scaling Data for Regularization
L1 Regularization (LASSO)

81 of 86

L1 Regularization

We can also use other shapes.

Example: Hypercube of radius Q.

82 of 86

L1 Regularization in Equation Form

Using a hypercube is known as L1 regularization. Expressed mathematically in the two equivalent forms below:

Problem 1: Find thetas that minimize:

Problem 2: Find thetas that minimize:

such that

83 of 86

L1 Regularized OLS in sklearn

In sklearn, we use the Lasso module.

Note: Performing OLS with L1 regularization is also called LASSO regression.

84 of 86

LASSO and “Feature Selection”

The optimal parameters for a LASSO model tend to include a lot of zeroes! In other words, LASSO effectively selects only a subset of the features.

Intuitive (??) reason:

Imagine expanding a 3D cube until it intersects a balloon. More likely to intersect at a corner or edge than a face (I’m pretty sure this is true? Convince me!)

85 of 86

Summary of Regression Methods

Our regression models are summarized below.

The “Objective” column gives the function that our gradient descent optimizer minimizes.
Note that this table uses lambda instead of alpha for regularization strength. Both are common.

Name	Model	Loss	Reg.	Objective	Solution
OLS		Squared loss	None
Ridge Regression		Squared loss	L2
LASSO		Squared loss	L1		No closed form