1 of 26

Cross-Validation,�Regularization

(Reading: 15.3, Ch 16)

(Slides adapted from Sandrine Dudoit and Joey Gonzalez)

UC Berkeley Data 100 Summer 2019

Sam Lau

Learning goals:

Learn how to perform K-fold CV and its benefits over a held-out validation set.
Understand L2 and L1 regularization and how to use regularization to manage the bias-variance tradeoff.

2 of 26

Announcements

HW5 out, due tomorrow
HW6 out tomorrow, due Tuesday
Screencast yesterday got frozen but audio is there

If you leave a comment on the YT video with the slide numbers and times I can update the description, e.g.
00:00 - Slide 1�01:30 - Slide 2�etc.

3 of 26

Cross-Validation

4 of 26

Simple Validation

Sample

Training Set

Validation Set

Test Set

Training Error

Validation Error

Test Error

Used to fit a model.

Used to choose a model.

Used to report final accuracy.

5 of 26

Assessing Model Risk

Training Set

Validation Set

Test Set

Training Error

Validation Error

Test Error

Used to fit a model.

Used to choose a model.

Used to report final accuracy.

Minimizes empirical risk

Estimates population risk

“Clean” estimate of population risk

6 of 26

Model Selection

E.g. f¹is linear, f² is deg 2 poly, f³ is linear with fewer features, etc.

Fit θ for each model by minimizing the training error.
Compute validation error for each model.
Pick the model with the lowest validation error.

This is model selection.

Now, report the test error of chosen model.

Given models:

7 of 26

K-Fold CV

Intuition: Validation error will not always be close to true risk. (Sometimes we are just unlucky!)

To address, compute multiple validation errors for each model.

K-Fold cross-validation:

Set aside test set from sample.
Split sample into K equal sized partitions
Use K - 1 splits to train, last split as validation set.
Repeat K times, average of K errors is validation error.

8 of 26

3-Fold CV

Sample

Rest of Sample

Test Set

Validation Set

Training Set

Fold 1

Validation Set

Training Set

Fold 2

We repeat this entire process for each model we want to try out.

Validation Set

Training Set

Fold 3

9 of 26

K-Fold CV Analysis

K usually chosen to be 5 or 10.
Advantages:

Makes use of more data for training (data often scarce)
Repeated estimates mitigates variance of splits
Can create confidence intervals for validation error

Disadvantages:

More computationally expensive

(Demo)

10 of 26

Estimating Risk, Bias, and Variance

CV lets us see bias and variance!
Training errors show model bias
Validation errors show risk, CIs show model variance

11 of 26

Break!

Fill out Attendance:

http://bit.ly/at-d100

12 of 26

Regularization

13 of 26

Weighty Issues

Large model weights create complicated models.

Idea: Prevent large weights to make simpler models.

14 of 26

Regularization

Regularization (aka shrinkage) adds a penalty for model weights to the loss function.
MSE loss with L2 regularization:

λ: Regularization parameter (non-negative)

Same ol’ loss as usual

Penalty for θ values

15 of 26

Ridge and Lasso Regression

Ridge regression: linear model with L2 regularization

Lasso regression: linear model with L1 regularization

L₂ norm

(Demo)

L₁ norm

16 of 26

Regularization Parameter

λ is the regularization parameter.
Higher values penalize model weights more.
Discuss:

What happens when λ = 0?
What happens when λ = ∞?
Does this change between L2 and L1 regularization?

17 of 26

What happens when...

λ = 0?

No regularization, back to linear model

λ = ∞?

Flat line, all model weights = 0

Does this change between L2 and L1 regularization?

18 of 26

Don’t regularize the bias

Notice that we don’t regularize the bias term!

Discuss: why not?

Bias term doesn’t add complexity to model

19 of 26

Normalize Data Before Using Regularization

Before using regularization, normalize data

Subtract mean and scale data to lie between -1 and 1.

Discuss: what happens if we don’t do this?

Artificial penalty on features with small numbers

20 of 26

Exercise to take home:

Prove that the stochastic gradient descent update rule for ridge regression is:

(Lasso is a bit tricker but also doable.)

21 of 26

Why two kinds of regularization?

Intuitive, hand-wavy explanation:
L2 regularization typically has all non-zero weights.

Makes sense when we think many small factors contribute to outcome.

L1 regularization will set some model weights = 0 depending on how big λ is.

L1 regularization lets us perform feature selection.
Makes sense when we think a few major factors contribute to outcome.

22 of 26

A more sophisticated explanation

Suppose we have a linear model with two parameters and no intercept term.

As we tweak the two parameters, loss changes.

Without regularization, we just pick θ̂.

θ₁

θ₂

θ̂

23 of 26

A more sophisticated explanation

Regularization balances loss with the regularization penalty.

For L2 regularization, we have circular contours for the penalty. Why?

θ₁

θ₂

θ̂

θ̂ with L2 regularization

24 of 26

A more sophisticated explanation

For L1 regularization, we have diamond-shaped contours for the penalty. Why?

Notice that this sets one parameter = 0!

This idea extends to multiple dimensions.

θ₁

θ₂

θ̂

θ̂ with L1 regularization

25 of 26

A tuning knob for bias-variance

Regularization gives us yet another way to manage the bias-variance tradeoff.

Increase λ = more bias, less variance
Decrease λ = less bias, more variance

How do we pick λ?

Cross-validation!

26 of 26

Summary

K-Fold cross-validation lets us estimate model bias, model variance, and overall risk.

We use CV to perform model and feature selection.

Regularization gives us a way to add complexity to our models while avoiding overfitting.

We use CV to tune the regularization amount.