1 of 89

Join at slido.com�#3726997

Start presenting to display the joining instructions on this slide.

2 of 89

Cross Validation, Regularization

Different methods for ensuring the generalizability of our models to unseen data.

Data 100/Data 200, Spring 2023 @ UC Berkeley

Narges Norouzi and Lisa Yan

Content credit: Acknowledgments

2

Lecture 15

3 of 89

Plan for Next Three Lectures: Model Selection

3

Model Selection Basics:�Cross Validation

Regularization

Question & Problem

Formulation

Data

Acquisition

Exploratory Data Analysis

Prediction and

Inference

Reports, Decisions, and Solutions

?

Probability I:Random Variables

Estimators

(today)

Probability II:�Bias and Variance

Inference/Multicollinearity

3726997

4 of 89

Today’s Roadmap

Lecture 15, Data 100 Spring 2023

Cross Validation

  • The Holdout Method
  • K-Fold Cross Validation
  • Test Sets

Regularization

  • L2 Regularization (Ridge)
  • Scaling Data for Regularization
  • L1 Regularization (LASSO)

4

3726997

5 of 89

Review: Error vs. Complexity

As we increase the complexity of our model:

  • Training error decreases.
  • Variance increases.

5

3726997

6 of 89

Review: Dataset

Today we will use the mpg dataset from the seaborn library.

The dataset has 392 rows and 9 column. Our task is to use some of the columns and their transformations to predict the value of the mpg column.

6

3726997

7 of 89

Review: Collecting More Data to Detect Overfitting

Suppose we use 35 sample to fit the regression model and collect the 9 new orange data points. We can compute MSE for our original models without refitting using the new orange data points.

7

Original 35 data points

New 9 data points

Best?

Best?

3726997

8 of 89

Review: Collecting More Data to Detect Overfitting

Which model do you like best? And why?

8

Original 35 data points

New 9 data points

Best?

Best?

3726997

9 of 89

Review: Collecting More Data to Detect Overfitting

The order 2 model seems best to me.

  • Performs best on data that has not yet been seen.

9

Original 35 data points

New 9 data points

Best?

Best?

3726997

10 of 89

Review: Collecting More Data to Detect Overfitting

Suppose we have 7 models and don’t know which is best.

  • Can’t necessarily trust the training error. We may have overfit!

We could wait for more data and see which of our 7 models does best on the new points.

  • Unfortunately, that means we need to wait for more data. May be very expensive or time-consuming.
  • “Will see an alternate approach next week.”
    • As promised! Let’s do it.

10

3726997

11 of 89

Idea 1: The Holdout Method

The simplest approach for avoiding overfitting is to keep some of our data secret from ourselves.

Example:

  • Previous approach: We fit 7 models on all 35 of the available data points. Then waited for 9 new data points to decide which is best.
  • Holdout Method: We train our models on all 25/35 of the available data points. Then we evaluate the models’ performance on the remaining 10 data points.
    • Data used to train is called the “training set”.
    • Held out data is often called the “validation set” or “development set” or “dev set”. These terms are all synonymous and used by different authors.

11

3726997

12 of 89

Holdout Set Demo

The code below splits our data into two sets of size 25 and 10.

12

Used for Training

Used for Evaluation

from sklearn.utils import shuffle

training_set, validation_set = np.split(shuffle(vehicle_data_sample_35), [25])

9

9

3726997

13 of 89

Reflection: Shuffling

Question: Why did I shuffle first?

13

from sklearn.utils import shuffle

training_set, validation_set = np.split(shuffle(vehicle_data_sample_35), [25])

3726997

14 of 89

Why did we shuffle the data before selecting the training and validation sets?

Start presenting to display the poll results on this slide.

15 of 89

Reflection: Shuffling

Question: Why did I shuffle first?

I’m using a large contiguous block of data as my validation set.

  • If the set is sorted by something e.g. vehicle MPG, then my model will perform poorly on this unseen data.
    • Model will have never seen a high MPG vehicle.

Shuffling prevents this problem.

  • Alternate mathematically equivalent approach: Picking 10 samples randomly.

15

from sklearn.utils import shuffle

training_set, validation_set = np.split(shuffle(vehicle_data_sample_35), [25])

3726997

16 of 89

Hold Out Method Demo Step 1: Generating Models of Various Orders

First we train 7 models the experiment now on our training set of 25 points, yielding the MSEs shown below. As before, MSE decrease monotonically with model order.

16

3726997

17 of 89

Hold Out Method Demo Step 1: Generating Models of Various Orders (Visualization)

Below, we show the order 0, 1, 2, and 6 models trained on our 25 training points.

17

Note: Our degree 6 model looks different than before. No surprise since variance is high and we’re using a different data set.

3726997

18 of 89

Hold Out Method Demo Step 2: Evaluating on the Validation Set (a.k.a. Dev Set)

Then we compute MSE on our 10 validation set points (in orange) for all 7 models without refitting using these orange data points. Models are only fit on the 25 training points.

18

Evaluation: Validation set MSE is best for degree = 2!

3726997

19 of 89

Plotting Training and Validation MSE

19

3726997

20 of 89

Idealized Picture of Training and Validation Error

As we increase the complexity of our model:

  • Training error decreases.
  • Variance increases.

  • Typically, error on validation data decreases, then increases.

We pick the model complexity that minimizes validation set error.

20

3726997

21 of 89

Hyperparameter: Terminology

In machine learning, a hyperparameter is a value that controls the learning process itself.

  • For our example today, we built seven models, each of which had a hyperparameter called degree or k that controlled the order of our polynomial.

We use:

  • The training set to select parameters.
  • The validation set (a.k.a. development set) (a.k.a. cross validation set) to select hyperparameters, or more generally, between different competing models.

21

3726997

22 of 89

K-Fold Cross Validation

Lecture 15, Data 100 Spring 2023

Cross Validation

  • The Holdout Method
  • K-Fold Cross Validation
  • Test Sets

Regularization

  • L2 Regularization (Ridge)
  • Scaling Data for Regularization
  • L1 Regularization (LASSO)

22

3726997

23 of 89

Another View of The Holdout Method

To determine the quality of a particular hyperparameter:

  • Train model on ONLY the training set. Quality is model’s error on ONLY the validation set.

Example, imagine we are trying to pick between three values of a hyperparameter .

23

Best! Use this one.

3726997

24 of 89

Another View of The Holdout Method

In the Holdout Method, we set aside the validation set at the beginning, and our choice is fixed.

  • Train model on ONLY the training set. Quality is model’s error on ONLY the validation set.

Example below where the last 20% is used as the validation set.

24

3726997

25 of 89

Thought Experiment

If we decided (arbitrarily) to use non-overlapping contiguous chunks of 20% of the data, there are 5 possible “chunks” of data we could use as our validation set, as shown below.

25

Use first 20% and last 60% to train, remaining 20% as validation set.

3726997

26 of 89

Thought Experiment

If we decided (arbitrarily) to use non-overlapping contiguous chunks of 20% of the data, there are 5 possible “chunks” of data we could use as our validation set, as shown below.

  • The common term for these chunks is a “fold”.
    • For example, for chunks of size 20%, we have 5 folds.

26

Use folds 1, 3, 4, and 5 to train, and use fold 2 as validation set.

3726997

27 of 89

K-Fold Cross Validation

In the k-fold cross-validation approach, we split our data into k equally-sized groups (often called folds).

Given k folds, to determine the quality of a particular hyperparameter:

  • Pick a fold, which we’ll call the validation fold. Train model on all but this fold. Compute error on the validation fold.
  • Repeat the step above for all k possible choices of validation fold.
  • Quality is the average of the k validation fold errors.

Example for k = 5:

27

Use folds 1, 3, 4, and 5 to train, and use fold 2 as validation set.

3726997

28 of 89

5-Fold Cross Validation Demo

Given k folds, to determine the quality of a particular hyperparameter, e.g. = 0.1:

  • Pick a fold, which we’ll call the validation fold. Train model on all but this fold. Compute error on the validation fold.
  • Repeat the step above for all k possible choices of validation fold.
  • Quality is the average of the k validation fold errors.

28

3726997

29 of 89

5-Fold Cross Validation Demo

Given k folds, to determine the quality of a particular hyperparameter, e.g. alpha = 0.1:

  • Pick a fold, which we’ll call the validation fold. Train model on all but this fold. Compute error on the validation fold.
  • Repeat the step above for all k possible choices of validation fold.
  • Quality is the average of the k validation fold errors.

29

3726997

30 of 89

Test Your Understanding: How Many MSEs?

Suppose we pick k = 3 and we have 4 possible hyperparameter values 𝛼=[0.01, 0.1, 1, 10].

  • How many total MSE values will we compute to get the quality of 𝛼=10?
  • How many total MSE values will we compute to find the best 𝛼?

30

3726997

31 of 89

Suppose we pick k = 3 and we have 4 possible hyperparameter values alpha=[0.01, 0.1, 1, 10]. How many total MSE values will we compute to get the quality of alpha=10?

Start presenting to display the poll results on this slide.

32 of 89

Test Your Understanding: How Many MSEs?

Suppose we pick k = 3 and we have 4 possible hyperparameter values 𝛼=[0.01, 0.1, 1, 10].

  • How many total MSE values will we compute to get the quality of 𝛼=10?
  • How many total MSE values will we compute to find the best 𝛼?

32

𝛼 = 0.1

𝛼 = 10

12

3

6

3726997

33 of 89

Test Your Understanding: Selecting Alpha

Which 𝛼 should we pick?

What fold (or folds) should we use as our training set for computing our final model parameters 𝜃?

33

𝛼 = 0.1

𝛼 = 10

6

3726997

34 of 89

Which alpha should we pick?

Start presenting to display the poll results on this slide.

35 of 89

Test Your Understanding: Selecting Alpha

Which 𝛼 should we pick? 0.1

What fold (or folds) should we use as our training set for computing our final model parameters 𝜃?

  • There’s no reason to prefer any fold over any other. In practice, best to train all model on all of the data, i.e. use all 3 folds.

35

3726997

36 of 89

Picking K

Typical choices of k are 5, 10, and N, where N is the amount of data.

  • k=N is also known as “leave one out cross validation”, and will typically give you the best results.
    • In this approach, each validation set is only one point.
    • Every point gets a chance to get used as the validation set.
  • k=N is also very expensive, require you to fit a huge number of models.

Ultimately, the tradeoff is between k and computation time.

36

3726997

37 of 89

Cross Validation Summary

When selecting between models, we want to pick the one that we believe would generalize best on unseen data. Generalization is estimated with a “cross validation score”*.

  • When selecting between models, keep the model with the best score.

Two techniques to compute a “cross validation score”:

  • The Holdout Method: Break data into a separate training set and validation set.
    • Use training set to fit parameters (thetas) for the model.
    • Use validation set to score the model.
    • Also called “Simple Cross Validation” in some sources.
  • k-Fold Cross Validation: Break data into k contiguous non-overlapping “folds”.
    • Perform k rounds of Simple Cross Validation, except:
      • Each fold gets to be the validation set exactly once.
      • The final score of a model is the average validation score across the k trials.

37

*Equivalently, I could have said “cross validation loss” instead of “cross validation score”.

3726997

38 of 89

Test Sets

Lecture 15, Data 100 Spring 2023

Cross Validation

  • The Holdout Method
  • K-Fold Cross Validation
  • Test Sets

Regularization

  • L2 Regularization (Ridge)
  • Scaling Data for Regularization
  • L1 Regularization (LASSO)

38

3726997

39 of 89

Providing an Estimate of Model Generalization to the World

Suppose we’re researchers building a state-of-the-art regression model.

  • After months of work and after comparing billions of candidate models, we find the model with the best validation set loss.

Now we want to report this model out to the world so it can be compared to other models.

  • Our validation set loss is not an unbiased estimator of its performance!
  • Instead, we’ll run our model just one more time on a special test set, that we’ve never seen or used for any purpose whatsoever.

39

3726997

40 of 89

Why Are Validation Set Errors Biased?

Analogy:

  • Imagine we have a golf ball hitting competition. Whoever can hit the ball the farthest wins.
  • Suppose we have the best 10000000 golfers in the world play a tournament. There are probably many roughly equal players near the top.
  • When we’re done, we want to provide an unbiased estimate of our best golfer’s distance in yards.
  • Using the tournament results may be biased, as the the winner maybe got just a bit lucky (maybe they had favorable wind during their rounds).
  • Better unbiased estimate: Have the winner play one more trial and report their score.

40

3726997

41 of 89

Test Sets

Test sets can be something that we generate ourselves. Or they can be a common dataset whose solution is unknown.

In real world machine learning competitions, competing teams share a common test set.

  • To avoid accidental or intentional overfitting, the correct predictions for the test set are never seen by the competitors.

41

3726997

42 of 89

Creating a Test Set Ourselves

We can do this easily in code. As before, we shuffle using scikit-learn then split using numpy.

Then we use np.split, now providing two numbers instead of one. For example, the code above splits the data into a Training, Validation, and Test set.

  • Recall that a validation set is just another name for a development set.
  • Training set used to pick parameters.
  • Validation set used to pick hyperparameters (or pick between different models).
  • Test set used to provide an unbiased MSE at the end.

42

# Splitting the data into training, validation, and test set

train_set, val_set, test_set = np.split(shuffle(vehicle_data_sample_35), [25, 30])

3726997

43 of 89

Test Set Terminology in Real World Practice

Warning: The terms “test set” and “validation set” are sometimes used interchangeably.

  • You’ll see authors saying things like “then we used a test set to select hyperparameters”.
    • While this violates my personal definition of test set, it’s clear to me what they meant, namely: “we used a holdout set to select hyperparameters”.
    • This is a terminological confusion, not a procedural error! They didn’t do anything wrong.
      • Imagine they said “We used a b͞l̶o͟òp͏ ҉blo͟p̵ set”. Same thing , just weird name.
    • The error would be if they claimed later that the loss on their test set was unbiased. Since “validation set” error and actually completely unseen “test set” errors are typically very close, this terminology error is very minor.

In practice, you may not need a test set at all!

  • If all you need to do is pick the best model, and you don’t care about providing a numerical measure of model quality, you don’t need a test set.

43

3726997

44 of 89

Validation Sets and Test Sets in Real World Practice

Standard validation sets and test sets are used as standard benchmarks to compare ML algorithms.

  • Example: ImageNet is a dataset / competition used to compare to different image classification and localization algorithms on 1000 object classes (a “cat”).
    • 1,281,167 training images. Images and correct label provided.
    • 50,000 validation images. Images and correct label provided.
    • 100,000 test images. Images provided, but no correct label.
  • When writing papers, researchers report their performance on the validation images.
    • This set is a “validation set” with respect to the entire global research community.
    • Research groups cannot report the test error because they cannot compute it!
  • When ImageNet was a competition, the test set was used to rank different algorithms.
    • Researchers provide their predictions for the test set to a central server.
    • Server (which knows the labels) reports back a test set score.
    • Best test set score wins.

Note: Since the competition uses the test set scores compare image classification algorithms, the best test set score is no longer an unbiased estimate of the best algorithm’s performance.

44

3726997

45 of 89

Idealized Training, Validation, and Test Error

As we increase the complexity of our model:

  • Training error decreases.
  • Variance increases.

  • Typically, validation error decreases, then increases.
  • The test error is the essentially the same thing as the validation error! Only difference is that we are much restrictive about computing the test error
    • Don’t get to see the whole curve!

45

3726997

46 of 89

L2 Regularization (Ridge)

Lecture 15, Data 100 Spring 2023

Cross Validation

  • The Holdout Method
  • K-Fold Cross Validation
  • Test Sets

Regularization

  • L2 Regularization (Ridge)
  • Scaling Data for Regularization
  • L1 Regularization (LASSO)

46

3726997

47 of 89

Earlier

We saw how we can select model complexity by choosing the hyperparameter that minimizes validation error. This validation error can be computed using the Holdout Method or K-Fold Cross Validation.

47

3726997

48 of 89

Earlier

For the example below, our hyperparameter was the polynomial degree.

  • Tweaking the “complexity” is simple, just increase or decrease the degree.

48

3726997

49 of 89

A More Complex Example

Suppose we have a dataset with 9 features.

  • We want to decide which of the 9 features to include in our linear regression.

49

vehicle_data_with_squared_features

3726997

50 of 89

Tweaking Complexity via Feature Selection

With 9 features, there are 29 different models.One approach:

  • For each of the 29 linear regression models, compute the validation MSE.
  • Pick the model that has the lowest validation MSE.

Runtime is exponential in the number of parameters!

50

3726997

51 of 89

Tweaking Complexity via Feature Selection

Alternate Idea: What if we use all of the features, but only a little bit?

  • Let’s see a simple example for a 2 feature model.
  • Will return to this 9 feature model later.

51

3726997

52 of 89

Recall: Gradient Descent

Imagine we have a two parameter model.

  • Optimal parameters are given by .
  • Gradient descent will find these parameter during training.

52

3726997

53 of 89

Constraining Gradient Descent

We can decide that gradient descent can never land outside of the green ball (to force small parameters).

53

Idea for reducing model complexity: What if we use all of the features, but only a little bit?

3726997

54 of 89

Test Your Understanding

We can decide that gradient descent can never land outside of the green ball (to force small parameters).

  • Where will gradient descent terminate?

54

3726997

55 of 89

Constraining Gradient Descent

Gradient descent ends up at instead.

  • Different than our unconstrained solution.
  • Not optimal, but closer to origin!

55

3726997

56 of 89

Adjusting the Allowed Space

We can change the size of our arbitrary boundary.

56

3726997

57 of 89

Adjusting the Allowed Space

We can change the size of our arbitrary boundary.

57

3726997

58 of 89

Philosophical Question

How are “ball radius” and “complexity” related?

58

3726997

59 of 89

Philosophical Question: Your Answer

How are “ball radius” and “complexity” related?

  • Bigger ball = more complex model?
  • Smaller ball = more complex model?

59

3726997

60 of 89

Philosophical Question: My Answer

The ball radius is a complexity control parameter.

  • Smaller radius = less complex model.

60

3726997

61 of 89

Test Your Understanding

Let’s return to our 9 feature model from before (d = 9).

  • If we pick a very small ball radius, what kind of model will we have?

  1. A model that only returns zero.
  2. A constant model.
  3. Ordinary least squares.
  4. Something else.

61

3726997

62 of 89

Based on the concept of reducing the complexity of the model by constraining the parameters to be within a ball centered at the origin, what kind of model will we have if we pick a very small radius?

Start presenting to display the poll results on this slide.

63 of 89

Test Your Understanding Answer

Let’s return to our 9 feature model from before (d = 9).

  • If we pick a very small ball radius, what kind of model will we have?

  • A model that only returns zero.
  • A constant model.
  • Ordinary least squares.
  • Something else.

63

Answer: It depends!

3726997

64 of 89

Test Your Understanding Answer

If the ball is very tiny, our gradient descent is stuck near the origin.

  • If all parameters are zero including intercept, model always outputs zero.
  • If all parameters are zero except intercept, model is a constant model (returns mean of the observations).

64

3726997

65 of 89

Test Your Understanding Answer

Traditionally the “ball restriction” only applies to non-intercept terms.

  • θ0 is allowed to be any value, not stuck in a ball.
  • If all parameters are zero except intercept, model is a constant model (returns mean of the observations).

65

3726997

66 of 89

Test Your Understanding

Back to our 9 feature model from before (d = 9).

  • If we pick a very large ball radius, what kind of model will we have?

  • A model that only returns zero.
  • A constant model.
  • Ordinary least squares.
  • Something else.

66

3726997

67 of 89

Test Your Understanding Answer

Back to our 9 feature model from before (d = 9).

  • If we pick a very large ball radius, what kind of model will we have?

  • A model that only returns zero.
  • A constant model.
  • Ordinary least squares.
  • Something else.

67

3726997

68 of 89

Test Your Understanding Answer

  • For very large ball sizes, the restriction has no effect.
  • The ball includes the OLS solution!

68

3726997

69 of 89

Training and Validation Errors vs. Ball Size for Our 9D Model

For very small ball size:

  • Model behaves like a constant model. Can’t actually use our 9 features!
  • High training error, low variance, high validation error.

For very large ball size:

  • Model behaves like OLS.
  • If we have tons of features, results in overfitting. Low training error, high variance, high validation error.

69

3726997

70 of 89

L2 Regularization

Constraining our model’s parameters to a ball around the origin is called L2 Regularization.

  • The smaller the ball, the simpler the model.

Ordinary least squares. Find thetas that minimize:

Ordinary least squares with L2 regularization. Find thetas that minimize:

70

Such that θ1 through θd live inside a ball of radius Q.

3726997

71 of 89

L2 Regularization

Constraining our model’s parameters to a ball around the origin is called L2 Regularization.

  • The smaller the ball, the simpler the model.

Ordinary least squares. Find thetas that minimize:

Ordinary least squares with L2 regularization. Find thetas that minimize:

71

such that

Note, intercept term not included!

3726997

72 of 89

Quick Detour into EECS127 (not tested in this class but worth mentioning)

In 127, you’ll learn (through the magic of Lagrangian Duality) that the two problems below are equivalent:

Problem 1: Find thetas that minimize:

Problem 2: Find thetas that minimize:

72

such that

Intuitively, this extra right term penalizes large thetas.

The “objective function” that gradient descent is minimizing now has an extra term.

Covered up until this slide on 3/7. Will continue in the next lecture

3726997

73 of 89

L2 Regularized Least Squares in sklearn

We can run least squares with an L2 regularization term by using the “Ridge” class.

Coefficients we get back:

73

Note: sklearn’s “alpha” parameter is equivalent to in the linear regression with L2 regularizer equation

  • Alpha is inversely related to the ball radius! Large alpha means small ball.

from sklearn.linear_model import Ridge

ridge_model = Ridge(alpha=10000)

ridge_model.fit(vehicle_data_with_squared_features, vehicle_data["mpg"])

3726997

74 of 89

L2 Regularized Least Squares in sklearn

We can run least squares with an L2 regularization term by using the “Ridge” class.

For a tiny alpha, the coefficients are larger:

74

from sklearn.linear_model import Ridge

ridge_model = Ridge(alpha=10**-5)

ridge_model.fit(vehicle_data_with_squared_features, vehicle_data["mpg"])

Note: sklearn’s “alpha” parameter is equivalent to in the linear regression with L2 regularizer equation

  • Alpha is inversely related to the ball radius! Large alpha means small ball.

3726997

75 of 89

L2 Regularized Least Squares in sklearn

We can run least squares with an L2 regularization term by using the “Ridge” class. For a tiny alpha, the coefficients are also about the same as a standard OLS model’s coefficients!

75

Green ball includes the OLS solution!

from sklearn.linear_model import LinearRegression

linear_model = LinearRegression()

linear_model.fit(vehicle_data_with_squared_features, vehicle_data["mpg"])

3726997

76 of 89

Figure (from lab 8)

In lab8, you’ll run an experiment for different values of alpha. The resulting plot is shown below.

  • Note: Since alpha is the inverse of the ball radius, the complexity is higher on the left!

76

3726997

77 of 89

Terminology Note

Why does sklearn use the word “Ridge”?

Because least squares with an L2 regularization term is also called “Ridge Regression”.

  • Term is historical. Doesn’t really matter.

77

Why does sklearn use a hyperparameter which is the inverse of the ball radius?

3726997

78 of 89

Mathematical Note

Ridge Regression has a closed form solution which we will not derive.

  • Note: The solution exists even if the feature matrix has collinearity between its columns.

78

Identity matrix

After lecture edit:

Note: This formula assumes that we regularize the intercept term (in practice we should not).

To avoid regularization on the intercept term, the matrix “I” should be replaced with the identity matrix with the first element (at index (0, 0)) replaced with 0.

3726997

79 of 89

Scaling Data for Regularization

Lecture 15, Data 100 Spring 2023

Cross Validation

  • The Holdout Method
  • K-Fold Cross Validation
  • Test Sets

Regularization

  • L2 Regularization (Ridge)
  • Scaling Data for Regularization
  • L1 Regularization (LASSO)

79

3726997

80 of 89

One Issue With Our Approach

Our data from before has features of quite different numerical scale!

  • Optimal theta for hp will probably be much further from origin than theta for weight^2.

80

Theta will tend to be smaller for weight^2 than other parameters

3726997

81 of 89

Coefficients from Earlier

81

3726997

82 of 89

Making Things Fair

Ideally, our data should all be on the same scale.

  • One approach: Standardize the data, i.e. replace everything with its Z-score.

  • Resulting model coefficients will be all on the same scale.

82

from sklearn.preprocessing import StandardScaler

ss = StandardScaler()

rescaled_df = pd.DataFrame(ss.fit_transform(vehicle_data_with_squared_features),

columns = ss.get_feature_names_out())

ridge_model = Ridge(alpha=10000)

ridge_model.fit(rescaled_df, vehicle_data["mpg"])

ridge_model.coef_

3726997

83 of 89

L1 Regularization (LASSO)

Lecture 15, Data 100 Spring 2023

Cross Validation

  • The Holdout Method
  • K-Fold Cross Validation
  • Test Sets

Regularization

  • L2 Regularization (Ridge)
  • Scaling Data for Regularization
  • L1 Regularization (LASSO)

83

3726997

84 of 89

L1 Regularization

We can also use other shapes.

  • Example: Hypercube of radius Q.

84

3726997

85 of 89

L1 Regularization in Equation Form

Using a hypercube is known as L1 regularization. Expressed mathematically in the two equivalent forms below:

Problem 1: Find thetas that minimize:

Problem 2: Find thetas that minimize:

85

such that

3726997

86 of 89

L1 Regularized OLS in sklearn

In sklearn, we use the Lasso module.

  • Note: Performing OLS with L1 regularization is also called LASSO regression.

86

from sklearn.linear_model import Lasso

lasso_model = Lasso(alpha = 10)

lasso_model.fit(vehicle_data_with_squared_features, vehicle_data["mpg"])

lasso_model.coef_

3726997

87 of 89

LASSO and “Feature Selection”

The optimal parameters for a LASSO model tend to include a lot of zeroes! In other words, LASSO effectively selects only a subset of the features.

Intuitive reason:

  • Imagine expanding a 3D cube until it intersects a balloon. More likely to intersect at a corner or edge than a face (especially in high dimensions)

87

from sklearn.linear_model import Lasso

lasso_model = Lasso(alpha = 10)

lasso_model.fit(vehicle_data_with_squared_features, vehicle_data["mpg"])

lasso_model.coef_

3726997

88 of 89

Summary of Regression Methods

Our regression models are summarized below.

  • The “Objective” column gives the function that our gradient descent optimizer minimizes.
  • Note that this table uses lambda instead of alpha for regularization strength. Both are common.

88

Name

Model

Loss

Reg.

Objective

Solution

OLS

Squared loss

None

Ridge Regression

Squared loss

L2

LASSO

Squared loss

L1

No closed form

3726997

89 of 89

Cross Validation, Regularization

89

Lecture 15