1 of 126

DS100: Fall 2019

Lecture 21 (Josh Hug): Midterm Review

  • SVD / PCA (Lectures 10, 11)
  • Linear Models (~Lectures 14, 15, 16)
  • Regularization and Bias-Variance Tradeoff (Lectures 17, 18)
  • Gradient Descent (Lectures 18, 19)
  • Logistic Regression (Lectures 19, 20)

2 of 126

Pre-announcement

Innovative Design:

  • Want to bring it more people into design community.
  • Participate in a design-a-thon.
    • Artwork for non-profit.
    • Interact with professionals.
  • Tickets are $10, apparently it costs money.
    • But you get food and material.
  • cmyk.innovativedesign.club

3 of 126

PCA and SVD

4 of 126

Singular Value Decomposition and PCA

Given a data matrix, SVD decomposes matrix into U, Σ and VT.

  • The columns of U and V form an orthonormal set, and Σ is a diagonal matrix containing the singular values of the matrix.
  • Columns of UΣ are the principal components.
  • Can think of V as matrix which transforms between X and UΣ.

width

length

area

perimeter

8

6

48

28

2

4

8

12

1

3

3

8

...

2

6

12

16

PC1

PC2

PC3

-56

4.1

-0.77

-1.4

-5.6

1.6

-7.4

-5.1

1.5

...

-19

-6.7

3.07

-0.15

-0.13

-0.81

-0.55

-0.19

-0.18

0.59

-0.76

-0.71

0.71

0.0008

0.0008

x

=

VT

5 of 126

Singular Value Interpretation (Informally)

Informally, the ith singular value tells us how valuable the ith principal component will be in reconstructing our original data.

  • First principal component does most of the work.
  • Next two principal components contribute about equally.
  • Fourth principal component does nothing (4th singular value is zero).

width

length

area

perimeter

2.97

1.35

24.78

8.64

-3.03

-0.65

-15.22

-7.36

-4.03

-1.65

-20.22

-11.36

...

-3.03

1.35

-11.22

-3.36

U

VT

Σ =

6 of 126

Singular Value Interpretation (More Formally)

Formally, the ith singular value tells us how much of the variance is captured by the ith principal component.

  • The amount of variance captured by the ith principal component is equal to (PC #i)2 / N.

Variance captured by 1st PC: 197.392/100 = 389.63

Variance captured by 2nd PC: 27.432/100 = 7.52

Variance captured by 3rd PC: 23.262 / 100 = 5.41

width

length

area

perimeter

2.97

1.35

24.78

8.64

-3.03

-0.65

-15.22

-7.36

-4.03

-1.65

-20.22

-11.36

...

-3.03

1.35

-11.22

-3.36

Total variance: 402.56

7 of 126

Principal Component Analysis

PCA is the process of (linearly) transforming data into a new coordinate system such that the greatest variance occurs along the first dimension, the second most along the second dimension, and so on.

  • Surfboard example on HW5: PCA rotates our data so that it is axis aligned.

8 of 126

Principal Component Analysis in Data Science

Two primary uses for PCA (that we’ve seen):

  • Scatter plotting high dimensional data in 2D (2D representation).
  • Inspection of rows of VT (esp row 1) to see what features are most important.

9 of 126

Principal Component Analysis in Data Science

Reminder: It is important to center your data before using PCA.

  • Otherwise the first principal component will effectively point from the axis towards the center of the data.

Scaling data to have unit variance can also be useful.

  • Left: Unscaled data: High variance variable appears in both pcs. Structure observed is a non-useful artifact.
  • Right: Scaled data: Structure disappears, and we see no useful trends.

10 of 126

Principal Component Analysis in Data Science

Reminder: It is important to center your data before using PCA.

  • Otherwise the first principal component will effectively point from the axis towards the center of the data.

Scaling data to have unit variance can also be useful.

  • But not always. For example, on the surfboard data, if we don’t scale our data, we get the figure on the left, which is still in the original units for our measurements. Can easily read off width and length of board.

11 of 126

Regression and Loss Functions

12 of 126

Data Generation Process

Terms:

  • Data Generation Process: The real world phenomena from which data is collected.
  • Model: A theory of the data generation process.

Two types of models so far:

  • Linear Regression: Given data vector, try to predict a scalar output.
  • Logistic Regression: Given data vector, try to predict a probability.
    • Combine with thresholding to yield a classification.

13 of 126

Example Linear Regression Task

Given the table below, try to predict the salary of each person.

  • Each column is a feature of the person.

Linear regression: The prediction and the correct answer are both real values.

Name

Age

Gender

City

Political Affiliation

Rent/own?

Children

Nico

45

M

São Paulo

PSDB

Rent

NaN

Claudia

65

F

Paris

REM

Own

3

Lin

20

F

Oakland

Dem

NaN

0

Renzo

35

M

Laredo

Dem

Rent

0

14 of 126

Example Logistic Regression Task

Given the table below, try to predict the probability they are identify with the United States Democratic Party for each person.

Logistic regression: The prediction and the correct answer are both real values.

  • Prediction is a probability. Correct answer is either 0 or 1.

Name

Age

Gender

City

Salary

Rent/own?

Children

Nico

45

M

São Paulo

11572

Rent

NaN

Claudia

65

F

Paris

156262

Own

3

Lin

20

F

Oakland

20189

NaN

0

Renzo

35

M

Laredo

9573

Rent

0

15 of 126

Loss Functions

A loss function quantifies the accuracy of a prediction. Common loss functions we’ve discussed:

  • L1 loss.
  • L2 loss.

When we train models in DS100, we try to minimize the average loss.

  • Average L1 loss (mean absolute error):

  • Average L2 loss (mean squared error):

Average loss is also called the “empirical risk”. More on this later.

1/n

16 of 126

Linear Models

Almost all of the regression models we’ve used in our class have been linear.

  • Model output is linear in the parameters and features.
  • Given a column vector of features, can write a single prediction as:

  • Or a vector of predictions as:

Above, is a matrix of all the data we use to make our predictions.

  • is a column vector of features.

17 of 126

Linear Models

For a fixed and , average loss in a model is a function of only the parameter vector.

We use to represent the optimal value of the parameters. That is:

Reminder: argmin means find the parameter vector beta that minimizes the right hand side.

18 of 126

Linear Models

Since our model is linear in theta, the average loss is convex for both L1 and L2.

  • Convex functions have a single global minimum. Minimum is where the gradient is zero (i.e. the bottom of the metaphorical bowl).
    • If we evaluate the gradient at we’ll get zero.

19 of 126

The L2 Loss for Linear Models and the Normal Equation

In lecture, we showed that if we use the L2 loss, then the optimal parameters are given by the Normal Equation.

  • Proven using geometric reasoning.
    • Will discuss this geometric reasoning behind OLS later.

For L2 Loss

20 of 126

Nonlinear Models

If our model is nonlinear in theta, we cannot use the normal equation.

  • Must resort to a technique like gradient descent (see hw 7 for more).
  • Loss function may not be convex, so result might not be optimal.
    • Gradient descent output may depend on the starting location.

Gradient descent update rule:

21 of 126

Nonlinear Models: Example [adapted from HW7]

Suppose we are trying to fit the data below with the model shown.

Lots of local minima!

22 of 126

Nonlinear Models: Example [adapted from HW7]

Suppose we are trying to fit the data below with the model shown.

start

end

23 of 126

Nonlinear Models: Example [adapted from HW7]

Suppose we are trying to fit the data below with the model shown.

start

end

24 of 126

The L2 Loss for Linear Models and the Normal Equation

Since linear models are so powerful, easy to optimize, relatively easy to understand, and have a nice geometric interpretation, we used them exclusively for regression except for HW7.

  • Note: Nonlinear models are used in the real world! Won’t discuss them in much detail in our course.

For L2 Loss

25 of 126

Feature Functions

It may not be possible to build an accurate linear model that uses “raw data”.

...

Why am I using theta now instead of beta? Just to get you used to both, since both letters are used to represent parameters.

26 of 126

Feature Functions

It may not be possible to build an accurate linear model that uses “raw data”.

...

Optimal theta for L2 loss, computed using normal equation.

27 of 126

Feature Functions

It may not be possible to build an accurate linear model that uses “raw data”.

  • But with a good choice of feature functions, we can get good predictions.

...

Optimal theta for L2 loss, computed using normal equation.

28 of 126

Feature Engineering

Feature engineering lets us:

  • Capture domain knowledge (e.g periodicity or relationships).
  • Encode non-numeric features, e.g.
    • indicator functions, bag-of-words encodings, one-hot encodings, etc.
  • Express non-linear relationships.
  • Some authors use X or to represent “raw” data, and to represent feature-engineered data.

29 of 126

One Hot Encoding

One Hot Encoding is a specific feature engineering technique.

  • Naive version converts a categorical variable (with N categories) into N dummy variables that are either 0 or 1.

30 of 126

One Hot Encoding

One Hot Encoding is a specific feature engineering technique.

  • In practice, we often pick one category as the “default” category, so that we end up with N - 1 dummy variables.
    • Why? If we include a bias term, then Nth dummy variable is redundant (= bias minus sum of other variables). If we include all N variables AND a bias term, design matrix is not full rank.
    • Important to include a bias term if you drop a dummy variable.

31 of 126

One Hot Encoding

One Hot Encoding is a specific feature engineering technique.

  • Someone asked in class: “What if you have multiple categories?” Suppose you had {democrat, republican, other} and {freshman, sophomore, junior, senior}.
    • In the naive approach: Create 7 total dummy variables.
    • If you include a bias term, you should drop 1 dummy variable from each category. You’d end up with 7 - 2 = 5 dummy variables.

32 of 126

One Hot Encoding (Obsolete Slide)

One Hot Encoding is a specific feature engineering technique.

  • Naive version converts a categorical variable (with N categories) into N variables that are either 0 or 1.
  • In practice, we often pick one category as the “default” category, so that we end up with N - 1 variables.
    • Why? Nth feature is redundant. Also if we include a bias term in our model, will result in a design matrix that is not full rank.

33 of 126

Feature Engineering

Feature engineering lets us:

  • Capture domain knowledge (e.g periodicity or relationships).
  • Encode non-numeric features, e.g.
    • indicator functions, bag-of-words encodings, one-hot encodings, etc.
  • Express non-linear relationships.

34 of 126

The OLS Model

35 of 126

Linear Modeling

Suppose we want to build a linear model that predicts the acceleration of a car from its weight in our mpg dataset.

  • Looks like a hard regression task. The correlation looks poor. We’ll see that nonetheless, a linear model with enough features can do very well.

36 of 126

Simplest Model: The Mean Model

Our simplest possible model is to represent the acceleration by a summary statistic. Choice of summary statistic depends on choice of loss.

  • L2 loss: Use the mean (shown below).
  • L1 loss: Use the median.

37 of 126

Data 8 Approach: Simple Linear Regression

In data 8 we considered the simple linear regression model:

  • Again, choice of beta depends on choice of loss function.
  • If we choose L2 as our loss function, we get the regression line below.

38 of 126

HW6 Model

In HW6 we considered a model with a slope but no y-intercept:

  • Yet again, choice of beta depends on choice of loss function.
  • If we choose L2 as our loss function, we get the regression line below.

39 of 126

Linear Model

All three of these are a special case of the general linear regression model.

  • Ordinary Least Squares: Linear regression model optimized for L2 loss.
  • Which is which?

40 of 126

Linear Model

All three of these are a special case of the general linear regression model.

  • Ordinary Least Squares: Linear regression model optimized for L2 loss.
  • Which is which?

41 of 126

Linear Model

All three of these are a special case of the general linear regression model.

  • Ordinary Least Squares: Linear regression model optimized for L2 loss.
  • Which is which?

42 of 126

Linear Model

For which of these models are the residuals guaranteed to sum to zero?

43 of 126

Linear Model

For which of these models are the residuals guaranteed to sum to zero?

44 of 126

Geometric Interpretation of OLS

45 of 126

Improving Our Model

Suppose we want to do better than model below.

  • Can add more quantitative features to our design matrix like “horsepower” or “squared horsepower”.
  • Can also add qualitative features using e.g. one-hot encoding.

46 of 126

Thought Experiment

Suppose we use the design matrix shown.

  • First column is weight. Second column is bias.
  • Third column is weight plus some totally random noise.

Will this model do better, do worse, or do about the same?

47 of 126

Thought Experiment

Suppose we use the design matrix shown.

  • First column is weight. Second column is bias.
  • Third column is weight plus some totally random noise.

Will this model do better, do worse, or do about the same?

48 of 126

Thought Experiment

Suppose we use the design matrix shown.

  • First column is weight. Second column is bias.
  • Third column is weight plus some totally random noise.

Will this model do better, do worse, or do about the same?

49 of 126

Thought Experiment

Why is this model better?

  • Give a geometric argument.

50 of 126

Thought Experiment

Suppose we add four random features?

  • Will the performance be even better?

51 of 126

Thought Experiment

Suppose we add four random features?

  • Will the performance be even better? Yes.

52 of 126

Thought Experiment

What if we add even more random features? What eventually happens? Does MSE level off, or does it keep getting better and better?

53 of 126

Thought Experiment

What if we add even more random features? What eventually happens? Does MSE level off, or does it keep getting better and better?

54 of 126

Thought Experiment

What if we add even more random features? What eventually happens? Does MSE level off, or does it keep getting better and better?

  • Once we have 28 random features (for a total of 30 features), we end up with zero MSE.
  • Why?

55 of 126

Thought Experiment

What if we add even more random features? What eventually happens? Does MSE level off, or does it keep getting better and better?

  • Once we have 28 random features (for a total of 30 features), we end up with zero MSE.
  • Why?

56 of 126

Linear Algebra

If we have 30 orthogonal columns in our data matrix , we can trivially fit any 30 observations we want.

  • The span of our 30 columns is the entire 30 dimensional space.
  • Thus the closest point in the span is the set of observations themselves.

Note: Being able to truly understand this slide is why linear algebra is so important.

57 of 126

Alternate View

In this case, since span{ } is the entire possible space of observations, lies within the span.

  • Thus is equal to .
  • The residual vector is just .

58 of 126

Linear Models and Overfitting

Having an excessive number of features in our model can lead to overfitting. Two techniques for keeping model complexity under control:

  • Feature selection (picking only a subset of features).
  • Regularization.

Will return to these, but first let’s consider overfitting in a more precise way.

59 of 126

Bias and Variance

60 of 126

Feature Functions

Earlier we showed how some clever feature engineering can allow for accurate modeling, even on seemingly nonlinear data.

...

Optimal theta for L2 loss, computed using normal equation.

61 of 126

Fundamental Limits

Our prediction below is very good, but isn’t perfect!

  • In this case, it’s because the data was generated with some random noise added.

62 of 126

Fundamental Limits

It’s not enough to simply grab a bunch of data points and run them through the normal equation.

  • Tricky issue: How do we decide how complex of a model to use?
    • Too simple of a model: Will not capture essential features.
    • Too complex: Will overfit to noise.

63 of 126

Too Simple of a Model

This model can’t capture the complexity of the data.

64 of 126

A Better Model

This model does a better job capturing the complexity of the model.

65 of 126

An Overly Complex Model

This model has even lower error, but captures small fluctuations that are unlikely to be meaningful.

66 of 126

Losses of Each Model

67 of 126

Philosophical Question

Our “overly complex” model overfit to the noise. But we proved it had the lowest L2 loss, why don’t we just use this model anyway?

68 of 126

Training vs. Real Data

Our “overly complex” model overfit to the noise. But we proved it had the lowest L2 loss, why don’t we just use this model anyway?

  • Because we’re going to use it to make more predictions on other data!

Particularly bad prediction!

69 of 126

Training vs. Test Error

70 of 126

Training Error

Naively, building more complex models drives down our error, but...

71 of 126

Training vs. Test Error, Philosophical Question

Want a level of model complexity that minimizes our test error.

  • So why not just repeatedly run an experiment on the test data?

72 of 126

Training vs. Test Error, Philosophical Question

Want a level of model complexity that minimizes our test error.

  • So why not just repeatedly run an experiment on the test data?
    • Because we’ll be using our model to make even MORE predictions on MORE data. Don’t want to overfit to the noise in the current test data.

Thus, we have a conundrum: How do we minimize the error for something we cannot see?

  • We can deal with this by conducting a probabilistic analysis of the L2 Loss.

73 of 126

Probabilistic Models of Bias and Variance

74 of 126

Thought Experiment

Imagine we design a feature matrix, then collect information about 500 random diners and use this data to train an OLS tip model.

�Then imagine we collect information about 1 more diner, and compute the loss of our model’s prediction.

  • Generate prediction:
  • Compute the loss on the prediction:

The loss will be some number.

Training data

75 of 126

Thought Experiment

Imagine we design a feature matrix, then collect information about 500 random diners and use this data to train an OLS tip model.

�Then imagine we collect information about 1 more diner, and compute the loss of our model’s prediction.

  • Generate prediction:
  • Compute the loss on the prediction:

The loss will be some number. What are some reasons loss might be nonzero?

Training data

76 of 126

Thought Experiment

Imagine we design a feature matrix, then collect information about 500 random diners and use this data to train an OLS tip model.

�Then imagine we collect information about 1 more diner, and compute the loss of our model’s prediction. Why might loss be non-zero?

  • That one person might be having a good day.
    • We missed the “good day” feature?
  • The server was really good or bad.
    • We missed the “server” quality feature.

Training data

77 of 126

Thought Experiment

Imagine we design a feature matrix, then collect information about 500 random diners and use this data to train an OLS tip model.

�Then imagine we collect information about 1 more diner, and compute the loss of our model’s prediction. Why might loss be non-zero?

  • They might have added wrong. Under a “free will” model of the universe, there may be some randomness we can’t capture.

Training data

78 of 126

Thought Experiment

Imagine we design a feature matrix, then collect information about 500 random diners and use this data to train an OLS tip model.

�Then imagine we collect information about 1 more diner, and compute the loss of our model’s prediction. Why might loss be non-zero?

  • Model might be missing important features (bias).
  • There might be some randomness in how people pick their tip (noise).
  • We might have gotten unlucky with our sample and had large measurement error and/or a non-representative sample (model variance).
  • Our model may have fit random noise (model variance).

Training data

79 of 126

Thought Experiment

Imagine we design a feature matrix, then collect information about 500 random diners and use this data to train an OLS tip model.

�Then imagine we collect information about 1 more diner, and compute the loss of our model’s prediction.

  • Generate prediction:
  • Compute the loss on the prediction:

We can repeat this same process over and over, each time yielding a different loss. The “risk” is simply the expected value of this loss.

Training data

80 of 126

Probabilistic View of Bias - Variance Tradeoff

First key idea: Assume any observations are created by a “true function” plus some zero-mean noise. Thus, observations are a random variable.

  • Example: x is every possible thing we could know about the people at a table, the weather, the server, etc. y is the resulting tip. We assume that the tip is given by true function h(x) + some noise.

81 of 126

Probabilistic View of Bias - Variance Tradeoff

Second key idea: The thetas that we compute for our model are based on noisy training observations. Thus, our is also a random variable.

  • Example: Data about our 500 diners had some randomness, so the resulting model we build is a random variable.
  • We assume nothing about the nature of the randomness of our training observations.

Example if our model is linear and we’re training by minimizing L2 Loss.

Training data

82 of 126

Probabilistic View of Bias - Variance Tradeoff

Let f be our model that produces a prediction based on data.

Since the loss depends on two random variables, loss is also a random variable.

Since loss is an R.V., we can compute its expectation:

  • Would be great to minimize this!

83 of 126

Probabilistic View of Bias - Variance Tradeoff

Now taking into account randomness, our goal would be to minimize this expectation, which we call the risk:

It is not possible to compute the risk (or true risk).

  • However, the average loss computed on data that was not used for training approximates the risk. Average loss on such data often called the “empirical risk”.

84 of 126

Probabilistic View of Bias - Variance Tradeoff

Now taking into account randomness, our goal would be to minimize this expectation, which we call the risk:

Minimizing this expectation would mean minimizing our loss on even unseen “test data”.

  • Using our definitions, we can break down the expectation above, resulting in the “bias variance tradeoff”.
  • What the result shows us is that the risk (i.e. expected loss) is made of two key terms that tend to work against each other.
  • Note: Hard to understand. You probably won’t really understand it until you’ve taken a more advanced course like 189.

85 of 126

Probabilistic View of Bias - Variance Tradeoff

Noise

Bias2

Model Variance

86 of 126

Bias

Bias: The expected deviation between the predicted and true values.

  • Left term: What would the true function generate for the current x?
  • Right term: If you were able to train huge numbers of models using noisy data generated by the true function, what is the average prediction they’d make for the current x?

A low complexity model is going to have large bias.

  • h(x) might be fairly complicated, i.e. observations are a complicated function on the x values.
  • If f is relatively simple, it won’t be able to capture that relationship.

87 of 126

Model Variance

Model variance: Variability in predicted value across different training datasets.

  • We saw on the previous slide that the left term is the average prediction for a given x.
  • This entire expression is how much a specific prediction for x is expected to deviate from the average prediction from x.

A high complexity model tends to have high variance:

  • Prediction of a model trained on one noisy dataset will lead to highly specific predictions for a given x.
  • The average prediction will not track that one specific noisy dataset.

88 of 126

Fundamental Limits [Figure from Scikit Learn [Link]]

Bias: Expected deviation between predicted value and true value.

  • Underfitting means we expect to miss important details about h(x).

Estimated Model Variance: Variability in predicted value across different datasets.

  • Overfitting means we expect to fit noise that is not part of h(x).

High

Bias

High

Expected

Model

Variance

89 of 126

Model Complexity, Bias, and Variance

Questions that arise:

  • How can we smoothly adjust model complexity?
  • For each level of complexity, how can we estimate our test error without seeing test data?

90 of 126

Regularization

91 of 126

Theta Calculation

Without taking into account model complexity, we had before that:

But minimizing the loss on our training data may lead to overfitting.

Ways to reduce model complexity:

  • Remove features entirely from our feature matrix.
  • Make some thetas zero or very small. We’ll use this approach.

92 of 126

Theta Calculation With Constraints

We can cast this as a “constrained optimization” problem.

Example of a complexity function:

such that:

93 of 126

Theta Calculation With Constraints (Example: L0 Ball)

We can cast this as a “constrained optimization” problem.

Yields the best model with two or fewer non-zero parameters.

  • Computationally intractable (just have to try all combinations).

such that:

94 of 126

Theta Calculation With Constraints (Example: L1 Ball)

We can cast this as a “constrained optimization” problem.

Yields the best model where the sum of the abs(thetas) is less than 205.5.

  • Computationally tractable, but we won’t discuss. See EECS127/227.
  • Also called LASSO regression.

such that:

95 of 126

Theta Calculation With Constraints (Example: L2 Ball)

We can cast this as a “constrained optimization” problem.

Yields the best model where the sum of the squares of thetas is less than 40,000.

  • Computationally tractable, but we won’t discuss. See EECS127/227.
  • Also called Ridge Regression.

such that:

96 of 126

Geometric Interpretation of Constrained Optimization Problem

97 of 126

Unconstrained Version of Regularization

Can convert constrained optimization problem into unconstrained version.

(See EECS127 for more)

such that:

Think of as a penalty for complex models.

EECS127 Magic

98 of 126

More Generically

Given a complexity function R(θ), we find the theta that minimizes the sum of the loss and λR(θ).

Example:

  • R(θ) = “sum of the squares of the theta” would be the previous slide.
  • R(θ) = “infinity if more than two thetas are non-zero, and zero otherwise”: would result in the best model that uses at most two features (this is effectively the constrained L0-ball problem).

99 of 126

The Role of Lambda

Lambda helps keep the model in check.

If we pick = 0, the model is not constrained at all, and we simply have standard linear regression.

If we pick a large , then our parameters are constrained and tend towards zero.

100 of 126

Nice Solution for Ridge Regression (Bonus)

Ridge Regression has a nice analytic solution.

  • If R(θ) = , then the optimal theta is given by:

We didn’t prove this in class, but has been mentioned in lecture / discussion.

Identity matrix

101 of 126

Regularization and Data Scaling

Reminder: It is very important to scale your data when using regularization.

  • Otherwise beta values for relatively small valued features will be artificially constrained to be small.
  • We do not need to scale the bias term (not included in our regularization computation).

102 of 126

Cross Validation

103 of 126

Model Complexity, Bias, and Variance

Questions that arise:

  • How can we smoothly adjust model complexity? Solved: Regularization.
  • For each level of complexity, how can we estimate our test error without seeing test data?

104 of 126

The Holdout Method (not discussed in DS100)

One approach: Set aside some training data as “validation data”. For each possible λ, optimize the function below for the training data, then compute the loss on the validation data. Use the λ with lowest validation loss.

Essentially, we use the validation set as a way to evaluate the quality of a hyperparameter.

105 of 126

K-Fold Cross Validation

Alternate strategy for hyperparameter quality. Split train data into k-folds, then:

  • Train model on all but 1 fold. Compute error on remaining fold, called the validation fold.
  • Repeat the step above for all k possible choices of validation fold.
  • Quality is the average of the k validation errors.

106 of 126

Model Complexity, Bias, and Variance

Questions that arise:

  • How can we smoothly adjust model complexity? Regularization.
  • For each level of complexity, how can we estimate our test error without seeing test data? Cross-validation.
    • Mimics the idea of hidden data.

107 of 126

Bootstrapping: A Cousin of Cross Validation

We also discussed briefly the idea of “bootstrapping” (more of a data8 topic).

  • Cross Validation gives us a sense of how well our model generalizes.
    • Gives us MSE for different scenarios (e.g. different lambda values).
  • Bootstrapping gives us a sense of how much our model parameters will vary.
    • Gives us a distribution of parameters.

108 of 126

Regularized Linear Modeling Summary

Our most common technique in this course is to use a regularized linear model.

  1. Pick a model, loss function, and regularization term.
  2. Split your data into training and test sets (e.g. 90%, 10%).
  3. Use only the training data when designing, training, and tuning the model.
    1. Use cross validation to tune hyperparameters during this phase.
    2. Do not look at the test data.
  4. Commit to your final model and train once more using only the training data.
  5. Test the final model using the test data. If accuracy is not acceptable return to (3). (Get more test data if possible.)
  6. Train on all available data and ship it!

109 of 126

Modeling More Generally

Logistic Regression

110 of 126

Regularized Linear Modeling Summary

Our most common technique in this course is to use a regularized linear model.

  • Pick a model, loss function, and regularization term.

Models we’ve seen in this class:

  • Linear model.
  • Logistic model.
  • Nonlinear sinusoidal model (HW7).

Loss functions:

  • L1 (absolute loss).
  • L2 (squared loss).
  • Cross entropy loss.�

Regularization terms:

  • No regularization.
  • L1 regularization (LASSO).
  • L2 regularization (Ridge).

111 of 126

Linear vs. Logistic Regression

In a linear regression model with p features, our goal is to predict a quantitative variable (i.e., some real number) from those features.

  • Our output can be any real number.

In a logistic regression model with p features, our goal is to predict a categorical variable from those features.

  • The output of logistic regression is always between 0 and 1, i.e. it is quantitative!
    • Gives probability under our model that the category is 1.
  • To convert probability into a classification, we use thresholding.
    • Example: If P < 0.4, return 1, otherwise return 0.

112 of 126

Model, Loss, and Regularization

Our three choices of model, loss, and regularization term can be made independently.

Examples of bad combinations:

  • A logistic model with squared loss can yield bad results.
  • It turns out logistic model with no regularization is also problematic, will see this on Tuesday.

113 of 126

Logistic Regression with L2 Loss - Not a Good Idea

A logistic model with squared loss can yield bad results.

In lecture, we saw the examples below, where we had large flat regions.

114 of 126

Logistic Regression with L2 Loss - Not a Good Idea

A logistic model with squared loss can yield bad results.

We can also construct even more pathological examples.

  • Gradient descent gets pulled towards infinite beta.

115 of 126

Logistic Regression with Cross Entropy Loss

A logistic model with cross entropy loss yields a loss surface that is always convex.

  • Easier to optimize.

See lecture 20 for more motivation on why we picked cross entropy loss specifically.

116 of 126

Precision and Recall

1

0

1

TP

FP

0

FN

TN

Prediction

Truth

Of all observations that were actually 1, what proportion did we predict to be 1?

How good is our classifier at detecting points belonging to class 1? Penalizes false negatives.

Of all observations that were predicted to be 1, what proportion were actually 1?

How precise is our classifier? Penalizes false positives.

What proportion of points did our classifier classify correctly?

Doesn’t tell the full story, especially in cases with high class imbalance.

117 of 126

Precision and Recall vs. Threshold

Threshold = 0:

  • Everything classified as malignant.
  • Accuracy: 37% (because 37% of examples are malignant)
  • Precision: 37% (there are no true or false negatives)
  • Recall: 100% (every malignant example is caught).

118 of 126

Precision and Recall vs. Threshold

Threshold = 0.2:

  • Everything above 20% is classified as malignant.
  • Accuracy: 81% (81% of the stars are on the correct side of the line).
  • Precision: 67% (67% of the _____________ are _____________)?
  • Recall: 91% (91% of the ______________ are ______________)?

119 of 126

Precision and Recall vs. Threshold

Threshold = 0.2:

  • Everything above 20% is classified as malignant.
  • Accuracy: 81% (81% of the stars are on the correct side of the line).
  • Precision: 67% (67% of the stars above the line are red).
  • Recall: 91% (91% of the red stars are above the line).

120 of 126

Precision and Recall vs. Threshold

Threshold = 0.5:

  • Everything above 50% is classified as malignant.
  • Accuracy: 87% (87% of the stars are on the correct side of the line).
  • Precision: 86% (86% of the stars above the line are red).
  • Recall: 76% (76% of the red stars are above the line).

121 of 126

Precision and Recall vs. Threshold

Threshold = 0.8:

  • Everything above 80% is classified as malignant.
  • Accuracy: 85% (85% of the stars are on the correct side of the line).
  • Precision: 97% (96% of the stars above the line are red).
  • Recall: 61% (61% of the red stars are above the line).

122 of 126

Precision and Recall vs. Threshold

Threshold = 1:

  • Everything classified as benign.
  • Accuracy: 63% (63% of the stars are on the correct side of the line).
  • Precision: Undefined (There are no stars above the line).
  • Recall: 0% (0% of the red stars are above the line).

123 of 126

Precision/Recall Curve

Can plot Precision and Recall as a curve.

  • To be consistent with lecture 20, an earlier version of this slide had recall on y-axis and precision on x-axis. It’s often the other way around (as below).

T = 0.9

T = 0.1

T = 0.5

T = 0.2

T = 0.4

T = 0.3

T = 0.6

T = 0.7

T = 0.8

124 of 126

Accuracy Curve

125 of 126

Extra

126 of 126

Note to Self: Possible Things to Add

  • Total sum of squares, explained variance.
  • Rank and non-invertibility.
  • Bootstrap.
  • MSE + logistic regression and outliers.