DS100: Fall 2019
Lecture 21 (Josh Hug): Midterm Review
Pre-announcement
Innovative Design:
PCA and SVD
Singular Value Decomposition and PCA
Given a data matrix, SVD decomposes matrix into U, Σ and VT.
width | length | area | perimeter |
8 | 6 | 48 | 28 |
2 | 4 | 8 | 12 |
1 | 3 | 3 | 8 |
...
2 | 6 | 12 | 16 |
PC1 | PC2 | PC3 |
-56 | 4.1 | -0.77 |
-1.4 | -5.6 | 1.6 |
-7.4 | -5.1 | 1.5 |
...
-19 | -6.7 | 3.07 |
-0.15 | -0.13 | -0.81 | -0.55 |
-0.19 | -0.18 | 0.59 | -0.76 |
-0.71 | 0.71 | 0.0008 | 0.0008 |
x
=
VT
UΣ
Singular Value Interpretation (Informally)
Informally, the ith singular value tells us how valuable the ith principal component will be in reconstructing our original data.
width | length | area | perimeter |
2.97 | 1.35 | 24.78 | 8.64 |
-3.03 | -0.65 | -15.22 | -7.36 |
-4.03 | -1.65 | -20.22 | -11.36 |
...
-3.03 | 1.35 | -11.22 | -3.36 |
U
VT
Σ =
Singular Value Interpretation (More Formally)
Formally, the ith singular value tells us how much of the variance is captured by the ith principal component.
Variance captured by 1st PC: 197.392/100 = 389.63
Variance captured by 2nd PC: 27.432/100 = 7.52
Variance captured by 3rd PC: 23.262 / 100 = 5.41
width | length | area | perimeter |
2.97 | 1.35 | 24.78 | 8.64 |
-3.03 | -0.65 | -15.22 | -7.36 |
-4.03 | -1.65 | -20.22 | -11.36 |
...
-3.03 | 1.35 | -11.22 | -3.36 |
Total variance: 402.56
Principal Component Analysis
PCA is the process of (linearly) transforming data into a new coordinate system such that the greatest variance occurs along the first dimension, the second most along the second dimension, and so on.
Principal Component Analysis in Data Science
Two primary uses for PCA (that we’ve seen):
Principal Component Analysis in Data Science
Reminder: It is important to center your data before using PCA.
Scaling data to have unit variance can also be useful.
Principal Component Analysis in Data Science
Reminder: It is important to center your data before using PCA.
Scaling data to have unit variance can also be useful.
Regression and Loss Functions
Data Generation Process
Terms:
Two types of models so far:
Example Linear Regression Task
Given the table below, try to predict the salary of each person.
Linear regression: The prediction and the correct answer are both real values.
Name | Age | Gender | City | Political Affiliation | Rent/own? | Children |
Nico | 45 | M | São Paulo | PSDB | Rent | NaN |
Claudia | 65 | F | Paris | REM | Own | 3 |
Lin | 20 | F | Oakland | Dem | NaN | 0 |
Renzo | 35 | M | Laredo | Dem | Rent | 0 |
Example Logistic Regression Task
Given the table below, try to predict the probability they are identify with the United States Democratic Party for each person.
Logistic regression: The prediction and the correct answer are both real values.
Name | Age | Gender | City | Salary | Rent/own? | Children |
Nico | 45 | M | São Paulo | 11572 | Rent | NaN |
Claudia | 65 | F | Paris | 156262 | Own | 3 |
Lin | 20 | F | Oakland | 20189 | NaN | 0 |
Renzo | 35 | M | Laredo | 9573 | Rent | 0 |
Loss Functions
A loss function quantifies the accuracy of a prediction. Common loss functions we’ve discussed:
When we train models in DS100, we try to minimize the average loss.
Average loss is also called the “empirical risk”. More on this later.
1/n
Linear Models
Almost all of the regression models we’ve used in our class have been linear.
Above, is a matrix of all the data we use to make our predictions.
Linear Models
For a fixed and , average loss in a model is a function of only the parameter vector.
We use to represent the optimal value of the parameters. That is:
Reminder: argmin means find the parameter vector beta that minimizes the right hand side.
Linear Models
Since our model is linear in theta, the average loss is convex for both L1 and L2.
The L2 Loss for Linear Models and the Normal Equation
In lecture, we showed that if we use the L2 loss, then the optimal parameters are given by the Normal Equation.
For L2 Loss
Nonlinear Models
If our model is nonlinear in theta, we cannot use the normal equation.
Gradient descent update rule:
Nonlinear Models: Example [adapted from HW7]
Suppose we are trying to fit the data below with the model shown.
Lots of local minima!
Nonlinear Models: Example [adapted from HW7]
Suppose we are trying to fit the data below with the model shown.
start
end
Nonlinear Models: Example [adapted from HW7]
Suppose we are trying to fit the data below with the model shown.
start
end
The L2 Loss for Linear Models and the Normal Equation
Since linear models are so powerful, easy to optimize, relatively easy to understand, and have a nice geometric interpretation, we used them exclusively for regression except for HW7.
For L2 Loss
Feature Functions
It may not be possible to build an accurate linear model that uses “raw data”.
...
Why am I using theta now instead of beta? Just to get you used to both, since both letters are used to represent parameters.
Feature Functions
It may not be possible to build an accurate linear model that uses “raw data”.
...
Optimal theta for L2 loss, computed using normal equation.
Feature Functions
It may not be possible to build an accurate linear model that uses “raw data”.
...
Optimal theta for L2 loss, computed using normal equation.
Feature Engineering
Feature engineering lets us:
One Hot Encoding
One Hot Encoding is a specific feature engineering technique.
One Hot Encoding
One Hot Encoding is a specific feature engineering technique.
One Hot Encoding
One Hot Encoding is a specific feature engineering technique.
One Hot Encoding (Obsolete Slide)
One Hot Encoding is a specific feature engineering technique.
Feature Engineering
Feature engineering lets us:
The OLS Model
Linear Modeling
Suppose we want to build a linear model that predicts the acceleration of a car from its weight in our mpg dataset.
Simplest Model: The Mean Model
Our simplest possible model is to represent the acceleration by a summary statistic. Choice of summary statistic depends on choice of loss.
Data 8 Approach: Simple Linear Regression
In data 8 we considered the simple linear regression model:
HW6 Model
In HW6 we considered a model with a slope but no y-intercept:
Linear Model
All three of these are a special case of the general linear regression model.
Linear Model
All three of these are a special case of the general linear regression model.
Linear Model
All three of these are a special case of the general linear regression model.
Linear Model
For which of these models are the residuals guaranteed to sum to zero?
Linear Model
For which of these models are the residuals guaranteed to sum to zero?
Geometric Interpretation of OLS
Improving Our Model
Suppose we want to do better than model below.
Thought Experiment
Suppose we use the design matrix shown.
Will this model do better, do worse, or do about the same?
Thought Experiment
Suppose we use the design matrix shown.
Will this model do better, do worse, or do about the same?
Thought Experiment
Suppose we use the design matrix shown.
Will this model do better, do worse, or do about the same?
Thought Experiment
Why is this model better?
Thought Experiment
Suppose we add four random features?
Thought Experiment
Suppose we add four random features?
Thought Experiment
What if we add even more random features? What eventually happens? Does MSE level off, or does it keep getting better and better?
Thought Experiment
What if we add even more random features? What eventually happens? Does MSE level off, or does it keep getting better and better?
Thought Experiment
What if we add even more random features? What eventually happens? Does MSE level off, or does it keep getting better and better?
Thought Experiment
What if we add even more random features? What eventually happens? Does MSE level off, or does it keep getting better and better?
Linear Algebra
If we have 30 orthogonal columns in our data matrix , we can trivially fit any 30 observations we want.
Note: Being able to truly understand this slide is why linear algebra is so important.
Alternate View
In this case, since span{ } is the entire possible space of observations, lies within the span.
Linear Models and Overfitting
Having an excessive number of features in our model can lead to overfitting. Two techniques for keeping model complexity under control:
Will return to these, but first let’s consider overfitting in a more precise way.
Bias and Variance
Feature Functions
Earlier we showed how some clever feature engineering can allow for accurate modeling, even on seemingly nonlinear data.
...
Optimal theta for L2 loss, computed using normal equation.
Fundamental Limits
Our prediction below is very good, but isn’t perfect!
Fundamental Limits
It’s not enough to simply grab a bunch of data points and run them through the normal equation.
Too Simple of a Model
This model can’t capture the complexity of the data.
A Better Model
This model does a better job capturing the complexity of the model.
An Overly Complex Model
This model has even lower error, but captures small fluctuations that are unlikely to be meaningful.
Losses of Each Model
Philosophical Question
Our “overly complex” model overfit to the noise. But we proved it had the lowest L2 loss, why don’t we just use this model anyway?
Training vs. Real Data
Our “overly complex” model overfit to the noise. But we proved it had the lowest L2 loss, why don’t we just use this model anyway?
Particularly bad prediction!
Training vs. Test Error
Training Error
Naively, building more complex models drives down our error, but...
Training vs. Test Error, Philosophical Question
Want a level of model complexity that minimizes our test error.
Training vs. Test Error, Philosophical Question
Want a level of model complexity that minimizes our test error.
Thus, we have a conundrum: How do we minimize the error for something we cannot see?
Probabilistic Models of Bias and Variance
Thought Experiment
Imagine we design a feature matrix, then collect information about 500 random diners and use this data to train an OLS tip model.
�Then imagine we collect information about 1 more diner, and compute the loss of our model’s prediction.
The loss will be some number.
Training data
Thought Experiment
Imagine we design a feature matrix, then collect information about 500 random diners and use this data to train an OLS tip model.
�Then imagine we collect information about 1 more diner, and compute the loss of our model’s prediction.
The loss will be some number. What are some reasons loss might be nonzero?
Training data
Thought Experiment
Imagine we design a feature matrix, then collect information about 500 random diners and use this data to train an OLS tip model.
�Then imagine we collect information about 1 more diner, and compute the loss of our model’s prediction. Why might loss be non-zero?
Training data
Thought Experiment
Imagine we design a feature matrix, then collect information about 500 random diners and use this data to train an OLS tip model.
�Then imagine we collect information about 1 more diner, and compute the loss of our model’s prediction. Why might loss be non-zero?
Training data
Thought Experiment
Imagine we design a feature matrix, then collect information about 500 random diners and use this data to train an OLS tip model.
�Then imagine we collect information about 1 more diner, and compute the loss of our model’s prediction. Why might loss be non-zero?
Training data
Thought Experiment
Imagine we design a feature matrix, then collect information about 500 random diners and use this data to train an OLS tip model.
�Then imagine we collect information about 1 more diner, and compute the loss of our model’s prediction.
We can repeat this same process over and over, each time yielding a different loss. The “risk” is simply the expected value of this loss.
Training data
Probabilistic View of Bias - Variance Tradeoff
First key idea: Assume any observations are created by a “true function” plus some zero-mean noise. Thus, observations are a random variable.
Probabilistic View of Bias - Variance Tradeoff
Second key idea: The thetas that we compute for our model are based on noisy training observations. Thus, our is also a random variable.
Example if our model is linear and we’re training by minimizing L2 Loss.
Training data
Probabilistic View of Bias - Variance Tradeoff
Let f be our model that produces a prediction based on data.
Since the loss depends on two random variables, loss is also a random variable.
Since loss is an R.V., we can compute its expectation:
Probabilistic View of Bias - Variance Tradeoff
Now taking into account randomness, our goal would be to minimize this expectation, which we call the risk:
It is not possible to compute the risk (or true risk).
Probabilistic View of Bias - Variance Tradeoff
Now taking into account randomness, our goal would be to minimize this expectation, which we call the risk:
Minimizing this expectation would mean minimizing our loss on even unseen “test data”.
Probabilistic View of Bias - Variance Tradeoff
Noise
Bias2
Model Variance
Bias
Bias: The expected deviation between the predicted and true values.
A low complexity model is going to have large bias.
Model Variance
Model variance: Variability in predicted value across different training datasets.
A high complexity model tends to have high variance:
Fundamental Limits [Figure from Scikit Learn [Link]]
Bias: Expected deviation between predicted value and true value.
Estimated Model Variance: Variability in predicted value across different datasets.
High
Bias
High
Expected
Model
Variance
Model Complexity, Bias, and Variance
Questions that arise:
Regularization
Theta Calculation
Without taking into account model complexity, we had before that:
But minimizing the loss on our training data may lead to overfitting.
Ways to reduce model complexity:
Theta Calculation With Constraints
We can cast this as a “constrained optimization” problem.
Example of a complexity function:
such that:
Theta Calculation With Constraints (Example: L0 Ball)
We can cast this as a “constrained optimization” problem.
Yields the best model with two or fewer non-zero parameters.
such that:
Theta Calculation With Constraints (Example: L1 Ball)
We can cast this as a “constrained optimization” problem.
Yields the best model where the sum of the abs(thetas) is less than 205.5.
such that:
Theta Calculation With Constraints (Example: L2 Ball)
We can cast this as a “constrained optimization” problem.
Yields the best model where the sum of the squares of thetas is less than 40,000.
such that:
Geometric Interpretation of Constrained Optimization Problem
Unconstrained Version of Regularization
Can convert constrained optimization problem into unconstrained version.
(See EECS127 for more)
such that:
Think of as a penalty for complex models.
EECS127 Magic
More Generically
Given a complexity function R(θ), we find the theta that minimizes the sum of the loss and λR(θ).
Example:
The Role of Lambda
Lambda helps keep the model in check.
If we pick = 0, the model is not constrained at all, and we simply have standard linear regression.
If we pick a large , then our parameters are constrained and tend towards zero.
Nice Solution for Ridge Regression (Bonus)
Ridge Regression has a nice analytic solution.
We didn’t prove this in class, but has been mentioned in lecture / discussion.
Identity matrix
Regularization and Data Scaling
Reminder: It is very important to scale your data when using regularization.
Cross Validation
Model Complexity, Bias, and Variance
Questions that arise:
The Holdout Method (not discussed in DS100)
One approach: Set aside some training data as “validation data”. For each possible λ, optimize the function below for the training data, then compute the loss on the validation data. Use the λ with lowest validation loss.
Essentially, we use the validation set as a way to evaluate the quality of a hyperparameter.
K-Fold Cross Validation
Alternate strategy for hyperparameter quality. Split train data into k-folds, then:
Model Complexity, Bias, and Variance
Questions that arise:
Bootstrapping: A Cousin of Cross Validation
We also discussed briefly the idea of “bootstrapping” (more of a data8 topic).
Regularized Linear Modeling Summary
Our most common technique in this course is to use a regularized linear model.
Modeling More Generally
Logistic Regression
Regularized Linear Modeling Summary
Our most common technique in this course is to use a regularized linear model.
…
Models we’ve seen in this class:
Loss functions:
Regularization terms:
Linear vs. Logistic Regression
In a linear regression model with p features, our goal is to predict a quantitative variable (i.e., some real number) from those features.
In a logistic regression model with p features, our goal is to predict a categorical variable from those features.
Model, Loss, and Regularization
Our three choices of model, loss, and regularization term can be made independently.
Examples of bad combinations:
Logistic Regression with L2 Loss - Not a Good Idea
A logistic model with squared loss can yield bad results.
In lecture, we saw the examples below, where we had large flat regions.
Logistic Regression with L2 Loss - Not a Good Idea
A logistic model with squared loss can yield bad results.
We can also construct even more pathological examples.
Logistic Regression with Cross Entropy Loss
A logistic model with cross entropy loss yields a loss surface that is always convex.
See lecture 20 for more motivation on why we picked cross entropy loss specifically.
Precision and Recall
| 1 | 0 |
1 | TP | FP |
0 | FN | TN |
Prediction
Truth
Of all observations that were actually 1, what proportion did we predict to be 1?
How good is our classifier at detecting points belonging to class 1? Penalizes false negatives.
Of all observations that were predicted to be 1, what proportion were actually 1?
How precise is our classifier? Penalizes false positives.
What proportion of points did our classifier classify correctly?
Doesn’t tell the full story, especially in cases with high class imbalance.
Precision and Recall vs. Threshold
Threshold = 0:
Precision and Recall vs. Threshold
Threshold = 0.2:
Precision and Recall vs. Threshold
Threshold = 0.2:
Precision and Recall vs. Threshold
Threshold = 0.5:
Precision and Recall vs. Threshold
Threshold = 0.8:
Precision and Recall vs. Threshold
Threshold = 1:
Precision/Recall Curve
Can plot Precision and Recall as a curve.
T = 0.9
T = 0.1
T = 0.5
T = 0.2
T = 0.4
T = 0.3
T = 0.6
T = 0.7
T = 0.8
Accuracy Curve
Extra
Note to Self: Possible Things to Add