Gradient Descent, Feature Engineering
Finishing Optimization. Transforming Data to Improve Our Models.
Data 100/Data 200, Spring 2022 @ UC Berkeley
Josh Hug and Lisa Yan
1
Lecture 13
Plan for Lectures 12 and 13: Model Implementation
2
Model Implementation I:�sklearn
Gradient Descent
Question & Problem
Formulation
Data
Acquisition
Exploratory Data Analysis
Prediction and
Inference
Reports, Decisions, and Solutions
?
Model Implementation II:�Gradient Descent
Feature Engineering
(today)
Today’s Roadmap
Lecture 13, Data 100 Spring 2022
Gradient Descent Wrap Up:
Feature Engineering:
3
Stochastic Gradient Descent
Lecture 13, Data 100 Spring 2022
Gradient Descent Wrap up:
Feature Engineering:
4
Review: Gradient Descent
Gradient descent algorithm: nudge θ in negative gradient direction until θ converges.
For a model with one parameter (equivalently: input data is one dimensional):
θ: Model weights L: loss function
⍺: Learning rate (ours was constant, but other techniques have ⍺ decrease over time)
y: True values from training data
5
Next value for θ
gradient of the loss function evaluated at current θ
Review: Gradient Descent
Gradient descent algorithm: nudge θ in negative gradient direction until θ converges.
For a model with multiple parameters:
θ: Model weights L: loss function
⍺: Learning rate (ours was constant, but other techniques have ⍺ decrease over time)
y: True values from training data
6
Next value for θ
gradient of the loss function evaluated at current θ
Gradient Descent
By repeating this process over and over, you can find a local minimum of the function being optimized.
7
7
Batch Gradient Descent
The algorithm we derived in the last class is more verbosely known as “batch gradient descent”.
Impractical in some circumstances. Imagine you have a billions of data points.
8
Next value for θ
gradient of the loss function evaluated at current θ
Mini-Batch Gradient Descent
In mini-batch gradient descent, we only use a subset of the data when computing the gradient. Example:
9
Next value for θ
gradient of the loss function evaluated at current θ
Mini-Batch Gradient Descent
In mini-batch gradient descent, we only use a subset of the data when computing the gradient. Example:
Question: Are we done now?
10
Next value for θ
gradient of the loss function evaluated at current θ
Mini-Batch Gradient Descent
In mini-batch gradient descent, we only use a subset of the data when computing the gradient. Example:
Question: Are we done now? Not unless we were lucky!
11
Next value for θ
gradient of the loss function evaluated at current θ
Mini-Batch Gradient Descent
In mini-batch gradient descent, we only use a subset of the data when computing the gradient. Example:
Question: So what should we do next?
12
Next value for θ
gradient of the loss function evaluated at current θ
Mini-Batch Gradient Descent
In mini-batch gradient descent, we only use a subset of the data when computing the gradient. Example:
Question: So what should we do next? Go through data again.
13
Next value for θ
gradient of the loss function evaluated at current θ
Mini-Batch Gradient Descent
In mini-batch gradient descent, we only use a subset of the data when computing the gradient. Example:
14
Next value for θ
gradient of the loss function evaluated at current θ
Mini-batch Gradient Descent
In mini-batch gradient descent, we only use a subset of the data when computing the gradient. Example:
Each pass in bold is called a training epoch.
15
Next value for θ
gradient of the loss function evaluated at current θ
Interpreting a Gradient Computed from a Mini-Batch
Of note: The gradient we compute using only 10% of our data is not the true gradient!
16
Batch Size and Sampling
In our example, we used 10% as the size of each mini-batch.
Additionally, rather than going in the order of our original dataset, we typically shuffle the data in between training epochs.
17
Stochastic Gradient Descent
In the most extreme case, we choose a batch size of 1.
A batch size of 1 is called “stochastic gradient descent”.
18
Gradient Descent
19
Stochtastic Gradient Descent
20
Convexity
Lecture 13, Data 100 Spring 2022
Gradient Descent Wrap up:
Feature Engineering:
21
Gradient Descent Only Finds Local Minima
As we saw, the gradient descent procedure can get stuck in a local minimum.
If a function has a special property called “convexity”, then gradient descent is guaranteed to find the global minimum.
22
Convexity
Formally, f is convex iff:
23
Convexity and Avoidance of Local Minima
For a convex function f, any local minimum is also a global minimum.
Our arbitrary curve from before is not convex:
24
Not all point are below the line!
Convexity and Optimization Difficulty
Fitting a Linear Parabolic Model
Lecture 13, Data 100 Spring 2022
Gradient Descent Wrap up:
Feature Engineering:
26
A Challenge for Linear Models
The plot below shows fuel efficiency vs. engine power of many different models of car.
27
Simple Linear Regression on MPG Data
If we create a simple linear regression model with hp as our only feature, we obviously can’t capture the nonlinear relationship between mpg and hp.
28
MSE: 23.94
Fitting a… Parabola?
Just eyeballing this data, it seems that a quadratic model might do a better job on the range of data given.
29
With these two visual observations we can compute the equation for the parabola.
Details not shown!
Same equation written in two different ways.
MSE: 21
Our Model Is Nonlinear in X
Here, we observe that our model is of the form
30
The Wrong Approach
The wrong approach would be to entirely abandon our definition of a linear model and try to invent new fitting techniques and libraries for squared models.
31
Staying Linear with Nonlinear Transformations
Rather than having to create an entirely new conceptual framework, a better solution is simply to add a new squared feature to our model.
If we do this, we can just use the same linear model framework from before!
32
Results of Linear Regression on Our Nonlinear Features
33
Comparing Our Models
My eyeballed parabolic model:
Our linear regression model using hp and hp2.
34
MSE: 21
MSE: 18.98
Fitting a Linear Parabolic Model
Lecture 13, Data 100 Spring 2022
Gradient Descent Wrap up:
Feature Engineering:
35
Feature Engineering
Feature Engineering is the process of transforming the raw features into more informative features that can be used in modeling or EDA tasks.
Feature engineering allows you to:
36
Feature Function
A Feature Function takes our original d dimensional input and transforms it into a p dimensional input.
37
Example: Our feature function earlier took our 1 dimensional input and transformed it into a 2 dimensional input.
p is often much greater than d.
Transformed Data and Linear Models
As we saw in our example earlier, adding a squared feature allowed us to capture a parabolic relationship.
Note that the equation for a linear model that is trained on transformed data is sometimes written using the symbol phi instead of x:
38
Feature Functions
Designing feature functions is a major part of data science and machine learning.
Let’s see an example video where Professor Joey Gonzales takes a 2 dimensional input and transforms it into a 15 dimensional input, allowing him to fit a rather complex surface using a linear model.
39
High Dimensional Feature Engineering Example (Joey Gonzales)
Lecture 13, Data 100 Spring 2022
Gradient Descent Wrap up:
Feature Engineering:
40
One Hot Encoding
Lecture 13, Data 100 Spring 2022
Gradient Descent Wrap up:
Feature Engineering:
41
Regression Using Non-Numeric Features
We can also perform regression on non-numeric features. For example, for the tips dataset from last lecture, we might want to use the day of the week.
42
Using Non-Numeric Features: One Hot Encoding
One approach is to use what is known as a “one hot encoding.”
43
Using Non-Numeric Features: One Hot Encoding
One approach is to use what is known as a “one hot encoding.”
44
Fitting a Model
If we fit a linear model, the result is a 6 dimensional model.
Resulting prediction is:
45
Test Your Understanding
If we fit a linear model, the result is a 6 dimensional model.
Resulting prediction is:
To test your understanding, what tip would the model predict for a party of 3 with a $50 check eating on a Thursday?
46
Test Your Understanding
If we fit a linear model, the result is a 6 dimensional model.
Resulting prediction is:
To test your understanding, what tip would the model predict for a party of 3 with a $50 check eating on a Thursday?
47
Verifying in Python
If we fit a linear model, the result is a 6 dimensional model.
Resulting prediction is:
To test your understanding, what tip would the model predict for a party of 3 with a $50 check eating on a Thursday?
48
Interpreting the 6 Dimensional Model
It turns out the MSE for this 6 dimensional model is 1.01.
This model makes slightly better predictions on this training set, but it likely does not represent the true nature of the data generating process.
49
An Alternate Approach
Another approach is to fit a separate model to each condition.
50
High Order Polynomial Example
Lecture 13, Data 100 Spring 2022
Gradient Descent Wrap up:
Feature Engineering:
51
Cubic Fit
Let’s return to where we started today: Creating higher order features for the mpg dataset. An interesting question arises: What happens if we add a feature corresponding to the horsepower cubed?
Let’s try it out:
52
Cubic Fit Results
53
We observe a small improvement in MSE.
Going Even Higher Order
54
As we increase model complexity, MSE drops from 60.76 to 23.94 to … 18.43.
The code that I used to generate these models is given below. Uses two out of scope syntax concepts:
See notebook for today if you’re curious.
55
Variance and Training Error
Lecture 13, Data 100 Spring 2022
Gradient Descent Wrap up:
Feature Engineering:
56
Going Even Higher Order
57
As we increase model complexity, MSE drops from 60.76 to 23.94 to … 18.43.
Error vs. Complexity
As we increase the complexity of our model, we see that the error on our training data (also called the Training Error) decreases.
58
Going Even Higher Order
59
As we increase model complexity, MSE drops from 60.76 to 23.94 to … 18.43. At the same time, the fit curve grows increasingly erratic and sensitive to the data.
Example on a Subset of the Data
On top, we see the results of fitting two very similar datasets using an order 2 model (𝜃1 + 𝜃2 𝑥 + 𝜃3 𝑥2). The resulting fit (model parameters) is close.
On bottom, we see the results of fitting the same datasets using an order 6 model (𝜃1+…+𝜃7 𝑥6). We see very different predictions, especially for hp around 170.
In ML, this sensitivity to data is known as “variance”.
60
Error vs. Complexity
As we increase the complexity of our model:
61
Overfitting
Lecture 13, Data 100 Spring 2022
Gradient Descent Wrap up:
Feature Engineering:
62
Four Parameter Model with Four Data Points
Interesting fact: Given N data points, we can always find a polynomial of degree N-1 that goes through all those points (as long as no point is directly above any other).
Example:
There exist such that goes through all of these points.
63
Four Parameter Model with Four Data Points
Interesting fact: Given N data points, we can always find a polynomial of degree N-1 that goes through all those points (as long as no point is directly above any other).
Example:
There exist such that goes through all of these points.
Just solve the system of equations below:
64
Four Parameter Model with Four Data Points
Interesting observation: Given N data points, we can always find a polynomial of degree N-1 that goes through all those points.
Example:
There exist such that goes through all of these points.
Just solve the system of equations below:
65
Reminder: Solving a System of Linear Equations is Equivalent to Matrix Inversion
Solving our linear equations is equivalent to a matrix inversion.
Specifically, we’re solving ,where is predictions, is features, and is parameters.
66
In sklearn
Can also do this in sklearn:
67
The Danger of Overfitting
This principle generalizes. If we have 100 data points with only a single feature, we can always generate 99 more features from the original feature, then fit a 100 parameter model with perfectly fits our data.
The problem we’re facing here is “overfitting”. Our model is effectively just memorizing existing data and cannot handle new situations at all.
To get a better handle on this problem, let’s build a model that perfectly fits 6 randomly chosen vehicles from our fuel efficiency dataset.
68
Model Sensitivity in Action
No matter which vehicles we pick, we’ll almost always get an essentially* perfect fit.
69
Comparing a Fit On Our Six Data Points with the Full Data Set
Consider the model on the left, generated from a sample of six data points. When overlaid on our full data set, we see that our predictions are terrible.
70
Detecting Overfitting
Lecture 13, Data 100 Spring 2022
Gradient Descent Wrap up:
Feature Engineering:
71
Our 35 Samples
Consider a model fit on only the 35 data points.
72
Fitting Various Degree Models
If we fit models of degree 0 through 7 of this model. The MSE is as shown below.
73
Visualizing the Models
Below we show the order 0, 1, 2, and 6 models.
74
An Intuitively Overfit Model
Intuitively, the degree 6 model below feels like it is overfit.
75
Collecting More Data to Prove a Model is Overfit
Suppose we collect the 9 new orange data points. Can compute MSE for our original models without refitting using the new orange data points.
76
Original 35 data points
New 9 data points
Collecting More Data to Prove a Model is Overfit
Suppose we have 7 models and don’t know which is best.
We could wait for more data and see which of our 7 models does best on the new points.
77
Gradient Descent, Feature Engineering
Content credit: Josh Hug, Joseph Gonzalez
78
Lecture 13