1 of 78

Gradient Descent, Feature Engineering

Finishing Optimization. Transforming Data to Improve Our Models.

Data 100/Data 200, Spring 2022 @ UC Berkeley

Josh Hug and Lisa Yan

1

Lecture 13

2 of 78

Plan for Lectures 12 and 13: Model Implementation

2

Model Implementation I:�sklearn

Gradient Descent

Question & Problem

Formulation

Data

Acquisition

Exploratory Data Analysis

Prediction and

Inference

Reports, Decisions, and Solutions

?

Model Implementation II:�Gradient Descent

Feature Engineering

(today)

3 of 78

Today’s Roadmap

Lecture 13, Data 100 Spring 2022

Gradient Descent Wrap Up:

Stochastic Gradient Descent
Convexity

Feature Engineering:

Fitting a Linear Parabolic Model
Feature Engineering Overview
High Dimensional Feature Engineering Example (Joey Gonzales)
One Hot Encoding
High Order Polynomial Example
Variance and Training Error
Overfitting
Detecting Overfitting

3

4 of 78

Stochastic Gradient Descent

Lecture 13, Data 100 Spring 2022

Gradient Descent Wrap up:

Stochastic Gradient Descent
Convexity

Feature Engineering:

Fitting a Linear Parabolic Model
Feature Engineering Overview
High Dimensional Feature Engineering Example (Joey Gonzales)
One Hot Encoding
High Order Polynomial Example
Variance and Training Error
Overfitting
Detecting Overfitting

4

5 of 78

Review: Gradient Descent

Gradient descent algorithm: nudge θ in negative gradient direction until θ converges.

For a model with one parameter (equivalently: input data is one dimensional):

θ: Model weights L: loss function

⍺: Learning rate (ours was constant, but other techniques have ⍺ decrease over time)

y: True values from training data

5

Next value for θ

gradient of the loss function evaluated at current θ

6 of 78

Review: Gradient Descent

Gradient descent algorithm: nudge θ in negative gradient direction until θ converges.

For a model with multiple parameters:

θ: Model weights L: loss function

⍺: Learning rate (ours was constant, but other techniques have ⍺ decrease over time)

y: True values from training data

6

Next value for θ

gradient of the loss function evaluated at current θ

7 of 78

Gradient Descent

By repeating this process over and over, you can find a local minimum of the function being optimized.

7

8 of 78

Batch Gradient Descent

The algorithm we derived in the last class is more verbosely known as “batch gradient descent”.

Uses the entire batch of available data to compute the gradient.

Impractical in some circumstances. Imagine you have a billions of data points.

Computing the gradient would require computing the loss for a prediction for EVERY data point, then computing the mean loss across all several billion.

8

Next value for θ

gradient of the loss function evaluated at current θ

9 of 78

Mini-Batch Gradient Descent

In mini-batch gradient descent, we only use a subset of the data when computing the gradient. Example:

Compute gradient on first 10% of the data. Adjust parameters.
Then compute gradient on next 10% of the data. Adjust parameters.
Then compute gradient on third 10% of the data. Adjust parameters.
…
Then compute gradient on last 10% of the data. Adjust parameters.

9

Next value for θ

gradient of the loss function evaluated at current θ

10 of 78

Mini-Batch Gradient Descent

In mini-batch gradient descent, we only use a subset of the data when computing the gradient. Example:

Compute gradient on first 10% of the data. Adjust parameters.
Then compute gradient on next 10% of the data. Adjust parameters.
Then compute gradient on third 10% of the data. Adjust parameters.
…
Then compute gradient on last 10% of the data. Adjust parameters.

Question: Are we done now?

10

Next value for θ

gradient of the loss function evaluated at current θ

11 of 78

Mini-Batch Gradient Descent

In mini-batch gradient descent, we only use a subset of the data when computing the gradient. Example:

Compute gradient on first 10% of the data. Adjust parameters.
Then compute gradient on next 10% of the data. Adjust parameters.
Then compute gradient on third 10% of the data. Adjust parameters.
…
Then compute gradient on last 10% of the data. Adjust parameters.

Question: Are we done now? Not unless we were lucky!

11

Next value for θ

gradient of the loss function evaluated at current θ

12 of 78

Mini-Batch Gradient Descent

In mini-batch gradient descent, we only use a subset of the data when computing the gradient. Example:

Compute gradient on first 10% of the data. Adjust parameters.
Then compute gradient on next 10% of the data. Adjust parameters.
Then compute gradient on third 10% of the data. Adjust parameters.
…
Then compute gradient on last 10% of the data. Adjust parameters.

Question: So what should we do next?

12

Next value for θ

gradient of the loss function evaluated at current θ

13 of 78

Mini-Batch Gradient Descent

In mini-batch gradient descent, we only use a subset of the data when computing the gradient. Example:

Compute gradient on first 10% of the data. Adjust parameters.
Then compute gradient on next 10% of the data. Adjust parameters.
Then compute gradient on third 10% of the data. Adjust parameters.
…
Then compute gradient on last 10% of the data. Adjust parameters.

Question: So what should we do next? Go through data again.

13

Next value for θ

gradient of the loss function evaluated at current θ

14 of 78

Mini-Batch Gradient Descent

In mini-batch gradient descent, we only use a subset of the data when computing the gradient. Example:

Compute gradient on first 10% of the data. Adjust parameters. Then compute gradient on next 10% of the data … Then compute gradient on final 10% of the data. Adjust parameters.
Repeat the process above until we we either hit some max number of iterations or our error is below some desired threshold.

14

Next value for θ

gradient of the loss function evaluated at current θ

15 of 78

Mini-batch Gradient Descent

In mini-batch gradient descent, we only use a subset of the data when computing the gradient. Example:

Compute gradient on first 10% of the data. Adjust parameters. Then compute gradient on next 10% of the data … Then compute gradient on final 10% of the data. Adjust parameters.
Repeat the process above until we we either hit some max number of iterations or our error is below some desired threshold.

Each pass in bold is called a training epoch.

15

Next value for θ

gradient of the loss function evaluated at current θ

16 of 78

Interpreting a Gradient Computed from a Mini-Batch

Of note: The gradient we compute using only 10% of our data is not the true gradient!

It’s merely an approximation. May not be the absolutely best way down the true loss surface.
Works well in practice.

16

17 of 78

Batch Size and Sampling

In our example, we used 10% as the size of each mini-batch.

Interestingly, in real world practice, the size of a mini-batch is usually just a fixed number that is independent of the size of the data set.
Size of the batch represents the quality of the gradient approximation.

All of the data (batch gradient descent): True gradient.
10% of the data: Approximation of the gradient (may not be fastest way down)

Typical choice for mini-batch size: 32 points.

Used regardless of how many billions of data points you may have.
See ML literature for more on why.
Results in a total of N/32 passes per epoch.

Additionally, rather than going in the order of our original dataset, we typically shuffle the data in between training epochs.

Details are beyond the scope of our class.

17

18 of 78

Stochastic Gradient Descent

In the most extreme case, we choose a batch size of 1.

Gradient is computed using only a single data point!
Number of passes per epoch is therefore the number of data points.
It may surprise you but this actually works on real world datasets.

Why it’s surprising: Imagine training an algorithm that recognizes pictures of dogs. Training based on only one dog image at a time means updating potentially millions of parameters based on a single image.
Why it works (intuitively): If we average across many epochs across the entire dataset, effect is similar to if we simply compute the true gradient based on the entire dataset.

A batch size of 1 is called “stochastic gradient descent”.

Some practicioners use the terms “stochastic gradient descent” and “mini-batch gradient descent” interchangeably.

18

19 of 78

Gradient Descent

19

20 of 78

Stochtastic Gradient Descent

20

21 of 78

Convexity

Lecture 13, Data 100 Spring 2022

Gradient Descent Wrap up:

Stochastic Gradient Descent
Convexity

Feature Engineering:

Fitting a Linear Parabolic Model
Feature Engineering Overview
High Dimensional Feature Engineering Example (Joey Gonzales)
One Hot Encoding
High Order Polynomial Example
Variance and Training Error
Overfitting
Detecting Overfitting

21

22 of 78

Gradient Descent Only Finds Local Minima

As we saw, the gradient descent procedure can get stuck in a local minimum.

If a function has a special property called “convexity”, then gradient descent is guaranteed to find the global minimum.

22

23 of 78

Convexity

Formally, f is convex iff:

Or in plain English: If I draw a line between two points on the curve, all values on the curve must be on or below the line.
Good news, MSE loss is convex (not proven)! So gradient descent is always going to do a good job minimizing the MSE, and will always find the global minimum.

23

Normally, we say that a function f is convex if and only if it obeys this inequality: T times F of A plus 1 minus T times F of B is greater than or equal to F of T times A plus 1 minus T times B for all a and b in the domain of f, and t in the range between 0 and 1.

This inequality probably seems very mysterious at first. on the homework you'll have a chance to build a deeper mathematical intuition for what this formula means. A simple English translation of this formula is: If I draw a line between two points on the curve, all values on the curve must be on or below the line.

As it so happens, the mean squared error loss function in one dimension is convex. For example, if we draw this line between these points on the mean squared error curve, we see that all of the values on the curve are below the line we drew. And indeed, if we draw any such line between two points on the MSE function, all of the values on the curve between those two points will be below that line.

Though we will not do so here, one can show that for any dataset, the MSE is always convex.

24 of 78

Convexity and Avoidance of Local Minima

For a convex function f, any local minimum is also a global minimum.

If loss function convex, gradient descent will always find the globally optimal parameters.

Our arbitrary curve from before is not convex:

24

Not all point are below the line!

25 of 78

Convexity and Optimization Difficulty

26 of 78

Fitting a Linear Parabolic Model

Lecture 13, Data 100 Spring 2022

Gradient Descent Wrap up:

Stochastic Gradient Descent
Convexity

Feature Engineering:

Fitting a Linear Parabolic Model
Feature Engineering Overview
High Dimensional Feature Engineering Example (Joey Gonzales)
One Hot Encoding
High Order Polynomial Example
Variance and Training Error
Overfitting
Detecting Overfitting

26

27 of 78

A Challenge for Linear Models

The plot below shows fuel efficiency vs. engine power of many different models of car.

Y-axis: Fuel efficiency in miles per gallon (similar to liters / kilometer).
X-axis: Total engine power in horsepower (1 horsepower = 745.7 watts).

27

28 of 78

Simple Linear Regression on MPG Data

If we create a simple linear regression model with hp as our only feature, we obviously can’t capture the nonlinear relationship between mpg and hp.

28

MSE: 23.94

29 of 78

Fitting a… Parabola?

Just eyeballing this data, it seems that a quadratic model might do a better job on the range of data given.

Bottom of the parabola somewhere around x = 250, y = 10.
Should intersect somewhere near x = 75, y = 27.5.

29

With these two visual observations we can compute the equation for the parabola.

Details not shown!

Same equation written in two different ways.

MSE: 21

30 of 78

Our Model Is Nonlinear in X

Here, we observe that our model is of the form

Our model appears to be nonlinear.
In other words, doesn’t seem to obey our definition of linear model:

30

31 of 78

The Wrong Approach

The wrong approach would be to entirely abandon our definition of a linear model and try to invent new fitting techniques and libraries for squared models.

Imaginary universe where “SquaredRegression” exists is depicted below! This is not real.
Why is this bad? Lose all of the very nice linear algebra properties. There is a better way.

31

32 of 78

Staying Linear with Nonlinear Transformations

Rather than having to create an entirely new conceptual framework, a better solution is simply to add a new squared feature to our model.

If we do this, we can just use the same linear model framework from before!

32

33 of 78

Results of Linear Regression on Our Nonlinear Features

33

34 of 78

Comparing Our Models

My eyeballed parabolic model:

Our linear regression model using hp and hp².

34

MSE: 21

MSE: 18.98

35 of 78

Fitting a Linear Parabolic Model

Lecture 13, Data 100 Spring 2022

Gradient Descent Wrap up:

Stochastic Gradient Descent
Convexity

Feature Engineering:

Fitting a Linear Parabolic Model
Feature Engineering Overview
High Dimensional Feature Engineering Example (Joey Gonzales)
One Hot Encoding
High Order Polynomial Example
Variance and Training Error
Overfitting
Detecting Overfitting

35

36 of 78

Feature Engineering

Feature Engineering is the process of transforming the raw features into more informative features that can be used in modeling or EDA tasks.

Feature engineering allows you to:

Capture domain knowledge (e.g. periodicity or relationships between features).
Express non-linear relationships using simple linear models.
Encode non-numeric features to be used as inputs to models.

Example: Using the country of origin of a car as an input to modeling its efficiency.

36

37 of 78

Feature Function

A Feature Function takes our original d dimensional input and transforms it into a p dimensional input.

37

Example: Our feature function earlier took our 1 dimensional input and transformed it into a 2 dimensional input.

p is often much greater than d.

38 of 78

Transformed Data and Linear Models

As we saw in our example earlier, adding a squared feature allowed us to capture a parabolic relationship.

As number of features grows, we can capture arbitrarily complex relationships.

Note that the equation for a linear model that is trained on transformed data is sometimes written using the symbol phi instead of x:

38

39 of 78

Feature Functions

Designing feature functions is a major part of data science and machine learning.

You’ll have a chance to do lots of feature function design on project 1.
Fun fact: Much of the success of modern deep learning is because of its ability to automatically learn feature functions. See a course in deep learning for more.

Let’s see an example video where Professor Joey Gonzales takes a 2 dimensional input and transforms it into a 15 dimensional input, allowing him to fit a rather complex surface using a linear model.

39

40 of 78

High Dimensional Feature Engineering Example (Joey Gonzales)

Lecture 13, Data 100 Spring 2022

Gradient Descent Wrap up:

Stochastic Gradient Descent
Convexity

Feature Engineering:

Fitting a Linear Parabolic Model
Feature Engineering Overview
High Dimensional Feature Engineering Example (Joey Gonzales)
One Hot Encoding
High Order Polynomial Example
Variance and Training Error
Overfitting
Detecting Overfitting

40

41 of 78

One Hot Encoding

Lecture 13, Data 100 Spring 2022

Gradient Descent Wrap up:

Stochastic Gradient Descent
Convexity

Feature Engineering:

Fitting a Linear Parabolic Model
Feature Engineering Overview
High Dimensional Feature Engineering Example (Joey Gonzales)
One Hot Encoding
High Order Polynomial Example
Variance and Training Error
Overfitting
Detecting Overfitting

41

42 of 78

Regression Using Non-Numeric Features

We can also perform regression on non-numeric features. For example, for the tips dataset from last lecture, we might want to use the day of the week.

One problem: Our linear model is always a linear combination of our features. Unclear at first how you’d do this.

42

43 of 78

Using Non-Numeric Features: One Hot Encoding

One approach is to use what is known as a “one hot encoding.”

Give every category its own feature, with value = 1 if that category applies to that row.
Can do this using the get_dummies function.

43

44 of 78

Using Non-Numeric Features: One Hot Encoding

One approach is to use what is known as a “one hot encoding.”

Give every category its own feature, with value = 1 if that category applies to that row.
Can do this using the get_dummies function. Then join with the original table with pd.concat.

44

45 of 78

Fitting a Model

If we fit a linear model, the result is a 6 dimensional model.

𝜃₁= 0.093: How much to weight the total bill.
𝜃₂= 0.187: How much to weight the party size.
𝜃₃= 0.668: How much to weight the fact that it is Thursday.
𝜃₄= 0.746: How much to weight the fact that it is Friday.
𝜃₅= 0.621: How much to weight the fact that it is Saturday.
𝜃₆= 0.732: How much to weight the fact that it is Sunday.

Resulting prediction is:

45

46 of 78

Test Your Understanding

If we fit a linear model, the result is a 6 dimensional model.

𝜃₁= 0.093: How much to weight the total bill.
𝜃₂= 0.187: How much to weight the party size.
𝜃₃= 0.668: How much to weight the fact that it is Thursday.
𝜃₄= 0.746: How much to weight the fact that it is Friday.
𝜃₅= 0.621: How much to weight the fact that it is Saturday.
𝜃₆= 0.732: How much to weight the fact that it is Sunday.

Resulting prediction is:

To test your understanding, what tip would the model predict for a party of 3 with a $50 check eating on a Thursday?

46

47 of 78

Test Your Understanding

If we fit a linear model, the result is a 6 dimensional model.

𝜃₁= 0.093: How much to weight the total bill.
𝜃₂= 0.187: How much to weight the party size.
𝜃₃= 0.668: How much to weight the fact that it is Thursday.
𝜃₄= 0.746: How much to weight the fact that it is Friday.
𝜃₅= 0.621: How much to weight the fact that it is Saturday.
𝜃₆= 0.732: How much to weight the fact that it is Sunday.

Resulting prediction is:

To test your understanding, what tip would the model predict for a party of 3 with a $50 check eating on a Thursday?

47

48 of 78

Verifying in Python

If we fit a linear model, the result is a 6 dimensional model.

𝜃₁= 0.093: How much to weight the total bill.
𝜃₂= 0.187: How much to weight the party size.
𝜃₃= 0.668: How much to weight the fact that it is Thursday.
𝜃₄= 0.746: How much to weight the fact that it is Friday.
𝜃₅= 0.621: How much to weight the fact that it is Saturday.
𝜃₆= 0.732: How much to weight the fact that it is Sunday.

Resulting prediction is:

To test your understanding, what tip would the model predict for a party of 3 with a $50 check eating on a Thursday?

48

49 of 78

Interpreting the 6 Dimensional Model

It turns out the MSE for this 6 dimensional model is 1.01.

A model trained on only the bill and the table size has an MSE of 1.06.

This model makes slightly better predictions on this training set, but it likely does not represent the true nature of the data generating process.

Bizarre to imagine that humans have a base tip that they start with for every day of the week.
My guess: This model will not generalize well to newly collected data.

49

50 of 78

An Alternate Approach

Another approach is to fit a separate model to each condition.

Reasonable for a small number of conditions.

50

51 of 78

High Order Polynomial Example

Lecture 13, Data 100 Spring 2022

Gradient Descent Wrap up:

Stochastic Gradient Descent
Convexity

Feature Engineering:

Fitting a Linear Parabolic Model
Feature Engineering Overview
High Dimensional Feature Engineering Example (Joey Gonzales)
One Hot Encoding
High Order Polynomial Example
Variance and Training Error
Overfitting
Detecting Overfitting

51

52 of 78

Cubic Fit

Let’s return to where we started today: Creating higher order features for the mpg dataset. An interesting question arises: What happens if we add a feature corresponding to the horsepower cubed?

Will we get better results?
What will the model look like?

Let’s try it out:

52

53 of 78

Cubic Fit Results

53

We observe a small improvement in MSE.

Qualitatively, the curve looks quite similar. Only slightly better predict power.
… but what happens if we add even higher order features?

54 of 78

Going Even Higher Order

54

As we increase model complexity, MSE drops from 60.76 to 23.94 to … 18.43.

55 of 78

The code that I used to generate these models is given below. Uses two out of scope syntax concepts:

The sklearn Pipeline class.
The sklearn PolynomialFeatures transformer.

See notebook for today if you’re curious.

55

56 of 78

Variance and Training Error

Lecture 13, Data 100 Spring 2022

Gradient Descent Wrap up:

Stochastic Gradient Descent
Convexity

Feature Engineering:

Fitting a Linear Parabolic Model
Feature Engineering Overview
High Dimensional Feature Engineering Example (Joey Gonzales)
One Hot Encoding
High Order Polynomial Example
Variance and Training Error
Overfitting
Detecting Overfitting

56

57 of 78

Going Even Higher Order

57

As we increase model complexity, MSE drops from 60.76 to 23.94 to … 18.43.

58 of 78

Error vs. Complexity

As we increase the complexity of our model, we see that the error on our training data (also called the Training Error) decreases.

58

59 of 78

Going Even Higher Order

59

As we increase model complexity, MSE drops from 60.76 to 23.94 to … 18.43. At the same time, the fit curve grows increasingly erratic and sensitive to the data.

60 of 78

Example on a Subset of the Data

On top, we see the results of fitting two very similar datasets using an order 2 model (𝜃₁+ 𝜃₂ 𝑥 + 𝜃₃ 𝑥²). The resulting fit (model parameters) is close.

On bottom, we see the results of fitting the same datasets using an order 6 model (𝜃₁+…+𝜃₇ 𝑥⁶). We see very different predictions, especially for hp around 170.

In ML, this sensitivity to data is known as “variance”.

60

61 of 78

Error vs. Complexity

As we increase the complexity of our model:

Training error decreases.
Variance increases.

61

62 of 78

Overfitting

Lecture 13, Data 100 Spring 2022

Gradient Descent Wrap up:

Stochastic Gradient Descent
Convexity

Feature Engineering:

Fitting a Linear Parabolic Model
Feature Engineering Overview
High Dimensional Feature Engineering Example (Joey Gonzales)
One Hot Encoding
High Order Polynomial Example
Variance and Training Error
Overfitting
Detecting Overfitting

62

63 of 78

Four Parameter Model with Four Data Points

Interesting fact: Given N data points, we can always find a polynomial of degree N-1 that goes through all those points (as long as no point is directly above any other).

Example:

There exist such that goes through all of these points.

63

64 of 78

Four Parameter Model with Four Data Points

Interesting fact: Given N data points, we can always find a polynomial of degree N-1 that goes through all those points (as long as no point is directly above any other).

Example:

There exist such that goes through all of these points.

Just solve the system of equations below:

64

65 of 78

Four Parameter Model with Four Data Points

Interesting observation: Given N data points, we can always find a polynomial of degree N-1 that goes through all those points.

Example:

There exist such that goes through all of these points.

Just solve the system of equations below:

65

66 of 78

Reminder: Solving a System of Linear Equations is Equivalent to Matrix Inversion

Solving our linear equations is equivalent to a matrix inversion.

Specifically, we’re solving ,where is predictions, is features, and is parameters.

66

67 of 78

In sklearn

Can also do this in sklearn:

67

68 of 78

The Danger of Overfitting

This principle generalizes. If we have 100 data points with only a single feature, we can always generate 99 more features from the original feature, then fit a 100 parameter model with perfectly fits our data.

MSE is always zero.
Model is totally useless.

The problem we’re facing here is “overfitting”. Our model is effectively just memorizing existing data and cannot handle new situations at all.

To get a better handle on this problem, let’s build a model that perfectly fits 6 randomly chosen vehicles from our fuel efficiency dataset.

68

69 of 78

Model Sensitivity in Action

No matter which vehicles we pick, we’ll almost always get an essentially* perfect fit.

(*With the caveat that real computers do not have infinite precision, and thus for even higher order models, this will break due to rounding errors. If you start generating features like x²⁰, it will break down because 100^20 is too big to store.)

69

70 of 78

Comparing a Fit On Our Six Data Points with the Full Data Set

Consider the model on the left, generated from a sample of six data points. When overlaid on our full data set, we see that our predictions are terrible.

Zero error on the training set (i.e. the set of data we used to train our model).
… but enormous error on a bigger sample of real world data.
Since most data that we work with are just samples of some larger population, this is bad!

70

71 of 78

Detecting Overfitting

Lecture 13, Data 100 Spring 2022

Gradient Descent Wrap up:

Stochastic Gradient Descent
Convexity

Feature Engineering:

Fitting a Linear Parabolic Model
Feature Engineering Overview
High Dimensional Feature Engineering Example (Joey Gonzales)
One Hot Encoding
High Order Polynomial Example
Variance and Training Error
Overfitting
Detecting Overfitting

71

72 of 78

Our 35 Samples

Consider a model fit on only the 35 data points.

We’ll try various degrees and try to find the one we like best.

72

73 of 78

Fitting Various Degree Models

If we fit models of degree 0 through 7 of this model. The MSE is as shown below.

73

74 of 78

Visualizing the Models

Below we show the order 0, 1, 2, and 6 models.

74

75 of 78

An Intuitively Overfit Model

Intuitively, the degree 6 model below feels like it is overfit.

More specifically: It seems that if we collect more data, i.e. draw more samples from the same distribution, we are worried this model will make poor predictions.

75

76 of 78

Collecting More Data to Prove a Model is Overfit

Suppose we collect the 9 new orange data points. Can compute MSE for our original models without refitting using the new orange data points.

76

Original 35 data points

New 9 data points

77 of 78

Collecting More Data to Prove a Model is Overfit

Suppose we have 7 models and don’t know which is best.

Can’t necessarily trust the training error. We may have overfit!

We could wait for more data and see which of our 7 models does best on the new points.

Unfortunately, that means we need to wait for more data. May be very expensive or time consuming.
Will see an alternate approach next week.

77

78 of 78

Gradient Descent, Feature Engineering

Content credit: Josh Hug, Joseph Gonzalez

78

Lecture 13