Gradient Descent
(Reading: Ch 11)
(Slides adapted from Sandrine Dudoit and Joey Gonzalez)
UC Berkeley Data 100 Summer 2019
Sam Lau
Learning goals:
Announcements
So Far:
Gradients
Derivatives
Partial Derivative
Gradients
You Try:
Find the gradient of f(θ, x) w.r.t θ. Then find the gradient w.r.t x.
You Try:
Notice how the gradient looks like taking a normal derivative?
This happens a lot! But you should always start with assembling a vector like we just did.
How to Interpret Gradients
Gradient Descent
Using the Gradient
Using the Gradient
Batch Gradient Descent
θ: Model weights L: loss function�⍺: Learning rate, usually small constant�y: True values from training data
Next value for θ
Gradient of loss wrt θ
Learning rate
Gradient Descent Algorithm
Gradient Descent for MSE
(demo)
The Learning Rate
(demo)
You Try:
Derive the gradient descent rule for a linear model with two model weights, no bias term, and MSE loss. Start by taking the gradient for a single data point.
𝓁 means loss for a single point;�L means average loss for dataset.
You Try:
The gradient for the entire dataset is the average of the gradients for each point, so we can run GD as-is.
In Matrix Form
There’s usually a succinct matrix form for gradients. You should be able to convert back and forth.
Once we have matrix form, we can use p dimensions (not just 2)
Last line is bonus. Don’t stress if you can’t find it.
Update Rule
(demo)
Although we started off with only two variables, matrix notation extends our derivation to any number of variables.
Stochastic Gradient Descent
Stochastic Gradient Descent (SGD)
Stochastic GD:
Batch GD:
SGD Algorithm
Each individual update is called an iteration.
Each full pass through the data is called an epoch.
SGD Example
SGD update rules look like BGD update rules without taking the average.
(demo)
Why does SGD perform well?
Batch GD always moves toward minimum:
Stochastic GD meanders
Convexity
Gradient Descent Only Finds Local Minima
Gradient Descent Only Finds Local Minima
Convexity
Convexity
Convexity
Gradient Descent Considerations
Choosing a Loss Function for GD
Huber Loss
Huber Loss
Linear when θ far from point
Smooth when θ close to point
Huber Loss
Why Use Gradient Descent?
Remember the Data Science Lifecycle?
Formulate Question or Problem
Acquire and Clean Data
Exploratory Data Analysis
Draw Conclusions from Predictions and Inference
Reports, Decisions, and Solutions
Summary
(demo)
Gradient of loss wrt θ