1 of 44

Gradient Descent

(Reading: Ch 11)

(Slides adapted from Sandrine Dudoit and Joey Gonzalez)

UC Berkeley Data 100 Summer 2019

Sam Lau

Learning goals:

  • See how gradients generalize derivatives.
  • Learn the batch and stochastic gradient descent algorithms.
  • Understand how GD is affected by learning rate, convexity, and loss function.

2 of 44

Announcements

  • HW4 due today
  • HW5 out today, due Friday

3 of 44

So Far:

  • Modeling pipeline involves picking a model, picking a loss function, and fitting model to loss.
  • We fit model by taking derivative of loss, setting derivative equal to 0, then solving for parameters.
  • Doesn’t work for complicated models or loss functions!
    • E.g. Normal equations take too long to solve.
  • Today: Learn gradient descent, a general technique for loss minimization.

4 of 44

Gradients

5 of 44

Derivatives

  • The derivative of a univariate function gives the slope at a given point.
  • We can also interpret the slope as:
    • If I nudge x, how does y change?
  • Desmos demo

6 of 44

Partial Derivative

  • Intuition: Pretend that multivariable function is univariate, then take derivative as normal. This is a partial derivative.
  • What about multivariable functions? E.g:
  • Desmos demo

7 of 44

Gradients

  • The gradient extends the derivative to multiple variables.
    • Vector-valued function (always outputs a vector).
    • The gradient of f(θ) w.r.t. θ is a vector. Each element is the partial derivative for component of θ.

8 of 44

You Try:

Find the gradient of f(θ, x) w.r.t θ. Then find the gradient w.r.t x.

9 of 44

You Try:

Notice how the gradient looks like taking a normal derivative?

This happens a lot! But you should always start with assembling a vector like we just did.

10 of 44

How to Interpret Gradients

  • Soon we take gradients of loss functions.
  • You should read these gradients as:
    • If I nudge the 1st model weight, what happens to loss?
    • If I nudge the 2nd, what happens to loss?
    • Etc.
  • Since the gradient generalizes the derivative, will say “gradient” from now on instead of “derivative”.

11 of 44

Gradient Descent

12 of 44

Using the Gradient

  • Goal: Fit model by minimizing loss.
  • Recall that loss is a function of model weights.
  • So, gradient w.r.t θ encodes where θ should “move”.
    • Left: θ should move in the positive direction
    • Right: θ should move in negative direction

13 of 44

Using the Gradient

  • Left plot: gradient -, want θ +
  • Right plot: gradient +, want θ -
  • Idea: Move θ in negative direction of gradient.

14 of 44

Batch Gradient Descent

  • Gradient descent algorithm: nudge θ in negative gradient direction until θ converges.
  • Batch gradient descent update rule:

θ: Model weights L: loss function�⍺: Learning rate, usually small constant�y: True values from training data

Next value for θ

Gradient of loss wrt θ

Learning rate

15 of 44

Gradient Descent Algorithm

  • Repeat until model weights don’t change (convergence).
    • At this point, we have θ̂ , our minimizing model weights
  • Initialize model weights to all zero
    • Also common: initialize using small random numbers
  • Update model weights using update rule:

16 of 44

Gradient Descent for MSE

  • For constant model using MSE loss, we have:
  • So the update rule is:

(demo)

17 of 44

The Learning Rate

  • ⍺ called the learning rate of gradient descent.
    • Higher values can converge faster since θ changes more each iteration.
    • But ⍺ too high might cause GD to never converge!
    • ⍺ too low makes GD take many iterations to converge.

(demo)

18 of 44

You Try:

Derive the gradient descent rule for a linear model with two model weights, no bias term, and MSE loss. Start by taking the gradient for a single data point.

𝓁 means loss for a single point;�L means average loss for dataset.

19 of 44

You Try:

The gradient for the entire dataset is the average of the gradients for each point, so we can run GD as-is.

20 of 44

In Matrix Form

There’s usually a succinct matrix form for gradients. You should be able to convert back and forth.

Once we have matrix form, we can use p dimensions (not just 2)

Last line is bonus. Don’t stress if you can’t find it.

21 of 44

Update Rule

(demo)

Although we started off with only two variables, matrix notation extends our derivation to any number of variables.

22 of 44

Break!

Fill out Attendance:

http://bit.ly/at-d100

23 of 44

Stochastic Gradient Descent

24 of 44

Stochastic Gradient Descent (SGD)

  • Can perform update using one data point instead of all the data points.

Stochastic GD:

Batch GD:

  • SGD usually converges faster than BGD in practice!

25 of 44

SGD Algorithm

  1. Initialize model weights to zero.
  2. Shuffle input data.
  3. For each point in the shuffled input data, perform update:
  • Repeat steps 2 and 3 until convergence.

Each individual update is called an iteration.

Each full pass through the data is called an epoch.

26 of 44

SGD Example

  • BGD update rule for constant model, MSE loss:
  • SGD update rule for constant model, MSE loss:

SGD update rules look like BGD update rules without taking the average.

(demo)

27 of 44

Why does SGD perform well?

  • Intuition: SGD “does more” with each data point.
    • E.g. Sample of size 1 million
    • BGD requires ≥1 million calculations for one update
    • SGD will have updated weights 1 million times!
  • However:
    • BGD will always move towards optima
    • SGD will sometimes move away because it only considers one point at a time.
    • SGD still does generally better in spite of this.

28 of 44

Batch GD always moves toward minimum:

29 of 44

Stochastic GD meanders

30 of 44

Convexity

31 of 44

Gradient Descent Only Finds Local Minima

  • If loss function has multiple local minima, GD is not guaranteed to find global minimum.
  • Suppose we have this loss curve:

32 of 44

Gradient Descent Only Finds Local Minima

  • Here’s how GD runs:
  • GD can converge at -15 when global minimum is 18

33 of 44

Convexity

  • For a convex function f, any local minimum is also a global minimum.
    • If loss function convex, gradient descent will always find the globally optimal minimizer.
  • Formally, f is convex iff:

34 of 44

Convexity

  • RTA: If I draw a line between two points on curve, all values on curve need to be on or below line.
  • E.g. MSE loss is convex:

35 of 44

Convexity

  • But this loss function is not convex:

36 of 44

Gradient Descent Considerations

37 of 44

Choosing a Loss Function for GD

  • Often we pick a loss function that “behaves well” for GD.
  • MSE has larger gradients further away from minimum
    • So GD will make large steps at first, then small steps
  • MAE is not differentiable at minimum.
    • So GD will seldom truly converge to minimum!

38 of 44

Huber Loss

  • What if we want the outlier-robust properties of MAE?
  • Use Huber Loss:
  • Behaves like MAE loss when points far from θ
    • But like MSE loss when points close to θ

39 of 44

Huber Loss

Linear when θ far from point

Smooth when θ close to point

40 of 44

Huber Loss

  • 𝛿 is a parameter that determines where Huber loss transitions from MSE to MAE.
  • No analytic solution (unlike MSE and MAE)
  • But differentiable everywhere, so we can use GD!
  • Try taking the derivative and running GD for practice.

41 of 44

Why Use Gradient Descent?

  • The beauty of GD is that it works for many types of models and loss function as long as you can take the gradient.
    • Usually, taking gradient is much easier than solving it!
    • Also, GD often finds solution faster than analytic solution even if there is one (e.g. linear regression).
  • Modeling recipe:
    • Pick model
    • Pick loss function
    • Fit model by running gradient descent

42 of 44

Remember the Data Science Lifecycle?

Formulate Question or Problem

Acquire and Clean Data

Exploratory Data Analysis

Draw Conclusions from Predictions and Inference

Reports, Decisions, and Solutions

43 of 44

Summary

  • Gradients generalize derivatives.
  • Gradient descent is a useful computational method for minimizing loss functions.
  • Stochastic GD is used more often than batch GD because of its faster convergence.
  • Convex loss functions behave well with GD methods.
  • We now have all the pieces needed for modeling!

44 of 44

(demo)

Gradient of loss wrt θ