1 of 26

Stochastic Gradient Descent

Prof. Seungchul Lee

Industrial AI Lab.

2 of 26

Gradient Descent

  • We will cover gradient descent algorithm and its variants:
    • Batch Gradient Descent
    • Stochastic Gradient Descent
    • Mini-batch Gradient Descent

  • We will explore the concept of these three gradient descent algorithms with a logistic regression model in TensorFlow

  • Limitation of the Gradient Descent
    • Adaptive learning rate

2

3 of 26

Batch Gradient Descent (= Gradient Descent)

3

4 of 26

Batch Gradient Descent

  •  

4

5 of 26

Stochastic Gradient Descent (SGD)

  • Stochastic gradient descent (SGD): update the parameters based on the gradient for a randomly selected single training example:

    • SGD takes steps in a noisy direction, but moves downhill on average.

  • Mathematical justification: if you sample a training example at random, the stochastic gradient is an unbiased estimate of the batch gradient:

5

6 of 26

SGD is Sometimes Better

  • No guarantee that this is what is going to always happen.
  • But the noisy SGD gradients can help occasionally escaping local optima

6

7 of 26

Mini-batch Gradient Descent

  •  

7

8 of 26

Implementation with TensorFlow

8

9 of 26

Batch Gradient Descent with TensorFlow

  • We will explore the python codes of these three gradient descent algorithms with a logistic regression model.

9

10 of 26

Batch Gradient Descent with TensorFlow

10

11 of 26

Stochastic Gradient Descent (SGD) with TensorFlow

11

12 of 26

Mini-batch Gradient Descent with TensorFlow

12

13 of 26

Limitation of the Gradient Descent

13

14 of 26

Setting the Learning Rate

  • How can we set the learning rate?

  • Small learning rate converges slowly and gets stuck in false local minima
  • Large learning rates overshoot, become unstable and diverge

  • Idea 1
    • Try lots of different learning rates and see what works “just right”
  • Idea 2
    • Do something smarter! Design an adaptive learning rate that “adapts” to the landscape
    • Temporal and spatial

14

15 of 26

SGD Learning Rate (= Step Size)

  • The gradient descent method makes a strong assumption about the magnitude of the “local curvature” to fix the step size, and about its isotropy so that the same step size makes sense in all directions.

15

Source: Dr. Francois Fleuret at EPFL

16 of 26

SGD Learning Rate: Spatial

  • We assign the same learning rate to all features

16

Harder !

Nice (all features are equally important)

Source: Dr. Francois Fleuret at EPFL

17 of 26

SGD Learning Rate: Spatial

  • The gradient descent method makes a strong assumption about the magnitude of the “local curvature” to fix the step size, and about its isotropy so that the same step size makes sense in all directions.
  • Nice (all features are equally important)

17

Source: Dr. Francois Fleuret at EPFL

18 of 26

SGD Learning Rate: Spatial

  • The gradient descent method makes a strong assumption about the magnitude of the “local curvature” to fix the step size, and about its isotropy so that the same step size makes sense in all directions.
  • Harder !

18

Source: Dr. Francois Fleuret at EPFL

19 of 26

SGD Learning Rate: Spatial

  • The gradient descent method makes a strong assumption about the magnitude of the “local curvature” to fix the step size, and about its isotropy so that the same step size makes sense in all directions.
  • Harder !

19

Source: Dr. Francois Fleuret at EPFL

20 of 26

SGD Learning Rate: Spatial

  • The gradient descent method makes a strong assumption about the magnitude of the “local curvature” to fix the step size, and about its isotropy so that the same step size makes sense in all directions.
  • Harder !

20

Source: Dr. Francois Fleuret at EPFL

21 of 26

SGD Learning Rate: Spatial

  • The gradient descent method makes a strong assumption about the magnitude of the “local curvature” to fix the step size, and about its isotropy so that the same step size makes sense in all directions.
  • Harder !

21

Source: Dr. Francois Fleuret at EPFL

22 of 26

SGD Learning Rate: Spatial

  • The gradient descent method makes a strong assumption about the magnitude of the “local curvature” to fix the step size, and about its isotropy so that the same step size makes sense in all directions.
  • Harder !

22

Source: Dr. Francois Fleuret at EPFL

23 of 26

SGD Learning Rate: Temporal

  • Typical strategy:
    • Use a large learning rate early in training so you can get close to the optimum
    • Gradually decay the learning rate to reduce the fluctuations

23

24 of 26

Adaptive Gradient Learning Rate Methods

  •  

24

25 of 26

Adaptive Gradient Learning Rate Methods

25

26 of 26

Adaptive Learning Rate Methods

26

Source: 6.S191 Intro. to Deep Learning at MIT