JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.

1 of 26

Stochastic Gradient Descent

Prof. Seungchul Lee

Industrial AI Lab.

2 of 26

Gradient Descent

We will cover gradient descent algorithm and its variants:

Batch Gradient Descent
Stochastic Gradient Descent
Mini-batch Gradient Descent

We will explore the concept of these three gradient descent algorithms with a logistic regression model in TensorFlow

Limitation of the Gradient Descent

Adaptive learning rate

2

3 of 26

Batch Gradient Descent (= Gradient Descent)

3

4 of 26

Batch Gradient Descent

4

5 of 26

Stochastic Gradient Descent (SGD)

Stochastic gradient descent (SGD): update the parameters based on the gradient for a randomly selected single training example:

SGD takes steps in a noisy direction, but moves downhill on average.

Mathematical justiﬁcation: if you sample a training example at random, the stochastic gradient is an unbiased estimate of the batch gradient:

5

6 of 26

SGD is Sometimes Better

No guarantee that this is what is going to always happen.
But the noisy SGD gradients can help occasionally escaping local optima

6

7 of 26

Mini-batch Gradient Descent

7

8 of 26

Implementation with TensorFlow

8

9 of 26

Batch Gradient Descent with TensorFlow

We will explore the python codes of these three gradient descent algorithms with a logistic regression model.

9

10 of 26

Batch Gradient Descent with TensorFlow

10

11 of 26

Stochastic Gradient Descent (SGD) with TensorFlow

11

12 of 26

Mini-batch Gradient Descent with TensorFlow

12

13 of 26

Limitation of the Gradient Descent

13

14 of 26

Setting the Learning Rate

How can we set the learning rate?

Small learning rate converges slowly and gets stuck in false local minima
Large learning rates overshoot, become unstable and diverge

Idea 1

Try lots of different learning rates and see what works “just right”

Idea 2

Do something smarter! Design an adaptive learning rate that “adapts” to the landscape
Temporal and spatial

14

15 of 26

SGD Learning Rate (= Step Size)

The gradient descent method makes a strong assumption about the magnitude of the “local curvature” to ﬁx the step size, and about its isotropy so that the same step size makes sense in all directions.

15

Source: Dr. Francois Fleuret at EPFL

16 of 26

SGD Learning Rate: Spatial

We assign the same learning rate to all features

16

Harder !

Nice (all features are equally important)

Source: Dr. Francois Fleuret at EPFL

17 of 26

SGD Learning Rate: Spatial

The gradient descent method makes a strong assumption about the magnitude of the “local curvature” to ﬁx the step size, and about its isotropy so that the same step size makes sense in all directions.
Nice (all features are equally important)

17

Source: Dr. Francois Fleuret at EPFL

18 of 26

SGD Learning Rate: Spatial

The gradient descent method makes a strong assumption about the magnitude of the “local curvature” to ﬁx the step size, and about its isotropy so that the same step size makes sense in all directions.
Harder !

18

Source: Dr. Francois Fleuret at EPFL

19 of 26

SGD Learning Rate: Spatial

The gradient descent method makes a strong assumption about the magnitude of the “local curvature” to ﬁx the step size, and about its isotropy so that the same step size makes sense in all directions.
Harder !

19

Source: Dr. Francois Fleuret at EPFL

20 of 26

SGD Learning Rate: Spatial

The gradient descent method makes a strong assumption about the magnitude of the “local curvature” to ﬁx the step size, and about its isotropy so that the same step size makes sense in all directions.
Harder !

20

Source: Dr. Francois Fleuret at EPFL

21 of 26

SGD Learning Rate: Spatial

The gradient descent method makes a strong assumption about the magnitude of the “local curvature” to ﬁx the step size, and about its isotropy so that the same step size makes sense in all directions.
Harder !

21

Source: Dr. Francois Fleuret at EPFL

22 of 26

SGD Learning Rate: Spatial

The gradient descent method makes a strong assumption about the magnitude of the “local curvature” to ﬁx the step size, and about its isotropy so that the same step size makes sense in all directions.
Harder !

22

Source: Dr. Francois Fleuret at EPFL

23 of 26

SGD Learning Rate: Temporal

Typical strategy:

Use a large learning rate early in training so you can get close to the optimum
Gradually decay the learning rate to reduce the ﬂuctuations

23

24 of 26

Adaptive Gradient Learning Rate Methods

24

25 of 26

Adaptive Gradient Learning Rate Methods

25

26 of 26

Adaptive Learning Rate Methods

26

Source: 6.S191 Intro. to Deep Learning at MIT