1 of 37

Lecture 11:

Training Neural Networks

Part V

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 11 -

8 Oct 2019

Lecture 11 -

8 Oct 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

2 of 37

Parameter Updates

Lecture 11 -

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

8 Oct 2019

Lecture 11 -

8 Oct 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

3 of 37

Training a neural network, main loop:

Lecture 11 -

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

8 Oct 2019

Lecture 11 -

8 Oct 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

4 of 37

simple gradient descent update

now: complicate.

Training a neural network, main loop:

Lecture 11 -

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

8 Oct 2019

Lecture 11 -

8 Oct 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

5 of 37

Image credits: Alec Radford

Lecture 11 -

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

8 Oct 2019

Lecture 11 -

8 Oct 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

6 of 37

Suppose loss function is steep vertically but shallow horizontally:

Q: What is the trajectory along which we converge towards the minimum with SGD?

Lecture 11 -

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

8 Oct 2019

Lecture 11 -

8 Oct 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

7 of 37

Suppose loss function is steep vertically but shallow horizontally:

Q: What is the trajectory along which we converge towards the minimum with SGD?

Lecture 11 -

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

8 Oct 2019

Lecture 11 -

8 Oct 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

8 of 37

Suppose loss function is steep vertically but shallow horizontally:

Q: What is the trajectory along which we converge towards the minimum with SGD? very slow progress along flat direction, jitter along steep one

Lecture 11 -

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

8 Oct 2019

Lecture 11 -

8 Oct 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

9 of 37

Momentum update

- Physical interpretation as ball rolling down the loss function + friction (mu coefficient).

- mu = usually ~0.5, 0.9, or 0.99 (Sometimes annealed over time, e.g. from 0.5 -> 0.99)

Lecture 11 -

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

8 Oct 2019

Lecture 11 -

8 Oct 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

10 of 37

Momentum update

Allows a velocity to “build up” along shallow directions
Velocity becomes damped in steep direction due to quickly changing sign

Lecture 11 -

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

8 Oct 2019

Lecture 11 -

8 Oct 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

11 of 37

SGD

Momentum

notice momentum

overshooting the target, but overall getting to the minimum much faster.

Lecture 11 -

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

8 Oct 2019

Lecture 11 -

8 Oct 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

12 of 37

Nesterov Momentum update

gradient

step

momentum

step

actual step

Ordinary momentum update:

Lecture 11 -

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

8 Oct 2019

Lecture 11 -

8 Oct 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

13 of 37

Nesterov Momentum update

gradient

step

momentum

step

actual step

momentum

step

“lookahead” gradient step (bit different than original)

actual step

Momentum update

Nesterov momentum update

Lecture 11 -

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

8 Oct 2019

Lecture 11 -

8 Oct 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

14 of 37

Nesterov Momentum update

gradient

step

momentum

step

actual step

momentum

step

“lookahead” gradient step (bit different than original)

actual step

Momentum update

Nesterov momentum update

Nesterov: the only difference...

Lecture 11 -

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

8 Oct 2019

Lecture 11 -

8 Oct 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

15 of 37

Nesterov Momentum update

Slightly inconvenient…

usually we have :

Lecture 11 -

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

8 Oct 2019

Lecture 11 -

8 Oct 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

16 of 37

Nesterov Momentum update

Slightly inconvenient…

usually we have :

Variable transform and rearranging saves the day:

Lecture 11 -

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

8 Oct 2019

Lecture 11 -

8 Oct 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

17 of 37

Nesterov Momentum update

Slightly inconvenient…

usually we have :

Variable transform and rearranging saves the day:

Replace all thetas with phis, rearrange and obtain:

Lecture 11 -

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

8 Oct 2019

Lecture 11 -

8 Oct 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

18 of 37

Nesterov Momentum update: another view

Lecture 11 -

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

8 Oct 2019

Lecture 11 -

8 Oct 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

19 of 37

nag =

Nesterov Accelerated Gradient

Lecture 11 -

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

8 Oct 2019

Lecture 11 -

8 Oct 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

20 of 37

AdaGrad update

Added element-wise scaling of the gradient based on the historical sum of squares in each dimension

[Duchi et al., 2011]

Lecture 11 -

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

8 Oct 2019

Lecture 11 -

8 Oct 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

21 of 37

Q: What happens with AdaGrad?

AdaGrad update

Lecture 11 -

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

8 Oct 2019

Lecture 11 -

8 Oct 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

22 of 37

Q2: What happens to the step size over long time?

AdaGrad update

Lecture 11 -

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

8 Oct 2019

Lecture 11 -

8 Oct 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

23 of 37

RMSProp update

[Tieleman and Hinton, 2012]

Lecture 11 -

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

8 Oct 2019

Lecture 11 -

8 Oct 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 of 37

Introduced in a slide in Geoff Hinton’s Coursera class, lecture 6

Lecture 11 -

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

8 Oct 2019

Lecture 11 -

8 Oct 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

25 of 37

Introduced in a slide in Geoff Hinton’s Coursera class, lecture 6

Cited by several papers as:

Lecture 11 -

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

8 Oct 2019

Lecture 11 -

8 Oct 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

26 of 37

adagrad

rmsprop

Lecture 11 -

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

8 Oct 2019

Lecture 11 -

8 Oct 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

27 of 37

Adam update

[Kingma and Ba, 2014]

(incomplete, but close)

Lecture 11 -

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

8 Oct 2019

Lecture 11 -

8 Oct 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

28 of 37

Adam update

[Kingma and Ba, 2014]

(incomplete, but close)

momentum

RMSProp-like

Looks a bit like RMSProp with momentum

Lecture 11 -

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

8 Oct 2019

Lecture 11 -

8 Oct 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

29 of 37

Adam update

[Kingma and Ba, 2014]

(incomplete, but close)

momentum

RMSProp-like

Looks a bit like RMSProp with momentum

Lecture 11 -

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

8 Oct 2019

Lecture 11 -

8 Oct 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

30 of 37

SGD, SGD+Momentum, Adagrad, RMSProp, Adam all have learning rate as a hyperparameter.

Q: Which one of these learning rates is best to use?

Lecture 11 -

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

8 Oct 2019

Lecture 11 -

8 Oct 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

31 of 37

SGD, SGD+Momentum, Adagrad, RMSProp, Adam all have learning rate as a hyperparameter.

=> Learning rate decay over time!

step decay:

e.g. decay learning rate by half every few epochs.

exponential decay:

1/t decay:

Lecture 11 -

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

8 Oct 2019

Lecture 11 -

8 Oct 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

32 of 37

Second order optimization methods

second-order Taylor expansion:

Solving for the critical point we obtain the Newton parameter update:

Q: what is nice about this update?

Lecture 11 -

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

8 Oct 2019

Lecture 11 -

8 Oct 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

33 of 37

Second order optimization methods

second-order Taylor expansion:

Solving for the critical point we obtain the Newton parameter update:

Q2: why is this impractical for training Deep Neural Nets?

notice:

no hyperparameters! (e.g. learning rate)

Lecture 11 -

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

8 Oct 2019

Lecture 11 -

8 Oct 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

34 of 37

Second order optimization methods

Quasi-Newton methods (BGFS most popular):

instead of inverting the Hessian (O(n^3)), approximate inverse Hessian with rank 1 updates over time (O(n^2) each).

L-BFGS (Limited memory BFGS):

Does not form/store the full inverse Hessian.

Lecture 11 -

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

8 Oct 2019

Lecture 11 -

8 Oct 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

35 of 37

Adam update

[Kingma and Ba, 2014]

RMSProp-like

bias correction

(only relevant in first few iterations when t is small)

momentum

The bias correction compensates for the fact that m,v are initialized at zero and need some time to “warm up”.

Lecture 11 -

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

8 Oct 2019

Lecture 11 -

8 Oct 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

36 of 37

L-BFGS

Usually works very well in full batch, deterministic mode

i.e. if you have a single, deterministic f(x) then L-BFGS will probably work very nicely

Does not transfer very well to mini-batch setting. Gives bad results. Adapting L-BFGS to large-scale, stochastic setting is an active area of research.

Lecture 11 -

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

8 Oct 2019

Lecture 11 -

8 Oct 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

37 of 37

Adam is a good default choice in most cases

If you can afford to do full batch updates then try out L-BFGS (and don’t forget to disable all sources of noise)

In practice:

Lecture 11 -

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

8 Oct 2019

Lecture 11 -

8 Oct 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson