1
Lecture 11:
Training Neural Networks
Part V
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 11 -
8 Oct 2019
Lecture 11 -
8 Oct 2019
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
2
Parameter Updates
Lecture 11 -
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
8 Oct 2019
Lecture 11 -
8 Oct 2019
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
3
Training a neural network, main loop:
Lecture 11 -
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
8 Oct 2019
Lecture 11 -
8 Oct 2019
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
4
simple gradient descent update
now: complicate.
Training a neural network, main loop:
Lecture 11 -
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
8 Oct 2019
Lecture 11 -
8 Oct 2019
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
5
Image credits: Alec Radford
Lecture 11 -
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
8 Oct 2019
Lecture 11 -
8 Oct 2019
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
6
Suppose loss function is steep vertically but shallow horizontally:
Q: What is the trajectory along which we converge towards the minimum with SGD?
Lecture 11 -
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
8 Oct 2019
Lecture 11 -
8 Oct 2019
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
7
Suppose loss function is steep vertically but shallow horizontally:
Q: What is the trajectory along which we converge towards the minimum with SGD?
Lecture 11 -
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
8 Oct 2019
Lecture 11 -
8 Oct 2019
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
8
Suppose loss function is steep vertically but shallow horizontally:
Q: What is the trajectory along which we converge towards the minimum with SGD? very slow progress along flat direction, jitter along steep one
Lecture 11 -
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
8 Oct 2019
Lecture 11 -
8 Oct 2019
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
9
Momentum update
- Physical interpretation as ball rolling down the loss function + friction (mu coefficient).
- mu = usually ~0.5, 0.9, or 0.99 (Sometimes annealed over time, e.g. from 0.5 -> 0.99)
Lecture 11 -
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
8 Oct 2019
Lecture 11 -
8 Oct 2019
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
10
Momentum update
Lecture 11 -
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
8 Oct 2019
Lecture 11 -
8 Oct 2019
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
11
SGD
vs
Momentum
notice momentum
overshooting the target, but overall getting to the minimum much faster.
Lecture 11 -
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
8 Oct 2019
Lecture 11 -
8 Oct 2019
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
12
Nesterov Momentum update
gradient
step
momentum
step
actual step
Ordinary momentum update:
Lecture 11 -
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
8 Oct 2019
Lecture 11 -
8 Oct 2019
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
13
Nesterov Momentum update
gradient
step
momentum
step
actual step
momentum
step
“lookahead” gradient step (bit different than original)
actual step
Momentum update
Nesterov momentum update
Lecture 11 -
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
8 Oct 2019
Lecture 11 -
8 Oct 2019
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
14
Nesterov Momentum update
gradient
step
momentum
step
actual step
momentum
step
“lookahead” gradient step (bit different than original)
actual step
Momentum update
Nesterov momentum update
Nesterov: the only difference...
Lecture 11 -
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
8 Oct 2019
Lecture 11 -
8 Oct 2019
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
15
Nesterov Momentum update
Slightly inconvenient…
usually we have :
Lecture 11 -
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
8 Oct 2019
Lecture 11 -
8 Oct 2019
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
16
Nesterov Momentum update
Slightly inconvenient…
usually we have :
Variable transform and rearranging saves the day:
Lecture 11 -
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
8 Oct 2019
Lecture 11 -
8 Oct 2019
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
17
Nesterov Momentum update
Slightly inconvenient…
usually we have :
Variable transform and rearranging saves the day:
Replace all thetas with phis, rearrange and obtain:
Lecture 11 -
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
8 Oct 2019
Lecture 11 -
8 Oct 2019
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
18
Nesterov Momentum update: another view
Lecture 11 -
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
8 Oct 2019
Lecture 11 -
8 Oct 2019
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
19
nag =
Nesterov Accelerated Gradient
Lecture 11 -
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
8 Oct 2019
Lecture 11 -
8 Oct 2019
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
20
AdaGrad update
Added element-wise scaling of the gradient based on the historical sum of squares in each dimension
[Duchi et al., 2011]
Lecture 11 -
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
8 Oct 2019
Lecture 11 -
8 Oct 2019
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
21
Q: What happens with AdaGrad?
AdaGrad update
Lecture 11 -
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
8 Oct 2019
Lecture 11 -
8 Oct 2019
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
22
Q2: What happens to the step size over long time?
AdaGrad update
Lecture 11 -
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
8 Oct 2019
Lecture 11 -
8 Oct 2019
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
23
RMSProp update
[Tieleman and Hinton, 2012]
Lecture 11 -
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
8 Oct 2019
Lecture 11 -
8 Oct 2019
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
24
Introduced in a slide in Geoff Hinton’s Coursera class, lecture 6
Lecture 11 -
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
8 Oct 2019
Lecture 11 -
8 Oct 2019
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
25
Introduced in a slide in Geoff Hinton’s Coursera class, lecture 6
Cited by several papers as:
Lecture 11 -
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
8 Oct 2019
Lecture 11 -
8 Oct 2019
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
26
adagrad
rmsprop
Lecture 11 -
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
8 Oct 2019
Lecture 11 -
8 Oct 2019
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
27
Adam update
[Kingma and Ba, 2014]
(incomplete, but close)
Lecture 11 -
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
8 Oct 2019
Lecture 11 -
8 Oct 2019
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
28
Adam update
[Kingma and Ba, 2014]
(incomplete, but close)
momentum
RMSProp-like
Looks a bit like RMSProp with momentum
Lecture 11 -
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
8 Oct 2019
Lecture 11 -
8 Oct 2019
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
29
Adam update
[Kingma and Ba, 2014]
(incomplete, but close)
momentum
RMSProp-like
Looks a bit like RMSProp with momentum
Lecture 11 -
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
8 Oct 2019
Lecture 11 -
8 Oct 2019
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
30
SGD, SGD+Momentum, Adagrad, RMSProp, Adam all have learning rate as a hyperparameter.
Q: Which one of these learning rates is best to use?
Lecture 11 -
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
8 Oct 2019
Lecture 11 -
8 Oct 2019
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
31
SGD, SGD+Momentum, Adagrad, RMSProp, Adam all have learning rate as a hyperparameter.
=> Learning rate decay over time!
step decay:
e.g. decay learning rate by half every few epochs.
exponential decay:
1/t decay:
Lecture 11 -
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
8 Oct 2019
Lecture 11 -
8 Oct 2019
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
32
Second order optimization methods
second-order Taylor expansion:
Solving for the critical point we obtain the Newton parameter update:
Q: what is nice about this update?
Lecture 11 -
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
8 Oct 2019
Lecture 11 -
8 Oct 2019
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
33
Second order optimization methods
second-order Taylor expansion:
Solving for the critical point we obtain the Newton parameter update:
Q2: why is this impractical for training Deep Neural Nets?
notice:
no hyperparameters! (e.g. learning rate)
Lecture 11 -
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
8 Oct 2019
Lecture 11 -
8 Oct 2019
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
34
Second order optimization methods
instead of inverting the Hessian (O(n^3)), approximate inverse Hessian with rank 1 updates over time (O(n^2) each).
Does not form/store the full inverse Hessian.
Lecture 11 -
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
8 Oct 2019
Lecture 11 -
8 Oct 2019
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
35
Adam update
[Kingma and Ba, 2014]
RMSProp-like
bias correction
(only relevant in first few iterations when t is small)
momentum
The bias correction compensates for the fact that m,v are initialized at zero and need some time to “warm up”.
Lecture 11 -
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
8 Oct 2019
Lecture 11 -
8 Oct 2019
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
36
L-BFGS
i.e. if you have a single, deterministic f(x) then L-BFGS will probably work very nicely
Lecture 11 -
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
8 Oct 2019
Lecture 11 -
8 Oct 2019
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
37
In practice:
Lecture 11 -
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
8 Oct 2019
Lecture 11 -
8 Oct 2019
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson