1 of 39

Optimization

Rowel Atienza

rowel@eee.upd.edu.ph

University of the Philippines

2 of 39

Optimization

Finding the parameters, 𝞱, of a neural network that significantly reduce the loss function L(𝞱)

Measured in terms of a performance measure, P, on the entire training set and some regularization terms

P is the one that makes Optimization in Machine Learning different from just pure optimization as the end goal itself.

3 of 39

Optimization

Loss function from an empirical distribution p’data (over the training set)

L(𝜽) = E(x,y)~p’dataL(f(x;𝜽),y)

f(x;𝜽) per sample prediction

y is the label

Usually, we want to minimize L* over the true data generating distribution pdata

L*(𝜽) = E(x,y)~pdataL(f(x;𝜽),y)

4 of 39

Empirical Risk Minimization

Empirical Risk Minimization:

L(𝜽) = E(x,y)~p’dataL(f(x;𝜽),y) = 1/mβˆ‘i=1mL(f(x(i);𝜽),y(i))

m is the number of sample

in some cases, prone to overfitting as training goes indefinitely

use validation set to detect start of overfitting

5 of 39

Minibatch Stochastic

Using entire training set, known as deterministic, is expensive and has no linear return; good estimate of gradient but low linear return; Note that gradient itself is a random variable

Use of minibatch stochastic (small subset of entire training set) offers many advantages:

Suitable for parallelization; GPU memory limits batch size

GPUs perform better on power of 2 sizes, 32 to 256

Small batches offer regularizing effects; improves generalization error

Small batch size increases gradient variance; learning rate must be reduced

Shuffle minibatch, make minibatches independent improves training

6 of 39

Challenges

Ill conditioning of Hessian Matrix, H: Issue for both convex and non-convex

Gradient step, -πœ–g, using second order Taylor expansion of Loss will add , Β½πœ–2gTHg - πœ–gTg, to the cost:

f(x0 - eg) = f(x0) + Β½πœ–2gTHg - πœ–gTg

If Β½πœ–2gTHg > πœ–gTg, then we must choose small πœ– to ensure stable training. Else the loss increases instead of decreasing.

This results to slow learning process

7 of 39

Challenges

Convex loss function: any local minimum is a global minimum

Non-convex: any local minimum is not necessarily a global minimum. We settle for a local minimum that satisfies our performance metrics.

8 of 39

Challenges

Saddle points: found in high-dimensional models

Hessian matrix both have positive and negative eigenvalues

Saddle pts common in high dim space; Local min in low dim space

Can be easily overcome by SGD; SGD are designed to move downhill and not necessarily seek critical points unlike Newton method

Newton method encounters difficulty dealing with saddle and global maxima

Saddle-free Newton method can overcome saddle points - research in progress

Saddle pt

9 of 39

Challenges

Cliff

Gradient descent proposes a large change

thus missing the minimum - Exploding Gradient

Solution is to use gradient clipping - capping the gradient

Common problem is Recurrent Neural Networks

10 of 39

Challenges

Long Term Dependencies (eg RNN, LSTM)

Performing the same computation many times

Applying the same W t-times

Wt = (Vdiag(𝝺)V-1)t = Vdiag(𝝺)tV-1

if |𝝺|<1, the term vanishes as t increases

if |𝝺|>1, the term explodes as t increases

Gradients are influenced by diag(𝝺)

Collectively known as the vanishing or exploding gradients problem

11 of 39

Challenges

Inexact gradients due to noisy or biased estimates

Deep learning uses sampling-based estimates

Local and Global Structure

Optimization does not necessarily lead to a critical pt (global, local or saddle).

Most of the time, only near zero gradient points with resulting acceptable performance

12 of 39

Challenges

Wrong side of the mountain: gradient descent will not find the minimum

Solution: algorithm for choosing the initial points (initial parameters)

Bad initial points send the objective function to the wrong side of the mountain

13 of 39

Parameter Initialization

Initial weights determine if the objective function will converge or not

Also affects network generalization and optimization

Modern initialization strategies are simple and heuristics

Optimization for neural network is not well understood yet

Initialize weights and biases with different random values (symmetry breaking effect) - no 2 units connected to same input have the same initial weights

Initialize weights with high-entropy distribution like Gaussian and Uniform PD

14 of 39

Parameter Initialization

Large weights are good for optimization and breaks symmetry

But… can saturate activation functions, can cause gradient explosion esp in RNN

Small weights are good for regularization

Optimal weights are good for optimization

Big enough weights for gradients to propagate

15 of 39

Parameter Initialization - Weights

Easiest and Default

Random values with std = 0.01

Difficult to converge on deep models

16 of 39

Parameter Initialization - Weights

Glorot Initialization

Assume a network layer with m inputs, n outputs

Wi,j ~ U(-sqrt(6/(m+n)), +sqrt(6/(m+n))) - Glorot uniform

Wi,j ~ N(w; 0, sqrt(2/(m+n))) - Glorot normal

Balance of activation and gradient variance

Assumes linear activation; Shallow models

[Glorot & Bengio, AISTATS 2010 - http://jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pdf

17 of 39

Parameter Initialization - Weights

He Initialization

Assume a network layer with m inputs

Wi,j ~ U(-6/sqrt(m), +6/sqrt(m)) - He uniform

Wi,j ~ N(w; 0,sqrt(2/m)) - He normal

Applicable to non-linear activations (ReLU, PReLU); Deep models

[He et al., http://arxiv.org/abs/1502.01852]

18 of 39

Parameter Initialization - Biases

Set biases to zero (only for linear activation)

Use small values (eg 0.1) for ReLU activation

1 for LSTM forget state

For output layer with highly skewed output ci in categorical distribution, we solve softmax(b) = c

Where c is the vector representing one-hot training labels for some marginal distribution ci ; under small weights, bias alone determines the right output label

19 of 39

Optimization Algorithms

Stochastic Gradient Descent

with Momentum

Adaptive Gradient (AdaGrad)

RMSProp

with Momentum

Adaptive Moment (Adam)

20 of 39

Stochastic Gradient Descent

Instead of using the whole training set, we use a minibatch of m iid samples

Learning Rate:

Gradually decrease learning rate during training since after some time, the gradient due to noise is more significant

Apply learning rate decay until t=𝝉 when it set to constant 𝝐𝝉

𝝐k = (1-k/𝝉)𝝐o + (k/𝝉)𝝐𝝉

𝝐 is usually chosen by trial and error while observing all errors

typical values [0.01 to 0.001]

21 of 39

Stochastic Gradient Descent Algorithm

Require: Learning Rate Scheduler: 𝝐1, 𝝐2, … 𝝐k

Require: Initial Parameter 𝛉

k ← 1

while stopping criterion is not met do

Sample a minibatch { x(1), ..., x(m) } with corresponding targets { y(1), ..., y(m) }

Compute gradient estimate βˆ‡πœ½L(𝜽) = g = 1/mβˆ‡πœ½βˆ‘imL(f(x(i);𝜽),y(i))

Apply update: 𝜽 = 𝜽 - πœ–kg

k = k + 1

end while

22 of 39

Stochastic Gradient Descent

Guarantee of convergence of SGD

βˆ‘βˆžk=1𝝐k = ∞

βˆ‘βˆžk=1𝝐k2 < ∞

23 of 39

Stochastic Gradient Descent - Convergence Rate

Theoretically, the excess error = L(𝞱) - Lmin(𝞱), has lower bound of O(1/k) for convex functions. Anything faster than O(1/k) will not improve the generalization error. Thus resulting to overfitting.

Generally, batch gradient is better than SGD in convergence. A technique that can be used is to increase the batch size gradually.

24 of 39

Momentum on SGD for Speed Improvement

v ← 𝛼v - 𝝐g

𝞱 ← 𝞱 + v

where v is the accumulator of gradient g; v includes influence of past gradients, g

𝛼 is momentum [0,1); typical 0.5, 0.9 and 0.99; 𝛼 is larger compared to 𝝐, the bigger is the influence of past g’s; similar to snowballing effect

If the momentum always agree, then the terminal velocity is 𝝐||g||/(1-𝛼)

𝛼=0.9, multiplies g by 10 (momentum effect)

25 of 39

SGD Algorithm with Momentum

Require: Learning Rate Scheduler: 𝝐1, 𝝐2, … 𝝐k

Require: Initial Parameter 𝛉

k ← 1

while stopping criterion is not met do

Sample a minibatch { x(1), ..., x(m) } with corresponding targets { y(1), ..., y(m) }

Compute gradient estimate βˆ‡πœ½L(𝜽) = g = 1/mβˆ‡πœ½βˆ‘imL(f(x(i);𝜽),y(i))

Compute velocity: v ← 𝛼v - 𝝐g

Apply update: 𝞱 ← 𝞱 + v

k = k + 1

end while

26 of 39

Momentum on SGD for Speed Improvement

Nesterov Momentum: Loss is evaluated after the momentum is applied

𝞱 β†’ (𝞱 + 𝛼v)

27 of 39

SGD Algorithm with Nesterov Momentum

Require: Learning Rate Scheduler: 𝝐1, 𝝐2, … 𝝐k

Require: Initial Parameter 𝛉

k ← 1

while stopping criterion is not met

Sample a minibatch { x(1), ..., x(m) } with corresponding targets { y(1), ..., y(m) }

Compute gradient estimate βˆ‡πœ½L(𝜽) = g = 1/mβˆ‡πœ½βˆ‘imL(f(x(i);𝜽+𝛼v),y(i))

Compute velocity: v ← 𝛼v - 𝝐g

Apply update: 𝞱 ← 𝞱 + v

k = k + 1

end while

28 of 39

Adaptive Learning Rates

AdaGrad (Adaptive Gradient) [Duchi et al 2011]: learning rate decrease is inversely proportional to the square root of the sum of all the historical squared values of the gradient

r ← r + g⌾g

π™πž± ← -g⌾𝝐/(ẟ+√r)

𝞱 ← 𝞱 + π™πž±

where ẟ = small constant (eg 10-7)

Effective for some deep learning but not all; Accumulation of early learning rate can cause excessive decrease in learning rate

29 of 39

AdaGrad Algorithm

Require: Learning Rate: 𝝐, Initial Parameter 𝛉, Small constant ẟ=10-7 for numerical stability

r ← 0

while stopping criterion is not met do

Sample a minibatch { x(1), ..., x(m) } with corresponding targets { y(1), ..., y(m) }

Compute gradient estimate βˆ‡πœ½L(𝜽) = g = 1/mβˆ‡πœ½βˆ‘imL(f(x(i);𝜽),y(i))

Accumulate squared gradient: r ← r + g⌾g

Compute update: π™πž± ← -g⌾𝝐/(ẟ+√r) (Division and sqrt applied element-wise)

Apply update: 𝞱 ← 𝞱 + π™πž±

end while

30 of 39

Adaptive Learning Rates

RMSProp [Hinton 2012] is like AdaGrad but replaces gradient accumulation with exponentially weighted moving average; Suitable for nonconvex optimization

r ← 𝜌r + (1-𝜌)g⌾g

π™πž± ← -g⌾𝝐/√(ẟ+r)

𝞱 ← 𝞱 + π™πž±

where ẟ = small constant (eg 10-6); 𝜌 is the decay rate (e.g. 0.9)

Discard history from extreme past. Effective and practical for deep neural nets

On nonconvex, AdaGrad might fail to converge to local convex bowl. RMSprop is like AdaGrad but converges rapidly after finding the local convex bowl.

31 of 39

RMSProp Algorithm

Require: Learning Rate: 𝝐, Initial Parameter 𝛉, Small constant ẟ=10-6 for numerical stability, Decay rate ⍴ (eg 0.9)

r ← 0

while stopping criterion is not met do

Sample a minibatch { x(1), ..., x(m) } with corresponding targets { y(1), ..., y(m) }

Compute gradient estimate βˆ‡πœ½L(𝜽) = g = 1/mβˆ‡πœ½βˆ‘imL(f(x(i);𝜽),y(i))

Accumulate squared gradient: r ← 𝜌r + (1-𝜌)g⌾g

Compute param update: π™πž± ← -g⌾𝝐/√(ẟ+r) (𝝐/√(ẟ+r) applied element-wise)

Apply update: 𝞱 ← 𝞱 + π™πž±

end while

32 of 39

Adaptive Learning Rates

RMSProp with Nesterov Momentum

r ← 𝜌r + (1-𝜌)g⌾g

v ← 𝛼v - g⌾𝝐/√r

𝞱 ← 𝞱 + v

where ẟ = small constant (eg 10-6)

𝜌 is the decay rate

𝛼 is momentum coefficient

33 of 39

RMSProp Algorithm with Nesterov Momentum

Require: Learning Rate: 𝝐, Initial Parameter 𝛉, Small constant ẟ=10-6 for numerical stability, Decay rate ⍴

r ← 0

while stopping criterion is not met do

Sample a minibatch { x(1), ..., x(m) } with corresponding targets { y(1), ..., y(m) }

Compute gradient estimate βˆ‡πœ½L(𝜽) = g = 1/mβˆ‡πœ½βˆ‘imL(f(x(i);𝜽+𝛼v),y(i))

Accumulate gradient: r ← 𝜌r + (1-𝜌)g⌾g

Compute velocity update: v ← 𝛼v - g⌾𝝐/√r (𝝐/√r applied element-wise)

Apply update: 𝞱 ← 𝞱 + v

end while

34 of 39

Adaptive Learning Rates

Adam (Adaptive Moments) [Kingma and Ba 2014]

first moment: s ← 𝜌1s + (1-𝜌1)g, s’ ← s/(1-𝜌1t)

Built-in 1st-order momentum & correction

second moment: r ← 𝜌2r + (1-𝜌2)g⌾g, r’ ← r/(1-𝜌2t)

Built-in 2nd-order momentum & correction

π™πž± ← -𝝐s’/(ẟ+√r’), 𝞱 ← 𝞱 + π™πž±

where ẟ = small constant for numerical stabilization (eg 10-8), 𝜌1 and 𝜌2 ∈ [0, 1) (suggest: 𝜌1 = 0.9, 𝜌2 = 0.999), t is time step, 𝝐 is suggested to be 0.001

35 of 39

Adam Algorithm

Require: Learning Rate: 𝝐 (eg 0.001), Initial Parameter 𝛉, ẟ = small constant for numerical stabilization (eg 10-8), 𝜌1 and 𝜌2 ∈ [0, 1) (suggest: 𝜌1 = 0.9, 𝜌2 = 0.999), t is time step, 𝝐 is suggested to be 0.001

r ← 0, s ← 0, t ← 0

while stopping criterion is not met do

Sample a minibatch { x(1), ..., x(m) } with corresponding targets { y(1), ..., y(m) }

Compute gradient estimate βˆ‡πœ½L(𝜽) = g = 1/mβˆ‡πœ½βˆ‘imL(f(x(i);𝜽),y(i)), t = t + 1

Compute 1st-moment and correction: s ← 𝜌1s + (1-𝜌1)g, s’ ← s/(1-𝜌1t)

Compute 2nd-moment and correction: r ← 𝜌2r + (1-𝜌2)g⌾g, r’ ← r/(1-𝜌2t)

Compute and apply update: π™πž± ← -𝝐s’/(ẟ+√r’), 𝞱 ← 𝞱 + π™πž±

end while

36 of 39

Batch Normalization

Applied layer-wise

To maintain zero mean and variance of 1 for activation outputs

Batch normalization reduces the amount by what the hidden unit values shift around (covariance shift).

Allows use of larger learning rate in deep models w/o causing instability

Applied to any input or hidden layer

H’ = (H - ΞΌ)/Οƒ (row-wise operation, one sample is one row)

where H matrix is a minibatch of activations of the layer to normalize

ΞΌ vector of mean of each unit, Οƒ vector of std of each unit (broadcast op)

37 of 39

Batch Normalization

At training time,

ΞΌ = 1/mβˆ‘iHi,: (1)

Οƒ = sqrt(ẟ + 1/mβˆ‘i(H-ΞΌ)i2) (2)

ẟ = 10-8 for numerical stability

In retrospect, this operations discourages ΞΌ and Οƒ from attempting to propose large changes

In practice, ΞΌ and Οƒ are stored as running averages so we do not have to compute (1) and (2) all the time.

38 of 39

Reference

Deep Learning, Ian Goodfellow and Yoshua Bengio and Aaron Courville, MIT Press, 2016, http://www.deeplearningbook.org

39 of 39

End