Optimization
Rowel Atienza
rowel@eee.upd.edu.ph
University of the Philippines
Optimization
Finding the parameters, π±, of a neural network that significantly reduce the loss function L(π±)
Measured in terms of a performance measure, P, on the entire training set and some regularization terms
P is the one that makes Optimization in Machine Learning different from just pure optimization as the end goal itself.
Optimization
Loss function from an empirical distribution pβdata (over the training set)
L(π½) = E(x,y)~pβdataL(f(x;π½),y)
f(x;π½) per sample prediction
y is the label
Usually, we want to minimize L* over the true data generating distribution pdata
L*(π½) = E(x,y)~pdataL(f(x;π½),y)
Empirical Risk Minimization
Empirical Risk Minimization:
L(π½) = E(x,y)~pβdataL(f(x;π½),y) = 1/mβi=1mL(f(x(i);π½),y(i))
m is the number of sample
in some cases, prone to overfitting as training goes indefinitely
use validation set to detect start of overfitting
Minibatch Stochastic
Using entire training set, known as deterministic, is expensive and has no linear return; good estimate of gradient but low linear return; Note that gradient itself is a random variable
Use of minibatch stochastic (small subset of entire training set) offers many advantages:
Suitable for parallelization; GPU memory limits batch size
GPUs perform better on power of 2 sizes, 32 to 256
Small batches offer regularizing effects; improves generalization error
Small batch size increases gradient variance; learning rate must be reduced
Shuffle minibatch, make minibatches independent improves training
Challenges
Ill conditioning of Hessian Matrix, H: Issue for both convex and non-convex
Gradient step, -πg, using second order Taylor expansion of Loss will add , Β½π2gTHg - πgTg, to the cost:
f(x0 - eg) = f(x0) + Β½π2gTHg - πgTg
If Β½π2gTHg > πgTg, then we must choose small π to ensure stable training. Else the loss increases instead of decreasing.
This results to slow learning process
Challenges
Convex loss function: any local minimum is a global minimum
Non-convex: any local minimum is not necessarily a global minimum. We settle for a local minimum that satisfies our performance metrics.
Challenges
Saddle points: found in high-dimensional models
Hessian matrix both have positive and negative eigenvalues
Saddle pts common in high dim space; Local min in low dim space
Can be easily overcome by SGD; SGD are designed to move downhill and not necessarily seek critical points unlike Newton method
Newton method encounters difficulty dealing with saddle and global maxima
Saddle-free Newton method can overcome saddle points - research in progress
Saddle pt
Challenges
Cliff
Gradient descent proposes a large change
thus missing the minimum - Exploding Gradient
Solution is to use gradient clipping - capping the gradient
Common problem is Recurrent Neural Networks
Challenges
Long Term Dependencies (eg RNN, LSTM)
Performing the same computation many times
Applying the same W t-times
Wt = (Vdiag(πΊ)V-1)t = Vdiag(πΊ)tV-1
if |πΊ|<1, the term vanishes as t increases
if |πΊ|>1, the term explodes as t increases
Gradients are influenced by diag(πΊ)
Collectively known as the vanishing or exploding gradients problem
Challenges
Inexact gradients due to noisy or biased estimates
Deep learning uses sampling-based estimates
Local and Global Structure
Optimization does not necessarily lead to a critical pt (global, local or saddle).
Most of the time, only near zero gradient points with resulting acceptable performance
Challenges
Wrong side of the mountain: gradient descent will not find the minimum
Solution: algorithm for choosing the initial points (initial parameters)
Bad initial points send the objective function to the wrong side of the mountain
Parameter Initialization
Initial weights determine if the objective function will converge or not
Also affects network generalization and optimization
Modern initialization strategies are simple and heuristics
Optimization for neural network is not well understood yet
Initialize weights and biases with different random values (symmetry breaking effect) - no 2 units connected to same input have the same initial weights
Initialize weights with high-entropy distribution like Gaussian and Uniform PD
Parameter Initialization
Large weights are good for optimization and breaks symmetry
But⦠can saturate activation functions, can cause gradient explosion esp in RNN
Small weights are good for regularization
Optimal weights are good for optimization
Big enough weights for gradients to propagate
Parameter Initialization - Weights
Easiest and Default
Random values with std = 0.01
Difficult to converge on deep models
Parameter Initialization - Weights
Glorot Initialization
Assume a network layer with m inputs, n outputs
Wi,j ~ U(-sqrt(6/(m+n)), +sqrt(6/(m+n))) - Glorot uniform
Wi,j ~ N(w; 0, sqrt(2/(m+n))) - Glorot normal
Balance of activation and gradient variance
Assumes linear activation; Shallow models
[Glorot & Bengio, AISTATS 2010 - http://jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pdf
Parameter Initialization - Weights
He Initialization
Assume a network layer with m inputs
Wi,j ~ U(-6/sqrt(m), +6/sqrt(m)) - He uniform
Wi,j ~ N(w; 0,sqrt(2/m)) - He normal
Applicable to non-linear activations (ReLU, PReLU); Deep models
[He et al., http://arxiv.org/abs/1502.01852]
Parameter Initialization - Biases
Set biases to zero (only for linear activation)
Use small values (eg 0.1) for ReLU activation
1 for LSTM forget state
For output layer with highly skewed output ci in categorical distribution, we solve softmax(b) = c
Where c is the vector representing one-hot training labels for some marginal distribution ci ; under small weights, bias alone determines the right output label
Optimization Algorithms
Stochastic Gradient Descent
with Momentum
Adaptive Gradient (AdaGrad)
RMSProp
with Momentum
Adaptive Moment (Adam)
Stochastic Gradient Descent
Instead of using the whole training set, we use a minibatch of m iid samples
Learning Rate:
Gradually decrease learning rate during training since after some time, the gradient due to noise is more significant
Apply learning rate decay until t=π when it set to constant ππ
πk = (1-k/π)πo + (k/π)ππ
π is usually chosen by trial and error while observing all errors
typical values [0.01 to 0.001]
Stochastic Gradient Descent Algorithm
Require: Learning Rate Scheduler: π1, π2, β¦ πk
Require: Initial Parameter π
k β 1
while stopping criterion is not met do
Sample a minibatch { x(1), ..., x(m) } with corresponding targets { y(1), ..., y(m) }
Compute gradient estimate βπ½L(π½) = g = 1/mβπ½βimL(f(x(i);π½),y(i))
Apply update: π½ = π½ - πkg
k = k + 1
end while
Stochastic Gradient Descent
Guarantee of convergence of SGD
ββk=1πk = β
ββk=1πk2 < β
Stochastic Gradient Descent - Convergence Rate
Theoretically, the excess error = L(π±) - Lmin(π±), has lower bound of O(1/k) for convex functions. Anything faster than O(1/k) will not improve the generalization error. Thus resulting to overfitting.
Generally, batch gradient is better than SGD in convergence. A technique that can be used is to increase the batch size gradually.
Momentum on SGD for Speed Improvement
v β πΌv - πg
π± β π± + v
where v is the accumulator of gradient g; v includes influence of past gradients, g
πΌ is momentum [0,1); typical 0.5, 0.9 and 0.99; πΌ is larger compared to π, the bigger is the influence of past gβs; similar to snowballing effect
If the momentum always agree, then the terminal velocity is π||g||/(1-πΌ)
πΌ=0.9, multiplies g by 10 (momentum effect)
SGD Algorithm with Momentum
Require: Learning Rate Scheduler: π1, π2, β¦ πk
Require: Initial Parameter π
k β 1
while stopping criterion is not met do
Sample a minibatch { x(1), ..., x(m) } with corresponding targets { y(1), ..., y(m) }
Compute gradient estimate βπ½L(π½) = g = 1/mβπ½βimL(f(x(i);π½),y(i))
Compute velocity: v β πΌv - πg
Apply update: π± β π± + v
k = k + 1
end while
Momentum on SGD for Speed Improvement
Nesterov Momentum: Loss is evaluated after the momentum is applied
π± β (π± + πΌv)
SGD Algorithm with Nesterov Momentum
Require: Learning Rate Scheduler: π1, π2, β¦ πk
Require: Initial Parameter π
k β 1
while stopping criterion is not met
Sample a minibatch { x(1), ..., x(m) } with corresponding targets { y(1), ..., y(m) }
Compute gradient estimate βπ½L(π½) = g = 1/mβπ½βimL(f(x(i);π½+πΌv),y(i))
Compute velocity: v β πΌv - πg
Apply update: π± β π± + v
k = k + 1
end while
Adaptive Learning Rates
AdaGrad (Adaptive Gradient) [Duchi et al 2011]: learning rate decrease is inversely proportional to the square root of the sum of all the historical squared values of the gradient
r β r + gβΎg
ππ± β -gβΎπ/(αΊ+βr)
π± β π± + ππ±
where αΊ = small constant (eg 10-7)
Effective for some deep learning but not all; Accumulation of early learning rate can cause excessive decrease in learning rate
AdaGrad Algorithm
Require: Learning Rate: π, Initial Parameter π, Small constant αΊ=10-7 for numerical stability
r β 0
while stopping criterion is not met do
Sample a minibatch { x(1), ..., x(m) } with corresponding targets { y(1), ..., y(m) }
Compute gradient estimate βπ½L(π½) = g = 1/mβπ½βimL(f(x(i);π½),y(i))
Accumulate squared gradient: r β r + gβΎg
Compute update: ππ± β -gβΎπ/(αΊ+βr) (Division and sqrt applied element-wise)
Apply update: π± β π± + ππ±
end while
Adaptive Learning Rates
RMSProp [Hinton 2012] is like AdaGrad but replaces gradient accumulation with exponentially weighted moving average; Suitable for nonconvex optimization
r β πr + (1-π)gβΎg
ππ± β -gβΎπ/β(αΊ+r)
π± β π± + ππ±
where αΊ = small constant (eg 10-6); π is the decay rate (e.g. 0.9)
Discard history from extreme past. Effective and practical for deep neural nets
On nonconvex, AdaGrad might fail to converge to local convex bowl. RMSprop is like AdaGrad but converges rapidly after finding the local convex bowl.
RMSProp Algorithm
Require: Learning Rate: π, Initial Parameter π, Small constant αΊ=10-6 for numerical stability, Decay rate β΄ (eg 0.9)
r β 0
while stopping criterion is not met do
Sample a minibatch { x(1), ..., x(m) } with corresponding targets { y(1), ..., y(m) }
Compute gradient estimate βπ½L(π½) = g = 1/mβπ½βimL(f(x(i);π½),y(i))
Accumulate squared gradient: r β πr + (1-π)gβΎg
Compute param update: ππ± β -gβΎπ/β(αΊ+r) (π/β(αΊ+r) applied element-wise)
Apply update: π± β π± + ππ±
end while
Adaptive Learning Rates
RMSProp with Nesterov Momentum
r β πr + (1-π)gβΎg
v β πΌv - gβΎπ/βr
π± β π± + v
where αΊ = small constant (eg 10-6)
π is the decay rate
πΌ is momentum coefficient
RMSProp Algorithm with Nesterov Momentum
Require: Learning Rate: π, Initial Parameter π, Small constant αΊ=10-6 for numerical stability, Decay rate β΄
r β 0
while stopping criterion is not met do
Sample a minibatch { x(1), ..., x(m) } with corresponding targets { y(1), ..., y(m) }
Compute gradient estimate βπ½L(π½) = g = 1/mβπ½βimL(f(x(i);π½+πΌv),y(i))
Accumulate gradient: r β πr + (1-π)gβΎg
Compute velocity update: v β πΌv - gβΎπ/βr (π/βr applied element-wise)
Apply update: π± β π± + v
end while
Adaptive Learning Rates
Adam (Adaptive Moments) [Kingma and Ba 2014]
first moment: s β π1s + (1-π1)g, sβ β s/(1-π1t)
Built-in 1st-order momentum & correction
second moment: r β π2r + (1-π2)gβΎg, rβ β r/(1-π2t)
Built-in 2nd-order momentum & correction
ππ± β -πsβ/(αΊ+βrβ), π± β π± + ππ±
where αΊ = small constant for numerical stabilization (eg 10-8), π1 and π2 β [0, 1) (suggest: π1 = 0.9, π2 = 0.999), t is time step, π is suggested to be 0.001
Adam Algorithm
Require: Learning Rate: π (eg 0.001), Initial Parameter π, αΊ = small constant for numerical stabilization (eg 10-8), π1 and π2 β [0, 1) (suggest: π1 = 0.9, π2 = 0.999), t is time step, π is suggested to be 0.001
r β 0, s β 0, t β 0
while stopping criterion is not met do
Sample a minibatch { x(1), ..., x(m) } with corresponding targets { y(1), ..., y(m) }
Compute gradient estimate βπ½L(π½) = g = 1/mβπ½βimL(f(x(i);π½),y(i)), t = t + 1
Compute 1st-moment and correction: s β π1s + (1-π1)g, sβ β s/(1-π1t)
Compute 2nd-moment and correction: r β π2r + (1-π2)gβΎg, rβ β r/(1-π2t)
Compute and apply update: ππ± β -πsβ/(αΊ+βrβ), π± β π± + ππ±
end while
Batch Normalization
Applied layer-wise
To maintain zero mean and variance of 1 for activation outputs
Batch normalization reduces the amount by what the hidden unit values shift around (covariance shift).
Allows use of larger learning rate in deep models w/o causing instability
Applied to any input or hidden layer
Hβ = (H - ΞΌ)/Ο (row-wise operation, one sample is one row)
where H matrix is a minibatch of activations of the layer to normalize
ΞΌ vector of mean of each unit, Ο vector of std of each unit (broadcast op)
Batch Normalization
At training time,
ΞΌ = 1/mβiHi,: (1)
Ο = sqrt(αΊ + 1/mβi(H-ΞΌ)i2) (2)
αΊ = 10-8 for numerical stability
In retrospect, this operations discourages ΞΌ and Ο from attempting to propose large changes
In practice, ΞΌ and Ο are stored as running averages so we do not have to compute (1) and (2) all the time.
Reference
Deep Learning, Ian Goodfellow and Yoshua Bengio and Aaron Courville, MIT Press, 2016, http://www.deeplearningbook.org
End