1 of 33

Introduction to optimization

Walid Krichene��Nassma Summer School - 06/24/2019

2 of 33

Learning from examples

Data collection

Inference/serving

Training data�(x, y)

Training/learning

Model�ŷ = h(x)

Unknown Process

y = g(x)

y-ŷ

y

ŷ

x

adjust

Trained model�h^*

x

ŷ

3 of 33

Linear regression

Model

We “train” the model so that it fits the observed data

4 of 33

Empirical Risk Minimization

Empirical Risk

Population Risk��Notation

5 of 33

Gradient Descent

Stochastic Gradient Descent

Optimization for Neural Networks

Other techniques / heuristics

6 of 33

Gradient Descent

7 of 33

Taylor Expansion

A function can be approximated�(locally) by its Taylor expansion

8 of 33

Taylor Expansion: higher dimensions

A function

9 of 33

Gradient descent

Gradient descent:�Minimize 1st order approximation in a neighborhood�of current point

Augustin L. Cauchy

1789 - 1857

10 of 33

Convex functions

Important class of functions for which we have�guarantees.
First-order approx. is a lower bound��
Stationary points are global minima

11 of 33

Convergence guarantees

For smooth convex functions (Lipschitz gradient), if

�For strongly convex functions with condition number and

12 of 33

Convergence guarantees

13 of 33

Another Interpretation of Gradient Descent

Gradient descent can be viewed as a discretization of Gradient flow��Can help in the design and analysis of algorithms

14 of 33

Example: energy function analysis

Define energy function

Take time derivative

Then

Gradient flow

15 of 33

Stochastic Gradient Descent

16 of 33

Gradient estimation

Approximation: compute sum on batch of examples��unbiased estimator of the gradient.

Computing exact gradient can be expensive, when number of examples is large

17 of 33

Stochastic optimization

Theory developed in the 40s and 50s: �Langevin equation and Stochastic Gradient Descent

Robbins and Monro

Assume is convex, , , and . Then,

Herbert Robbins

1915-2001

Paul Langevin

1872-1946

18 of 33

Stochastic optimization

Stochastic optimization with different levels of noise

19 of 33

Variance reduction

How to reduce the variance of the gradient estimate?

Increase batch size

Variance reduction methods

"Cache" the contribution of each example to the gradient (SAG/SAGA)
Periodically recompute the full gradient (SVRG)

20 of 33

Optimization for Neural Networks

21 of 33

Optimizing the objective function

Gradient descent

Stationary point:

maximum
saddle point
local minimum
global minimum

22 of 33

Back-Propagation

and automatic differentiation

Computing gradients

Forward pass�compute the output
Backward pass�compute the gradients�Recursive formula

as a function of

23 of 33

Neural network optimization

Illustration on TensorFlow Playground

24 of 33

Other techniques / heuristics

25 of 33

Regularization

Norm of the weights can be a proxy for the "complexity" of the model�Intuition: increased sensitivity to small changes of inputs, complex decision boundary

Penalize large norm of the weights

�Simple to implement: the gradient becomes

Regularization term

26 of 33

Regularization

Effect of regularization (after 1000 epochs)

27 of 33

Batch normalization

For each layer with input , define

Typically can use a higher learning rate�Serving: freeze the parameters at the end of training (e.g. use weighted average or recompute on full set)

28 of 33

Dropout

With probability , ignore the output of a neuron.�(and scale the output of the layer)

Intuition: More robust to the output of a subset of neurons

29 of 33

Hyperparameter tuning

30 of 33

Hyperparameter tuning

Hyperparameters:

Learning rate
Batch size (also sampling distribution)
Dropout rate
Regularization coefficient
Initialization
(number of layers, width of layers)

31 of 33

Hyperparameter tuning

Difference with other parameters: hard to compute gradients.�No automatic differentiation
Can use cross-validation��
Tools for automated hyper-parameter tuning (Bayesian optimization)�e.g. Vizier in TensorFlow

Training

Validation

Test

32 of 33

Summary

Stochastic gradient descent is at the core of machine learning

Guarantees for some classes of functions

Hyperparameter tuning is important

Many heuristics to accelerate training

Advanced techniques (some covered in the next lecture):�Constraints, second-order methods, momentum, variance reduction, etc.

33 of 33

Thank you