Introduction to optimization
Walid Krichene��Nassma Summer School - 06/24/2019
Learning from examples
Data collection
Inference/serving
Training data�(x, y)
Training/learning
Model�ŷ = h(x)
Unknown Process
y = g(x)
y-ŷ
y
ŷ
x
adjust
Trained model�h*
x
ŷ
Linear regression
Model
We “train” the model so that it fits the observed data
Empirical Risk Minimization
Empirical Risk
Population Risk���Notation
Gradient Descent
Stochastic Gradient Descent
Optimization for Neural Networks
Other techniques / heuristics
Gradient Descent
Taylor Expansion
A function can be approximated�(locally) by its Taylor expansion
Taylor Expansion: higher dimensions
A function
Gradient descent
Gradient descent:�Minimize 1st order approximation in a neighborhood�of current point
Augustin L. Cauchy
1789 - 1857
Convex functions
Convergence guarantees
For smooth convex functions (Lipschitz gradient), if
�For strongly convex functions with condition number and
Convergence guarantees
Another Interpretation of Gradient Descent
Gradient descent can be viewed as a discretization of Gradient flow����Can help in the design and analysis of algorithms
Example: energy function analysis
Define energy function
Take time derivative
Then
Gradient flow
Stochastic Gradient Descent
Gradient estimation
Stochastic optimization
Robbins and Monro
Assume is convex, , , and . Then,
Herbert Robbins
1915-2001
Paul Langevin
1872-1946
Stochastic optimization
Stochastic optimization with different levels of noise
Variance reduction
How to reduce the variance of the gradient estimate?
Increase batch size
Variance reduction methods
Optimization for Neural Networks
Optimizing the objective function
Gradient descent
Stationary point:
Back-Propagation
and automatic differentiation
Computing gradients
as a function of
Neural network optimization
Illustration on TensorFlow Playground
Other techniques / heuristics
Regularization
Norm of the weights can be a proxy for the "complexity" of the model�Intuition: increased sensitivity to small changes of inputs, complex decision boundary
Penalize large norm of the weights
�Simple to implement: the gradient becomes
Regularization term
Regularization
Effect of regularization (after 1000 epochs)
Batch normalization
For each layer with input , define
Typically can use a higher learning rate�Serving: freeze the parameters at the end of training (e.g. use weighted average or recompute on full set)
Dropout
With probability , ignore the output of a neuron.�(and scale the output of the layer)
Intuition: More robust to the output of a subset of neurons
Hyperparameter tuning
Hyperparameter tuning
Hyperparameters:
Hyperparameter tuning
Training
Validation
Test
Summary
Stochastic gradient descent is at the core of machine learning
Guarantees for some classes of functions
Hyperparameter tuning is important
Many heuristics to accelerate training
Advanced techniques (some covered in the next lecture):�Constraints, second-order methods, momentum, variance reduction, etc.
Thank you