10-405/605 - Recitation 9
Credits: Daniel, Keshav Narayan, Siruo Zou
Jayesh Singla, Li Chen
Today’s Recitation
SGD Recap
Stochastic Gradient Descent
Gradient Descent
for i in range(n):
for i in range(n):
SGD Recap
Stochastic Gradient Descent
Gradient Descent
In most cases, SGD can find the minimizer to low accuracy targets much faster
SGD Recap
Credit: Yuanzhi Li, 10725 Convex Optimization
𝑖𝑠 𝑘𝑛𝑜𝑤𝑛 𝑎𝑠 𝑡ℎ𝑒 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒
Bad
Good
SGD Recap
Mini Batch Gradient Descent
for b in batches:
SGD with Momentum
“Noisy” derivatives for SGD.
Only estimate on a small batch, which might not be the optimal direction.
Momentum:
Define a way to get the “moving” average of some sequence, which will change along with data.
ADAM - The “default” optimizer today
Learning Rate Tuning
Learning Rate Decay: Step
Reduce learning rate
Step: Reduce learning rate at a few points. E.g. for ResNets, multiply LR by 0.1 after epochs 30, 60, and 90.
Slide Credit: Justin Johnson
Learning Rate Decay: Cosine
Step: Reduce learning rate at a few fixed points. E.g. for ResNets, multiply LR by 0.1 after epochs 30, 60, and 90.
Cosine:
Loshchilov and Hutter, “SGDR: Stochastic Gradient Descent with Warm Restarts”, ICLR 2017 Radford et al, “Improving Language Understanding by Generative Pre-Training”, 2018 Feichtenhofer et al, “SlowFast Networks for Video Recognition”, ICCV 2019
Radosavovic et al, “On Network Design Spaces for Visual Recognition”, ICCV 2019 Child at al, “Generating Long Sequences with Sparse Transformers”, arXiv 2019
Slide Credit: Justin Johnson
Learning Rate Decay: Linear
Cosine:
Linear:
Devlin et al, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”, NAACL 2018
Liu et al, “RoBERTa: A Robustly Optimized BERT Pretraining Approach”, 2019
Yang et al, “XLNet: Generalized Autoregressive Pretraining for Language Understanding”, NeurIPS
2019
Slide Credit: Justin Johnson
Step: Reduce learning rate at a few fixed points. E.g. for ResNets, multiply LR by 0.1 after epochs 30, 60, and 90.
Learning Rate Warmup
Vaswani et al, “Attention Is All You Need”, NeurIPS 2017
Step: For the first k warmup steps, gradually increase the learning rate, then decay over steps
Motivation: Large-scale transformers have large variance in initial training. Warmup helps stabilize training.
Slide Credit: Xinyue Chen
Scheduler in “Attention is All You Need.”
Slide Credit: Justin Johnson
Slide Credit: Justin Johnson
Slide Credit: Justin Johnson
gap is increasing!
Overfitting is not always a bad thing
Lecture 11 - 71
Choosing Hyperparameters
Lecture 11 - 71
Step 1: Check initial loss
Step 2: Overfit a small sample
Step 3: Find LR that makes loss go down
Use the architecture from the previous step, use all training data, turn on small weight decay, find a learning rate that makes the loss drop significantly within ~100 iterations
Good learning rates to try: 1e-1, 1e-2, 1e-3, 1e-4
Slide Credit: Justin Johnson
Choosing Hyperparameters - A principled approach
- Model p(y|λ)
- Uses successive halving
Coding Example
Reference
Thank you