1 of 22

10-405/605 - Recitation 9

Credits: Daniel, Keshav Narayan, Siruo Zou

Jayesh Singla, Li Chen

2 of 22

Today’s Recitation

  • SGD Recap

  • Optimize SGD

  • Learning Rate Tuning

  • Coding Example

3 of 22

SGD Recap

Stochastic Gradient Descent

Gradient Descent

for i in range(n):

for i in range(n):

4 of 22

SGD Recap

Stochastic Gradient Descent

  • Computationally cheap for one step
  • High variance
  • More steps to converge

Gradient Descent

  • Computationally expensive for one step
  • Low variance
  • Less steps to converge

In most cases, SGD can find the minimizer to low accuracy targets much faster

5 of 22

SGD Recap

Credit: Yuanzhi Li, 10725 Convex Optimization

𝑖𝑠 𝑘𝑛𝑜𝑤𝑛 𝑎𝑠 𝑡ℎ𝑒 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒

Bad

Good

6 of 22

SGD Recap

Mini Batch Gradient Descent

for b in batches:

7 of 22

SGD with Momentum

“Noisy” derivatives for SGD.

Only estimate on a small batch, which might not be the optimal direction.

Momentum:

Define a way to get the “moving” average of some sequence, which will change along with data.

8 of 22

ADAM - The “default” optimizer today

9 of 22

Learning Rate Tuning

10 of 22

Learning Rate Decay: Step

Reduce learning rate

Step: Reduce learning rate at a few points. E.g. for ResNets, multiply LR by 0.1 after epochs 30, 60, and 90.

Slide Credit: Justin Johnson

11 of 22

Learning Rate Decay: Cosine

Step: Reduce learning rate at a few fixed points. E.g. for ResNets, multiply LR by 0.1 after epochs 30, 60, and 90.

Cosine:

Loshchilov and Hutter, “SGDR: Stochastic Gradient Descent with Warm Restarts”, ICLR 2017 Radford et al, “Improving Language Understanding by Generative Pre-Training”, 2018 Feichtenhofer et al, “SlowFast Networks for Video Recognition”, ICCV 2019

Radosavovic et al, “On Network Design Spaces for Visual Recognition”, ICCV 2019 Child at al, “Generating Long Sequences with Sparse Transformers”, arXiv 2019

Slide Credit: Justin Johnson

12 of 22

Learning Rate Decay: Linear

Cosine:

Linear:

Devlin et al, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”, NAACL 2018

Liu et al, “RoBERTa: A Robustly Optimized BERT Pretraining Approach”, 2019

Yang et al, “XLNet: Generalized Autoregressive Pretraining for Language Understanding”, NeurIPS

2019

Slide Credit: Justin Johnson

Step: Reduce learning rate at a few fixed points. E.g. for ResNets, multiply LR by 0.1 after epochs 30, 60, and 90.

13 of 22

Learning Rate Warmup

Step: For the first k warmup steps, gradually increase the learning rate, then decay over steps

Motivation: Large-scale transformers have large variance in initial training. Warmup helps stabilize training.

Slide Credit: Xinyue Chen

Scheduler in “Attention is All You Need.”

14 of 22

Slide Credit: Justin Johnson

15 of 22

Slide Credit: Justin Johnson

16 of 22

Slide Credit: Justin Johnson

gap is increasing!

17 of 22

Overfitting is not always a bad thing

Lecture 11 - 71

  • Large-scale models are over-parameterized models
  • A high-capacity model that is able to overfit the training data may perform better than a low-capacity model that cannot
    • If you can overfit to the train data, your model is powerful enough to capture some patterns in the data
  • Once overfit, use regularization, dropout, etc. to help with overfitting
  • Early stopping based on the performance on validation set

18 of 22

Choosing Hyperparameters

Lecture 11 - 71

Step 1: Check initial loss

Step 2: Overfit a small sample

Step 3: Find LR that makes loss go down

Use the architecture from the previous step, use all training data, turn on small weight decay, find a learning rate that makes the loss drop significantly within ~100 iterations

Good learning rates to try: 1e-1, 1e-2, 1e-3, 1e-4

Slide Credit: Justin Johnson

19 of 22

Choosing Hyperparameters - A principled approach

  • Bayesian hyperparameter optimisation

- Model p(y|λ)

- Uses successive halving

20 of 22

Coding Example

  • Task: Image Classification

  • Data: CIFAR10

  • Model: CNN

21 of 22

Reference

22 of 22

Thank you