1 of 29

The Deep Bootstrap Framework: Good Online Learners are Good Offline Generalizers

Preetum Nakkiran, Behnam Neyshabur, Hanie Sedghi

Harvard University, Google Brain

Presenters: Chris Zhang, Dami Choi, Anqi (Joyce) Yang

2 of 29

Generalization theory

Goal: Understand when and why trained models have small test error

To study

Empirically measure

Analyze

Problem 2: �Since our current bounds for this are all expressed in terms of function capacity, they turn vacuous when applied to overparameterized neural networks

Problem 1: �Modern methods can often achieve zero training error. The resulting equation becomes trivial

3 of 29

Deep Bootstrap Framework

Key idea: Consider alternative test error decomposition

4 of 29

Model after training for steps on a dataset size of

(Real World: Training with reused/bootstrapped data)

Model after training for steps on a dataset size of

(Ideal World: Training always with fresh data)

Given some data generating distribution and a learning algorithm �(a fixed architecture + optimizer + hyperparameters)

5 of 29

Online learning: What is the test error after training for t steps on completely fresh data?

How fast does the model optimize the population loss in the ideal world?

6 of 29

Online learning: What is the test error after training for t steps on completely fresh data?

How fast does the model optimize the population loss in the ideal world?

Bootstrap Error: How much does the real world differ from the ideal world?

What effect does reusing data instead of having access to unlimited fresh data have on test error?

7 of 29

Main result of the paper

In typical deep learning setups, the bootstrap error is seen to be very small!

8 of 29

What small bootstrap error looks like…

9 of 29

For different architectures as well…

10 of 29

Even when overfitting … !

11 of 29

What does this imply

To try to understand performance in

  • Offline
  • Real world, finite-data
  • Overparameterized

settings ...

12 of 29

What does this imply

To try to understand performance in

  • Offline
  • Real world, finite-data
  • Overparameterized

settings ...

We can study the

  • Online
  • ideal world, infinite-data
  • Underparameterized

setting (borrowing literature from online learning community, etc.)

13 of 29

What does this imply

Summary: Good online learners are good offline generalizers

14 of 29

Main Experiment

15 of 29

Measuring Bootstrap Error Empirically

Problem: Typically we don’t have infinite data.�How to simulate Ideal World for tasks we care about (e.g CIFAR-10)?

CIFAR-10

CIFAR-5m

As a proxy, authors construct:

  • CIFAR-5m: Use SOTA generator + classifier to generate 5 million samples
  • ImageNet-DogBird: Only classify dog vs bird, within ImageNet (155k samples in total)

16 of 29

Measuring Bootstrap Error Empirically

  • Metric: Soft-error instead of hard error
  • Measure bootstrap error until Real World training converges
    • 1% training soft-error
  • Real World and Ideal World executes exact same training code
    • Optimizer, Data-augmentation, learning rate schedule, hyperparameters, etc.

17 of 29

Bootstrap error is small for various architectures and learning-rates

CIFAR-5m

ImageNet-DogBird

18 of 29

Some phenomena �through the lens of the�Bootstrap Framework

19 of 29

Model Selection in different data-size regimes

Deep learning theory focuses on overparameterized networks

But data is big enough nowadays to take us into the underparameterized regime

  • 300 million JFT image data set
  • 1 billion Instagram image dataset
  • Internet-text datasets for GPT3

20 of 29

Model Selection in different data-size regimes

Why are the same techniques (architectures and training methods) used in practice in both over- and under-parameterized regimes?

Intuitively a priori,

  • Over-parameterized regime - architecture matters for generalization reasons
  • Under-parameterized regime - architecture matters for optimization reasons

21 of 29

Model Selection in different data-size regimes

Bootstrap framework: ��If bootstrap error is small

architectures which perform well in the online setting (underparameterized regime)

also generalize in the offline setting (overparameterized regime)

Hence why it’s not too surprising that the same methods work well in both regimes

22 of 29

Implicit Bias via Explicit Optimization

One line of work studies the effect of implicit bias of gradient descent.

  • Consider a CNN and an MLP which subsumes the CNN
  • Both find local minima that have similar training accuracy

Traditional explanation: SGD implicitly biases CNN to better generalizing local minima compared to MLP

23 of 29

Implicit Bias via Explicit Optimization

Deep bootstrap framework:

Instead of studying which empirical minima SGD converges upon, �study why SGD quickly optimizes the population loss.

Generalization is captured by the fact tha CNN optimize population loss much faster than MLP

24 of 29

Ablations

25 of 29

What affects the bootstrap error?

  • Real World sample size
  • Data augmentation
  • Test metric

26 of 29

Effect of sample size

  • Increasing the sample size => stopping time extends
  • Bootstrap error is more or less small until stopping time

27 of 29

Effect of data augmentation

  • Data augmentation affects the stopping time more than the bootstrap gap
  • Might hurt the performance in the Ideal World

28 of 29

Effect of test metric

  • Bootstrap gap is not small for all test metrics
  • Huge gap for loss makes sense
    • Training to convergence in Real World causes network weights to grow => softmax distribution concentrates.

29 of 29

Summary

  • New framework to understand generalization in deep learning.
  • Generalization in offline learning => optimization in online learning

Limitations

  • The gap is not universally small for all models and tasks
  • Lack of theory or explanation
    • We don’t understand why the gap is small given that we’re overfitting on the training set (~99% training soft-error)
    • We attempted to understand this in toy setting (colab notebook link)