The Deep Bootstrap Framework: Good Online Learners are Good Offline Generalizers
Preetum Nakkiran, Behnam Neyshabur, Hanie Sedghi
Harvard University, Google Brain
Presenters: Chris Zhang, Dami Choi, Anqi (Joyce) Yang
Generalization theory
Goal: Understand when and why trained models have small test error
To study
Empirically measure
Analyze
Problem 2: �Since our current bounds for this are all expressed in terms of function capacity, they turn vacuous when applied to overparameterized neural networks
Problem 1: �Modern methods can often achieve zero training error. The resulting equation becomes trivial
Deep Bootstrap Framework
Key idea: Consider alternative test error decomposition
Model after training for steps on a dataset size of
(Real World: Training with reused/bootstrapped data)
Model after training for steps on a dataset size of
(Ideal World: Training always with fresh data)
Given some data generating distribution and a learning algorithm �(a fixed architecture + optimizer + hyperparameters)
Online learning: What is the test error after training for t steps on completely fresh data?
How fast does the model optimize the population loss in the ideal world?
Online learning: What is the test error after training for t steps on completely fresh data?
How fast does the model optimize the population loss in the ideal world?
Bootstrap Error: How much does the real world differ from the ideal world?
What effect does reusing data instead of having access to unlimited fresh data have on test error?
Main result of the paper
In typical deep learning setups, the bootstrap error is seen to be very small!
What small bootstrap error looks like…
For different architectures as well…
Even when overfitting … !
What does this imply
To try to understand performance in
settings ...
What does this imply
To try to understand performance in
settings ...
We can study the
setting (borrowing literature from online learning community, etc.)
What does this imply
Summary: Good online learners are good offline generalizers
Main Experiment
Measuring Bootstrap Error Empirically
Problem: Typically we don’t have infinite data.�How to simulate Ideal World for tasks we care about (e.g CIFAR-10)?
CIFAR-10
CIFAR-5m
As a proxy, authors construct:
Measuring Bootstrap Error Empirically
Bootstrap error is small for various architectures and learning-rates
CIFAR-5m
ImageNet-DogBird
Some phenomena �through the lens of the�Bootstrap Framework
Model Selection in different data-size regimes
Deep learning theory focuses on overparameterized networks
But data is big enough nowadays to take us into the underparameterized regime
Model Selection in different data-size regimes
Why are the same techniques (architectures and training methods) used in practice in both over- and under-parameterized regimes?
Intuitively a priori,
�
Model Selection in different data-size regimes
Bootstrap framework: ��If bootstrap error is small
architectures which perform well in the online setting (underparameterized regime)
also generalize in the offline setting (overparameterized regime)
Hence why it’s not too surprising that the same methods work well in both regimes
Implicit Bias via Explicit Optimization
One line of work studies the effect of implicit bias of gradient descent.
Traditional explanation: SGD implicitly biases CNN to better generalizing local minima compared to MLP
Implicit Bias via Explicit Optimization
Deep bootstrap framework:
Instead of studying which empirical minima SGD converges upon, �study why SGD quickly optimizes the population loss.
Generalization is captured by the fact tha CNN optimize population loss much faster than MLP
Ablations
What affects the bootstrap error?
Effect of sample size
Effect of data augmentation
Effect of test metric
Summary
Limitations