1 of 25

About Ping

  • 5th Year PhD student at University of Maryland - College Park
  • Research Interest
    • Reliability of machine learning model in extreme scenarios
    • Security of machine learning models
    • Fundamentals of neural networks

2 of 25

Loss landscape is all you need: Neural Network Generalization Can Be Explained Without the Implicit Bias of Gradient Descent

�Ping-yeh Chiang, Renkun Ni, David Yu Miller, Arpit Bansal, Jonas Geiping, Micah Goldblum, Tom Goldstein

3 of 25

Deep Learning Puzzle

Increasing the model capacity increases the model performance

(Nakkiran, 2021)

4 of 25

Deep Learning Puzzle

Increasing the model capacity increases the model performance

100 Million Parameters

100 Billion Parameters

5 of 25

Why is it puzzling? Large models include many solutions that overfit.

100 Million Parameters

100 Billion Parameters

Good

Model

Bad

Model

(Overfitting)

Good

Model

Bad

Model

(Overfitting)

Gradient Based Optimizer

((Galanti & Poggio, 2022; Arora et al., 2019; Advani et al., 2020; Liu et al., 2020).

6 of 25

How do we know that bad models have increased?

More model parameters

7 of 25

What would be a crazy optimizer to test?

  • Zeroth order optimizer
  • Guess & Check
    • We randomly sample param within an Linf ball of (-1, 1)
    • If the sampled parameter fits the training data, then we consider the model trained

First try

Second try

Third try

8 of 25

The Guess & Check model performs surprisingly well!

More model parameters

9 of 25

Our proposed alternative hypothesis - Volume Hypothesis

Good

Model

Bad

Model

Good

Model

Bad

Model

10 of 25

Experimental Set-up

  • Non-gradient based optimizer
    • Guess & Check
  • Dataset
    • 2 Class MNIST/CIFAR-10 with LeNet
    • Training set varies from 2-32 training samples
    • Test set contains 10000 examples
  • Model
    • LeNet with 10,000+ parameters

11 of 25

This model is very large for this training data regime!

  • LeNet with 10,000+ parameters
  • The model can achieve 100% training accuracy while misclassifying 10000 test examples!

12 of 25

2-class MNIST - Guess & Check generalizes surprisingly well

Observation

  • Performance of G&C is very similar to SGD
  • G&C model is much better than linear models
  • Model continue to perform better as we decrease the losses even though training accuracy has already reached 100%

13 of 25

2-class MNIST - Guess & Check generalizes surprisingly well

Important details

  • Training losses are calculated after l2 weight normalization of each layers
    • We found that G&C can obtain arbitrarily low loss if given the classification is already correct, we can arbitrarily minimize the loss by scaling the weights up. We always calculate training loss for SGD in this normalized manner as well
  • It is important to evaluate performance for separate loss bins
    • If we do not separate the models by loss bin, SGD would perform much much better. Since the models we found with G&C are dominated by solutions with higher losses

14 of 25

2-class CIFAR - Guess & Check generalizes surprisingly well

Observation

  • Again, performance of G&C is very similar to SGD
  • Model continue to perform better as we decrease the losses even though training accuracy has already reached 100%

15 of 25

Model performance improves as we increase width!

16 of 25

Larger number of samples + Larger number of classes

  1. We used Pattern Search and Random Greedy Search for larger number of samples
    1. Pattern Search moves only a single coordinate at a time
    2. Random Greedy randomly search through neighborhood of existing best solution
  2. Generalization performance is similar.
    • Although the experiments here does not directly support the volume hypothesis, we thought it is important to at least test some examples with larger samples to be more thorough

17 of 25

Summary

  1. Guess & Check performs similarly to SGD trained model confirming that volume bias alone allows neural network to generalize in low data regime
  2. In addition, we observe some common phenomenon in Guess & Checked trained models
    1. Larger models perform better
    2. Lower loss solutions perform better despite all achieving 100% training accuracy
  3. Finally, we tested non-gradient based optimizers on medium data sizes and confirmed similar generalizaiton level.

18 of 25

Future Work - Does volume bias also exist for large data regime?

How do we test volume hypothesis in settings with larger number of examples and parameters?

19 of 25

Future Work - Can we characterize the volume bias of neural network?

For example, why are image recognition models inherently susceptible to adversarial examples? Is it due to volume bias?

The left decision boundary has volume that is more than 4 orders of magnitude larger than the one on the right -> maybe models are inherently non-robust due to the architecture?

20 of 25

Future Work - How does larger model have better volume bias?

  1. What is the bias inherent in larger models and why is it better?
  2. How can we translate the bias within large models to smaller models?

Good

Model

Bad

Model

Good

Model

Bad

Model

??????

What bias does this induce?

21 of 25

Future Work - How does loss level change the volume bias of a network?

  1. Why does lower loss solution have better generalization?

Bin 1

Bin 2

Bin 3

Bin 4

22 of 25

Any Questions?

23 of 25

Thank you!

Micah Goldblum

Tom Goldstein

Arpit Bansal

Renkun Ni

Jonas Geiping

David Miller

24 of 25

Citations

Nakkiran, Preetum, et al. "Deep double descent: Where bigger models and more data hurt." Journal of Statistical Mechanics: Theory and Experiment 2021.12 (2021): 124003.

Tomer Galanti and Tomaso Poggio. Sgd noise and implicit low-rank bias in deep neural networks. 03/2022 2022. URL https://cbmm.mit.edu/publications/

sgd-noise-and-implicit-low-rank-bias-deep-neural-networks.

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. CoRR,

abs/1609.04836, 2016. URL http://arxiv.org/abs/1609.04836.

25 of 25

Appendix