1 of 25

About Ping

5th Year PhD student at University of Maryland - College Park
Research Interest

Reliability of machine learning model in extreme scenarios
Security of machine learning models
Fundamentals of neural networks

2 of 25

Loss landscape is all you need: Neural Network Generalization Can Be Explained Without the Implicit Bias of Gradient Descent

�Ping-yeh Chiang, Renkun Ni, David Yu Miller, Arpit Bansal, Jonas Geiping, Micah Goldblum, Tom Goldstein

3 of 25

Deep Learning Puzzle

Increasing the model capacity increases the model performance

(Nakkiran, 2021)

4 of 25

Deep Learning Puzzle

Increasing the model capacity increases the model performance

100 Million Parameters

100 Billion Parameters

5 of 25

Why is it puzzling? Large models include many solutions that overfit.

100 Million Parameters

100 Billion Parameters

Good

Model

Bad

Model

(Overfitting)

Good

Model

Bad

Model

(Overfitting)

Gradient Based Optimizer

((Galanti & Poggio, 2022; Arora et al., 2019; Advani et al., 2020; Liu et al., 2020).

6 of 25

How do we know that bad models have increased?

More model parameters

7 of 25

What would be a crazy optimizer to test?

Zeroth order optimizer
Guess & Check

We randomly sample param within an Linf ball of (-1, 1)
If the sampled parameter fits the training data, then we consider the model trained

First try

Second try

Third try

8 of 25

The Guess & Check model performs surprisingly well!

More model parameters

9 of 25

Our proposed alternative hypothesis - Volume Hypothesis

Good

Model

Bad

Model

Good

Model

Bad

Model

10 of 25

Experimental Set-up

Non-gradient based optimizer

Guess & Check

Dataset

2 Class MNIST/CIFAR-10 with LeNet
Training set varies from 2-32 training samples
Test set contains 10000 examples

Model

LeNet with 10,000+ parameters

11 of 25

This model is very large for this training data regime!

LeNet with 10,000+ parameters
The model can achieve 100% training accuracy while misclassifying 10000 test examples!

12 of 25

2-class MNIST - Guess & Check generalizes surprisingly well

Observation

Performance of G&C is very similar to SGD
G&C model is much better than linear models
Model continue to perform better as we decrease the losses even though training accuracy has already reached 100%

13 of 25

2-class MNIST - Guess & Check generalizes surprisingly well

Important details

Training losses are calculated after l2 weight normalization of each layers

We found that G&C can obtain arbitrarily low loss if given the classification is already correct, we can arbitrarily minimize the loss by scaling the weights up. We always calculate training loss for SGD in this normalized manner as well

It is important to evaluate performance for separate loss bins

If we do not separate the models by loss bin, SGD would perform much much better. Since the models we found with G&C are dominated by solutions with higher losses

14 of 25

2-class CIFAR - Guess & Check generalizes surprisingly well

Observation

Again, performance of G&C is very similar to SGD
Model continue to perform better as we decrease the losses even though training accuracy has already reached 100%

15 of 25

Model performance improves as we increase width!

16 of 25

Larger number of samples + Larger number of classes

We used Pattern Search and Random Greedy Search for larger number of samples

Pattern Search moves only a single coordinate at a time
Random Greedy randomly search through neighborhood of existing best solution

Generalization performance is similar.

Although the experiments here does not directly support the volume hypothesis, we thought it is important to at least test some examples with larger samples to be more thorough

17 of 25

Summary

Guess & Check performs similarly to SGD trained model confirming that volume bias alone allows neural network to generalize in low data regime
In addition, we observe some common phenomenon in Guess & Checked trained models

Larger models perform better
Lower loss solutions perform better despite all achieving 100% training accuracy

Finally, we tested non-gradient based optimizers on medium data sizes and confirmed similar generalizaiton level.

18 of 25

Future Work - Does volume bias also exist for large data regime?

How do we test volume hypothesis in settings with larger number of examples and parameters?

19 of 25

Future Work - Can we characterize the volume bias of neural network?

For example, why are image recognition models inherently susceptible to adversarial examples? Is it due to volume bias?

The left decision boundary has volume that is more than 4 orders of magnitude larger than the one on the right -> maybe models are inherently non-robust due to the architecture?

20 of 25

Future Work - How does larger model have better volume bias?

What is the bias inherent in larger models and why is it better?
How can we translate the bias within large models to smaller models?

Good

Model

Bad

Model

Good

Model

Bad

Model

??????

What bias does this induce?

21 of 25

Future Work - How does loss level change the volume bias of a network?

Why does lower loss solution have better generalization?

Bin 1

Bin 2

Bin 3

Bin 4

22 of 25

Any Questions?

23 of 25

Thank you!

Micah Goldblum

Tom Goldstein

Arpit Bansal

Renkun Ni

Jonas Geiping

David Miller

24 of 25

Citations

Nakkiran, Preetum, et al. "Deep double descent: Where bigger models and more data hurt." Journal of Statistical Mechanics: Theory and Experiment 2021.12 (2021): 124003.

Tomer Galanti and Tomaso Poggio. Sgd noise and implicit low-rank bias in deep neural networks. 03/2022 2022. URL https://cbmm.mit.edu/publications/

sgd-noise-and-implicit-low-rank-bias-deep-neural-networks.

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. CoRR,

abs/1609.04836, 2016. URL http://arxiv.org/abs/1609.04836.

1 of 25

2 of 25

3 of 25

4 of 25

5 of 25

6 of 25

7 of 25

8 of 25

9 of 25

10 of 25

11 of 25

12 of 25

13 of 25

14 of 25

15 of 25

16 of 25

17 of 25

18 of 25

19 of 25

20 of 25

21 of 25

22 of 25

23 of 25

24 of 25

25 of 25