JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.

1 of 10

Exploring Generalization in DL

What drives generalization in deep learning?
Various measures and how they ensure generalization

VC dimensionality
Norm of weights (and variants)

Connection between sharpness and PAC-Bayes

2 of 10

What a measure should do?

The measure cannot be uniform throughout all the functions representable by one architecture
Appropriate complexity measure must be sufficient in ensuring generalization
Want it to capture empirical phenomena:

Network learned on noiseless data has lower complexity than one on the noisy data
Network complexity decreases as the number of parameters is increased
Correlation between complexity and different zero-training networks

3 of 10

Network size

Capacity of a model is linear in the total number of parameters
VC dimension of feedforward network with ReLU can be bounded

Cannot explain the reduction in generalization error with growing amount of parameters

4 of 10

Norms and Margins

Regularization of the norm is the long-known way to improve generalization
Can be considered different norms and normalization can help to fight the rescaling problem
Loss as 0-1 is insensitive to scaling, but cross-entropy is sensitive

In order to minimize cross-entropy, the weights should go to infinity
Comparing norms of networks trained till different point is invalid

(1) l2 norm

(2) l1 and (3) l2 path norms and (4) spectral norm

5 of 10

Norms and Margins: Empirical check

6 of 10

Lipschitz continuity and Robustness

Capacity is proportional to

with C_M being the Lipschitz constant and in limited by diam metric space

Tightly connected to it the idea of robustness with respect to some partition of the input space
Critic: exponential in the input dimensionality and depth of the network

7 of 10

Sharpness

Sharpness definition proposed by Keskar – robustness to adversarial perturbations on the parameter space

Does not capture random\true labels training phenomena
Sensitive to rescaling of the parameters

PAC-Bayes connection:

Makes it clear that sharpness shall be balanced by another measure (e.g. norm)
When the prior and perturbations are put to gaussian zero-means

8 of 10

PAC-Bayes Sharpness: Empirical check

9 of 10

Empirical evaluation

CIFAR10 with confusion sets of varying sizes
MNIST with a fully connected network of varying width
Conclusion:

Considered generalization measures cannot capture all the desired phenomena

10 of 10

Bounds on sharpness

Conditions that affect sharpness of a network:

C1. Weak interactions between layers can lead to high sharpness
C2. Changes in activations is small when perturbation is small
C3. No hidden units with too high magnitude, that can lead to significantly different output when the unit is active

Bound on the sharpness denoted as