1 of 10

Exploring Generalization in DL

  • What drives generalization in deep learning?
  • Various measures and how they ensure generalization
    • VC dimensionality
    • Norm of weights (and variants)
  • Connection between sharpness and PAC-Bayes

2 of 10

What a measure should do?

  • The measure cannot be uniform throughout all the functions representable by one architecture
  • Appropriate complexity measure must be sufficient in ensuring generalization
  • Want it to capture empirical phenomena:
    • Network learned on noiseless data has lower complexity than one on the noisy data
    • Network complexity decreases as the number of parameters is increased
    • Correlation between complexity and different zero-training networks

3 of 10

Network size

  • Capacity of a model is linear in the total number of parameters
  • VC dimension of feedforward network with ReLU can be bounded

  • Cannot explain the reduction in generalization error with growing amount of parameters

4 of 10

Norms and Margins

  • Regularization of the norm is the long-known way to improve generalization
  • Can be considered different norms and normalization can help to fight the rescaling problem
  • Loss as 0-1 is insensitive to scaling, but cross-entropy is sensitive
    • In order to minimize cross-entropy, the weights should go to infinity
    • Comparing norms of networks trained till different point is invalid

(1) l2 norm

(2) l1 and (3) l2 path norms and (4) spectral norm

5 of 10

Norms and Margins: Empirical check

6 of 10

Lipschitz continuity and Robustness

  • Capacity is proportional to

with C_M being the Lipschitz constant and in limited by diam metric space

  • Tightly connected to it the idea of robustness with respect to some partition of the input space
  • Critic: exponential in the input dimensionality and depth of the network

7 of 10

Sharpness

  • Sharpness definition proposed by Keskar – robustness to adversarial perturbations on the parameter space
    • Does not capture random\true labels training phenomena
    • Sensitive to rescaling of the parameters
  • PAC-Bayes connection:
    • Makes it clear that sharpness shall be balanced by another measure (e.g. norm)
    • When the prior and perturbations are put to gaussian zero-means

8 of 10

PAC-Bayes Sharpness: Empirical check

9 of 10

Empirical evaluation

  • CIFAR10 with confusion sets of varying sizes
  • MNIST with a fully connected network of varying width
  • Conclusion:
    • Considered generalization measures cannot capture all the desired phenomena

10 of 10

Bounds on sharpness

  • Conditions that affect sharpness of a network:
    • C1. Weak interactions between layers can lead to high sharpness
    • C2. Changes in activations is small when perturbation is small
    • C3. No hidden units with too high magnitude, that can lead to significantly different output when the unit is active
  • Bound on the sharpness denoted as