1 of 16

Data Geometry and DL - Lecture 8

Network pruning and the "Lottery ticket hypothesis"

Survey on main approaches to pruning (paper)

2 of 16

Network pruning – biological parallels, computational motivations

  • Biological brains structure: the more neurons a brain has, the sparser it gets (paper).
  • Biological brains growth:
    • A human brain starts sparse → early phase of densification → massive pruning
    • Then remains at a relatively stable sparsity level.
    • Fully-grown brains change up to 40% of their synapses each day.
  • Concretely, sparse models are easier to store, and often lead to computational savings.
  • Open question: how to pick optimal architecture towards a given task?
  • Open question: how to use sparsity for explainability/interpretability?

(concrete ideas: CNNs/equivariance, curvature, Neural Architecture Search..)

3 of 16

Pruning works well → The Lottery Ticket Hypothesis

  • Lottery ticket hypothesis (paper) says that the following actually works:

4 of 16

Pruning basics

  • Different masks (paper) →

  • Different protocols (weights)
    • “Reinit”: back to initialization values
    • “Reshuffle”: restore initial weight distribution
    • “Constant”: values +a, 0, -a

  • Different pruning schedules:

5 of 16

Time effects in pruning – More biological parallels

  • Critical importance of initial phases of development (paper)

Recall Lecture 7: First I(Y,T) grows then I(X,T) diminishes

  • Connectivity/sensitivity quantification: Fisher Information Matrix

6 of 16

Time effects in pruning – Early-Bird lottery tickets

  • Main narrative: connectivity fixed first, then fine-tuned
  • Consequence – “Early-Bird (EB) tickets” (paper): the winning tickets can be drawn very early in training, with aggressive low-cost training algorithms
  • Networks will first learn low-complexity (lower-frequency) functional components, before absorbing high-frequency features
  • The precise scaling behavior of scheduling rate / precision / aggressiveness, is not well studied and understood

7 of 16

Data-oblivious pruning (summary)

  1. Pruning based on weight masks (already seen) (see e.g. this paper)
  2. Replace NxN weights (sub)matrix by lower-dimensional one (add weights/biases) (or this paper)
  3. Structured sparsity: respect network operations/architecture (necessary for computational gains)

  • Compressing weights/network representations (for storage)

8 of 16

Data-dependent pruning. Saliency measures, incomplete list (1/2)

  1. Loss dependence on each weight
  2. Hessian of loss, assuming we reached critical point (paper, paper)
  3. Movement pruning: select weights that move away from zero (paper)
  4. Connection sensitivity (paper):
    1. we desire optimum fixed-sparsity loss on data
    2. then see influence of each weight on loss

  • If a connection (a source neuron multiplied by the weight) has little variance across all training examples, then it can be removed and added to the bias

9 of 16

Data-dependent pruning. Saliency measures, incomplete list (2/2)

F. Hebbian learning: correlated neurons connect, uncorrelated ones can be cut out

G. Remove redundant populations (impact at “fine tuning” stage especially):

    • Compare eigenvalues of (layerwise) covariance matrices to uniform distribution (paper)
    • Or use mutual information between layers

H. Use a gradient based scheme with respect to a binary gating function

→ during training, differentiate gating to determine weight importance

10 of 16

Sparsity versus depth: how to measure “global” saliency?

  • Most methods are “local”, therefore the pruning is biased towards “shallow layers”
  • Deep neurons have “global effects”, not individuated by sensitivity measures

  • Possible solution: Add layerwise neuron budgets
  • Possible solution: Rethink the whole framework with “Sparse neural networks”

11 of 16

Sparse Evolutionary Training

  1. Initialize network: Erdos-Renyi random sparse graph →
  2. Remove fraction of connections based on weights
  3. Add randomly connections to restore total number, and train

(paper)

12 of 16

Towards sparse neural networks

  • Every dense network is an element in an exponentially large equivalence class, which will generate the same output for each input (permutations of neurons don’t matter).

  • Sparsity/dimension interplay:
    • Increase dimension “n”, decrease overlap radius “theta”

  • ReLU replacement: k-winners

(paper)

13 of 16

Extending the sparsity paradigm: Novelty-oriented search (paper)

  • The optimization of a single loss function is not fit to some simple tasks,
  • Internal state space much larger than physical dimension space (eg. maze navigation)
  • Novelty metrics: favor essentially finding unobserved behaviors.
  • Different than exhaustive search (limit memory, forget similar behaviors)

14 of 16

Final comments

  • Sparsification often does not achieve the same gains as architectures developed “by hand” later (excluding cases of structured sparsification).
  • Several breakthroughs are “manually defined sparsifiers”
    • Bottleneck layers (autoencoders, SqueezeNet)
    • Depthwise separable convolutions (MobileNets).
    • (Such optimized networks resist sparsification: (1), (2))
  • Sparse (deep) Double Descent (paper)
  • Real data has sloppy spectrum (paper)

=Sloppiness of A

DNN dimension at minimum =

  • Inertial Manifold Theory = Dynamical Systems study (paper)

15 of 16

Summary of important directions

16 of 16