1 of 16

Data Geometry and DL - Lecture 8

Network pruning and the "Lottery ticket hypothesis"

Survey on main approaches to pruning (paper)

2 of 16

Network pruning – biological parallels, computational motivations

Biological brains structure: the more neurons a brain has, the sparser it gets (paper).
Biological brains growth:

A human brain starts sparse → early phase of densification → massive pruning
Then remains at a relatively stable sparsity level.
Fully-grown brains change up to 40% of their synapses each day.

Concretely, sparse models are easier to store, and often lead to computational savings.

Open question: how to pick optimal architecture towards a given task?
Open question: how to use sparsity for explainability/interpretability?

(concrete ideas: CNNs/equivariance, curvature, Neural Architecture Search..)

3 of 16

Pruning works well → The Lottery Ticket Hypothesis

Lottery ticket hypothesis (paper) says that the following actually works:

4 of 16

Pruning basics

Different masks (paper) →

Different protocols (weights)

“Reinit”: back to initialization values
“Reshuffle”: restore initial weight distribution
“Constant”: values +a, 0, -a

Different pruning schedules:

5 of 16

Time effects in pruning – More biological parallels

Critical importance of initial phases of development (paper)

Recall Lecture 7: First I(Y,T) grows then I(X,T) diminishes

Connectivity/sensitivity quantification: Fisher Information Matrix

6 of 16

Time effects in pruning – Early-Bird lottery tickets

Main narrative: connectivity fixed first, then fine-tuned
Consequence – “Early-Bird (EB) tickets” (paper): the winning tickets can be drawn very early in training, with aggressive low-cost training algorithms
Networks will first learn low-complexity (lower-frequency) functional components, before absorbing high-frequency features

The precise scaling behavior of scheduling rate / precision / aggressiveness, is not well studied and understood

7 of 16

Data-oblivious pruning (summary)

Pruning based on weight masks (already seen) (see e.g. this paper)
Replace NxN weights (sub)matrix by lower-dimensional one (add weights/biases) (or this paper)
Structured sparsity: respect network operations/architecture (necessary for computational gains)

Compressing weights/network representations (for storage)

8 of 16

Data-dependent pruning. Saliency measures, incomplete list (1/2)

Loss dependence on each weight
Hessian of loss, assuming we reached critical point (paper, paper)
Movement pruning: select weights that move away from zero (paper)
Connection sensitivity (paper):

we desire optimum fixed-sparsity loss on data
then see influence of each weight on loss

If a connection (a source neuron multiplied by the weight) has little variance across all training examples, then it can be removed and added to the bias

9 of 16

Data-dependent pruning. Saliency measures, incomplete list (2/2)

F. Hebbian learning: correlated neurons connect, uncorrelated ones can be cut out

G. Remove redundant populations (impact at “fine tuning” stage especially):

Compare eigenvalues of (layerwise) covariance matrices to uniform distribution (paper)
Or use mutual information between layers

H. Use a gradient based scheme with respect to a binary gating function

→ during training, differentiate gating to determine weight importance

10 of 16

Sparsity versus depth: how to measure “global” saliency?

Most methods are “local”, therefore the pruning is biased towards “shallow layers”
Deep neurons have “global effects”, not individuated by sensitivity measures

Possible solution: Add layerwise neuron budgets
Possible solution: Rethink the whole framework with “Sparse neural networks”

11 of 16

Sparse Evolutionary Training

Initialize network: Erdos-Renyi random sparse graph →
Remove fraction of connections based on weights
Add randomly connections to restore total number, and train

(paper)

12 of 16

Towards sparse neural networks

Every dense network is an element in an exponentially large equivalence class, which will generate the same output for each input (permutations of neurons don’t matter).

Sparsity/dimension interplay:

Increase dimension “n”, decrease overlap radius “theta”

ReLU replacement: k-winners

(paper)

13 of 16

Extending the sparsity paradigm: Novelty-oriented search (paper)

The optimization of a single loss function is not fit to some simple tasks,
Internal state space much larger than physical dimension space (eg. maze navigation)
Novelty metrics: favor essentially finding unobserved behaviors.
Different than exhaustive search (limit memory, forget similar behaviors)

14 of 16

Final comments

Sparsification often does not achieve the same gains as architectures developed “by hand” later (excluding cases of structured sparsification).
Several breakthroughs are “manually defined sparsifiers”

Bottleneck layers (autoencoders, SqueezeNet)
Depthwise separable convolutions (MobileNets).
(Such optimized networks resist sparsification: (1), (2))

Sparse (deep) Double Descent (paper)
Real data has sloppy spectrum (paper)

=Sloppiness of A

DNN dimension at minimum =

Inertial Manifold Theory = Dynamical Systems study (paper)