Dynamic Sparse Training
Interdisciplinary Research Achievement �Dynamic Sparse Training (DST) methods achieve state-of-the-art results in sparse neural network training, matching the generalization of dense models while enabling sparse training and inference. Although the resulting models are highly sparse and theoretically cheaper to train, achieving speedups with unstructured sparsity on real-world hardware is challenging. This work proposes a sparse-to-sparse DST method to learn a variant of structured N:M sparsity by imposing a constant fan-in constraint.
Impact on Artificial Intelligence�Dynamic structured-sparse training is achieved in a way that does not harm model generalization ability. The work demonstrates with both a theoretical analysis and empirical results: state-of-the-art sparse-to-sparse structured DST performance on a variety of network architectures, a condensed representation with a reduced parameter and memory footprint, and reduced inference time compared to dense models with a naive PyTorch CPU implementation of the condensed representation.
Impact on Fundamental Interactions
This project stems from other work on sparse neural network training; in particular, building upon the current state of the art method of dynamic sparse training: a method called RigL. The goal is to constrain the sparsity pattern, i.e. make it more structured, without losing any model performance. A natural constraint is to impose an equal number of incoming connections per neuron (constant fan-in). By doing some theoretical calculations, the group computed the output-norm variance of a neural network layer at initialization (a quantity that has been observed to be indicative of training stability) under a variety of different sparsity constraints and found that the constant fan-in constraint does not increase this variance, indicating that it would not destabilize training. This constraint was then implemented on top of RigL, and the group conducted many experiments, analyzing the model empirically by inspecting its weights and activations, and adjusted the dynamic sparse training method where it was necessary to improve model performance.
Outlook
The goal of the project is to achieve optimal dynamic structured-sparse training. We hope this work will motivate future work implementing additional fine-grained structured sparsity schemes within the DST framework.
Mike Lasby2, Anna Golubeva1, Utku Evci3, Mihai Nica4,5, Yani A. Ioannou2
Constant fan-in pruning keeps the most salient weights per neuron, while unstructured pruning keeps the most salient weights per layer. A constant fan-in weight matrix has the same number of non-zero elements (here 2) per column allowing condensed representation. While pruning may remove salient weights affecting generalization, with SRigL structure and weights are learned concurrently.
The NSF Institute for Artificial Intelligence and Fundamental Interactions (IAIFI) is �supported by National Science Foundation under Cooperative Agreement PHY-2019786
References
1. IAIFI Fellow/MIT�2. University of Calgary�3. Google Research
4. University of Guelph�5. Vector Institute for AI