第 1 页，共 20 页

Let's Make Neural Networks Efficient

Sathyaprakash Narayanan

ECE-x83: Special Topics in Engineering [Brain Inspired Machine Learning]

Jason Eshraghian

第 2 页，共 20 页

So what’s the catch?

Scaling these models up is a huge pain:

- binarized activations are hard to deal with; you are drastically limiting the ability of individual neurons to represent information

For equivalent tasks, non-spiking networks are easier to train to often better loss convergence

- recurrent, time-varying neurons are expensive to train using BPTT (linear memory complexity with time)

- Sparsity only makes sense if your hardware knows to skip “0” operations. GPUs, by default, do not know to do this.

Solutions?

- Computation is cheap, memory access is expensive. Maybe we focus on sparsity instead of binarization. I.e., silicon’s already optimized for computation. Use it.

- Real-time learning techniques?

- Dynamical weights

Major takeaway: don’t trust me. Go build cooler shit.

Stolen from Class slides: Week 2 :P

Von Neumann Bottleneck “Memory is Expensive”

Data Movement → More Memory Reference → More Energy

How should we make deep learning more eﬃcient?

So we’re going to cast our attention to the algorithms side of things for a little bit.

But it does all culminate in building better hardware.

Now this is a computational graph of just ONE single neuron, represented in discrete time.

So moving from left to right corresponds to moving forward in time.

Given that we have a temporally varying network, the training process depends on what’s called “backpropagation through time” – a rather classic technique from 1989 for training recurrent neurons. So if you’re interested in the math, I recommend checking out my manifesto but in the meantime, I’ll save you from the pain, and just show you the path of data flow.
So you see the backward propagating gradient terms from right to left, they have a well defined derivate. That’s just the decay rate, beta. Now, to ensure the neuron’s membrane voltage decays to 0, beta must fall between 0 and 1.
A bit of symmetry is going to emerge here, and we’re going to exploit this – moving forward in time, the membrane potential decays. Additionally, this also suggests that when running temporal neural networks, every single value has to be stored through time to calculate the gradient at the very end, so that means memory complexity scaled linearly with time. Brains don’t completely operate like this, something more realistic would implement temporally local training.
Moving backwards in time, the temporal gradient keeps getting iteratively scaled by beta, it also exponentially decays. This kind of makes sense, intuitively, the input at this final time step is going to have a far more direct influence on this output, when compared to an input earlier on in time.
So based on this principle, we can actually push gradients forward through the computational graph at the same time as passing inputs through. So based on this symmetry, the training process becomes completely local in time. So I’ve called this technique backprop to the future, and memory complexity only scaled with n, or the number of neurons.

So that addresses temporal locality. We no longer need to penetrate the time continuum.

%%%%

RTRL challenge: lots of updates = overfitting, especially with adaptive weight decay/learning rate decay/momentum

5000 timesteps

第 3 页，共 20 页

Sparsity is the Key!!!!!!!!

第 4 页，共 20 页

Von Neumann Bottleneck “Memory is Expensive”

Data Movement → More Memory Reference → More Energy

4

Operation	Energy [pJ]
32 bit int ADD	0.1
32 bit float ADD	0.9
32 bit Register File	1
32 bit int MULT	3.1
32 bit float MULT	3.7
32 bit SRAM Cache	5
32 bit DRAM Memory	640

10

100 1000 10000

Rough Energy Cost For Various Operations in 45nm 0.9V

Relative Energy Cost

200 ✕

1

= 200

1

This image is in the public domain

Computing's Energy Problem (and What We Can Do About it) [Horowitz, M., IEEE ISSCC 2014]

What’s the first step towards Sparsity?

How should we make deep learning more eﬃcient?

第 5 页，共 20 页

Pruning Happens in Human Brain

5

Time

Slide Inspiration: Alila Medical Media

Number of Synapses

Newborn

2-4 years old

Adult

Adolescence

15000 synapses

2500 synapses

per neuron ^[1]

7000 synapses

per neuron ^[2]

Data Source: 1, 2

Do We Have Brain to Spare? [Drachman DA, Neurology 2004]

Peter Huttenlocher (1931–2013) [Walsh, C. A., Nature 2013]

第 6 页，共 20 页

Neural Network Pruning

6

Make neural network smaller by removing synapses and neurons

Optimal Brain Damage [LeCun et al., NeurIPS 1989]

Learning Both Weights and Connections for Efficient Neural Network [Han et al., NeurIPS 2015]

Pruning Comparison

Dense Model

Sparse Model

第 7 页，共 20 页

Let make snn’s actually sparse(nn in general)!!

Wrote a package to fix this, :P

snn Notebook: https://colab.research.google.com/drive/1nNil4aj0GJxGfFQka8By9dTESVjLlied?usp=sharing

Tutorial: Show the DL Comparion on sconce/tutorials.

第 8 页，共 20 页

Is that all we could do?

Quantization

Knowledge-Distillation

Sparsity Aware Engine/Model Computations

CUDA Optimizations

Hardware Aware Optimizations

Neural-Architecture Search

第 9 页，共 20 页

Sconce (Note: Any Torch can be used)

https://github.com/satabios/sconce

Go and smash the Star on Github Repo(satabios/sconce)… I can’t get you extra credits, but I promise that you’ll be in my stargazers list 🥹

第 10 页，共 20 页

Didn’t want to bore you further

第 11 页，共 20 页

References

11

Model Compression and Hardware Acceleration for Neural Networks: A Comprehensive Survey [Deng et al., IEEE 2020]
Computing's Energy Problem (and What We Can Do About it) [Horowitz, M., IEEE ISSCC 2014]
Optimal Brain Damage [LeCun et al., NeurIPS 1989]
Learning Both Weights and Connections for Eﬃcient Neural Network [Han et al., NeurIPS 2015]
Eﬃcient Methods and Hardware for Deep Learning [Han S., Stanford University]
Peter Huttenlocher (1931–2013) [Walsh, C. A., Nature 2013]
Exploring the granularity of sparsity in convolutional neural networks [Mao et al., CVPR-W]
Accelerating Inference with Sparsity Using the NVIDIA Ampere Architecture and NVIDIA TensorRT
AMC: Automl for Model Compression and Acceleration on Mobile Devices [He et al., ECCV 2018]
Learning Structured Sparsity in Deep Neural Networks [Wen et al., NeurIPS 2016]
Learning Eﬃcient Convolutional Networks through Network Slimming [Liu et al., ICCV 2017]
Pruning Convolutional Filters with First Order Taylor Series Ranking [Wang M.]
Importance Estimation for Neural Network Pruning [Molchanov et al., CVPR 2019]
Network Trimming: A Data-Driven Neuron Pruning Approach towards Eﬃcient Deep Architectures [Hu et al., ArXiv 2017]
Pruning Convolutional Neural Networks for Resource Eﬃcient Inference [Molchanov et al., ICLR 2017]
Channel Pruning for Accelerating Very Deep Neural Networks [He et al., ICCV 2017]
ThiNet: A Filter Level Pruning Method for Deep Neural Network Compression [Luo et al., ICCV 2017]
SparseGPT: Massive Language Models Can be Accurately Pruned in One-Shot [Elias Frantar, Dan Alistarh, ArXiv 2023]

第 12 页，共 20 页

sconce v0.99

第 13 页，共 20 页

Auto-Sensitivity Scan

Reduce CPM(Compute, Power, Memory) Given the Prune/Quant/etc..
Causing the least/negligible amount of degradation

* - Future Features ( I too have only 24 hrs in a day :P)

第 14 页，共 20 页

sconce v1.1

第 15 页，共 20 页

Altruism is all you need !!! Don’t just be self attentive

WANDA: https://github.com/locuslab/wanda
AWQ: https://github.com/mit-han-lab/llm-awq

第 16 页，共 20 页

Make Kernels Aware of the Future Kernel Spaces

Code: https://github.com/satabios/sconce/blob/main/tutorials/Pruning.ipynb

Citations:

EIE: https://arxiv.org/abs/1602.01528
Multi-scale channel importance sorting and spatial attention mechanism for retinal vessels segmentation
EACP: An effective automatic channel pruning for neural networks

第 17 页，共 20 页

Channel-Based Activation Aware Pruning

But Channel Wise!!

Register hooks to fetch O/P Feature Maps
Run through a Calibration Dataset
Prune Channels(Kernel Weights) Based on Activations

第 18 页，共 20 页

Activation Aware - QAT

But Channel Wise!!

Apply Scaling on Feature Maps based on Activation Saliency
QAT

第 19 页，共 20 页

Complete Flow

Sort Channels Based on Successive Channels
Activation Aware Pruning ( WANDA- like)
Activation Aware Quantization (AWQ- like)

第 20 页，共 20 页

Possible Additions

Layer-Wise Neural Network Compression via Layer Fusion: https://proceedings.mlr.press/v157/o-neill21a/o-neill21a.pdf
Layer-Selective Rank Reduction: https://github.com/pratyushasharma/laser
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits: https://arxiv.org/abs/2402.17764