第 1 页,共 20 页

Let's Make Neural Networks Efficient

Sathyaprakash Narayanan

ECE-x83: Special Topics in Engineering [Brain Inspired Machine Learning]

Jason Eshraghian

第 2 页,共 20 页

So what’s the catch?

Scaling these models up is a huge pain:

- binarized activations are hard to deal with; you are drastically limiting the ability of individual neurons to represent information

For equivalent tasks, non-spiking networks are easier to train to often better loss convergence

- recurrent, time-varying neurons are expensive to train using BPTT (linear memory complexity with time)

- Sparsity only makes sense if your hardware knows to skip “0” operations. GPUs, by default, do not know to do this.

Solutions?

- Computation is cheap, memory access is expensive. Maybe we focus on sparsity instead of binarization. I.e., silicon’s already optimized for computation. Use it.

- Real-time learning techniques?

- Dynamical weights

Major takeaway: don’t trust me. Go build cooler shit.

Stolen from Class slides: Week 2 :P

Von Neumann Bottleneck “Memory is Expensive”

Data Movement → More Memory Reference → More Energy

How should we make deep learning more efficient?

第 3 页,共 20 页

Sparsity is the Key!!!!!!!!

第 4 页,共 20 页

Von Neumann Bottleneck “Memory is Expensive”

Data Movement → More Memory Reference → More Energy

4

Operation

Energy [pJ]

32 bit int ADD

0.1

32 bit float ADD

0.9

32 bit Register File

1

32 bit int MULT

3.1

32 bit float MULT

3.7

32 bit SRAM Cache

5

32 bit DRAM Memory

640

10

100 1000 10000

Rough Energy Cost For Various Operations in 45nm 0.9V

Relative Energy Cost

200

1

= 200

1

This image is in the public domain

Computing's Energy Problem (and What We Can Do About it) [Horowitz, M., IEEE ISSCC 2014]

What’s the first step towards Sparsity?

How should we make deep learning more efficient?

第 5 页,共 20 页

Pruning Happens in Human Brain

5

Time

Slide Inspiration: Alila Medical Media

Number of Synapses

Newborn

2-4 years old

Adult

Adolescence

15000 synapses

2500 synapses

per neuron [1]

per neuron [1]

7000 synapses

per neuron [2]

Data Source: 1, 2

Do We Have Brain to Spare? [Drachman DA, Neurology 2004]

Peter Huttenlocher (1931–2013) [Walsh, C. A., Nature 2013]

第 6 页,共 20 页

Neural Network Pruning

6

Make neural network smaller by removing synapses and neurons

Optimal Brain Damage [LeCun et al., NeurIPS 1989]

Learning Both Weights and Connections for Efficient Neural Network [Han et al., NeurIPS 2015]

Dense Model

Sparse Model

第 7 页,共 20 页

Let make snn’s actually sparse(nn in general)!!

Wrote a package to fix this, :P

snn Notebook: https://colab.research.google.com/drive/1nNil4aj0GJxGfFQka8By9dTESVjLlied?usp=sharing

Tutorial: Show the DL Comparion on sconce/tutorials.

第 8 页,共 20 页

Is that all we could do?

Quantization

Knowledge-Distillation

Sparsity Aware Engine/Model Computations

CUDA Optimizations

Hardware Aware Optimizations

Neural-Architecture Search

第 9 页,共 20 页

Sconce (Note: Any Torch can be used)

Go and smash the Star on Github Repo(satabios/sconce)… I can’t get you extra credits, but I promise that you’ll be in my stargazers list 🥹

第 10 页,共 20 页

Didn’t want to bore you further

第 11 页,共 20 页

References

11

  • Model Compression and Hardware Acceleration for Neural Networks: A Comprehensive Survey [Deng et al., IEEE 2020]
  • Computing's Energy Problem (and What We Can Do About it) [Horowitz, M., IEEE ISSCC 2014]
  • Optimal Brain Damage [LeCun et al., NeurIPS 1989]
  • Learning Both Weights and Connections for Efficient Neural Network [Han et al., NeurIPS 2015]
  • Efficient Methods and Hardware for Deep Learning [Han S., Stanford University]
  • Peter Huttenlocher (1931–2013) [Walsh, C. A., Nature 2013]
  • Exploring the granularity of sparsity in convolutional neural networks [Mao et al., CVPR-W]
  • Accelerating Inference with Sparsity Using the NVIDIA Ampere Architecture and NVIDIA TensorRT
  • AMC: Automl for Model Compression and Acceleration on Mobile Devices [He et al., ECCV 2018]
  • Learning Structured Sparsity in Deep Neural Networks [Wen et al., NeurIPS 2016]
  • Learning Efficient Convolutional Networks through Network Slimming [Liu et al., ICCV 2017]
  • Pruning Convolutional Filters with First Order Taylor Series Ranking [Wang M.]
  • Importance Estimation for Neural Network Pruning [Molchanov et al., CVPR 2019]
  • Network Trimming: A Data-Driven Neuron Pruning Approach towards Efficient Deep Architectures [Hu et al., ArXiv 2017]
  • Pruning Convolutional Neural Networks for Resource Efficient Inference [Molchanov et al., ICLR 2017]
  • Channel Pruning for Accelerating Very Deep Neural Networks [He et al., ICCV 2017]
  • ThiNet: A Filter Level Pruning Method for Deep Neural Network Compression [Luo et al., ICCV 2017]
  • SparseGPT: Massive Language Models Can be Accurately Pruned in One-Shot [Elias Frantar, Dan Alistarh, ArXiv 2023]

第 12 页,共 20 页

sconce v0.99

第 13 页,共 20 页

Auto-Sensitivity Scan

  • Reduce CPM(Compute, Power, Memory) Given the Prune/Quant/etc..
  • Causing the least/negligible amount of degradation

* - Future Features ( I too have only 24 hrs in a day :P)

第 14 页,共 20 页

sconce v1.1

第 15 页,共 20 页

Altruism is all you need !!! Don’t just be self attentive

第 16 页,共 20 页

Make Kernels Aware of the Future Kernel Spaces

Code: https://github.com/satabios/sconce/blob/main/tutorials/Pruning.ipynb

Citations:

  • EIE: https://arxiv.org/abs/1602.01528
  • Multi-scale channel importance sorting and spatial attention mechanism for retinal vessels segmentation
  • EACP: An effective automatic channel pruning for neural networks

第 17 页,共 20 页

Channel-Based Activation Aware Pruning

But Channel Wise!!

  • Register hooks to fetch O/P Feature Maps
  • Run through a Calibration Dataset
  • Prune Channels(Kernel Weights) Based on Activations

第 18 页,共 20 页

Activation Aware - QAT

But Channel Wise!!

  • Apply Scaling on Feature Maps based on Activation Saliency
  • QAT

第 19 页,共 20 页

Complete Flow

  • Sort Channels Based on Successive Channels
  • Activation Aware Pruning ( WANDA- like)
  • Activation Aware Quantization (AWQ- like)

第 20 页,共 20 页

Possible Additions