1 of 36

How to win the lottery!

Presentation about model compression

by

William Gazali

2 of 36

Contents

Background
Motivation
Main topic of discussion

The lottery ticket hypothesis (Base knowledge ICLR 2019)
Finding Lottery Tickets in Vision Models via Data-driven Spectral Foresight Pruning (Recent application CVPR 2024)

Conclusion
QnA

3 of 36

The Pareto Principle

“80% of Italy is owned by 20% of the people” - Vilfredo Pareto

4 of 36

Background

5 of 36

Model Compression

High accuracy, fraction of the network

There are 4 main types

Pruning and Quantization
Compact architecture design
Low rank decomposition
Knowledge distillation

Menghani, G. (2023). Efficient Deep Learning: A Survey on Making Deep Learning Models Smaller, Faster, and Better. ACM Computing Surveys, 55(12), 1–37.

6 of 36

Pruning and Quantization

Quantization

Reducing bits in a network

Pruning

Aims to sever unused neurons (or set to 0)

Menghani, G. (2023). Efficient Deep Learning: A Survey on Making Deep Learning Models Smaller, Faster, and Better. ACM Computing Surveys, 55(12), 1–37.

7 of 36

Compact Architecture Design

A form of architectural design
Aims to create an efficient yet robust network

Mingxing Tan, & Quoc V. Le. (2020). EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.

8 of 36

Low Rank Decomposition

Aims to approximate a layer into smaller dimensional layers

Yu Cheng, Duo Wang, Pan Zhou, & Tao Zhang. (2020). A Survey of Model Compression and Acceleration for Deep Neural Networks.

9 of 36

Knowledge Distillation

Involving a teacher (bigger) network that teach a student (smaller) network
Mostly used so that the student network is able to predict like the teacher network

Menghani, G. (2023). Efficient Deep Learning: A Survey on Making Deep Learning Models Smaller, Faster, and Better. ACM Computing Surveys, 55(12), 1–37.

10 of 36

Why is it important?

11 of 36

Motivation

Industrial application
Fit to use for edge devices
We can fully utilize a network’s potential
etc

Bohan Liu, Zĳie Zhang, Peixiong He, Zhensen Wang, Yang Xiao, Ruimeng Ye, Yang Zhou, Wei-Shinn Ku, & Bo Hui. (2024). A Survey of Lottery Ticket Hypothesis.

12 of 36

The lottery ticket hypothesis

The Motivation

Left :- Early stopping iteration with percent of weights remaining
Right :- Accuracy of early stop with percent of weights
Winning tickets can preserve accuracy!

13 of 36

The lottery ticket hypothesis - Definition

The Lottery Ticket Hypothesis. A randomly-initialized, dense neural network contains a subnetwork that is initialized such that—when trained in isolation—it can match the test accuracy of the original network after training for at most the same number of iterations.

14 of 36

The lottery ticket hypothesis - The Idea

Consider a training network

𝒇(𝒙;θ) → θ = weight

Consider a Masked network

𝒇(𝒙;𝑚⊙θ) → 𝑚⊙θ = Masked weight

15 of 36

Identifying winning tickets

In a nutshell

Randomly initialize 𝒇(𝒙;θ0)
Train the network for j iteration arriving at θj
Prune p% of θj creating mask m
Reset the remaining parameters to their values in θ0, creating the winning ticket 𝒇(𝒙;𝑚⊙θ).
Reset the parameter back to the initialize state

Can be done 1 shot, but the paper recommends an iterative pruning method

16 of 36

Pruning a fully connected layer

Following the algorithm in the last slide

Layer wise magnitude pruning is used

I.e calculating absolute weights and removing those closer to zero based on a percentage

Tested on the Lenet MNST

17 of 36

Pruning a fully connected layer

18 of 36

Pruning a convolutional network

Same strategy as before
Conv-2, 4, 6 are used which is a scaled down VGG network

19 of 36

Pruning a convolutional network

Pruning with dropout increases the accuracy

20 of 36

Pruning a VGG and Resnet architecture

They use VGG19 and Resnet 18 in CIFAR-10
Instead of a layer wise pruning, they use global pruning
This is because some layer have far more parameters and layer wise pruning can create a bottleneck

21 of 36

Pruning a VGG and Resnet architecture

22 of 36

Limitation

This work only consider small dataset such as MNST and CIFAR-10
Very resource intensive (iterative pruning)
Sparse pruning may not be optimized for modern hardware and libraries

23 of 36

Recent works utilizing this theory

24 of 36

Spectral foresight pruning

Problem:

Try to solve the same Prune on Initialization (PAI) problem
The usual pruning strategy is expensive
Built upon the weakness of utilizing Neural Tangent Kernel (NTK) theory
Pruning on a pretrained network?

25 of 36

Background (Saliency score)

Alternative way to calculate mask m

The usage of saliency score I.e The significance of a parameter based on a property F formally:

26 of 36

Background (NTK)

Neural Tangent Kernel

Encapsulate information regarding the relationship of a network parameter and its output.

is the NTK of the network and describe how sensitive the output change based on the gradient of the network
It also has been shown that faster convergence is correlated to this value!

27 of 36

How do we combine them?

A lot of works has failed to utilize NTK of a network
NTK is only possible using a small amount of data
Time complexity to calculate

N : Amount of data

K : Output size

FP : [FP] is a single cost of forward pass

28 of 36

Methodology

How to calculate NTK eigenvalue efficiently?

29 of 36

Methodology

Create 2 copies of f, namely g and h
g takes the input squared with parameters of all 1 and activation same as f
h takes a simplified data of all 1 and activations of all 1

30 of 36

Methodology

Final loss is defined as

Perform global masking based on the top parameters
Update mask M
Redo for another pruning round

31 of 36

Results

Tested on 3 different networks (ResNet 20, VGG 16, ResNet 18)
Tested on 3 different dataset (CIFAR-10, CIFAR-100, Tiny-Imagenet)

32 of 36

Results on NTK eigen spectrum

Closely match the dense network

33 of 36

Result on Pretrained model

34 of 36

Conclusion

Model pruning although expensive could achieve significant results
Recent approach focus on making it cheaper
It can achieve high level of sparsity while maintaining the results

35 of 36

Q n A

36 of 36

Thank You