How to win the lottery!
Presentation about model compression
by
William Gazali
Contents
The Pareto Principle
“80% of Italy is owned by 20% of the people” - Vilfredo Pareto
Background
Model Compression
Menghani, G. (2023). Efficient Deep Learning: A Survey on Making Deep Learning Models Smaller, Faster, and Better. ACM Computing Surveys, 55(12), 1–37.
Pruning and Quantization
Quantization
Pruning
Menghani, G. (2023). Efficient Deep Learning: A Survey on Making Deep Learning Models Smaller, Faster, and Better. ACM Computing Surveys, 55(12), 1–37.
Compact Architecture Design
Mingxing Tan, & Quoc V. Le. (2020). EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.
Low Rank Decomposition
Yu Cheng, Duo Wang, Pan Zhou, & Tao Zhang. (2020). A Survey of Model Compression and Acceleration for Deep Neural Networks.
Knowledge Distillation
Menghani, G. (2023). Efficient Deep Learning: A Survey on Making Deep Learning Models Smaller, Faster, and Better. ACM Computing Surveys, 55(12), 1–37.
Why is it important?
Motivation
Bohan Liu, Zijie Zhang, Peixiong He, Zhensen Wang, Yang Xiao, Ruimeng Ye, Yang Zhou, Wei-Shinn Ku, & Bo Hui. (2024). A Survey of Lottery Ticket Hypothesis.
The lottery ticket hypothesis
The Motivation
The lottery ticket hypothesis - Definition
The Lottery Ticket Hypothesis. A randomly-initialized, dense neural network contains a subnetwork that is initialized such that—when trained in isolation—it can match the test accuracy of the original network after training for at most the same number of iterations.
The lottery ticket hypothesis - The Idea
Consider a training network
𝒇(𝒙;θ) → θ = weight
Consider a Masked network
𝒇(𝒙;𝑚⊙θ) → 𝑚⊙θ = Masked weight
Identifying winning tickets
In a nutshell
Can be done 1 shot, but the paper recommends an iterative pruning method
Pruning a fully connected layer
Following the algorithm in the last slide
Pruning a fully connected layer
Pruning a convolutional network
Pruning a convolutional network
Pruning a VGG and Resnet architecture
Pruning a VGG and Resnet architecture
Limitation
Recent works utilizing this theory
Spectral foresight pruning
Problem:
Background (Saliency score)
Alternative way to calculate mask m
Background (NTK)
Neural Tangent Kernel
How do we combine them?
N : Amount of data
K : Output size
FP : [FP] is a single cost of forward pass
Methodology
Methodology
Methodology
Results
Results on NTK eigen spectrum
Result on Pretrained model
Conclusion
Q n A
Thank You