1 of 36

How to win the lottery!

Presentation about model compression

by

William Gazali

2 of 36

Contents

  • Background
  • Motivation
  • Main topic of discussion
    • The lottery ticket hypothesis (Base knowledge ICLR 2019)
    • Finding Lottery Tickets in Vision Models via Data-driven Spectral Foresight Pruning (Recent application CVPR 2024)
  • Conclusion
  • QnA

3 of 36

The Pareto Principle

“80% of Italy is owned by 20% of the people” - Vilfredo Pareto

4 of 36

Background

5 of 36

Model Compression

  • High accuracy, fraction of the network

  • There are 4 main types
    • Pruning and Quantization
    • Compact architecture design
    • Low rank decomposition
    • Knowledge distillation

Menghani, G. (2023). Efficient Deep Learning: A Survey on Making Deep Learning Models Smaller, Faster, and Better. ACM Computing Surveys, 55(12), 1–37.

6 of 36

Pruning and Quantization

Quantization

  • Reducing bits in a network

Pruning

  • Aims to sever unused neurons (or set to 0)

Menghani, G. (2023). Efficient Deep Learning: A Survey on Making Deep Learning Models Smaller, Faster, and Better. ACM Computing Surveys, 55(12), 1–37.

7 of 36

Compact Architecture Design

  • A form of architectural design
  • Aims to create an efficient yet robust network

Mingxing Tan, & Quoc V. Le. (2020). EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.

8 of 36

Low Rank Decomposition

  • Aims to approximate a layer into smaller dimensional layers

Yu Cheng, Duo Wang, Pan Zhou, & Tao Zhang. (2020). A Survey of Model Compression and Acceleration for Deep Neural Networks.

9 of 36

Knowledge Distillation

  • Involving a teacher (bigger) network that teach a student (smaller) network
  • Mostly used so that the student network is able to predict like the teacher network

Menghani, G. (2023). Efficient Deep Learning: A Survey on Making Deep Learning Models Smaller, Faster, and Better. ACM Computing Surveys, 55(12), 1–37.

10 of 36

Why is it important?

11 of 36

Motivation

  • Industrial application
  • Fit to use for edge devices
  • We can fully utilize a network’s potential
  • etc

Bohan Liu, Zijie Zhang, Peixiong He, Zhensen Wang, Yang Xiao, Ruimeng Ye, Yang Zhou, Wei-Shinn Ku, & Bo Hui. (2024). A Survey of Lottery Ticket Hypothesis.

12 of 36

The lottery ticket hypothesis

The Motivation

  • Left :- Early stopping iteration with percent of weights remaining
  • Right :- Accuracy of early stop with percent of weights
  • Winning tickets can preserve accuracy!

13 of 36

The lottery ticket hypothesis - Definition

The Lottery Ticket Hypothesis. A randomly-initialized, dense neural network contains a subnetwork that is initialized such that—when trained in isolation—it can match the test accuracy of the original network after training for at most the same number of iterations.

14 of 36

The lottery ticket hypothesis - The Idea

Consider a training network

𝒇(𝒙;θ) → θ = weight

Consider a Masked network

𝒇(𝒙;𝑚⊙θ) → 𝑚⊙θ = Masked weight

15 of 36

Identifying winning tickets

In a nutshell

  • Randomly initialize 𝒇(𝒙;θ0)
  • Train the network for j iteration arriving at θj
  • Prune p% of θj creating mask m
  • Reset the remaining parameters to their values in θ0, creating the winning ticket 𝒇(𝒙;𝑚⊙θ).
  • Reset the parameter back to the initialize state

Can be done 1 shot, but the paper recommends an iterative pruning method

16 of 36

Pruning a fully connected layer

Following the algorithm in the last slide

  • Layer wise magnitude pruning is used
    • I.e calculating absolute weights and removing those closer to zero based on a percentage
  • Tested on the Lenet MNST

17 of 36

Pruning a fully connected layer

18 of 36

Pruning a convolutional network

  • Same strategy as before
  • Conv-2, 4, 6 are used which is a scaled down VGG network

19 of 36

Pruning a convolutional network

  • Pruning with dropout increases the accuracy

20 of 36

Pruning a VGG and Resnet architecture

  • They use VGG19 and Resnet 18 in CIFAR-10
  • Instead of a layer wise pruning, they use global pruning
  • This is because some layer have far more parameters and layer wise pruning can create a bottleneck

21 of 36

Pruning a VGG and Resnet architecture

22 of 36

Limitation

  • This work only consider small dataset such as MNST and CIFAR-10
  • Very resource intensive (iterative pruning)
  • Sparse pruning may not be optimized for modern hardware and libraries

23 of 36

Recent works utilizing this theory

24 of 36

Spectral foresight pruning

Problem:

  • Try to solve the same Prune on Initialization (PAI) problem
  • The usual pruning strategy is expensive
  • Built upon the weakness of utilizing Neural Tangent Kernel (NTK) theory
  • Pruning on a pretrained network?

25 of 36

Background (Saliency score)

Alternative way to calculate mask m

  • The usage of saliency score I.e The significance of a parameter based on a property F formally:

26 of 36

Background (NTK)

Neural Tangent Kernel

  • Encapsulate information regarding the relationship of a network parameter and its output.

  • is the NTK of the network and describe how sensitive the output change based on the gradient of the network
  • It also has been shown that faster convergence is correlated to this value!

27 of 36

How do we combine them?

  • A lot of works has failed to utilize NTK of a network
  • NTK is only possible using a small amount of data
  • Time complexity to calculate

N : Amount of data

K : Output size

FP : [FP] is a single cost of forward pass

28 of 36

Methodology

  • How to calculate NTK eigenvalue efficiently?

29 of 36

Methodology

  • Create 2 copies of f, namely g and h
  • g takes the input squared with parameters of all 1 and activation same as f
  • h takes a simplified data of all 1 and activations of all 1

30 of 36

Methodology

  • Final loss is defined as

  • Perform global masking based on the top parameters
  • Update mask M
  • Redo for another pruning round

31 of 36

Results

  • Tested on 3 different networks (ResNet 20, VGG 16, ResNet 18)
  • Tested on 3 different dataset (CIFAR-10, CIFAR-100, Tiny-Imagenet)

32 of 36

Results on NTK eigen spectrum

  • Closely match the dense network

33 of 36

Result on Pretrained model

34 of 36

Conclusion

  • Model pruning although expensive could achieve significant results
  • Recent approach focus on making it cheaper
  • It can achieve high level of sparsity while maintaining the results

35 of 36

Q n A

36 of 36

Thank You