1 of 27

IFT6760-B 1

PackNet: Adding Multiple Tasks to a

Single Network by Iterative Pruning

Arun Mallya, Svetlana Lazebnik

Presenters: Mostafa ElAraby, Dishank Bansal

2 of 27

Outline

IFT6760-B 2

Introduction
Related Works
PackNet
Experiments and Results
Conclusion
Discussion

3 of 27

Introduction

IFT6760-B 3

Task aware both at train and test
Doesn’t require to store data from previous tasks
Avoid catastrophic forgetting - no performance drop in previous tasks
Iterative pruning and re-training to learn new tasks

Key Theme

Use redundant parameters in network to learn new tasks, instead of removing them

4 of 27

Related Works

IFT6760-B 4

Network Expansion method:

Keep adding new layers and learn them for new tasks.
Size of network keeps increasing with number of tasks
In this case, the only overhead is storage of masks which is less than adding layers.

5 of 27

Related Works

IFT6760-B 5

Preserves responses of the network on older tasks by using a distillation loss during training on new task.
If new task differ a lot from previous task, then decision boundary of network can change substantially leading to forgetting.

Penalise change in weights that important to previous tasks.
Soft-constraints can still lead to change in weights leading to extent of forgetting.

DSD:DENSE-SPARSE-DENSE TRAINING

Sparsifying and retraining weights of a network serves as a form of regularization and improves performance on the same task.

6 of 27

Approach

IFT6760-B 6

During training on task t

Weights from previous tasks are used during forward pass, but are fixed (not updated during backprop).
Only update weights which are free i.e not used by previous tasks.

Pruning for task t

Prune some fraction of weights used for training task t.
Note, weights that are used for previous tasks can’t be pruned in task t step.

After pruning, performance drops on task t, hence have to retrain to compensate for pruned weights
Repeat until no more weights can be pruned without loss of performance.

7 of 27

Approach: Overhead & Inference

IFT6760-B 7

Store mask indicating which parameters are active for particular tasks.

Bits to encode the mask per parameter, where N is total number of tasks.

Since each parameter can be used for at max for N tasks, we have to use encoding which can capture integers upto N. Hence

Let’s say N=4, then for each param we have 2 bits. I.e xy

x=0, y=0 => Use for task t=1,2,3,4
x=0, y=1 => Use for task t=2,3,4
x=1, y=0 => Use for task=3,4
x=1, y=1 => Use for task=4

8 of 27

Approach: Pruning

IFT6760-B 8

At each round of pruning, fixed percentage of weights are pruned.
Weighted to be pruned are selected based on absolute magnitude.
Least magnitude weights are pruned.
This is one shot-pruning, as weights are pruned in at once.

Incremental pruning can also be used as it achieve better performance.

9 of 27

Datasets

Experiments

IFT6760-B 9

10 of 27

Training Setting

IFT6760-B 10

11 of 27

Baselines

IFT6760-B 11

Learn without Forgetting (LWF)

Trains on initial layer
Add new parameters in the last layer for new upcoming task
Regularize based on the output of current task from the old task’s parameters

Individual Networks

Trains an individual network for each task

Classifier

Trains on ImageNet Initially
Freeze the features extractor
Trains the classifier only for each upcoming task

Joint training

Trains all parameters jointly
Relies on datasets having same input size
Sampling should be balanced between different tasks

12 of 27

Fine Grained Tasks Vanilla VGG-16

IFT6760-B 12

Beats LWF in all tasks
Pruning 75% of the parameters for first task and then 50% of the upcoming tasks is better than 50% at the start because we leave more parameters for all upcoming tasks
Pruning 75% of the parameters leaves more parameters for upcoming task making it slightly better on the next task with inferior performance on previous one

13 of 27

Large Datasets Vanilla VGG-16

IFT6760-B 13

Joint training is considered an upper bound in CL as it optimizes all parameters by interleaving samples from both tasks while balancing them (requiring 60 epochs vs 10 epochs using outs).
Pruning 75% of the parameters leaves more parameters for upcoming task making it slightly better on the next task with inferior performance on previous one

14 of 27

Change In Error LWF vs PackNet

IFT6760-B 14

In LWF, performance starts decreasing on previous tasks whenever we add a new task
In Packnet, performance on previous tasks is fixed regardless of the number of newly added tasks as we freeze parameters of previous tasks

15 of 27

Results on VGG-16 with BatchNorm

IFT6760-B 15

The authors decided to prune 50% of parameters for first task and then 75% for any upcoming task.
Biases, running mean and variance of the batchnorm are frozen after training on initial task for sake of simplicity

16 of 27

Results on ResNet-50

IFT6760-B 16

In ResNet-50, on flowers task PackNet is slightly better than training an individual network because of a possible overfitting of large ResNet on flowers task.
Flowers task is the smallest task and we use in PackNet fewer available parameters after pruning which helps avoiding the overfitting issue.

17 of 27

Results on DenseNet-121

IFT6760-B 17

Pruning ratio selected represents the trade-off between loss of current task accuracy and increasing the capacity of upcoming tasks.
The authors pre-select a fixed ratio for all upcoming tasks except for the first one which sometimes gets a higher threshold.

18 of 27

Effect of Training Order

IFT6760-B 18

Order of the training matters in PackNet.
Pruning initial network gives more parameters to the first task compared to upcoming tasks.
Each new added task have fewer params than previous ones.
Challenging tasks or unrelated tasks should be added first as the capacity of the network decreases for upcoming tasks.

19 of 27

Effect of Pruning Ratios

IFT6760-B 19

Re-training on previous task after pruning helps restoring the network’s capacity due to sudden change in its connectivity.
The authors showed that even for aggressive pruning (90%) results in a small increase in the error.
Modification of few parameters is enough to obtain good accuracy

20 of 27

Sharing Parameters (biases)

IFT6760-B 20

Learning separate biases for convolutional and Batch Norm layers doesn’t show any noticeable improvement in performance.
Sharing the biases would reduce the storage overhead as we won’t need to store biases for each task.

21 of 27

Effect of training all layers

IFT6760-B 21

The authors show the effect of fine-tuning various layers of the network and its effect.
Fine-tuning fully connected layers decreases the error when compared to the baseline that fine-tunes the classification layer only.
Fine-tuning all layers including convolutional layers provides the biggest decrease in the error term
We can control in PackNet the pruning ratio of each layer (task specific)

22 of 27

Filter based pruning

IFT6760-B 22

The authors compare sparsifying portions of each filter versus pruning an entire filter.
They noticed that pruning entire filters eliminates ~30% of the filters in VGG-16 which is aggressive and most of the filters comes from the convolutional layer before fully connected beside the implementation complexity of re-connecting sparse filters.

23 of 27

Conclusion

IFT6760-B 23

Able to use redundant parameters in network to learn new tasks.
No forgetting, since parameters for previous are fixed during training on current tasks
More efficient compared to tasks learned on individual networks, as on single network we can leverage similarity of features.
Responsive to different network architectures.

24 of 27

Limitation of the Work

IFT6760-B 24

The number of possible tasks is limited by the capacity of the network, at some point the parameters left for training won’t be enough for upcoming tasks.

25 of 27

Limitation of the Work

IFT6760-B 25

Might not work with efficient networks where there are no redundant parameters.

Number of task that can be added is correlated with tasks similarities, but no experiments are shown for that. For example, if all the task are completely different then a network might learn less task compared to if tasks are related because since capacity is constant, related task might make more use of previous learned weigths than unrelated tasks.

26 of 27

Limitations of the Work

IFT6760-B 26

No adaptive pruning.

Adaptive to similarity of tasks.

27 of 27

Improvements

IFT6760-B 27

Adaptive pruning based on similarity of tasks.
Network extensions after complete capacity of network is used.