1 of 35

Learning Without Forgetting (LwF)

Presenters: Irene Tenison, Sai Aravind Sreeramadas

2 of 35

Introduction

Objective
LwF setting
Other relevant methods

Learning Without Forgetting (LwF)

Method

Experiments
Design choices
Summary
Discussion

3 of 35

Introduction - Objective

To learn new tasks by sharing parameters from old tasks without suffering from Catastrophic Forgetting (performance degradation on old tasks as new tasks are learnt) or having access to old task’s training data (unrecorded, proprietary, or to cumbersome to store).

Classified as a Regularization based method for continual learning

4 of 35

LwF Setting

Old task data will not be available during training of the new task

Three sets of parameters:
𝛳_s - shared parameters
𝛳_o - task specific parameters of previous tasks
𝛳_n - task specific parameters of the new task

𝛳_s

𝛳_o & 𝛳_n

5 of 35

Relevant Methods - Feature Extraction

For each new task,

Pre-trained CNN extracts the features
Classifiers (from random initialization) are trained on these features

𝛳_s and 𝛳_o are unchanged and 𝛳_n is trained

Performance can be improved by fine tuning

6 of 35

Relevant Methods - Fine Tuning

For each new task,

Existing CNNs are modified (with low LR)
Output layer (from random initialization) are trained

𝛳_o is unchanged and 𝛳_sand 𝛳_n are optimized with during training

Variations - Duplicating and fine tuning, Fine tuning FC

7 of 35

Relevant Methods - Joint Training (Multitask learning)

All parameters - 𝛳_s, 𝛳_o, and 𝛳_n are optimized during training

Requires data from all tasks to be present simultaneously

Upper bound of LwF

8 of 35

Comparison of Methods

9 of 35

LwF Goal

Given a CNN with shared parameters, 𝛳_s , and task specific parameters of previous tasks, 𝛳_o, the goal of LwF is to add task specific parameters for the new task, 𝛳_n, and to learn parameters (𝛳_s, 𝛳_o, 𝛳_n) that works well on the old and the new tasks using data from the new task only.

10 of 35

Method

Starts With:

𝛳_s and 𝛳_o: shared parameters and task specific parameters of old tasks

X_n , Y_n : data of the new task

𝛳_s

𝛳_o

11 of 35

Method

Starts With:

𝛳_s and 𝛳_o: shared parameters and task specific parameters of old tasks

X_n , Y_n : data of the new task

Initialize:

X_n

𝛳_s

𝛳_o

Y_o

12 of 35

Training

- Weight decay of 0.0005

- Loss balance weight

13 of 35

Loss - new task

Multinomial logistic loss for multiclass classification

Where, - one-hot ground truth label vector

- softmax output of the new network

Encourages the predicted to be consistent with the ground truth

14 of 35

Loss - old task

Encourages output probabilities, to be close to the output from the original network,

and

Knowledge Distillation Loss (used where one network approximates the output of another)

Where, l is the number of labels,

is the modified version of recorded and,

is the modified version of current

16 of 35

Implementation details:

CNN architecture - AlexNet ( VGG for some experiments)
Initialisation of 𝛳_n - Xavier initialisation
Training mechanism - uses the same preprocess and training norms of Alexnet
What’s different - smaller learning rate than usual for LwF training

X_n

𝛳_s

𝛳_o

Y_o

17 of 35

Methods being compared with

Fine tuning
LFL(less forgetting learning)
Fine-tuning FC
Feature extraction
Joint training or multitask training

18 of 35

Experiments

1)Single new task scenario - add all classes of the new data at once

2)Multiple new task scenario - add a sub group of dataset one by one

3) Influence of dataset size - For a subset of new task data

19 of 35

Experiments

Data

All the tasks are classification tasks
Old task → ImageNet , Places 365
New taks → Pascal VOC 2012 (‘VOC’) , caltech-UCSD (‘CUB’) , MIT indoor scene classification (‘Scenes’) , MNIST

20 of 35

1- Single new task scenario

Compare the results of one new task among different task pairs and different methods
Most of the results are given on validation set
For testing , they used only Places365 → VOC
They have implemented vgg on some tasks

22 of 35

Observations

Dissimilar new tasks degrade the old task performance more(CUB , MILA)
Similar tasks have a good performance using Lwf on most of the new tasks
Whereas it has notable performance on the old tasks it is always outperformed by joint training and feature extraction methods

23 of 35

Multiple new task scenario

Multiple classes are added in groups to classification task

For this experiment they have split the VOC dataset in to subgroups like animals , places, rooms and gradually train them

Similarly they do it with scenes dataset as well

24 of 35

Multiple new task scenario

25 of 35

Observations

LwF outperforms fine tuning , FE, LFL and fine tuning FC for most newly added tasks
Under performs on the old tasks compared to joint training

26 of 35

3- Influence of dataset size

27 of 35

Design choices and alternatives

Choice of task-specific layers

Add new layers specific for new task and train the model
It didn’t show any improvement

Network expansion

Expand the final layers (FCN) to add more hidden units and initialize randomly to have more new task specific

Effect of lower learning rate of shared parameters.

Simply lowering lr of shared params (theta s ) doesn’t preserve the old task performance

30 of 35

Design choices and alternatives

L2 soft-constrained weights.

To act as a regulariser for the old task weights
It was outperformed by LwF

Choice of response preserving loss

Tried out other loss alternatives like cross entropy,L1 , L2
Out of them , distillation loss gave the best results in terms of performance

32 of 35

Additional Experiments

Tracking with MD-NET using LwF

Similar to classification task as we have seen in multiple new scenario, they apply this on the tracking task as a classification task

Incrementally adding new objects for detection is real use case for this LwF technique.

33 of 35

Advantages and Disadvantages

Advantages :

Classification performance
Computational efficiency
Simplicity deployment

Disadvantages :

New task images are different from old task images , it degrades the performance on old tasks
Distribution of images across tasks may be different, it may potentially decrease the performance

34 of 35

Takeaways

A simple idea which shows preserving outputs of old tasks can be useful in countering catastrophic forgetting
In multiple settings , we have seen better results when there was some similarity in tasks and poor performance if that wasn't the case