1 of 35

Learning Without Forgetting (LwF)

Presenters: Irene Tenison, Sai Aravind Sreeramadas

2 of 35

  • Introduction
    • Objective
    • LwF setting
    • Other relevant methods
  • Learning Without Forgetting (LwF)
    • Method
  • Experiments
  • Design choices
  • Summary
  • Discussion

3 of 35

Introduction - Objective

  • To learn new tasks by sharing parameters from old tasks without suffering from Catastrophic Forgetting (performance degradation on old tasks as new tasks are learnt) or having access to old task’s training data (unrecorded, proprietary, or to cumbersome to store).

  • Classified as a Regularization based method for continual learning

4 of 35

LwF Setting

  • Old task data will not be available during training of the new task

  • Three sets of parameters:
  • 𝛳s - shared parameters
  • 𝛳o - task specific parameters of previous tasks
  • 𝛳n - task specific parameters of the new task

𝛳s

𝛳o & 𝛳n

5 of 35

Relevant Methods - Feature Extraction

  • For each new task,
    • Pre-trained CNN extracts the features
    • Classifiers (from random initialization) are trained on these features

  • 𝛳s and 𝛳o are unchanged and 𝛳n is trained

  • Performance can be improved by fine tuning

6 of 35

Relevant Methods - Fine Tuning

  • For each new task,
    • Existing CNNs are modified (with low LR)
    • Output layer (from random initialization) are trained

  • 𝛳o is unchanged and 𝛳s and 𝛳n are optimized with during training

  • Variations - Duplicating and fine tuning, Fine tuning FC

7 of 35

Relevant Methods - Joint Training (Multitask learning)

  • All parameters - 𝛳s, 𝛳o , and 𝛳n are optimized during training

  • Requires data from all tasks to be present simultaneously

  • Upper bound of LwF

8 of 35

Comparison of Methods

9 of 35

LwF Goal

Given a CNN with shared parameters, 𝛳s , and task specific parameters of previous tasks, 𝛳o, the goal of LwF is to add task specific parameters for the new task, 𝛳n, and to learn parameters (𝛳s, 𝛳o, 𝛳n) that works well on the old and the new tasks using data from the new task only.

10 of 35

Method

Starts With:

𝛳s and 𝛳o : shared parameters and task specific parameters of old tasks

Xn , Yn : data of the new task

𝛳s

𝛳o

11 of 35

Method

Starts With:

𝛳s and 𝛳o : shared parameters and task specific parameters of old tasks

Xn , Yn : data of the new task

Initialize:

Xn

𝛳s

𝛳o

Yo

12 of 35

Training

- Weight decay of 0.0005

- Loss balance weight

13 of 35

Loss - new task

  • Multinomial logistic loss for multiclass classification

Where, - one-hot ground truth label vector

- softmax output of the new network

  • Encourages the predicted to be consistent with the ground truth

14 of 35

Loss - old task

  • Encourages output probabilities, to be close to the output from the original network,

and

  • Knowledge Distillation Loss (used where one network approximates the output of another)

Where, l is the number of labels,

is the modified version of recorded and,

is the modified version of current

15 of 35

LwF :

16 of 35

Implementation details:

  • CNN architecture - AlexNet ( VGG for some experiments)
  • Initialisation of 𝛳n - Xavier initialisation
  • Training mechanism - uses the same preprocess and training norms of Alexnet
  • What’s different - smaller learning rate than usual for LwF training

Xn

𝛳s

𝛳o

Yo

17 of 35

Methods being compared with

  • Fine tuning
  • LFL(less forgetting learning)
  • Fine-tuning FC
  • Feature extraction
  • Joint training or multitask training

18 of 35

Experiments

1)Single new task scenario - add all classes of the new data at once

2)Multiple new task scenario - add a sub group of dataset one by one

3) Influence of dataset size - For a subset of new task data

19 of 35

Experiments

Data

  • All the tasks are classification tasks
  • Old task → ImageNet , Places 365
  • New taks → Pascal VOC 2012 (‘VOC’) , caltech-UCSD (‘CUB’) , MIT indoor scene classification (‘Scenes’) , MNIST

20 of 35

1- Single new task scenario

  • Compare the results of one new task among different task pairs and different methods
  • Most of the results are given on validation set
  • For testing , they used only Places365 → VOC
  • They have implemented vgg on some tasks

21 of 35

22 of 35

Observations

  • Dissimilar new tasks degrade the old task performance more(CUB , MILA)
  • Similar tasks have a good performance using Lwf on most of the new tasks
  • Whereas it has notable performance on the old tasks it is always outperformed by joint training and feature extraction methods

23 of 35

Multiple new task scenario

Multiple classes are added in groups to classification task

For this experiment they have split the VOC dataset in to subgroups like animals , places, rooms and gradually train them

Similarly they do it with scenes dataset as well

24 of 35

Multiple new task scenario

25 of 35

Observations

  • LwF outperforms fine tuning , FE, LFL and fine tuning FC for most newly added tasks
  • Under performs on the old tasks compared to joint training

26 of 35

3- Influence of dataset size

27 of 35

Design choices and alternatives

  • Choice of task-specific layers
    • Add new layers specific for new task and train the model
    • It didn’t show any improvement
  • Network expansion
    • Expand the final layers (FCN) to add more hidden units and initialize randomly to have more new task specific
  • Effect of lower learning rate of shared parameters.
    • Simply lowering lr of shared params (theta s ) doesn’t preserve the old task performance

28 of 35

29 of 35

30 of 35

Design choices and alternatives

  • L2 soft-constrained weights.
    • To act as a regulariser for the old task weights
    • It was outperformed by LwF
  • Choice of response preserving loss
    • Tried out other loss alternatives like cross entropy,L1 , L2
    • Out of them , distillation loss gave the best results in terms of performance

31 of 35

32 of 35

Additional Experiments

Tracking with MD-NET using LwF

Similar to classification task as we have seen in multiple new scenario, they apply this on the tracking task as a classification task

Incrementally adding new objects for detection is real use case for this LwF technique.

33 of 35

Advantages and Disadvantages

Advantages :

  • Classification performance
  • Computational efficiency
  • Simplicity deployment

Disadvantages :

  • New task images are different from old task images , it degrades the performance on old tasks
  • Distribution of images across tasks may be different, it may potentially decrease the performance

34 of 35

Takeaways

  1. A simple idea which shows preserving outputs of old tasks can be useful in countering catastrophic forgetting
  2. In multiple settings , we have seen better results when there was some similarity in tasks and poor performance if that wasn't the case

35 of 35

Discussion