1 of 35

Using Hindsight to Anchor Past Knowledge in Continual Learning

Arslan Chaudhry, Albert Gordo, Puneet K. Dokania, Philip Torr, David Lopez-Paz

Arian Khorasani, Nader Asadi

2 of 35

Outline

Introduction and Background
Method
Experiments
Conclusion
Discussion and Questions

3 of 35

Introduction and Background

4 of 35

Online Class Incremental Learning

Trainable from a stream of data

Different classes occur at different times

At all times during the lifetime of the model, predictions on examples from any task may be requested.
Bounded computational requirements and memory footprint.
The learner can only assume a single pass over the data.

5 of 35

State-of-the-Art Solutions

Regularization-based

Reduce forgetting by restricting the updates in model parameters that were important for previous tasks.
When the number of tasks are large, the regularization of past tasks becomes obsolete, leading to the representation drift.

Modular approaches

Add new modules to the learner as new tasks are learned.
The memory complexity of these approaches scales with the number of tasks.

Memory-based

Store a few examples from past tasks in an “episodic memory”, to be revisited when training for a new task.

6 of 35

Motivation and Contributions

Since memory is usually very small, the performance of the predictor becomes sensitive to the choice of samples stored in the episodic memory.
HAL tries to fix this issue using synthetically constructed anchors that maximize forgetting loss for current task.
Synthetic anchors lie close to the model’s decision boundary; keeping predictions invariant on such anchors preserves the performance.

7 of 35

Continual Learning Setup

Minimize the following multi-task error:

Where each datapoint in the stream is experienced once.

Evaluation statistics:

Final average accuracy:
Final maximum forgetting:

9 of 35

Replay Problems

Where is the episodic memory, a new minibatch observation, and .

Problems with replay:

Replay memory reminds only a small amount of data; behaviour outside of the replay samples is not guaranteed.
With small memories, choice of samples to be stored becomes more important.

10 of 35

HAL

First, let’s assume that we have the anchor points , which are synthetically generated samples that maximize forgetting on task . Then:

Equations:

A usual experience replay parameter update over a minibatch of current task.
Trades-off the minimization of:

The loss value at the current minibatch and the episodic memory.
Changes in predictions at the anchor points for all past tasks.

The second rule not only updates the network, but also reduces forgetting on synthetic anchors.

11 of 35

HAL - Anchor Generation

Ideal anchor generation:

where is the parameter vector obtained after training on task t.

However, the above approach requires:

Access to the entire distribution of tasks to compute the maximization.
Access to all future distributions to compute the final parameter vector .

12 of 35

HAL - Anchor Generation

Problem 1: Access to the entire dataset of task t

Learn the desired anchor by initializing it at random and using k gradient ascent updates for a given label.

Problem 2: Access to .

Instead of measuring the forgetting that would happen after the model is trained for future tasks, we measure the forgetting that happens when the model is fine-tuned for past tasks.

14 of 35

Offline HAL(?)

What if we did not have the constraint of doing only a single pass over the data?

Then we would not need to use past data to simulate forgetting in future.

Possible approach(similar to [1]):

Train an auxiliary network on task t data, f(t).
Use f(t) and f(t-1), snapshot of the model at t-1, to compute anchors for f(t-1).
Use memory buffer, data of task t, and computed anchors to train f(t).

References:

Liu, Yaoyao, et al. "Mnemonics training: Multi-class incremental learning without forgetting." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.

15 of 35

Experiments

16 of 35

Experiments

Datasets and tasks being used in this experiment:

Permuted MNIST

Contains 23 tasks, with 1000 samples from 10 classes.

Rotated MNIST

Contains 23 tasks, with 1000 samples from 10 classes

Split CIFAR

Contains 23 tasks, with 1000 samples from 5 classes

Split miniImageNet

Contains 20 tasks, with 250 samples from 5 classes

17 of 35

Baselines

The following baselines are comparing to the Model

Fine-Tune
iCaRL
EWC
VCL
AGEM
MER

18 of 35

Experiments

19 of 35

Experiments

20 of 35

Experiments

21 of 35

Experiments

22 of 35

Experiments

23 of 35

Experiments

24 of 35

Experiments

25 of 35

Experiments

26 of 35

Conclusion

Introduced bilevel optimization objective
Introduced Anchor Point per class per task
The Anchors are learned using gradient-based optimization
HAL improves the performance of CL based on Experience Replay
Achieved a new state-of-the-art on 4 standard benchmark of CL

27 of 35

Questions and Discussion

28 of 35

Final Reviews and Discussion

Is simple backprop a good way to generate anchors?

29 of 35

Final Reviews and Discussion

Is simple backprop a good way to generate anchors?

The output image does not fall into the natural distribution of images.
Extensive training on these samples can diminish the performance of the model.

30 of 35

Final Reviews and Discussion

There are several approaches to alleviate this issue:

Using image prior regularization such as total variation.

It improves the result but still not natural.

31 of 35

Final Reviews and Discussion

There are several approaches to alleviate this issue:

Using image prior regularization such as total variation.

It improves the result but still not natural.
We could also utilize substantial priors stored in batch norm layers.[1]

References:

Yin, Hongxu, et al. "Dreaming to distill: Data-free knowledge transfer via deepinversion." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.

32 of 35

Final Reviews and Discussion

There are several approaches to alleviate this issue:

Using image prior regularization such as total variation.

It improves the result but still not natural.
We could also utilize substantial priors stored in batch norm layers.[1]

Use a generative model, such as a VAE, and add perturbations to the latent space of our generative model instead of input space.

References:

Yin, Hongxu, et al. "Dreaming to distill: Data-free knowledge transfer via deepinversion." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.

33 of 35

Questions

Marie St-Laurent

Interesting approach, and it seems to work really well! I feel that the authors downplay how similar their approach is to Maximally Interfered Retrieval (MIR) https://arxiv.org/pdf/1908.04742.pdf to play up the novelty of their own technique, possibly (differences seem to be nuances, while the overall idea to protect points most likely to be forgotten after a new task is learned is similar across their approaches). Could you discuss these similarities/differences, and possibly address why HAL performs better than MIR? Thanks! (hope it's not too much to ask!!)

Guillaume Lam

Their last paragraph of the third section make it seem as if HAL would have quite advantage over MIR. However, their results seem to show that they haven't distanced themselves that much from MER, E-Ring, and MIR. Would you say that HAL is addressing an issue that isn't addressed by the others or addressing a certain issue better? Thank you!

34 of 35

Questions

Max Schwarzer

Looking over fig. 2 we can see that there are many class boundaries that have no anchor points. Is there a reason to only use a single anchor point per class, and would it be straightforward to modify the algorithm to use multiple? The total effective "replay" size would still be tiny even with two or three per class per task.

Pierre-André Brousseau

This was an interesting read. The paper does not show practically what the built anchor samples are. For a given example, like Split Cifar, are the anchor points still images that a human would recognize and find useful?

35 of 35

Questions

Reza

I am wondering what do you think the influence of the episodic memory size is when it comes to approximating the future by simulating the past. I am thinking when we look at CL at scale and we expand the size of the episodic memory (more past data) this method would reach its plateau or becomes less effective because effectively we would encourage it to over-fit on the past data?

Darshan Patil

How does using more than one anchor affect performance?

Yusong Wu

Does build e_t by updating k times of Eq. 9 require a lot of computation? Also, I notice that the anchor building in HAL is computed after the task ends, so is it technically only suitable for task-incremental learning but will not work for class-incremental learning?

References:

Jin, Xisen, Junyi Du, and Xiang Ren. "Gradient Based Memory Editing for Task-Free Continual Learning." arXiv preprint arXiv:2006.15294 (2020).

1 of 35

2 of 35

3 of 35

4 of 35

5 of 35

6 of 35

7 of 35

8 of 35

9 of 35

10 of 35

11 of 35

12 of 35

13 of 35

14 of 35

15 of 35

16 of 35

17 of 35

18 of 35

19 of 35

20 of 35

21 of 35

22 of 35

23 of 35

24 of 35

25 of 35

26 of 35

27 of 35

28 of 35

29 of 35

30 of 35

31 of 35

32 of 35

33 of 35

34 of 35

35 of 35