Using Hindsight to Anchor Past Knowledge in Continual Learning
Arslan Chaudhry, Albert Gordo, Puneet K. Dokania, Philip Torr, David Lopez-Paz
Arian Khorasani, Nader Asadi
Outline
Introduction and Background
Online Class Incremental Learning
State-of-the-Art Solutions
Motivation and Contributions
Continual Learning Setup
Minimize the following multi-task error:
Where each datapoint in the stream is experienced once.
Evaluation statistics:
Method
Replay Problems
Where is the episodic memory, a new minibatch observation, and .
Problems with replay:
HAL
First, let’s assume that we have the anchor points , which are synthetically generated samples that maximize forgetting on task . Then:
Equations:
The second rule not only updates the network, but also reduces forgetting on synthetic anchors.
HAL - Anchor Generation
Ideal anchor generation:
where is the parameter vector obtained after training on task t.
However, the above approach requires:
HAL - Anchor Generation
Problem 1: Access to the entire dataset of task t
Problem 2: Access to .
HAL
Offline HAL(?)
What if we did not have the constraint of doing only a single pass over the data?
Possible approach(similar to [1]):
References:
Experiments
Experiments
Datasets and tasks being used in this experiment:
Baselines
The following baselines are comparing to the Model
Experiments
Experiments
Experiments
Experiments
Experiments
Experiments
Experiments
Experiments
Conclusion
Questions and Discussion
Final Reviews and Discussion
Is simple backprop a good way to generate anchors?
Final Reviews and Discussion
Is simple backprop a good way to generate anchors?
Final Reviews and Discussion
There are several approaches to alleviate this issue:
Final Reviews and Discussion
There are several approaches to alleviate this issue:
References:
Final Reviews and Discussion
There are several approaches to alleviate this issue:
References:
Questions
Marie St-Laurent
Interesting approach, and it seems to work really well! I feel that the authors downplay how similar their approach is to Maximally Interfered Retrieval (MIR) https://arxiv.org/pdf/1908.04742.pdf to play up the novelty of their own technique, possibly (differences seem to be nuances, while the overall idea to protect points most likely to be forgotten after a new task is learned is similar across their approaches). Could you discuss these similarities/differences, and possibly address why HAL performs better than MIR? Thanks! (hope it's not too much to ask!!)
Guillaume Lam
Their last paragraph of the third section make it seem as if HAL would have quite advantage over MIR. However, their results seem to show that they haven't distanced themselves that much from MER, E-Ring, and MIR. Would you say that HAL is addressing an issue that isn't addressed by the others or addressing a certain issue better? Thank you!
Questions
Max Schwarzer
Looking over fig. 2 we can see that there are many class boundaries that have no anchor points. Is there a reason to only use a single anchor point per class, and would it be straightforward to modify the algorithm to use multiple? The total effective "replay" size would still be tiny even with two or three per class per task.
Pierre-André Brousseau
This was an interesting read. The paper does not show practically what the built anchor samples are. For a given example, like Split Cifar, are the anchor points still images that a human would recognize and find useful?
Questions
Reza
I am wondering what do you think the influence of the episodic memory size is when it comes to approximating the future by simulating the past. I am thinking when we look at CL at scale and we expand the size of the episodic memory (more past data) this method would reach its plateau or becomes less effective because effectively we would encourage it to over-fit on the past data?
Darshan Patil
How does using more than one anchor affect performance?
Yusong Wu
Does build e_t by updating k times of Eq. 9 require a lot of computation? Also, I notice that the anchor building in HAL is computed after the task ends, so is it technically only suitable for task-incremental learning but will not work for class-incremental learning?
References: