September 16th 2024
Lifelong Continual Learning in Dynamic Environments: a Tutorial
Matteo Tiezzi
Motivations and Definitions
Motivations and definitions
Evolution has empowered human with strong adaptability to continually acquire, update, accumulate and exploit knowledge
Image Credits: economist.com
Non–stationarity vs offline learning
This contrasts with common Machine Learning assumptions:
This is rarely true!
The “data-centric” AI era as a static paradigm: Batch/Offline Learning
Assumption Data sampled from a stationary (unknown) distribution
STEPS
Static paradigm: Batch/Offline Learning
Retraining GPTs can cost up to millions of euros
Data distribution shift
Training costs
Motivations and definition
Evolution has empowered human with strong adaptability to continually acquire, update, accumulate and exploit knowledge
What is Lifelong/Continual Learning?
Why do we need Continual Learning?
Data privacy and storage
Data privacy, copyright, accountability, liability…
Continual Learning: Can we design models that do not need to store data collections, but learn online/on the fly?
GD biological plausability: Humans don’t learn well from randomly sampled data
Motivations and definition: Continual Lifelong Learning
Continual Learning goal: learn from a stream of data (usually simplified as Tasks) incrementally and behave as if all the samples were observed simultaneously.
The CLEAR Benchmark: Continual LEArning on Real-World Imagery, NeurIPS 2021
SplitMNIST dataset, Task-Incremental setting
From Offline to Continual/Lifelong Learning with SGD
Accuracy Metric
Catastrophic forgetting
Learn one class at the time
(without past classes)
Only the class from the current Task is correctly predicted!
Task 1
Task 2
Assumption Data sampled from a fixed (unknown) distribution
What could be gained by adopting Continual Learning?
G. Pasquale, et al. "Teaching iCub to recognize objects using deep Convolutional Neural Networks." Machine Learning for Interactive Systems. PMLR, 2015.
G. Pasquale, C. Ciliberto, L. Rosasco and L. Natale, "Object identification from few examples by improving the invariance of a Deep Convolutional Neural Network," 2016 IROS
On the Fly Object Recognition on the R1, Your Personal Humanoid - IIT
Why do we need Continual Learning: a Practical Example and some keywords
Why do we need Continual Learning: a Practical Example and some keywords
Other domains: Continual Learning in Graph Representation Learning
Yuan, Qiao, et al. "Continual graph learning: A survey." arXiv preprint arXiv:2301.12230 (2023).
Zhang, Xikun, Dongjin Song, and Dacheng Tao. "Continual Learning on Graphs: Challenges, Solutions, and Opportunities." arXiv preprint arXiv:2402.11565 (2024).
Nomenclature and related paradigms
As contents are provided incrementally over a lifetime, continual learning is also referred to as lifelong learning or incremental learning in much of the literature, without a strict distinction
Unconsolidated Nomenclature
Related Paradigms
The most general notion
Lifelong Learning: A longstanding goal
Image Credits to V. Lomonaco and M. Mundt (see last slide)
CL: Many surveys on the topic
An hot topic - Conference on Lifelong Learning Agents (3rd edition) - lifelong-ml.cc
Desiderata, metrics, settings
Continual Learning desiderata and constraints
Many desiderata, often competing one each other:
Continual Learning desiderata and constraints (Cont.)
Minimizing catastrophic forgetting and interference.
Illustration of catastrophic forgetting
Credits: Hadsell, Raia, et al. "Embracing change: Continual learning in deep neural networks." Trends in cognitive sciences 24.12 (2020): 1028-1040.
Illustration of the perfect learning dynamics
Continual Learning desiderata and constraints (Cont.)
Minimizing catastrophic forgetting and interference.
Illustration of catastrophic forgetting
Notation: Evaluation Metrics
Continual Learning desiderata and constraints (Cont.)
Minimizing catastrophic forgetting and interference.
Illustration of catastrophic forgetting
Notation: Metrics
Continual Learning desiderata and constraints (Cont.)
Maintaining plasticity
Illustration of declining plasticity
Continual Learning desiderata and constraints (Cont.)
Maximizing forward and backward transfer
Illustration of forward transfer (FWT)
FWT evaluates the average influence of all old tasks on the current k-th task
classification accuracy of a randomly initialized reference model trained only on the j-th task
Learning on Task 1 helps also in future ones
Continual Learning desiderata and constraints (Cont.)
Maximizing forward and backward transfer
Illustration of forward and backward transfer (BWT)
BWT evaluates the average influence of learning the k-th task on all old tasks
Learning on Task 2 improves also performances on previous Tasks
Summary: stability, plasticity, generalizability
Stability-plasticity tradeoff
[*] Wang, Liyuan, et al. "A comprehensive survey of continual learning: Theory, method and application." IEEE TPAMI (2024).
Task specificity generalizability tradeoff
Settings/scenarios
CL practitioners commonly define three possible settings (i.e. it is not stable):
What classes are these, given that they belong to Task i?
What classes are these?
Same classes, different input distributions (domains)
Settings/scenarios summary
CL practitioners commonly define three possible settings (i.e. it is not stable):
Each task/domain has a different input distribution
Disjoint output space (e.g. different classes for each task)
Mai, Zheda, et al. "Online continual learning in image classification: An empirical survey." Neurocomputing 469 (2022): 28-51.
Output spaces are disjoint and separated by task-IDs
Task-ID known both at training and test-time
(Finer-grained taxonomy) Scenarios
Additional scenarios
[*] Wang, Liyuan, et al. "A comprehensive survey of continual learning: Theory, method and application." IEEE TPAMI (2024).
Samples are seen only once (1 epoch)
Datasets and Benchmarks
Datasets: Classics
SplitMNIST
SplitCIFAR
Permuted MNIST (domain incremental)
Datasets Taxonomy
Verwimp, Eli, et al. "Clad: A realistic continual learning benchmark for autonomous driving." Neural Networks 161 (2023): 659-669.
Strict boundaries
Towards real-world streams: desiderata
CL popular benchmarks:
Real World Streams:
Evaluation with Real vs Virtual Drifts: Streaming Protocols for evaluation
Virtual drift Issues
Credits to “Continual Learning Beyond Catastrophic Forgetting in Class-Incremental Scenarios” CoLLAs 2023 tutorial, V. Lomonaco and A. Carta
Real drift
Real drifts datasets: CLEAR and CLOC
Z. Lin et al. “The CLEAR Benchmark: Continual LEArning on Real-World Imagery” NeurIPS 2021
Z. Cai et al. “Online Continual Learning with Natural Distribution Shifts: An Empirical Study with Visual Data.” ICCV ‘21
Credits to “Continual Learning Beyond Catastrophic Forgetting in Class-Incremental Scenarios” CoLLAs 2023 tutorial, V. Lomonaco and A. Carta
Temporal Coherence - CoRE5 and SAILenv
M. Meloni,L. Pasqualini, M. Tiezzi et al. "Sailenv: Learning in virtual visual environments made simple." 2020 25th International Conference on Pattern Recognition (ICPR). IEEE, 2021.
Lomonaco, Vincenzo, and Davide Maltoni. "Core50: a new dataset and benchmark for continuous object recognition." Conference on robot learning. PMLR, 2017.
Meloni, Enrico, et al. "Evaluating continual learning algorithms by generating 3d virtual environments." International Workshop on Continual Semi-Supervised Learning. 2021
CL datasets for sequential models: a less explored direction
Cossu, Andrea, et al. "Continual learning for recurrent neural networks: an empirical evaluation." Neural Networks 143 (2021): 607-627.
Incrementally increase the sequence lengths or add new classes
Mainly tested on RNNs and LSTMs
Methods
Continual Learning methods: a Taxonomy
[*] Wang, Liyuan, et al. "A comprehensive survey of continual learning: Theory, method and application." IEEE TPAMI (2024).
Regularization-based Approaches
Intuition:
[*] Wang, Liyuan, et al. "A comprehensive survey of continual learning: Theory, method and application." IEEE TPAMI (2024).
Regularization-based Approaches: Elastic Weight Consolidation (EWC) (Kirkpatrick 2017)
TL;DR: Quadratic penalty to regularize the update of model parameters that were important to past tasks.
Task B loss
Fisher matrix: measures the amount of information that an observable random variable carries about an unknown parameter θ of a distribution that models such variable
the optimal value of j-th parameter after learning task A
Regularization-based Approaches: Learning without Forgetting (LwF) (Li 2017)
TL;DR: Knowledge distillation (using predictions from the output head of the old tasks) to preserve knowledge
Standard cross-entropy on new task data
Predicted value for the new task with Xn
Predicted value for the old task with Xn
Computed before learning on current task: output for the old task with Xn
Replay-based Approaches
Intuition:
[*] Wang, Liyuan, et al. "A comprehensive survey of continual learning: Theory, method and application." IEEE TPAMI (2024).
Replay-based Approaches: Experience Replay
Reservoir sampling: every sample has the same probability, mem_sz/n, to be stored in the memory buffer
mem_sz is the size of the buffer
n is the #samples observed up to now
Jeffrey S Vitter. Random sampling with a reservoir. ACM Transactions on Mathematical Software (TOMS), 11(1):37–57, 1985.
Replay-based Approaches: Maximally Interfered Retrieval (MIR)
MIR: retrieved the replay samples that are maximally interfered (the largest loss increases) by a virtual parameter update with
the incoming mini-batch
R. Aljundi, E. Belilovsky, T. Tuytelaars, L. Charlin, M. Caccia, M. Lin, L. Page-Caccia, Online continual learning with maximal interfered retrieval, in: Advances in Neural Information Processing Systems 32, 2019,
Top-k samples achieving maximal loss increase
Replay-based Approaches: Dark Experience Replay (DER & DER++)
DER: tl;dr: feature replay. Preserve old training samples together with their logits, and perform logit-matching throughout the optimization trajectory
Buzzega, Pietro, et al. "Dark experience for general continual learning: a strong, simple baseline." Advances in neural information processing systems 33 (2020): 15920-15930.
Minimizing the Euclidean distance between the logit from memory and the current logits on the memory sample.
Additional term on buffer datapoints and their labels
Also logits are stored in memory
Replay-based Approaches: Greedy Sampler and Dumb Learner (GDumb)
GDumb: greedily updates the memory buffer keeping a balanced class distribution. At inference, it trains a model from scratch using the memory buffer only.
Prabhu, Ameya, Philip HS Torr, and Puneet K. Dokania. "Gdumb: A simple approach that questions our progress in continual learning." Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16. Springer International Publishing, 2020.
Architecture-based Approaches
Intuition:
[*] Wang, Liyuan, et al. "A comprehensive survey of continual learning: Theory, method and application." IEEE TPAMI (2024).
Architecture-based Approaches
Intuition:
[*] Wang, Liyuan, et al. "A comprehensive survey of continual learning: Theory, method and application." IEEE TPAMI (2024).
Architecture-based Approaches: Continual Neural Units (Tiezzi et al. 2024)
Tiezzi, M., Marullo, S., Becattini, F., & Melacci, S. Continual Neural Computation. In ECML-PKDD 2024
Architecture-based Approaches: Continual Neural Units (Tiezzi et al. 2024)
A Continual Neural Unit generates its own weight vector in function of the input region in which the neuron is expected to operate
Tiezzi, M., Marullo, S., Becattini, F., & Melacci, S. (2024, August). Continual Neural Computation. In ECML-PKDD 2024
Classic Neural Unit ➙ Continual Neural Unit
2. Matrix storing multiple weight vectors (memory units), one per row
1. Function that generates the weight vector to use in function of x
3. Matrix storing multiple keys, used for indexing purposes (compared to to x)
A Continual Neural Unit generates its own weight vector in function of the input region in which the neuron is expected to operate
Tiezzi, M., Marullo, S., Becattini, F., & Melacci, S. (2024, August). Continual Neural Computation. In ECML-PKDD 2024
Continual Neural Unit (1/4)
K: Learnable Keys
M: Learnable
Memory Units
Neuron Output (previous slide)
Tiezzi, M., Marullo, S., Becattini, F., & Melacci, S. (2024, August). Continual Neural Computation. In ECML-PKDD 2024
Continual Neural Unit (2/4)
Attention scores by sparse softmax-based attention, custom similarity function
Neuron Output (previous slide)
Tiezzi, M., Marullo, S., Becattini, F., & Melacci, S. (2024, August). Continual Neural Computation. In ECML-PKDD 2024
K: Learnable Keys
M: Learnable
Memory Units
Continual Neural Unit (3/4)
Weight vector generated by blending memories M in function of the attention scores
Neuron Output (previous slide)
Attention scores by sparse softmax-based attention, custom similarity function
Tiezzi, M., Marullo, S., Becattini, F., & Melacci, S. (2024, August). Continual Neural Computation. In ECML-PKDD 2024
K: Learnable Keys
M: Learnable
Memory Units
Continual Neural Unit (4/4)
Attention scores by sparse softmax-based attention, custom similarity function
Benefits in Continual Learning?
Only the top-𝛅 memory units participate in the generation of the weight vector
Tiezzi, M., Marullo, S., Becattini, F., & Melacci, S. (2024, August). Continual Neural Computation. In ECML-PKDD 2024
Learning with Continual Neural Units
Max-isolation: Winner-Take-All (WTA) updates of memory units and keys
Current input (possibly projected by ѱ, if any)
Winning key
Customizable positive coefficient
Similarly function (attention)
+ criteria to replace unused keys
in function of the last-time and of the number-of-times they resulted to be winning keys
2D Experiment
Each star is a key in K
In this "cloud" there are several data points, streamed one after the other; color = class ID
2D Experiment
Each star is a key in K
In this "cloud" there are several data points, streamed one after the other; color = class ID
2D Experiment
Each star is a key in K
In this "cloud" there are several data points, streamed one after the other; color = class ID
2D Experiment
Each star is a key in K
In this "cloud" there are several data points, streamed one after the other; color = class ID
2D Experiment
Each star is a key in K
In this "cloud" there are several data points, streamed one after the other; color = class ID
Optimization-based Approaches
[*] Wang, Liyuan, et al. "A comprehensive survey of continual learning: Theory, method and application." IEEE TPAMI (2024).
Representation-based Approaches
Intuition:
Zhou, Da-Wei, et al. "Continual learning with pre-trained models: A survey." arXiv preprint arXiv:2401.16386 (2024).
Representation-based Approaches: Learning to Prompt for CL (L2P, Wang et al. 2022)
TL;DR: prepend a set of learnable parameters (i.e., prompts) to the current input. Prompts are gathered from a prompt pool via a key-query matching strategy.
Top-N key selection
optimize the keys during learning
Representation-based Approaches: Memory Head for CL (MH, Tiezzi et al. 2024)
TL;DR: neurons route their computation across a learnable key-value mechanism, exhibiting a dynamic behaviour depending on their own input. Only the units(weights) that are relevant to process the observed sample are blended.
Tiezzi, M., Becattini, F., Marullo, S., & Melacci, S. (2024, August). Memory Head for Pre-Trained Backbones in Continual Learning. In CoLLAs 2024
Learnable Keys
Learnable Memory Units
Attention scores
obtained by blending memories M, indexed by keys K, in function of their input
Backbone features
Concluding remarks
Issues and opportunities
Computational costs, Real-Time computation, Memory constraints
[1] Y. Ghunaim et al. “Real-Time Evaluation in Online Continual Learning: A New Hope.” CVPR ’23
A. Prabhu et al. “Computationally Budgeted Continual Learning: What Does Matter?” CVPR ‘23
Model selection issues in CL
[1] A. Chaudhry et al. “Efficient Lifelong Learning with A-GEM.” 2019
Main messages
The goal of Continual Learning is to understand how to design machine learning models that learn over time
Main messages: Unsolved CL questions and Future directions
Verwimp, Eli, et al. "Continual learning: Applications and the road forward." arXiv preprint arXiv:2311.11908 (2023).
Credits
Thanks for your attention!
Acknowledgments and Credits
Credits to the wonderful talks by R. Pascanu (“Embracing Change: Continual Learning in Deep Neural Networks”, UCL Centre for AI. Recorded on the 5th May 2021), V. Lomonaco (“Continual Learning Beyond Catastrophic Forgetting in Class-Incremental Scenarios”, CoLLas 2023 tutorial, and ContinualAI CL course), M. Mundt (ContinualAI lectures) and D. Abati (Intro to CL @CVPR 2020)