3 of 77

Motivations and definitions

Evolution has empowered human with strong adaptability to continually acquire, update, accumulate and exploit knowledge

without forgetting previous concepts and abilities
in a non-stationary world

Image Credits: economist.com

4 of 77

Non–stationarity vs offline learning

This contrasts with common Machine Learning assumptions:

Fixed problem domain
static/stationary dataset and learning environment

Gradient-based learning successful when data samples independent and identically distributed (IID)

dataset is balanced, shuffled and randomly sampled

No (or little) constraints on data storage or model size

This is rarely true!

Health, Robotics, Language

5 of 77

The “data-centric” AI era as a static paradigm: Batch/Offline Learning

TASK Cat breed classification
GOAL Learn a function (e.g., neural net) for pattern class-membership prediction

Assumption Data sampled from a stationary (unknown) distribution

STEPS

Data collection

Offline Learning

SGD on an empirical risk function

6 of 77

Static paradigm: Batch/Offline Learning

Stationarity is an issue: new data with a new underlying distribution

New dataset collection

distribution shift?

New training is required

from scratch? Time? Cost?

Retraining GPTs can cost up to millions of euros

Data distribution shift

7 of 77

Training costs

8 of 77

Motivations and definition

Evolution has empowered human with strong adaptability to continually acquire, update, accumulate and exploit knowledge

We expect artificial intelligence (AI) systems to adapt in a similar way

Lifelong or “Continual Learning” goal: understand how to design machine learning models that learn over time

In non-stationary environments (e.g. data streams)

on a constrained budget (memory/compute/real-time requirements)
Avoiding retraining from scratch

9 of 77

What is Lifelong/Continual Learning?

The goal of Continual Learning is to understand how to design machine learning models that learn over time

on a constrained budget (memory/compute/real-time requirements)
with non-stationary data

coming from a stream

10 of 77

Why do we need Continual Learning?

50 GB/s streaming data
~30240 TB of data after one week
Impossible to retrain the robot from scratch every time with Batch/Offline Learning

11 of 77

Data privacy and storage

Data privacy, copyright, accountability, liability…

Continual Learning: Can we design models that do not need to store data collections, but learn online/on the fly?

12 of 77

GD biological plausability: Humans don’t learn well from randomly sampled data

Offline Learning

SGD, shuffled batches

Are you able to learn high school subjects by sampling one page at random from different textbooks?

13 of 77

Motivations and definition: Continual Lifelong Learning

Continual Learning goal: learn from a stream of data (usually simplified as Tasks) incrementally and behave as if all the samples were observed simultaneously.

Tasks:

new classes, data distributions, skills, different environments, etc.
Notice: dealing with strict boundaries is a simplification by practitioners

The CLEAR Benchmark: Continual LEArning on Real-World Imagery, NeurIPS 2021

SplitMNIST dataset, Task-Incremental setting

14 of 77

From Offline to Continual/Lifelong Learning with SGD

Accuracy Metric

Catastrophic forgetting

Learn one class at the time

(without past classes)

Only the class from the current Task is correctly predicted!

Task 1

Task 2

Assumption Data sampled from a fixed (unknown) distribution

15 of 77

What could be gained by adopting Continual Learning?

Applications that continually adapt to track a varying problem

evolving (social) networks, molecule interaction/reactions, epidemiological models, language evolutions

Robots that acquire and expand skills over time

Novel techniques that could help Deep Learning methods efficiency, even in the stationary setting

G. Pasquale, et al. "Teaching iCub to recognize objects using deep Convolutional Neural Networks." Machine Learning for Interactive Systems. PMLR, 2015.

G. Pasquale, C. Ciliberto, L. Rosasco and L. Natale, "Object identification from few examples by improving the invariance of a Deep Convolutional Neural Network," 2016 IROS

On the Fly Object Recognition on the R1, Your Personal Humanoid - IIT

16 of 77

Why do we need Continual Learning: a Practical Example and some keywords

Household chores robot

for any home, many tasks

cannot be preprogrammed in factory to generalize to every environment/domain

Need to expand skill-set over time

task and environment variations

‘tidying’ a table or shelving books
‘laundry’ may require sorting socks or ironing shirts.

Robot that is expected to perform any household chore, in any home.
Cannot be preprogrammed in the factory and then deployed

because of the sheer variety of tasks and homes.

It needs to expand its skill set over time

(e.g., learning to wash dishes, then tidying, and finally laundry)

Task variations:

‘tidying’ may mean cleaning up a board game or shelving books,
‘laundry’ may require sorting socks or ironing shirts.

The robot will need to adapt quickly, and also not forget (at least not catastrophically)
When learning related tasks (e.g., vacuuming, sweeping, and mopping), the robot should show forward transfer (better performance and faster learning on each subsequent task) and also show backward transfer (better performance on previous tasks, when revisited) because of transfer from the current task.
Moreover, the robot will have limited access to previous tasks as well as limited capacity to store data, increase its model size, or increase processing time.

17 of 77

Why do we need Continual Learning: a Practical Example and some keywords

Need to adapt and not forget

When learning related tasks (e.g., vacuuming, sweeping, and mopping), the robot should show:

forward transfer

Learning current task foster subsequent tasks

backward transfer

Learning current task foster better performance on previous tasks

Limited capacity to:

store data, increase its model size, or increase processing time.

Robot that is expected to perform any household chore, in any home.
Cannot be preprogrammed in the factory and then deployed

because of the sheer variety of tasks and homes.

It needs to expand its skill set over time

(e.g., learning to wash dishes, then tidying, and finally laundry)

Task variations:

‘tidying’ may mean cleaning up a board game or shelving books,
‘laundry’ may require sorting socks or ironing shirts.

The robot will need to adapt quickly, and also not forget (at least not catastrophically)
When learning related tasks (e.g., vacuuming, sweeping, and mopping), the robot should show forward transfer (better performance and faster learning on each subsequent task) and also show backward transfer (better performance on previous tasks, when revisited) because of transfer from the current task.
Moreover, the robot will have limited access to previous tasks as well as limited capacity to store data, increase its model size, or increase processing time.

18 of 77

Other domains: Continual Learning in Graph Representation Learning

Graphs where features, node targets, topology, etc. evolve over time

Social networks, Molecular properties, knowledge graphs etc.

Yuan, Qiao, et al. "Continual graph learning: A survey." arXiv preprint arXiv:2301.12230 (2023).

Zhang, Xikun, Dongjin Song, and Dacheng Tao. "Continual Learning on Graphs: Challenges, Solutions, and Opportunities." arXiv preprint arXiv:2402.11565 (2024).

19 of 77

Nomenclature and related paradigms

As contents are provided incrementally over a lifetime, continual learning is also referred to as lifelong learning or incremental learning in much of the literature, without a strict distinction

Unconsolidated Nomenclature

Continual Learning
Continuous Learning
Lifelong learning
Incremental learning

Related Paradigms

Multi-task learning
Meta-learning/learning to learn
Transfer Learning & Domain Adaptation
Online/Streaming Learning
Time series forecasting

The most general notion

20 of 77

Lifelong Learning: A longstanding goal

Image Credits to V. Lomonaco and M. Mundt (see last slide)

21 of 77

CL: Many surveys on the topic

22 of 77

An hot topic - Conference on Lifelong Learning Agents (3rd edition) - lifelong-ml.cc

23 of 77

Desiderata, metrics, settings

24 of 77

Continual Learning desiderata and constraints

Many desiderata, often competing one each other:

Minimal resources

Limit the access to previous tasks data
finite storage for previous experience and can not interact with previously seen tasks

or no storage at all

Minimal increase in model capacity and computation

The approach must be scalable: it cannot add a new model for each subsequent task.

Fast adaptation and recovery

adaptation to novel tasks or domain shifts
fast recovery when presented with past tasks

25 of 77

Continual Learning desiderata and constraints (Cont.)

Minimizing catastrophic forgetting and interference.

Training on new tasks should not significantly reduce performance on previously learned tasks

Illustration of catastrophic forgetting

Credits: Hadsell, Raia, et al. "Embracing change: Continual learning in deep neural networks." Trends in cognitive sciences 24.12 (2020): 1028-1040.

Illustration of the perfect learning dynamics

26 of 77

Continual Learning desiderata and constraints (Cont.)

Minimizing catastrophic forgetting and interference.

Training on new tasks should not significantly reduce performance on previously learned tasks

Illustration of catastrophic forgetting

Notation: Evaluation Metrics

Average Accuracy

a_k,j∈ [0, 1] is the accuracy on the test set of the j-th task after incremental learning of the k-th task (j ≤ k).

Average Incremental Accuracy

27 of 77

Continual Learning desiderata and constraints (Cont.)

Minimizing catastrophic forgetting and interference.

Training on new tasks should not significantly reduce performance on previously learned tasks

Illustration of catastrophic forgetting

Notation: Metrics

Forgetting of a task

the difference between its maximum performance obtained in the past and its current performance

Forgetting Measure

at the k-th task is the average forgetting of all old tasks

28 of 77

Continual Learning desiderata and constraints (Cont.)

Maintaining plasticity

The model should be able to keep learning effectively as new tasks are observed
Failure: lack of model capacity, regularization

Illustration of declining plasticity

29 of 77

Continual Learning desiderata and constraints (Cont.)

Maximizing forward and backward transfer

Learning a task should improve related tasks, both past and future, in terms of both learning efficiency and performance

Illustration of forward transfer (FWT)

FWT evaluates the average influence of all old tasks on the current k-th task

classification accuracy of a randomly initialized reference model trained only on the j-th task

Learning on Task 1 helps also in future ones

30 of 77

Continual Learning desiderata and constraints (Cont.)

Maximizing forward and backward transfer

Learning a task should improve related tasks, both past and future, in terms of both learning efficiency and performance

Illustration of forward and backward transfer (BWT)

BWT evaluates the average influence of learning the k-th task on all old tasks

Learning on Task 2 improves also performances on previous Tasks

31 of 77

Summary: stability, plasticity, generalizability

Stability-plasticity tradeoff

[*] Wang, Liyuan, et al. "A comprehensive survey of continual learning: Theory, method and application." IEEE TPAMI (2024).

Task specificity generalizability tradeoff

32 of 77

Settings/scenarios

CL practitioners commonly define three possible settings (i.e. it is not stable):

Task-Incremental:

different input distributions, disjoint output space, separated by taskID, taskID known at training and test time

What classes are these, given that they belong to Task i?

What classes are these?

Same classes, different input distributions (domains)

Class-Incremental:

different input distributions, disjoint output space

Domain-Incremental:

different input distributions

33 of 77

Settings/scenarios summary

CL practitioners commonly define three possible settings (i.e. it is not stable):

Each task/domain has a different input distribution

Disjoint output space (e.g. different classes for each task)

Mai, Zheda, et al. "Online continual learning in image classification: An empirical survey." Neurocomputing 469 (2022): 28-51.

Output spaces are disjoint and separated by task-IDs

Task-ID known both at training and test-time

34 of 77

(Finer-grained taxonomy) Scenarios

Additional scenarios

[*] Wang, Liyuan, et al. "A comprehensive survey of continual learning: Theory, method and application." IEEE TPAMI (2024).

Samples are seen only once (1 epoch)

General Continual Learning (GCL) : model observes incremental data in an online fashion without explicit task boundaries

35 of 77

Datasets and Benchmarks

36 of 77

Datasets: Classics

SplitMNIST

SplitCIFAR

Permuted MNIST (domain incremental)

37 of 77

Datasets Taxonomy

Verwimp, Eli, et al. "Clad: A realistic continual learning benchmark for autonomous driving." Neural Networks 161 (2023): 659-669.

Strict boundaries

38 of 77

Towards real-world streams: desiderata

CL popular benchmarks:

Sharp Virtual drift
New classes
Balanced data
No temporal consistency

Real World Streams:

Gradual and sharp drifts
New domains and classes appear over time
Imbalanced distributions
Temporal consistency (e.g. video frames)

39 of 77

Evaluation with Real vs Virtual Drifts: Streaming Protocols for evaluation

Virtual drift Issues

sampling bias
Evaluation on a static test set
a.k.a. most of the CL research

violating the main assumption in CL, namely no access to previous task data

Credits to “Continual Learning Beyond Catastrophic Forgetting in Class-Incremental Scenarios” CoLLAs 2023 tutorial, V. Lomonaco and A. Carta

Real drift

Concept drift

e.g.: politician roles and affiliations to political party

Evaluation on the next data (e.g. prequential evaluation a.k.a. test-than-train)
Not a lot of research in CL right now

40 of 77

Real drifts datasets: CLEAR and CLOC

Real-world images with smooth temporal evolution
Large unlabeled dataset (~7.8M images)
Prequential evaluation
Scenario: domain-incremental and semi-supervised

Z. Lin et al. “The CLEAR Benchmark: Continual LEArning on Real-World Imagery” NeurIPS 2021

Z. Cai et al. “Online Continual Learning with Natural Distribution Shifts: An Empirical Study with Visual Data.” ICCV ‘21

Credits to “Continual Learning Beyond Catastrophic Forgetting in Class-Incremental Scenarios” CoLLAs 2023 tutorial, V. Lomonaco and A. Carta

41 of 77

Temporal Coherence - CoRE5 and SAILenv

Temporally coherent streams (videos)

dynamic environments
Many classes with several movement patterns
Many scenarios are possible: batch, online, with repetitions

M. Meloni,L. Pasqualini, M. Tiezzi et al. "Sailenv: Learning in virtual visual environments made simple." 2020 25th International Conference on Pattern Recognition (ICPR). IEEE, 2021.

Lomonaco, Vincenzo, and Davide Maltoni. "Core50: a new dataset and benchmark for continuous object recognition." Conference on robot learning. PMLR, 2017.

Meloni, Enrico, et al. "Evaluating continual learning algorithms by generating 3d virtual environments." International Workshop on Continual Semi-Supervised Learning. 2021

42 of 77

CL datasets for sequential models: a less explored direction

Cossu, Andrea, et al. "Continual learning for recurrent neural networks: an empirical evaluation." Neural Networks 143 (2021): 607-627.

Incrementally increase the sequence lengths or add new classes

Mainly tested on RNNs and LSTMs

44 of 77

Continual Learning methods: a Taxonomy

[*] Wang, Liyuan, et al. "A comprehensive survey of continual learning: Theory, method and application." IEEE TPAMI (2024).

45 of 77

Regularization-based Approaches

Intuition:

Regularization on some selected (here is the hard part) parameters

Seminal methods:

Elastic Weight Consolidation (EWC)
Synaptic Intelligence (SI)

Distillation to mitigate parameter drift:

Learning without Forgetting (LwF)

[*] Wang, Liyuan, et al. "A comprehensive survey of continual learning: Theory, method and application." IEEE TPAMI (2024).

46 of 77

Regularization-based Approaches: Elastic Weight Consolidation (EWC) (Kirkpatrick 2017)

TL;DR: Quadratic penalty to regularize the update of model parameters that were important to past tasks.

Parameter importance approximated by the diagonal Fisher Information Matrix F

e.g. given two tasks A and B seen in sequence, the loss function in EWC is:

Task B loss

Fisher matrix: measures the amount of information that an observable random variable carries about an unknown parameter θ of a distribution that models such variable

the optimal value of j-th parameter after learning task A

47 of 77

Regularization-based Approaches: Learning without Forgetting (LwF) (Li 2017)

TL;DR: Knowledge distillation (using predictions from the output head of the old tasks) to preserve knowledge

New task (X_n, Y_n)
Teacher model: model after learning the last task
Student model: model trained with the current task

Standard cross-entropy on new task data

Predicted value for the new task with X_n

Predicted value for the old task with X_n

Computed before learning on current task: output for the old task with X_n

48 of 77

Replay-based Approaches

Intuition:

Use two mini-batches: the incoming one, and one from a memory buffer
Model is updated using both mini-batches
Three directions:

Experience Replay
Generative Replay
Feature Replay

[*] Wang, Liyuan, et al. "A comprehensive survey of continual learning: Theory, method and application." IEEE TPAMI (2024).

49 of 77

Replay-based Approaches: Experience Replay

Reservoir sampling: every sample has the same probability, mem_sz/n, to be stored in the memory buffer

mem_sz is the size of the buffer

n is the #samples observed up to now

Jeffrey S Vitter. Random sampling with a reservoir. ACM Transactions on Mathematical Software (TOMS), 11(1):37–57, 1985.

50 of 77

Replay-based Approaches: Maximally Interfered Retrieval (MIR)

MIR: retrieved the replay samples that are maximally interfered (the largest loss increases) by a virtual parameter update with

the incoming mini-batch

R. Aljundi, E. Belilovsky, T. Tuytelaars, L. Charlin, M. Caccia, M. Lin, L. Page-Caccia, Online continual learning with maximal interfered retrieval, in: Advances in Neural Information Processing Systems 32, 2019,

Top-k samples achieving maximal loss increase

51 of 77

Replay-based Approaches: Dark Experience Replay (DER & DER++)

DER: tl;dr: feature replay. Preserve old training samples together with their logits, and perform logit-matching throughout the optimization trajectory

Buzzega, Pietro, et al. "Dark experience for general continual learning: a strong, simple baseline." Advances in neural information processing systems 33 (2020): 15920-15930.

Minimizing the Euclidean distance between the logit from memory and the current logits on the memory sample.

Additional term on buffer datapoints and their labels

Also logits are stored in memory

52 of 77

Replay-based Approaches: Greedy Sampler and Dumb Learner (GDumb)

GDumb: greedily updates the memory buffer keeping a balanced class distribution. At inference, it trains a model from scratch using the memory buffer only.

Prabhu, Ameya, Philip HS Torr, and Puneet K. Dokania. "Gdumb: A simple approach that questions our progress in continual learning." Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16. Springer International Publishing, 2020.

53 of 77

Architecture-based Approaches

Intuition:

constructing task-specific parameters instead of learning all incremental tasks with a shared set of params

Parameter allocation:

fixed architecture, select paths/identify important neurons for current task (e.g. Piggyback, HAT, SupSup methods)

sparsity

alternative: dynamically expand the network if the capacity is not sufficient

[*] Wang, Liyuan, et al. "A comprehensive survey of continual learning: Theory, method and application." IEEE TPAMI (2024).

54 of 77

Architecture-based Approaches

Intuition:

Modular Network:

parallel sub-networks or sub-modules to learn incremental tasks in a differentiated manner

Progressive Networks (Rusu et al. 2016):

introduce an identical sub-network for each task + knowledge transfer from other sub-networks

Expert Gate (Aljundi 2017)/ Routing Net. (Collier 2020):

Mixture of Experts (MoE) to learn incremental tasks, expanding one expert as each task is introduced

[*] Wang, Liyuan, et al. "A comprehensive survey of continual learning: Theory, method and application." IEEE TPAMI (2024).

55 of 77

Architecture-based Approaches: Continual Neural Units (Tiezzi et al. 2024)

Tiezzi, M., Marullo, S., Becattini, F., & Melacci, S. Continual Neural Computation. In ECML-PKDD 2024

56 of 77

Architecture-based Approaches: Continual Neural Units (Tiezzi et al. 2024)

A Continual Neural Unit generates its own weight vector in function of the input region in which the neuron is expected to operate

Given a set of weight vectors (memory units), a few of them are blended to generate the "right" weight vector to use, in function of x
The other weight vectors of the set are left untouched

Tiezzi, M., Marullo, S., Becattini, F., & Melacci, S. (2024, August). Continual Neural Computation. In ECML-PKDD 2024

57 of 77

Classic Neural Unit ➙ Continual Neural Unit

2. Matrix storing multiple weight vectors (memory units), one per row

1. Function that generates the weight vector to use in function of x

3. Matrix storing multiple keys, used for indexing purposes (compared to to x)

A Continual Neural Unit generates its own weight vector in function of the input region in which the neuron is expected to operate

Given a set of weight vectors (memory units), a few of them are blended to generate the "right" weight vector to use, in function of x
The other weight vectors of the set are left untouched

Tiezzi, M., Marullo, S., Becattini, F., & Melacci, S. (2024, August). Continual Neural Computation. In ECML-PKDD 2024

58 of 77

Continual Neural Unit (1/4)

K: Learnable Keys

M: Learnable

Memory Units

Attention is exploited to generate the weight vector w which drives the neural computation

Neuron Output (previous slide)

Tiezzi, M., Marullo, S., Becattini, F., & Melacci, S. (2024, August). Continual Neural Computation. In ECML-PKDD 2024

59 of 77

Continual Neural Unit (2/4)

Attention is exploited to generate the weight vector w which drives the neural computation

Attention scores by sparse softmax-based attention, custom similarity function

Neuron Output (previous slide)

Tiezzi, M., Marullo, S., Becattini, F., & Melacci, S. (2024, August). Continual Neural Computation. In ECML-PKDD 2024

K: Learnable Keys

M: Learnable

Memory Units

60 of 77

Continual Neural Unit (3/4)

Weight vector generated by blending memories M in function of the attention scores

Attention is exploited to generate the weight vector w which drives the neural computation

Neuron Output (previous slide)

Attention scores by sparse softmax-based attention, custom similarity function

Tiezzi, M., Marullo, S., Becattini, F., & Melacci, S. (2024, August). Continual Neural Computation. In ECML-PKDD 2024

K: Learnable Keys

M: Learnable

Memory Units

61 of 77

Continual Neural Unit (4/4)

Attention is exploited to generate the weight vector w which drives the neural computation

Attention scores by sparse softmax-based attention, custom similarity function

Benefits in Continual Learning?

Only the top-𝛅 memory units participate in the generation of the weight vector

The other ones are not considered at all (parameter isolation) - prevent catastrophic forgetting
The information stored on the 𝛅 units is merged, promoting information transfer

Tiezzi, M., Marullo, S., Becattini, F., & Melacci, S. (2024, August). Continual Neural Computation. In ECML-PKDD 2024

62 of 77

Learning with Continual Neural Units

Max-isolation: Winner-Take-All (WTA) updates of memory units and keys

The most-responding (winning) key is identified, and only the corresponding memory unit is updated by gradient descent (WTA)
The winning key is updated by an online K-Means like criterion, that pushes the key toward the current input of the data stream

Current input (possibly projected by ѱ, if any)

Winning key

Customizable positive coefficient

Similarly function (attention)

+ criteria to replace unused keys

in function of the last-time and of the number-of-times they resulted to be winning keys

63 of 77

2D Experiment

Class and domain incremental, online learning (one sample at-a-time, single pass); each class is a bi-modal Gaussian, where the two modes are sequentially considered

Each star is a key in K

In this "cloud" there are several data points, streamed one after the other; color = class ID

64 of 77

2D Experiment

Class and domain incremental, online learning (one sample at-a-time, single pass); each class is a bi-modal Gaussian, where the two modes are sequentially considered

Each star is a key in K

In this "cloud" there are several data points, streamed one after the other; color = class ID

65 of 77

2D Experiment

Class and domain incremental, online learning (one sample at-a-time, single pass); each class is a bi-modal Gaussian, where the two modes are sequentially considered

Each star is a key in K

In this "cloud" there are several data points, streamed one after the other; color = class ID

66 of 77

2D Experiment

Class and domain incremental, online learning (one sample at-a-time, single pass); each class is a bi-modal Gaussian, where the two modes are sequentially considered

Each star is a key in K

In this "cloud" there are several data points, streamed one after the other; color = class ID

67 of 77

2D Experiment

Class and domain incremental, online learning (one sample at-a-time, single pass); each class is a bi-modal Gaussian, where the two modes are sequentially considered

Each star is a key in K

In this "cloud" there are several data points, streamed one after the other; color = class ID

68 of 77

Optimization-based Approaches

[*] Wang, Liyuan, et al. "A comprehensive survey of continual learning: Theory, method and application." IEEE TPAMI (2024).

69 of 77

Representation-based Approaches

Intuition:

exploit the strengths of (pre-trained) representations for continual learning

self-supervised learning to help CL
pre-training for downstream CL

Foundational models, fixed backbones, employ Adapters or prompt-based approaches

continual pre-training

data required for pre-training is typically collected in an incremental manner
LLMs etc.

Zhou, Da-Wei, et al. "Continual learning with pre-trained models: A survey." arXiv preprint arXiv:2401.16386 (2024).

70 of 77

Representation-based Approaches: Learning to Prompt for CL (L2P, Wang et al. 2022)

TL;DR: prepend a set of learnable parameters (i.e., prompts) to the current input. Prompts are gathered from a prompt pool via a key-query matching strategy.

Top-N key selection

optimize the keys during learning

71 of 77

Representation-based Approaches: Memory Head for CL (MH, Tiezzi et al. 2024)

TL;DR: neurons route their computation across a learnable key-value mechanism, exhibiting a dynamic behaviour depending on their own input. Only the units(weights) that are relevant to process the observed sample are blended.

Tiezzi, M., Becattini, F., Marullo, S., & Melacci, S. (2024, August). Memory Head for Pre-Trained Backbones in Continual Learning. In CoLLAs 2024

Learnable Keys

Learnable Memory Units

Attention scores

obtained by blending memories M, indexed by keys K, in function of their input

Backbone features

No replay buffers!
No task-id required!

72 of 77

Concluding remarks

Issues and opportunities

73 of 77

Computational costs, Real-Time computation, Memory constraints

When compared in real-world settings the simple Experience Replay outperforms all CL methods:

current evaluations: no constraint on training time and computation
Evaluation in [1] w.r.t computational costs :

the stream does not wait for the model to complete training before revealing the next data

skip data if the model is not fast enough

[1] Y. Ghunaim et al. “Real-Time Evaluation in Online Continual Learning: A New Hope.” CVPR ’23

A. Prabhu et al. “Computationally Budgeted Continual Learning: What Does Matter?” CVPR ‘23

74 of 77

Model selection issues in CL

Most researchers perform a full hyperparameter selection on the entire validation stream

i.e. also on validation from previoustasks
violating the main assumption in CL - namely no access to previous task data
it’s suboptimal because optimal parameters may vary over time

Possibile solution

use only the first part of the validation stream for hyperp. selection [1]

[1] A. Chaudhry et al. “Efficient Lifelong Learning with A-GEM.” 2019

75 of 77

Main messages

The goal of Continual Learning is to understand how to design machine learning models that learn over time

on a constrained budget (memory/compute/real-time requirements)
with non-stationary data
much wider than «class-incremental learning» or «finetuning a pretrained model»

76 of 77

Main messages: Unsolved CL questions and Future directions

Push towards more realistic settings

Toy data is fine for research, toy settings not so much
CL metrics can be misleading and very easy to abuse
Continual hyperparameter selection (and robustness)
Compute-bounded continual learning

Verwimp, Eli, et al. "Continual learning: Applications and the road forward." arXiv preprint arXiv:2311.11908 (2023).

77 of 77

Credits

Thanks for your attention!

Acknowledgments and Credits

Credits to the wonderful talks by R. Pascanu (“Embracing Change: Continual Learning in Deep Neural Networks”, UCL Centre for AI. Recorded on the 5th May 2021), V. Lomonaco (“Continual Learning Beyond Catastrophic Forgetting in Class-Incremental Scenarios”, CoLLas 2023 tutorial, and ContinualAI CL course), M. Mundt (ContinualAI lectures) and D. Abati (Intro to CL @CVPR 2020)

1 of 77

2 of 77

3 of 77

4 of 77

5 of 77

6 of 77

7 of 77

8 of 77

9 of 77

10 of 77

11 of 77

12 of 77

13 of 77

14 of 77

15 of 77

16 of 77

17 of 77

18 of 77

19 of 77

20 of 77

21 of 77

22 of 77

23 of 77

24 of 77

25 of 77

26 of 77

27 of 77

28 of 77

29 of 77

30 of 77

31 of 77

32 of 77

33 of 77

34 of 77

35 of 77

36 of 77

37 of 77

38 of 77

39 of 77

40 of 77

41 of 77

42 of 77

43 of 77

44 of 77

45 of 77

46 of 77

47 of 77

48 of 77

49 of 77

50 of 77

51 of 77

52 of 77

53 of 77

54 of 77

55 of 77

56 of 77

57 of 77

58 of 77

59 of 77

60 of 77

61 of 77

62 of 77

63 of 77

64 of 77

65 of 77

66 of 77

67 of 77

68 of 77

69 of 77

70 of 77

71 of 77

72 of 77

73 of 77

74 of 77

75 of 77

76 of 77

77 of 77