1 of 77

September 16th 2024

Lifelong Continual Learning in Dynamic Environments: a Tutorial

Matteo Tiezzi

2 of 77

Motivations and Definitions

3 of 77

Motivations and definitions

Evolution has empowered human with strong adaptability to continually acquire, update, accumulate and exploit knowledge

  • without forgetting previous concepts and abilities
  • in a non-stationary world

Image Credits: economist.com

4 of 77

Non–stationarity vs offline learning

This contrasts with common Machine Learning assumptions:

  • Fixed problem domain
  • static/stationary dataset and learning environment
    • Gradient-based learning successful when data samples independent and identically distributed (IID)
        • dataset is balanced, shuffled and randomly sampled
  • No (or little) constraints on data storage or model size

This is rarely true!

  • Health, Robotics, Language

5 of 77

The “data-centric” AI era as a static paradigm: Batch/Offline Learning

  • TASK Cat breed classification
  • GOAL Learn a function (e.g., neural net) for pattern class-membership prediction

Assumption Data sampled from a stationary (unknown) distribution

STEPS

  1. Data collection
  1. Offline Learning
    1. SGD on an empirical risk function

6 of 77

Static paradigm: Batch/Offline Learning

  • Stationarity is an issue: new data with a new underlying distribution

  1. New dataset collection
    1. distribution shift?
  1. New training is required
    1. from scratch? Time? Cost?

Retraining GPTs can cost up to millions of euros

Data distribution shift

7 of 77

Training costs

8 of 77

Motivations and definition

Evolution has empowered human with strong adaptability to continually acquire, update, accumulate and exploit knowledge

  • We expect artificial intelligence (AI) systems to adapt in a similar way

  • Lifelong or “Continual Learning” goal: understand how to design machine learning models that learn over time
    • In non-stationary environments (e.g. data streams)
            • on a constrained budget (memory/compute/real-time requirements)
            • Avoiding retraining from scratch

9 of 77

What is Lifelong/Continual Learning?

  • The goal of Continual Learning is to understand how to design machine learning models that learn over time
    • on a constrained budget (memory/compute/real-time requirements)
    • with non-stationary data
      • coming from a stream

10 of 77

Why do we need Continual Learning?

  • 50 GB/s streaming data
  • ~30240 TB of data after one week
  • Impossible to retrain the robot from scratch every time with Batch/Offline Learning

11 of 77

Data privacy and storage

Data privacy, copyright, accountability, liability…

Continual Learning: Can we design models that do not need to store data collections, but learn online/on the fly?

12 of 77

GD biological plausability: Humans don’t learn well from randomly sampled data

  • Offline Learning
    • SGD, shuffled batches

  • Are you able to learn high school subjects by sampling one page at random from different textbooks?

13 of 77

Motivations and definition: Continual Lifelong Learning

Continual Learning goal: learn from a stream of data (usually simplified as Tasks) incrementally and behave as if all the samples were observed simultaneously.

  • Tasks:
    • new classes, data distributions, skills, different environments, etc.
    • Notice: dealing with strict boundaries is a simplification by practitioners

The CLEAR Benchmark: Continual LEArning on Real-World Imagery, NeurIPS 2021

SplitMNIST dataset, Task-Incremental setting

14 of 77

From Offline to Continual/Lifelong Learning with SGD

Accuracy Metric

Catastrophic forgetting

Learn one class at the time

(without past classes)

Only the class from the current Task is correctly predicted!

Task 1

Task 2

Assumption Data sampled from a fixed (unknown) distribution

15 of 77

What could be gained by adopting Continual Learning?

  • Applications that continually adapt to track a varying problem
    • evolving (social) networks, molecule interaction/reactions, epidemiological models, language evolutions

  • Robots that acquire and expand skills over time

  • Novel techniques that could help Deep Learning methods efficiency, even in the stationary setting

G. Pasquale, et al. "Teaching iCub to recognize objects using deep Convolutional Neural Networks." Machine Learning for Interactive Systems. PMLR, 2015.

G. Pasquale, C. Ciliberto, L. Rosasco and L. Natale, "Object identification from few examples by improving the invariance of a Deep Convolutional Neural Network," 2016 IROS

On the Fly Object Recognition on the R1, Your Personal Humanoid - IIT

16 of 77

Why do we need Continual Learning: a Practical Example and some keywords

  • Household chores robot
    • for any home, many tasks
      • cannot be preprogrammed in factory to generalize to every environment/domain
  • Need to expand skill-set over time
    • task and environment variations
      • ‘tidying’ a table or shelving books
      • ‘laundry’ may require sorting socks or ironing shirts.

17 of 77

Why do we need Continual Learning: a Practical Example and some keywords

  • Need to adapt and not forget

  • When learning related tasks (e.g., vacuuming, sweeping, and mopping), the robot should show:
    • forward transfer
      • Learning current task foster subsequent tasks
    • backward transfer
      • Learning current task foster better performance on previous tasks

  • Limited capacity to:
    • store data, increase its model size, or increase processing time.

18 of 77

Other domains: Continual Learning in Graph Representation Learning

  • Graphs where features, node targets, topology, etc. evolve over time
    • Social networks, Molecular properties, knowledge graphs etc.

Yuan, Qiao, et al. "Continual graph learning: A survey." arXiv preprint arXiv:2301.12230 (2023).

Zhang, Xikun, Dongjin Song, and Dacheng Tao. "Continual Learning on Graphs: Challenges, Solutions, and Opportunities." arXiv preprint arXiv:2402.11565 (2024).

19 of 77

Nomenclature and related paradigms

As contents are provided incrementally over a lifetime, continual learning is also referred to as lifelong learning or incremental learning in much of the literature, without a strict distinction

Unconsolidated Nomenclature

  • Continual Learning
  • Continuous Learning
  • Lifelong learning
  • Incremental learning

Related Paradigms

  • Multi-task learning
  • Meta-learning/learning to learn
  • Transfer Learning & Domain Adaptation
  • Online/Streaming Learning
  • Time series forecasting

The most general notion

20 of 77

Lifelong Learning: A longstanding goal

Image Credits to V. Lomonaco and M. Mundt (see last slide)

21 of 77

CL: Many surveys on the topic

22 of 77

An hot topic - Conference on Lifelong Learning Agents (3rd edition) - lifelong-ml.cc

23 of 77

Desiderata, metrics, settings

24 of 77

Continual Learning desiderata and constraints

Many desiderata, often competing one each other:

  1. Minimal resources
    • Limit the access to previous tasks data
    • finite storage for previous experience and can not interact with previously seen tasks
      • or no storage at all
  2. Minimal increase in model capacity and computation
    • The approach must be scalable: it cannot add a new model for each subsequent task.
  3. Fast adaptation and recovery
    • adaptation to novel tasks or domain shifts
    • fast recovery when presented with past tasks

25 of 77

Continual Learning desiderata and constraints (Cont.)

Minimizing catastrophic forgetting and interference.

    • Training on new tasks should not significantly reduce performance on previously learned tasks

Illustration of catastrophic forgetting

Credits: Hadsell, Raia, et al. "Embracing change: Continual learning in deep neural networks." Trends in cognitive sciences 24.12 (2020): 1028-1040.

Illustration of the perfect learning dynamics

26 of 77

Continual Learning desiderata and constraints (Cont.)

Minimizing catastrophic forgetting and interference.

    • Training on new tasks should not significantly reduce performance on previously learned tasks

Illustration of catastrophic forgetting

Notation: Evaluation Metrics

  • Average Accuracy
    • ak,j ∈ [0, 1] is the accuracy on the test set of the j-th task after incremental learning of the k-th task (j ≤ k).
  • Average Incremental Accuracy

27 of 77

Continual Learning desiderata and constraints (Cont.)

Minimizing catastrophic forgetting and interference.

    • Training on new tasks should not significantly reduce performance on previously learned tasks

Illustration of catastrophic forgetting

Notation: Metrics

  • Forgetting of a task
    • the difference between its maximum performance obtained in the past and its current performance
  • Forgetting Measure
    • at the k-th task is the average forgetting of all old tasks

28 of 77

Continual Learning desiderata and constraints (Cont.)

Maintaining plasticity

    • The model should be able to keep learning effectively as new tasks are observed
    • Failure: lack of model capacity, regularization

Illustration of declining plasticity

29 of 77

Continual Learning desiderata and constraints (Cont.)

Maximizing forward and backward transfer

    • Learning a task should improve related tasks, both past and future, in terms of both learning efficiency and performance

Illustration of forward transfer (FWT)

FWT evaluates the average influence of all old tasks on the current k-th task

classification accuracy of a randomly initialized reference model trained only on the j-th task

Learning on Task 1 helps also in future ones

30 of 77

Continual Learning desiderata and constraints (Cont.)

Maximizing forward and backward transfer

    • Learning a task should improve related tasks, both past and future, in terms of both learning efficiency and performance

Illustration of forward and backward transfer (BWT)

BWT evaluates the average influence of learning the k-th task on all old tasks

Learning on Task 2 improves also performances on previous Tasks

31 of 77

Summary: stability, plasticity, generalizability

Stability-plasticity tradeoff

[*] Wang, Liyuan, et al. "A comprehensive survey of continual learning: Theory, method and application." IEEE TPAMI (2024).

Task specificity generalizability tradeoff

32 of 77

Settings/scenarios

CL practitioners commonly define three possible settings (i.e. it is not stable):

  1. Task-Incremental:
    • different input distributions, disjoint output space, separated by taskID, taskID known at training and test time

What classes are these, given that they belong to Task i?

What classes are these?

Same classes, different input distributions (domains)

  1. Class-Incremental:
    • different input distributions, disjoint output space

  1. Domain-Incremental:
    • different input distributions

33 of 77

Settings/scenarios summary

CL practitioners commonly define three possible settings (i.e. it is not stable):

Each task/domain has a different input distribution

Disjoint output space (e.g. different classes for each task)

Mai, Zheda, et al. "Online continual learning in image classification: An empirical survey." Neurocomputing 469 (2022): 28-51.

Output spaces are disjoint and separated by task-IDs

Task-ID known both at training and test-time

34 of 77

(Finer-grained taxonomy) Scenarios

Additional scenarios

[*] Wang, Liyuan, et al. "A comprehensive survey of continual learning: Theory, method and application." IEEE TPAMI (2024).

Samples are seen only once (1 epoch)

  • General Continual Learning (GCL) : model observes incremental data in an online fashion without explicit task boundaries

35 of 77

Datasets and Benchmarks

36 of 77

Datasets: Classics

SplitMNIST

SplitCIFAR

Permuted MNIST (domain incremental)

37 of 77

Datasets Taxonomy

Verwimp, Eli, et al. "Clad: A realistic continual learning benchmark for autonomous driving." Neural Networks 161 (2023): 659-669.

Strict boundaries

38 of 77

Towards real-world streams: desiderata

CL popular benchmarks:

  • Sharp Virtual drift
  • New classes
  • Balanced data
  • No temporal consistency

Real World Streams:

  • Gradual and sharp drifts
  • New domains and classes appear over time
  • Imbalanced distributions
  • Temporal consistency (e.g. video frames)

39 of 77

Evaluation with Real vs Virtual Drifts: Streaming Protocols for evaluation

Virtual drift Issues

  • sampling bias
  • Evaluation on a static test set
  • a.k.a. most of the CL research
    • violating the main assumption in CL, namely no access to previous task data

Credits to “Continual Learning Beyond Catastrophic Forgetting in Class-Incremental Scenarios” CoLLAs 2023 tutorial, V. Lomonaco and A. Carta

Real drift

  • Concept drift
    • e.g.: politician roles and affiliations to political party
  • Evaluation on the next data (e.g. prequential evaluation a.k.a. test-than-train)
  • Not a lot of research in CL right now

40 of 77

Real drifts datasets: CLEAR and CLOC

  • Real-world images with smooth temporal evolution
  • Large unlabeled dataset (~7.8M images)
  • Prequential evaluation
  • Scenario: domain-incremental and semi-supervised

Z. Lin et al. “The CLEAR Benchmark: Continual LEArning on Real-World Imagery” NeurIPS 2021

Z. Cai et al. “Online Continual Learning with Natural Distribution Shifts: An Empirical Study with Visual Data.” ICCV ‘21

Credits to “Continual Learning Beyond Catastrophic Forgetting in Class-Incremental Scenarios” CoLLAs 2023 tutorial, V. Lomonaco and A. Carta

41 of 77

Temporal Coherence - CoRE5 and SAILenv

  • Temporally coherent streams (videos)
    • dynamic environments
    • Many classes with several movement patterns
    • Many scenarios are possible: batch, online, with repetitions

M. Meloni,L. Pasqualini, M. Tiezzi et al. "Sailenv: Learning in virtual visual environments made simple." 2020 25th International Conference on Pattern Recognition (ICPR). IEEE, 2021.

Lomonaco, Vincenzo, and Davide Maltoni. "Core50: a new dataset and benchmark for continuous object recognition." Conference on robot learning. PMLR, 2017.

Meloni, Enrico, et al. "Evaluating continual learning algorithms by generating 3d virtual environments." International Workshop on Continual Semi-Supervised Learning. 2021

42 of 77

CL datasets for sequential models: a less explored direction

Cossu, Andrea, et al. "Continual learning for recurrent neural networks: an empirical evaluation." Neural Networks 143 (2021): 607-627.

Incrementally increase the sequence lengths or add new classes

Mainly tested on RNNs and LSTMs

43 of 77

Methods

44 of 77

Continual Learning methods: a Taxonomy

[*] Wang, Liyuan, et al. "A comprehensive survey of continual learning: Theory, method and application." IEEE TPAMI (2024).

45 of 77

Regularization-based Approaches

Intuition:

  • Regularization on some selected (here is the hard part) parameters
    • Seminal methods:
      • Elastic Weight Consolidation (EWC)
      • Synaptic Intelligence (SI)
  • Distillation to mitigate parameter drift:
    • Learning without Forgetting (LwF)

[*] Wang, Liyuan, et al. "A comprehensive survey of continual learning: Theory, method and application." IEEE TPAMI (2024).

46 of 77

Regularization-based Approaches: Elastic Weight Consolidation (EWC) (Kirkpatrick 2017)

TL;DR: Quadratic penalty to regularize the update of model parameters that were important to past tasks.

  • Parameter importance approximated by the diagonal Fisher Information Matrix F
    • e.g. given two tasks A and B seen in sequence, the loss function in EWC is:

Task B loss

Fisher matrix: measures the amount of information that an observable random variable carries about an unknown parameter θ of a distribution that models such variable

the optimal value of j-th parameter after learning task A

47 of 77

Regularization-based Approaches: Learning without Forgetting (LwF) (Li 2017)

TL;DR: Knowledge distillation (using predictions from the output head of the old tasks) to preserve knowledge

  • New task (Xn, Yn)
  • Teacher model: model after learning the last task
  • Student model: model trained with the current task

Standard cross-entropy on new task data

Predicted value for the new task with Xn

Predicted value for the old task with Xn

Computed before learning on current task: output for the old task with Xn

48 of 77

Replay-based Approaches

Intuition:

  • Use two mini-batches: the incoming one, and one from a memory buffer
  • Model is updated using both mini-batches
  • Three directions:
    • Experience Replay
    • Generative Replay
    • Feature Replay

[*] Wang, Liyuan, et al. "A comprehensive survey of continual learning: Theory, method and application." IEEE TPAMI (2024).

49 of 77

Replay-based Approaches: Experience Replay

Reservoir sampling: every sample has the same probability, mem_sz/n, to be stored in the memory buffer

mem_sz is the size of the buffer

n is the #samples observed up to now

Jeffrey S Vitter. Random sampling with a reservoir. ACM Transactions on Mathematical Software (TOMS), 11(1):37–57, 1985.

50 of 77

Replay-based Approaches: Maximally Interfered Retrieval (MIR)

MIR: retrieved the replay samples that are maximally interfered (the largest loss increases) by a virtual parameter update with

the incoming mini-batch

R. Aljundi, E. Belilovsky, T. Tuytelaars, L. Charlin, M. Caccia, M. Lin, L. Page-Caccia, Online continual learning with maximal interfered retrieval, in: Advances in Neural Information Processing Systems 32, 2019,

Top-k samples achieving maximal loss increase

51 of 77

Replay-based Approaches: Dark Experience Replay (DER & DER++)

DER: tl;dr: feature replay. Preserve old training samples together with their logits, and perform logit-matching throughout the optimization trajectory

Buzzega, Pietro, et al. "Dark experience for general continual learning: a strong, simple baseline." Advances in neural information processing systems 33 (2020): 15920-15930.

Minimizing the Euclidean distance between the logit from memory and the current logits on the memory sample.

Additional term on buffer datapoints and their labels

Also logits are stored in memory

52 of 77

Replay-based Approaches: Greedy Sampler and Dumb Learner (GDumb)

GDumb: greedily updates the memory buffer keeping a balanced class distribution. At inference, it trains a model from scratch using the memory buffer only.

Prabhu, Ameya, Philip HS Torr, and Puneet K. Dokania. "Gdumb: A simple approach that questions our progress in continual learning." Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16. Springer International Publishing, 2020.

53 of 77

Architecture-based Approaches

Intuition:

  • constructing task-specific parameters instead of learning all incremental tasks with a shared set of params
    • Parameter allocation:
      • fixed architecture, select paths/identify important neurons for current task (e.g. Piggyback, HAT, SupSup methods)
        • sparsity
      • alternative: dynamically expand the network if the capacity is not sufficient

[*] Wang, Liyuan, et al. "A comprehensive survey of continual learning: Theory, method and application." IEEE TPAMI (2024).

54 of 77

Architecture-based Approaches

Intuition:

  • Modular Network:
    • parallel sub-networks or sub-modules to learn incremental tasks in a differentiated manner
      • Progressive Networks (Rusu et al. 2016):
        • introduce an identical sub-network for each task + knowledge transfer from other sub-networks
      • Expert Gate (Aljundi 2017)/ Routing Net. (Collier 2020):
        • Mixture of Experts (MoE) to learn incremental tasks, expanding one expert as each task is introduced

[*] Wang, Liyuan, et al. "A comprehensive survey of continual learning: Theory, method and application." IEEE TPAMI (2024).

55 of 77

Architecture-based Approaches: Continual Neural Units (Tiezzi et al. 2024)

Tiezzi, M., Marullo, S., Becattini, F., & Melacci, S. Continual Neural Computation. In ECML-PKDD 2024

56 of 77

Architecture-based Approaches: Continual Neural Units (Tiezzi et al. 2024)

A Continual Neural Unit generates its own weight vector in function of the input region in which the neuron is expected to operate

  • Given a set of weight vectors (memory units), a few of them are blended to generate the "right" weight vector to use, in function of x
  • The other weight vectors of the set are left untouched

Tiezzi, M., Marullo, S., Becattini, F., & Melacci, S. (2024, August). Continual Neural Computation. In ECML-PKDD 2024

57 of 77

Classic Neural Unit ➙ Continual Neural Unit

2. Matrix storing multiple weight vectors (memory units), one per row

1. Function that generates the weight vector to use in function of x

3. Matrix storing multiple keys, used for indexing purposes (compared to to x)

A Continual Neural Unit generates its own weight vector in function of the input region in which the neuron is expected to operate

  • Given a set of weight vectors (memory units), a few of them are blended to generate the "right" weight vector to use, in function of x
  • The other weight vectors of the set are left untouched

Tiezzi, M., Marullo, S., Becattini, F., & Melacci, S. (2024, August). Continual Neural Computation. In ECML-PKDD 2024

58 of 77

Continual Neural Unit (1/4)

K: Learnable Keys

M: Learnable

Memory Units

  • Attention is exploited to generate the weight vector w which drives the neural computation

Neuron Output (previous slide)

Tiezzi, M., Marullo, S., Becattini, F., & Melacci, S. (2024, August). Continual Neural Computation. In ECML-PKDD 2024

59 of 77

Continual Neural Unit (2/4)

  • Attention is exploited to generate the weight vector w which drives the neural computation

Attention scores by sparse softmax-based attention, custom similarity function

Neuron Output (previous slide)

Tiezzi, M., Marullo, S., Becattini, F., & Melacci, S. (2024, August). Continual Neural Computation. In ECML-PKDD 2024

K: Learnable Keys

M: Learnable

Memory Units

60 of 77

Continual Neural Unit (3/4)

Weight vector generated by blending memories M in function of the attention scores

  • Attention is exploited to generate the weight vector w which drives the neural computation

Neuron Output (previous slide)

Attention scores by sparse softmax-based attention, custom similarity function

Tiezzi, M., Marullo, S., Becattini, F., & Melacci, S. (2024, August). Continual Neural Computation. In ECML-PKDD 2024

K: Learnable Keys

M: Learnable

Memory Units

61 of 77

Continual Neural Unit (4/4)

  • Attention is exploited to generate the weight vector w which drives the neural computation

Attention scores by sparse softmax-based attention, custom similarity function

Benefits in Continual Learning?

Only the top-𝛅 memory units participate in the generation of the weight vector

  • The other ones are not considered at all (parameter isolation) - prevent catastrophic forgetting
  • The information stored on the 𝛅 units is merged, promoting information transfer

Tiezzi, M., Marullo, S., Becattini, F., & Melacci, S. (2024, August). Continual Neural Computation. In ECML-PKDD 2024

62 of 77

Learning with Continual Neural Units

Max-isolation: Winner-Take-All (WTA) updates of memory units and keys

  • The most-responding (winning) key is identified, and only the corresponding memory unit is updated by gradient descent (WTA)
  • The winning key is updated by an online K-Means like criterion, that pushes the key toward the current input of the data stream

Current input (possibly projected by ѱ, if any)

Winning key

Customizable positive coefficient

Similarly function (attention)

+ criteria to replace unused keys

in function of the last-time and of the number-of-times they resulted to be winning keys

63 of 77

2D Experiment

  • Class and domain incremental, online learning (one sample at-a-time, single pass); each class is a bi-modal Gaussian, where the two modes are sequentially considered

Each star is a key in K

In this "cloud" there are several data points, streamed one after the other; color = class ID

64 of 77

2D Experiment

  • Class and domain incremental, online learning (one sample at-a-time, single pass); each class is a bi-modal Gaussian, where the two modes are sequentially considered

Each star is a key in K

In this "cloud" there are several data points, streamed one after the other; color = class ID

65 of 77

2D Experiment

  • Class and domain incremental, online learning (one sample at-a-time, single pass); each class is a bi-modal Gaussian, where the two modes are sequentially considered

Each star is a key in K

In this "cloud" there are several data points, streamed one after the other; color = class ID

66 of 77

2D Experiment

  • Class and domain incremental, online learning (one sample at-a-time, single pass); each class is a bi-modal Gaussian, where the two modes are sequentially considered

Each star is a key in K

In this "cloud" there are several data points, streamed one after the other; color = class ID

67 of 77

2D Experiment

  • Class and domain incremental, online learning (one sample at-a-time, single pass); each class is a bi-modal Gaussian, where the two modes are sequentially considered

Each star is a key in K

In this "cloud" there are several data points, streamed one after the other; color = class ID

68 of 77

Optimization-based Approaches

[*] Wang, Liyuan, et al. "A comprehensive survey of continual learning: Theory, method and application." IEEE TPAMI (2024).

69 of 77

Representation-based Approaches

Intuition:

  • exploit the strengths of (pre-trained) representations for continual learning
    • self-supervised learning to help CL
    • pre-training for downstream CL
      • Foundational models, fixed backbones, employ Adapters or prompt-based approaches
    • continual pre-training
      • data required for pre-training is typically collected in an incremental manner
      • LLMs etc.

Zhou, Da-Wei, et al. "Continual learning with pre-trained models: A survey." arXiv preprint arXiv:2401.16386 (2024).

70 of 77

Representation-based Approaches: Learning to Prompt for CL (L2P, Wang et al. 2022)

TL;DR: prepend a set of learnable parameters (i.e., prompts) to the current input. Prompts are gathered from a prompt pool via a key-query matching strategy.

Top-N key selection

optimize the keys during learning

71 of 77

Representation-based Approaches: Memory Head for CL (MH, Tiezzi et al. 2024)

TL;DR: neurons route their computation across a learnable key-value mechanism, exhibiting a dynamic behaviour depending on their own input. Only the units(weights) that are relevant to process the observed sample are blended.

Tiezzi, M., Becattini, F., Marullo, S., & Melacci, S. (2024, August). Memory Head for Pre-Trained Backbones in Continual Learning. In CoLLAs 2024

Learnable Keys

Learnable Memory Units

Attention scores

obtained by blending memories M, indexed by keys K, in function of their input

Backbone features

  • No replay buffers!
  • No task-id required!

72 of 77

Concluding remarks

Issues and opportunities

73 of 77

Computational costs, Real-Time computation, Memory constraints

  • When compared in real-world settings the simple Experience Replay outperforms all CL methods:
    • current evaluations: no constraint on training time and computation
    • Evaluation in [1] w.r.t computational costs :
      • the stream does not wait for the model to complete training before revealing the next data
        • skip data if the model is not fast enough

[1] Y. Ghunaim et al. “Real-Time Evaluation in Online Continual Learning: A New Hope.” CVPR ’23

A. Prabhu et al. “Computationally Budgeted Continual Learning: What Does Matter?” CVPR ‘23

74 of 77

Model selection issues in CL

  • Most researchers perform a full hyperparameter selection on the entire validation stream
    • i.e. also on validation from previoustasks
    • violating the main assumption in CL - namely no access to previous task data
    • it’s suboptimal because optimal parameters may vary over time

  • Possibile solution
    • use only the first part of the validation stream for hyperp. selection [1]

[1] A. Chaudhry et al. “Efficient Lifelong Learning with A-GEM.” 2019

75 of 77

Main messages

The goal of Continual Learning is to understand how to design machine learning models that learn over time

  • on a constrained budget (memory/compute/real-time requirements)
  • with non-stationary data
  • much wider than «class-incremental learning» or «finetuning a pretrained model»

76 of 77

Main messages: Unsolved CL questions and Future directions

  • Push towards more realistic settings
    • Toy data is fine for research, toy settings not so much
    • CL metrics can be misleading and very easy to abuse
    • Continual hyperparameter selection (and robustness)
    • Compute-bounded continual learning

Verwimp, Eli, et al. "Continual learning: Applications and the road forward." arXiv preprint arXiv:2311.11908 (2023).

77 of 77

Credits

Thanks for your attention!

Acknowledgments and Credits

Credits to the wonderful talks by R. Pascanu (“Embracing Change: Continual Learning in Deep Neural Networks”, UCL Centre for AI. Recorded on the 5th May 2021), V. Lomonaco (“Continual Learning Beyond Catastrophic Forgetting in Class-Incremental Scenarios”, CoLLas 2023 tutorial, and ContinualAI CL course), M. Mundt (ContinualAI lectures) and D. Abati (Intro to CL @CVPR 2020)