1 of 35

On Tiny Episodic Memories in Continual Learning

Continual Learning - Winter 2021

Presented by:

Ali Rahimi-Kalahroudi and Darshan Patil

Université de Montréal

2 of 35

Outline

  • Introduction
  • Background on ER
  • Method
  • Experiments
  • Conclusion and Discussion

3 of 35

Introduction

  • Continual Learning Objective
    • Learn new skills from a sequence of tasks
    • Leveraging the knowledge accumulated in the past
  • Catastrophic Forgetting
    • Inability to recall how to perform old tasks

4 of 35

Proposed Solutions

  • Regularization-Based CL Approaches
    • Penalizes the feature drift
    • Discourages change in parameters
  • Parameter Isolation CL Approaches
    • Fixed network architecture
    • Dynamic architectures
  • Memory-Based CL Approaches
    • Constrain the optimization
    • Rehearsal

5 of 35

Experience Replay Background in RL

  • Widely Used
  • DQN -- [Mnih et al., 2013, 2015]
  • Storing Trajectory Samples
  • Train the Estimator

6 of 35

Constraints and Assumptions

  • Model
    • Fixed network architecture
  • Dataset
    • Each task is fully supervised
    • Each example from a task can only be seen once
    • Access to a small memory storing examples of past tasks

7 of 35

Method

  • Protocol for Single-Pass Through the Data
  • Metrics
  • ER and Learning
  • Strategies for Updating the Memory

8 of 35

Protocol for Single-Pass Through the Data

  • Two Streams of Tasks
    • Cross-Validation
    • EValuation
      • A held-out test set drawn from this stream
  • Each Task Data

9 of 35

Metrics

  • Average Accuracy
    • Performance on the held-out test set of task j when trained on task i

    • The average accuracy at task T

10 of 35

Metrics (Cont.)

  • Forgetting
    • Forgetting on task j after training on task i

    • The average forgetting at task T

11 of 35

Learning Algorithm

12 of 35

Strategies for Updating the Memory

  • Reservoir Sampling
  • Ring Buffer
  • k-Means
  • Mean of Features (MoF)
  • Hybrids

13 of 35

Reservoir Sampling

  • Input
    • A stream of data of unknown length
  • Selects a Random Subset of the Input

14 of 35

Ring Buffer

  • Equally Sized FIFO Buffers for all Classes

15 of 35

k-Means

  • For each Class
  • Online k-Means in Feature Space
    • Representation before the last classification layer
  • Storing Samples that are Closest to Centroids

16 of 35

Mean of Features (MoF)

  • A Running Estimate of the Average Feature Vector
  • Storing Samples that are Closest to the Average Feature Vector

17 of 35

Hybrids

  • Mixture of Previous Strategies
  • Or any other Strategies

18 of 35

Experiments

19 of 35

Datasets

  • Permuted/Rotated MNIST
    • 23 tasks consisting of random permutations of input pixels
    • 60,000 training examples
  • Split CIFAR
    • Split CIFAR-100 into 20 disjoint sets of 5 classes each
    • 600 images for each class
  • Split miniImagenet
    • A subset of ImageNet with a total of 100 classes and 600 images per class, to 20 disjoint subsets
  • Split CUB
    • Split CUB (bird dataset) into 20 disjoint sets of 10 classes each
    • 6,033 images

20 of 35

Baselines

  • Finetune
  • EWC [Kirkpatrick et al, 2016]
    • Use Fisher matrix to slow down learning for previously important parameters
  • A-GEM [Chaudhry et al, 2019]
    • Project gradient onto gradient of previous examples approximated with episodic memory
  • MER [Riemer et al, 2019]
    • Integrate REPTILE algorithm with Experience Replay module

21 of 35

Results

22 of 35

Results

  • ER in general outperforms both episodic memory based and non episodic memory based baselines

23 of 35

Results

  • Accuracy increases with memory size

24 of 35

Results

  • Reservoir sampling seems to work the best, except when memory per class is small

25 of 35

Results

  • When memory per class is small, sampling methods that employ balanced number of samples per class work best

26 of 35

Hybrid Experience Replay Writing Methods

  • Reservoir sampling when every class is well represented
  • Ringbuffer when some class only has a few examples left

27 of 35

Why does this work?

  • Experiment: In a 2 task setup, after training on task 1, evaluate accuracy on task 1 while training on Task 2 data + replay
  • Use rotated MNIST as a way to gauge effect of task similarity

28 of 35

Why does this work?

  • Approach does memorize replay examples (trained to 0 loss)
  • Results imply that:
    • Task 1 memory somewhat improves performance compared to finetune
    • Task 2 data helps regularize for task 1 memory

29 of 35

QA

30 of 35

QA

31 of 35

QA

32 of 35

QA

33 of 35

QA

34 of 35

Conclusion

  • Strong, very simple baseline for continual learning
  • Using previous data itself as a regularizer, instead of network performance on data as regularizer
  • Basic empirical results and analysis

35 of 35

Future Directions?

  • Theoretical results
  • New sampling strategies
  • Data dependent memory sizes
  • Storing higher level representations