1 of 35

On Tiny Episodic Memories in Continual Learning

Continual Learning - Winter 2021

Presented by:

Ali Rahimi-Kalahroudi and Darshan Patil

Université de Montréal

2 of 35

Outline

Introduction
Background on ER
Method
Experiments
Conclusion and Discussion

3 of 35

Introduction

Continual Learning Objective

Learn new skills from a sequence of tasks
Leveraging the knowledge accumulated in the past

Catastrophic Forgetting

Inability to recall how to perform old tasks

4 of 35

Proposed Solutions

Regularization-Based CL Approaches

Penalizes the feature drift
Discourages change in parameters

Parameter Isolation CL Approaches

Fixed network architecture
Dynamic architectures

Memory-Based CL Approaches

Constrain the optimization
Rehearsal

5 of 35

Experience Replay Background in RL

Widely Used
DQN -- [Mnih et al., 2013, 2015]
Storing Trajectory Samples
Train the Estimator

6 of 35

Constraints and Assumptions

Model

Fixed network architecture

Dataset

Each task is fully supervised
Each example from a task can only be seen once
Access to a small memory storing examples of past tasks

7 of 35

Method

Protocol for Single-Pass Through the Data
Metrics
ER and Learning
Strategies for Updating the Memory

8 of 35

Protocol for Single-Pass Through the Data

Two Streams of Tasks

Cross-Validation

EValuation

A held-out test set drawn from this stream

Each Task Data

9 of 35

Metrics

Average Accuracy

Performance on the held-out test set of task j when trained on task i

The average accuracy at task T

10 of 35

Metrics (Cont.)

Forgetting

Forgetting on task j after training on task i

The average forgetting at task T

11 of 35

Learning Algorithm

12 of 35

Strategies for Updating the Memory

Reservoir Sampling
Ring Buffer
k-Means
Mean of Features (MoF)
Hybrids

13 of 35

Reservoir Sampling

Input

A stream of data of unknown length

Selects a Random Subset of the Input

14 of 35

Ring Buffer

Equally Sized FIFO Buffers for all Classes

15 of 35

k-Means

For each Class
Online k-Means in Feature Space

Representation before the last classification layer

Storing Samples that are Closest to Centroids

16 of 35

Mean of Features (MoF)

A Running Estimate of the Average Feature Vector
Storing Samples that are Closest to the Average Feature Vector

17 of 35

Hybrids

Mixture of Previous Strategies
Or any other Strategies

18 of 35

Experiments

19 of 35

Datasets

Permuted/Rotated MNIST

23 tasks consisting of random permutations of input pixels
60,000 training examples

Split CIFAR

Split CIFAR-100 into 20 disjoint sets of 5 classes each
600 images for each class

Split miniImagenet

A subset of ImageNet with a total of 100 classes and 600 images per class, to 20 disjoint subsets

Split CUB

Split CUB (bird dataset) into 20 disjoint sets of 10 classes each
6,033 images

20 of 35

Baselines

Finetune
EWC [Kirkpatrick et al, 2016]

Use Fisher matrix to slow down learning for previously important parameters

A-GEM [Chaudhry et al, 2019]

Project gradient onto gradient of previous examples approximated with episodic memory

MER [Riemer et al, 2019]

Integrate REPTILE algorithm with Experience Replay module

22 of 35

Results

ER in general outperforms both episodic memory based and non episodic memory based baselines

23 of 35

Results

Accuracy increases with memory size

24 of 35

Results

Reservoir sampling seems to work the best, except when memory per class is small

25 of 35

Results

When memory per class is small, sampling methods that employ balanced number of samples per class work best

26 of 35

Hybrid Experience Replay Writing Methods

Reservoir sampling when every class is well represented
Ringbuffer when some class only has a few examples left

27 of 35

Why does this work?

Experiment: In a 2 task setup, after training on task 1, evaluate accuracy on task 1 while training on Task 2 data + replay
Use rotated MNIST as a way to gauge effect of task similarity

28 of 35

Why does this work?

Approach does memorize replay examples (trained to 0 loss)
Results imply that:

Task 1 memory somewhat improves performance compared to finetune
Task 2 data helps regularize for task 1 memory

34 of 35

Conclusion

Strong, very simple baseline for continual learning
Using previous data itself as a regularizer, instead of network performance on data as regularizer
Basic empirical results and analysis

35 of 35

Future Directions?

Theoretical results
New sampling strategies
Data dependent memory sizes
Storing higher level representations

1 of 35

2 of 35

3 of 35

4 of 35

5 of 35

6 of 35

7 of 35

8 of 35

9 of 35

10 of 35

11 of 35

12 of 35

13 of 35

14 of 35

15 of 35

16 of 35

17 of 35

18 of 35

19 of 35

20 of 35

21 of 35

22 of 35

23 of 35

24 of 35

25 of 35

26 of 35

27 of 35

28 of 35

29 of 35

30 of 35

31 of 35

32 of 35

33 of 35

34 of 35

35 of 35