1 of 35

Heterogeneous Continual Learning

Presented by: Lucas Wu

2 of 35

Def. Continual Learning

3 of 35

Def. Continual Learning

Forward transfer:

Learning previous tasks should

help the latter ones

Backward transfer:

Learning new tasks should help the previous ones

Prevent catastrophic forgetting:

Learning new tasks should not forget previous ones

4 of 35

Current Approach-1

Replay method

5 of 35

Current Approach-2

Regularization-based methods:

Regularize the model change

(Reduce the weight variation)

6 of 35

Current Approach-3

Parameter-isolated methods

7 of 35

Same architecture

Can be any CL method

8 of 35

Motivation from real world examples

Autonomous driving

Clinical applications

Recommendation systems

Can’t store original data (privacy/no enough space)

Need to upgrade to new architecture

9 of 35

Motivation from real world examples

Can’t store original data (privacy/no enough space)

Need to upgrade to new architecture

10 of 35

Motivation from real world examples

Can’t store original data (privacy/no enough space)
Different architecture

11 of 35

Motivation from real world examples

Can’t store original data (privacy/no enough space)
Different architecture

12 of 35

Motivation from real world examples

Can’t store original data (privacy/no enough space)
Different architecture

13 of 35

Current Approach-1

Replay method

Can’t store original data (privacy/no enough space)
Different architecture

14 of 35

Current Approach-2

Regularization-based methods:

Regularize the model change

(Reduce the weight variation)

Can’t store original data (privacy/no enough space)
Different architecture

Parameter-isolated methods

15 of 35

Current Approach-3

Can’t store original data (privacy/no enough space)
Different architecture

Parameter-isolated methods

16 of 35

Questions?

17 of 35

Sketch of the solutions

Inspired from knowledge distillation

week

strong

18 of 35

Sketch of the solutions

Inspired from knowledge distillation

week

strong

19 of 35

Sketch of the solutions

probability distribution

Soft CE: Cross-Entropy loss(Difference between predicted and true labels)

KL Divergence: method for comparing prediction probability distribution

20 of 35

Sketch of the solutions

probability distribution

Soft CE: Cross-Entropy loss(Difference between predicted and true labels)

KL Divergence: method for comparing prediction probability distribution

Augmentation:

Label smoothing
Temperature Scaling

Keshigeyan Chandrasegaran, Ngoc-Trung Tran, Yunqing Zhao, and Ngai-Man Cheung. Revisiting label smoothing and knowledge distillation compatibility: What was missing? In Proceedings of the International Conference on Machine Learning (ICML), 2022. 2, 4

21 of 35

Sketch of the solutions

Hyper parameter for label smoothing

Objective:

22 of 35

Sketch of the solutions�(w/ buffer)

Hyper parameter for label smoothing

Objective:

knowledge distillation Loss

Buffer: size==200, same to replay

23 of 35

How to extract features�(w/o buffer)

Objective:

24 of 35

How to extract features

Objective:

10%, random select from previous tasks

Encourage spatial continuity in the generated images, thus avoiding excessive noise and unnatural patterns.

0.5K iterations

25 of 35

How to extract features

Objective:

10%, random select from previous tasks

Used to encourage pixel-level similarity between the generated image and the original image.

0.5K iterations

26 of 35

How to extract features

Objective:

10%, random select from previous tasks

Encouraging the generated image to be similar to the target image in the feature space.

0.5K iterations

27 of 35

How to extract features

Objective:

Refers to DeepInversion

28 of 35

Deep Inversion (DI) VS Quick Deep Inversion (QDI)

Dog

29 of 35

DeepInversion V.S. Quick DeepInversion

30 of 35

Experiment setting

Average accuracy

Average forgetting

Evaluation metrics

31 of 35

Task-incremental continual learning.

32 of 35

Class-incremental continual learning.

33 of 35

Conclusion

Best performance in :

Task incremental continuous learning
Class Incremental Continuous Learning

Ablation study:

Experimental results show that combining with knowledge distillation and label smoothing of enhanced images can significantly improve performance

34 of 35

Limitation

It cannot be applied to unsupervised CL with heterogeneous architecture

Did not adjust the training configuration or hyperparameters of each model for a fair comparison

35 of 35

Questions?