Expert Gate: Lifelong Learning with a Network of Experts
Paul Crouther, Remus Mocanu
Outline
Intro
Related work
Methods
Experiments
Questions
Introduction
Introduction
Approaches to deal with catastrophic forgetting
Alternatives to previous approaches
What the Expert Gate paper suggests
Motivation for Expert Gate
Intro
In this paper we introduce a model of lifelong learning, based on a Network of Experts. New tasks / experts are learned and added to the model sequentially, building on what was learned before. To ensure scalability of this process, data from previous tasks cannot be stored and hence is not available when learning a new task. A critical issue in such context, not addressed in the literature so far, relates to the decision which expert to deploy at test time. We introduce a set of gating autoencoders that learn a representation for the task at hand, and, at test time, automatically forward the test sample to the relevant expert. This also brings memory efficiency as only one expert network has to be loaded into memory at any given time. Further, the autoencoders inherently capture the relatedness of one task to another, based on which the most relevant prior model to be used for training a new expert, with fine-tuning or learning without-forgetting, can be selected. We evaluate our method on image classification and video prediction problems.
Related work
Related work
Multi-task learning
Multiple models for multiple tasks
Lifelong learning without catastrophic forgetting
Methods
Methods intro
Methods intro (continued)
The Autoencoder Gate
The network is composed of two parts, an
The Autoencoder Gate
Undercomplete autoencoder
Aside: Overcomplete autoencoder
Aside: Overcomplete autoencoder
Aside: Overcomplete autoencoder
Suggested reasoning:
Aside: Relationship to PCA with SVD
Selecting the most relevant expert
Measuring task relatedness
Exploiting task relatedness
Algorithm
The architecture of our Expert Gate system
Experiments
Experiments
Implementation Details
Baseline comparisons
Gate Analysis
Gate Analysis
Task Relatedness Analysis
Video Prediction
Conclusion
Questions
Questions
-
Questions
Questions
Here are some other possible differences:
NDP is critical of naively adding a new expert, and states two issues:
One thing they seem to do differently is try to share parameters between experts using lateral connections to previous experts
Then also block the gradient from the new expert to prevent catastrophic forgetting
Thank you
Any questions?