1 of 40

Expert Gate: Lifelong Learning with a Network of Experts

Paul Crouther, Remus Mocanu

2 of 40

Outline

Intro

Related work

Methods

Experiments

Questions

3 of 40

Introduction

4 of 40

Introduction

  • Learning a new task (e.g. scene classification based on a preexisting object recognition network trained on ImageNet), requires adapting the model to the new set of classes and fine-tuning it with the new data.
  • The newly trained network performs well on the new task, but has a degraded performance on the old ones.
  • This is called catastrophic forgetting, and is a major problem facing lifelong learning techniques, where new tasks and datasets are added in a sequential manner

5 of 40

Approaches to deal with catastrophic forgetting

  • Each time a new task arrives along with its own training data, new layers/neurons are added, if needed, and the model is re-trained on all the tasks. Such a solution has three main drawbacks.
    • First is the risk of the negative inductive bias when the tasks are not related or simply adversarial.
    • Second, a shared model might fail to capture specialist information for particular tasks as joint training will encourage a hidden representation beneficial for all tasks.
    • Third, each time a new task is to be learned, the whole network needs to be re-trained
  • However, the biggest constraint with joint training is that of keeping all the data from the previous tasks.

6 of 40

Alternatives to previous approaches

  • Without storing the data, one can consider strategies like using the previous model to generate virtual samples use the soft outputs of the old model on new task data to generate virtual labels and use them in the retraining phase
  • This works to some extent, but is unlikely to scale as repeating this scheme a number of times causes a bias towards the new tasks and an exponential buildup of errors on the older ones, as they show in their experiments.
  • Moreover, it also suffers from the same drawbacks as the joint training described above. Instead of having a network that is jack of all trades and master of none, they stress the need for having different specialist or expert models for different tasks

7 of 40

What the Expert Gate paper suggests

  • Without storing the data, one can consider strategies like using the previous model to generate virtual samples use the soft outputs of the old model on new task data to generate virtual labels) and use them in the retraining phase
  • This works to some extent, but is unlikely to scale as repeating this scheme a number of times causes a bias towards the new tasks and an exponential buildup of errors on the older ones, as they show in their experiments.
  • Moreover, it also suffers from the same drawbacks as the joint training described above. Instead of having a network that is jack of all trades and master of none, they stress the need for having different specialist or expert models for different tasks

8 of 40

Motivation for Expert Gate

  • Instead of having a network that is jack of all trades and master of none, they stress the need for having different specialist or expert models for different tasks
  • Built a “network of experts”, where a new expert model is added whenever a new task arrives and knowledge is transferred from previous models.
  • With an increasing number of task specializations, the number of expert models increases.

9 of 40

Intro

In this paper we introduce a model of lifelong learning, based on a Network of Experts. New tasks / experts are learned and added to the model sequentially, building on what was learned before. To ensure scalability of this process, data from previous tasks cannot be stored and hence is not available when learning a new task. A critical issue in such context, not addressed in the literature so far, relates to the decision which expert to deploy at test time. We introduce a set of gating autoencoders that learn a representation for the task at hand, and, at test time, automatically forward the test sample to the relevant expert. This also brings memory efficiency as only one expert network has to be loaded into memory at any given time. Further, the autoencoders inherently capture the relatedness of one task to another, based on which the most relevant prior model to be used for training a new expert, with fine-tuning or learning without-forgetting, can be selected. We evaluate our method on image classification and video prediction problems.

10 of 40

Related work

11 of 40

Related work

Multi-task learning

  • R. Caruana. Multitask learning. In Learning to learn, pages 95–133. Springer, 1998.
  • T. M. Mitchell. The need for biases in learning generalizations. Department of Computer Science, Laboratory for Computer Science Research, Rutgers Univ. New Jersey, 1980.

Multiple models for multiple tasks

  • R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton. Adaptive mixtures of local experts. Neural computation, 3(1):79–87, 1991.

Lifelong learning without catastrophic forgetting

  • I. J. Goodfellow, M. Mirza, D. Xiao, A. Courville, and Y. Bengio. An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv preprint arXiv:1312.6211, 2013.

12 of 40

Methods

13 of 40

Methods intro

  • Consider the case of lifelong learning or sequential learning where tasks and their corresponding data come one after another.
  • For each task, they learn a specialized model (expert) by transferring knowledge from previous tasks – in particular, they build on the most related previous task.
  • Simultaneously they learn a gating function that captures the characteristics of each task.
  • This gate forwards the test data to the corresponding expert resulting in a high performance over all learned tasks.

14 of 40

Methods intro (continued)

  • How to learn such a gating function to differentiate between tasks, without having access to the training data of previous tasks?
  • They learn a low dimensional subspace for each task/domain. At test time they then select the representation (subspace) that best fits the test sample.
  • They do this using an undercomplete autoencoder per task.

15 of 40

The Autoencoder Gate

The network is composed of two parts, an

  • encoder f = h(x), which maps the input x to a code h(x)
  • decoder r = g(h(x)), that maps the code to a reconstruction of the input.
  • The loss function L(x, g(h(x))) is simply the reconstruction error.

16 of 40

The Autoencoder Gate

  • The encoder learns, through a hidden layer, a lower dimensional representation (undercomplete autoencoder)
  • or a higher dimensional representation (overcomplete autoencoder) of the input data, guided by regularization criteria to prevent the autoencoder from copying its input.
  • A linear autoencoder with a Euclidean loss function learns the same subspace as PCA. However, autoencoders with non-linear functions yield better dimensionality reduction compared to PCA

17 of 40

Undercomplete autoencoder

  • The encoder learns, through a hidden layer, a lower dimensional representation (undercomplete autoencoder)

Autoencoder lecture

18 of 40

Aside: Overcomplete autoencoder

  • or a higher dimensional representation (overcomplete autoencoder) of the input data, guided by regularization criteria to prevent the autoencoder from copying its input.

Autoencoder lecture

19 of 40

Aside: Overcomplete autoencoder

  • or a higher dimensional representation (overcomplete autoencoder) of the input data, guided by regularization criteria to prevent the autoencoder from copying its input.

Autoencoder lecture

20 of 40

Aside: Overcomplete autoencoder

  • or a higher dimensional representation (overcomplete autoencoder) of the input data, guided by regularization criteria to prevent the autoencoder from copying its input.

Autoencoder lecture

Suggested reasoning:

  • It has been stated by “G. Alain and Y. Bengio. What regularized auto-encoders learn from the data-generating distribution.” that in regularized (over-complete) autoencoders, the opposing forces between the risk term and the regularization term result in a score like behavior for the reconstruction error.
  • As a result, a zero reconstruction loss means a zero derivative which could be a local minimum or a local maximum.

21 of 40

Aside: Relationship to PCA with SVD

  • A linear autoencoder with a Euclidean loss function learns the same subspace as PCA. However, autoencoders with non-linear functions yield better dimensionality reduction compared to PCA
  • Suggest looking at the Autoencoder lecture for proof

Autoencoder lecture

22 of 40

Selecting the most relevant expert

  • At test time, and after learning the autoencoders for the different tasks, they add a softmax layer that takes as input the reconstruction errors eri from the different tasks autoencoders given a test sample x.
  • The reconstruction error eri of the i-th autoencoder is the output of the loss function given the input sample x.
  • The softmax layer gives a probability pi for each task autoencoder indicating its confidence

23 of 40

Measuring task relatedness

  • Given a new task Tk associated with its data Dk, first learn an autoencoder for this task Ak.
  • Let Ta be a previous task with associated autoencoder Aa.
  • They measure the task relatedness between task Tk and task Ta.
  • Since they do not have access to the data of task Ta, they use the validation data from the current task Tk.
  • Compute the average reconstruction error Erk on the current task data made by the current task autoencoder Ak
  • Compute the average reconstruction error Era made by the previous task autoencoder Aa on the current task data.
  • Relatedness given in the following equation:

24 of 40

Exploiting task relatedness

  • First use relatedness to select the most related task to be used as prior model for learning the new task.
  • Second, they exploit the level of task relatedness to determine which transfer method to use: fine-tuning or learning-without-forgetting(LwF)

25 of 40

Algorithm

  • They utilize the asymmetry of relatedness to select between LwF and fine-tuning
  • They found that LwF only outperforms fine-tuning when the two tasks are sufficiently related.
  • When this is not the case, enforcing the new model to give similar outputs for the old task may actually hurt performance.
  • Fine-tuning, on the other hand, only uses the previous task parameters as a starting point and is less sensitive to the level of task relatedness.
  • Applied a threshold on the task relatedness value to decide when to use LwF and when to fine-tune

26 of 40

The architecture of our Expert Gate system

  • The Expert Gate system is a task recognizer using an undercomplete one layer autoencoder as a gating mechanism.
  • They learn for each new task or domain, a gating function that captures the shared characteristics among the training samples and can recognize similar samples at test time.
  • Each autoencoder is trained along with the corresponding expert model and maps the training data to its own lower dimensional subspace.
  • At test time, each task autoencoder projects the sample to its learned subspace and measures the reconstruction error due to the projection.
  • The autoencoder with the lowest reconstruction error is used like a switch, selecting the corresponding expert model

27 of 40

Experiments

28 of 40

Experiments

  • Comparison with Certain Baselines
  • Gate Analysis
  • Task Relatedness Analysis
  • Video Prediction

29 of 40

Implementation Details

  • Image representation in preprocessing step used a Pre-trained AlexNet with ImageNet
  • Experimented on finding the optimal number of neurons in the single hidden layer (10,50,100,200,500)

30 of 40

Baseline comparisons

  • Three datasets:
    • MIT Scenes
    • Caltech-UCSD Birds
    • Oxford Flowers

  • LwF Model

31 of 40

Gate Analysis

  • Three more datasets:
    • Stanford Cars
    • FGVC-Aircraft
    • VOC Actions

32 of 40

Gate Analysis

  • Discriminative Classifier:
    • All previous data is stored, 1 hidden layer of 100 neurons

33 of 40

Task Relatedness Analysis

  • Using following datasets:
    • Google Street View House Numbers (SVHN)
    • Chars 74k (English subset, only letters)
    • MNIST (Handwritten digits)
    • 2 most related to each other: Scenes and Actions
    • 2 least related to each other: Cars and Flowers

34 of 40

Video Prediction

  • Task of predicting future images based of previous ones
  • Use Dynamic Filter Network (DFN)
  • 3 Tasks/Domains:
    • DFN Dataset (Highway)
    • KITTI Dataset subset (Residential)
    • CityScapes/Stuttgart Sequence (City)

35 of 40

Conclusion

  • Previous methods exploit previous tasks knowledge to perform on newer tasks
  • Expert Gate explores which model to choose from at test time without storing data.
  • Potential Future Works

36 of 40

Questions

37 of 40

Questions

-

38 of 40

Questions

39 of 40

Questions

Here are some other possible differences:

NDP is critical of naively adding a new expert, and states two issues:

  • Number of parameters grows unnecessarily large as experts redundantly learn features
  • There is no positive transfer of knowledge between experts

One thing they seem to do differently is try to share parameters between experts using lateral connections to previous experts

Then also block the gradient from the new expert to prevent catastrophic forgetting

40 of 40

Thank you

Any questions?