2 of 40

Outline

Intro

Related work

Methods

Experiments

Questions

3 of 40

Introduction

4 of 40

Introduction

Learning a new task (e.g. scene classification based on a preexisting object recognition network trained on ImageNet), requires adapting the model to the new set of classes and fine-tuning it with the new data.
The newly trained network performs well on the new task, but has a degraded performance on the old ones.
This is called catastrophic forgetting, and is a major problem facing lifelong learning techniques, where new tasks and datasets are added in a sequential manner

5 of 40

Approaches to deal with catastrophic forgetting

Each time a new task arrives along with its own training data, new layers/neurons are added, if needed, and the model is re-trained on all the tasks. Such a solution has three main drawbacks.

First is the risk of the negative inductive bias when the tasks are not related or simply adversarial.
Second, a shared model might fail to capture specialist information for particular tasks as joint training will encourage a hidden representation beneficial for all tasks.
Third, each time a new task is to be learned, the whole network needs to be re-trained

However, the biggest constraint with joint training is that of keeping all the data from the previous tasks.

6 of 40

Alternatives to previous approaches

Without storing the data, one can consider strategies like using the previous model to generate virtual samples use the soft outputs of the old model on new task data to generate virtual labels and use them in the retraining phase
This works to some extent, but is unlikely to scale as repeating this scheme a number of times causes a bias towards the new tasks and an exponential buildup of errors on the older ones, as they show in their experiments.
Moreover, it also suffers from the same drawbacks as the joint training described above. Instead of having a network that is jack of all trades and master of none, they stress the need for having different specialist or expert models for different tasks

7 of 40

What the Expert Gate paper suggests

Without storing the data, one can consider strategies like using the previous model to generate virtual samples use the soft outputs of the old model on new task data to generate virtual labels) and use them in the retraining phase
This works to some extent, but is unlikely to scale as repeating this scheme a number of times causes a bias towards the new tasks and an exponential buildup of errors on the older ones, as they show in their experiments.
Moreover, it also suffers from the same drawbacks as the joint training described above. Instead of having a network that is jack of all trades and master of none, they stress the need for having different specialist or expert models for different tasks

8 of 40

Motivation for Expert Gate

Instead of having a network that is jack of all trades and master of none, they stress the need for having different specialist or expert models for different tasks
Built a “network of experts”, where a new expert model is added whenever a new task arrives and knowledge is transferred from previous models.
With an increasing number of task specializations, the number of expert models increases.

9 of 40

Intro

In this paper we introduce a model of lifelong learning, based on a Network of Experts. New tasks / experts are learned and added to the model sequentially, building on what was learned before. To ensure scalability of this process, data from previous tasks cannot be stored and hence is not available when learning a new task. A critical issue in such context, not addressed in the literature so far, relates to the decision which expert to deploy at test time. We introduce a set of gating autoencoders that learn a representation for the task at hand, and, at test time, automatically forward the test sample to the relevant expert. This also brings memory efficiency as only one expert network has to be loaded into memory at any given time. Further, the autoencoders inherently capture the relatedness of one task to another, based on which the most relevant prior model to be used for training a new expert, with fine-tuning or learning without-forgetting, can be selected. We evaluate our method on image classification and video prediction problems.

10 of 40

Related work

11 of 40

Related work

Multi-task learning

R. Caruana. Multitask learning. In Learning to learn, pages 95–133. Springer, 1998.
T. M. Mitchell. The need for biases in learning generalizations. Department of Computer Science, Laboratory for Computer Science Research, Rutgers Univ. New Jersey, 1980.

Multiple models for multiple tasks

R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton. Adaptive mixtures of local experts. Neural computation, 3(1):79–87, 1991.

Lifelong learning without catastrophic forgetting

I. J. Goodfellow, M. Mirza, D. Xiao, A. Courville, and Y. Bengio. An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv preprint arXiv:1312.6211, 2013.

13 of 40

Methods intro

Consider the case of lifelong learning or sequential learning where tasks and their corresponding data come one after another.
For each task, they learn a specialized model (expert) by transferring knowledge from previous tasks – in particular, they build on the most related previous task.
Simultaneously they learn a gating function that captures the characteristics of each task.
This gate forwards the test data to the corresponding expert resulting in a high performance over all learned tasks.

14 of 40

Methods intro (continued)

How to learn such a gating function to differentiate between tasks, without having access to the training data of previous tasks?
They learn a low dimensional subspace for each task/domain. At test time they then select the representation (subspace) that best fits the test sample.
They do this using an undercomplete autoencoder per task.

15 of 40

The Autoencoder Gate

The network is composed of two parts, an

encoder f = h(x), which maps the input x to a code h(x)
decoder r = g(h(x)), that maps the code to a reconstruction of the input.
The loss function L(x, g(h(x))) is simply the reconstruction error.

16 of 40

The Autoencoder Gate

The encoder learns, through a hidden layer, a lower dimensional representation (undercomplete autoencoder)
or a higher dimensional representation (overcomplete autoencoder) of the input data, guided by regularization criteria to prevent the autoencoder from copying its input.
A linear autoencoder with a Euclidean loss function learns the same subspace as PCA. However, autoencoders with non-linear functions yield better dimensionality reduction compared to PCA

17 of 40

Undercomplete autoencoder

The encoder learns, through a hidden layer, a lower dimensional representation (undercomplete autoencoder)

Autoencoder lecture

18 of 40

Aside: Overcomplete autoencoder

or a higher dimensional representation (overcomplete autoencoder) of the input data, guided by regularization criteria to prevent the autoencoder from copying its input.

Autoencoder lecture

19 of 40

Aside: Overcomplete autoencoder

or a higher dimensional representation (overcomplete autoencoder) of the input data, guided by regularization criteria to prevent the autoencoder from copying its input.

Autoencoder lecture

20 of 40

Aside: Overcomplete autoencoder

or a higher dimensional representation (overcomplete autoencoder) of the input data, guided by regularization criteria to prevent the autoencoder from copying its input.

Autoencoder lecture

Suggested reasoning:

It has been stated by “G. Alain and Y. Bengio. What regularized auto-encoders learn from the data-generating distribution.” that in regularized (over-complete) autoencoders, the opposing forces between the risk term and the regularization term result in a score like behavior for the reconstruction error.
As a result, a zero reconstruction loss means a zero derivative which could be a local minimum or a local maximum.

21 of 40

Aside: Relationship to PCA with SVD

A linear autoencoder with a Euclidean loss function learns the same subspace as PCA. However, autoencoders with non-linear functions yield better dimensionality reduction compared to PCA
Suggest looking at the Autoencoder lecture for proof

Autoencoder lecture

22 of 40

Selecting the most relevant expert

At test time, and after learning the autoencoders for the different tasks, they add a softmax layer that takes as input the reconstruction errors er_i from the different tasks autoencoders given a test sample x.
The reconstruction error er_i of the i-th autoencoder is the output of the loss function given the input sample x.
The softmax layer gives a probability p_i for each task autoencoder indicating its confidence

23 of 40

Measuring task relatedness

Given a new task T_k associated with its data D_k, first learn an autoencoder for this task A_k.
Let T_a be a previous task with associated autoencoder A_a.
They measure the task relatedness between task T_k and task T_a.
Since they do not have access to the data of task T_a, they use the validation data from the current task T_k.
Compute the average reconstruction error Er_k on the current task data made by the current task autoencoder A_k
Compute the average reconstruction error Er_a made by the previous task autoencoder A_a on the current task data.
Relatedness given in the following equation:

24 of 40

Exploiting task relatedness

First use relatedness to select the most related task to be used as prior model for learning the new task.
Second, they exploit the level of task relatedness to determine which transfer method to use: fine-tuning or learning-without-forgetting(LwF)

25 of 40

Algorithm

They utilize the asymmetry of relatedness to select between LwF and fine-tuning
They found that LwF only outperforms fine-tuning when the two tasks are sufficiently related.
When this is not the case, enforcing the new model to give similar outputs for the old task may actually hurt performance.
Fine-tuning, on the other hand, only uses the previous task parameters as a starting point and is less sensitive to the level of task relatedness.
Applied a threshold on the task relatedness value to decide when to use LwF and when to fine-tune

26 of 40

The architecture of our Expert Gate system

The Expert Gate system is a task recognizer using an undercomplete one layer autoencoder as a gating mechanism.
They learn for each new task or domain, a gating function that captures the shared characteristics among the training samples and can recognize similar samples at test time.
Each autoencoder is trained along with the corresponding expert model and maps the training data to its own lower dimensional subspace.
At test time, each task autoencoder projects the sample to its learned subspace and measures the reconstruction error due to the projection.
The autoencoder with the lowest reconstruction error is used like a switch, selecting the corresponding expert model

27 of 40

Experiments

28 of 40

Experiments

Comparison with Certain Baselines
Gate Analysis
Task Relatedness Analysis
Video Prediction

29 of 40

Implementation Details

Image representation in preprocessing step used a Pre-trained AlexNet with ImageNet
Experimented on finding the optimal number of neurons in the single hidden layer (10,50,100,200,500)

30 of 40

Baseline comparisons

Three datasets:

MIT Scenes
Caltech-UCSD Birds
Oxford Flowers

LwF Model

31 of 40

Gate Analysis

Three more datasets:

Stanford Cars
FGVC-Aircraft
VOC Actions

32 of 40

Gate Analysis

Discriminative Classifier:

All previous data is stored, 1 hidden layer of 100 neurons

33 of 40

Task Relatedness Analysis

Using following datasets:

Google Street View House Numbers (SVHN)
Chars 74k (English subset, only letters)
MNIST (Handwritten digits)
2 most related to each other: Scenes and Actions
2 least related to each other: Cars and Flowers

34 of 40

Video Prediction

Task of predicting future images based of previous ones
Use Dynamic Filter Network (DFN)
3 Tasks/Domains:

DFN Dataset (Highway)
KITTI Dataset subset (Residential)
CityScapes/Stuttgart Sequence (City)

35 of 40

Conclusion

Previous methods exploit previous tasks knowledge to perform on newer tasks
Expert Gate explores which model to choose from at test time without storing data.
Potential Future Works

37 of 40

Questions

39 of 40

Questions

Here are some other possible differences:

NDP is critical of naively adding a new expert, and states two issues:

Number of parameters grows unnecessarily large as experts redundantly learn features
There is no positive transfer of knowledge between experts

One thing they seem to do differently is try to share parameters between experts using lateral connections to previous experts

Then also block the gradient from the new expert to prevent catastrophic forgetting

1 of 40

2 of 40

3 of 40

4 of 40

5 of 40

6 of 40

7 of 40

8 of 40

9 of 40

10 of 40

11 of 40

12 of 40

13 of 40

14 of 40

15 of 40

16 of 40

17 of 40

18 of 40

19 of 40

20 of 40

21 of 40

22 of 40

23 of 40

24 of 40

25 of 40

26 of 40

27 of 40

28 of 40

29 of 40

30 of 40

31 of 40

32 of 40

33 of 40

34 of 40

35 of 40

36 of 40

37 of 40

38 of 40

39 of 40

40 of 40