1 of 23

Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks

By Chelsea Finn, Pieter Abbeel, and Sergey Levine, ICML 2017

Present by: Shihao Ma, Yichun Zhang, and Zilun Zhang

CSC 2541 Winter 2021

April 1st, 2021

2 of 23

Agenda

Background

Meta-learning
FSL

MAML Algorithm

Generic Definition
Algorithm Explanation
Case Study 1: Classification
Case Study 2: Regression (*)
Case Study 3: Reinforcement Learning (*)

Related Works

Meta-learning
FSL
Differences with MAML

Problems of MAML

Computation Expensive
Gradient Instability
Shared Batch Normalization
Shared Inner loop LR

Improved MAML

MAML++ (with Omniglot)
Reptile

Summary

Pros & Cons
Difference between Pre-train+Fine-tune and Meta-learning + Adaption

3 of 23

Background: Meta Learning

A learn to Learn Framework

Training stage: learn some prior knowledge (metric, good initialization, data distribution etc.)

Testing stage: Apply the prior knowledge to meta test set with fast adaptation.

4 of 23

Background: Meta Learning

5 of 23

Background: Few-Shot Learning

Goal: to train a model that can perform well in the case where only a few samples are given.

The minimum unit during training is ‘episode’, each episode contains a Support Set and a Query Set.

Support Set has N ∙ K samples, Query Set has N ∙ Q samples, we call such FSL task: “N-way K-shot task”

6 of 23

Background: Few-Shot Learning

Whole Dataset could split to meta-train, meta-validation, and meta-test Set, and classes in each set are distinct to classes in other set.

N: Number of class
K: Number of support samples in each class
Q: Number of query samples in each class (In most FSL models, Q = 1 or 15)

7 of 23

MAML: Generic Definition

Learning Task (generic notation across different learning problems):

task specific loss

distribution over initial observations

transition distribution

episode length ( In i.i.d. supervised learning problems, the length )

output of model at time t

8 of 23

MAML: Algorithm Explanation

Inner Loop:

Outer Loop:

The MAML meta-gradient update involves a gradient through a gradient, which requires to compute Hessian vector product

9 of 23

Case Study 1: Classification

Data

Omniglot, 1623 classes, 50 image each, 28x28 size
MiniImageNet, 100 classes, 600 image each, 84x84 size

Objective

Classify images in meta-test set to the correct class

Backbone - Conv4

Structure: {Conv3x3 -> BN -> ReLU -> Maxpool2x2} x 4
Channel: 3 -> 64 -> 64 -> 64 -> 64

Loss

Cross Entropy

10 of 23

Case Study 1: Classification

The model learned with MAML uses fewer overall parameters compared to sota models (in 2017), since the algorithm does not introduce additional parameters beyond the weights of the classifier itself

Jax Code

https://github.com/zilunzhang/CSC2541-colab-assignment

Split (train-val-test)

Omniglot

1000-200-423

MiniImageNet

64-16-20

Cub_200_2011

100-50-50

11 of 23

Case Study 2: Sine Wave Regression

Data

Data points sampled uniformly from [−5.0, 5.0]
Amplitude varies within [0.1, 5.0]
Phase varies within [0, π]

Objective

Predict the outputs of a continuous-valued function from only a few data points

Model

NN (denote as f) with 2 hidden layers of size 40 with ReLU nonlinearities

Loss

MSE between the prediction f (x) and true value

Evaluation Protocol

Fine-tune a single meta-learned model (MAML) on varying numbers of K examples

Compare performance to two baselines

Pretraining on all of the tasks, fine-tuning with gradient descent on the K provided points at test-time.

Oracle which receives the true amplitude and phase as input

12 of 23

Case Study 2: Sine Wave Regression

MAML is able to estimate parts of the curve without data points, and is able to quickly adapt to test data.
Pre-trained model is unable to recover a suitable representation from the small number samples at test time.
Model learned with MAML continues to improve with additional gradient steps

13 of 23

Case Study 3: Reinforcement Learning

Similar Framework
Similar Objective

Achieving a new goal or succeeding on a previously trained goal in a new environment (agent learn to quickly figure out how to navigate mazes so it can determine with only a few samples in a new maze)

Different Loss

Different Gradient Estimation Method

Policy Gradient for discrete data (non-differentiable reward)

14 of 23

Related Works

Meta-Learning Application

Train a meta-learner that learns how to update the parameters of the learner’s model (Bengio et al., 1992; Schmidhuber, 1992; Bengio et al., 1990)

Learning to optimize deep networks (Hochreiter et al., 2001; Andrychowicz et al., 2016; Li & Malik, 2017)

Learning dynamically changing recurrent networks (Ha et al., 2017)

Learns the weight initialization (Ravi & Larochelle, 2017)

Train memory augmented models on many tasks, where the recurrent learner is trained to adapt to new tasks as it is rolled out (Santoro et al., 2016; Munkhdalai & Yu, 2017).

15 of 23

Related Works

Few-shot Learning

Learn to compare new examples in a learned metric space (Koch, 2015)

Recurrence with attention mechanisms (Vinyals et al., 2016; Shyam et al., 2017; Snell et al., 2017).

Generative Model (Edwards & Storkey, 2017; Rezende et al., 2016)

16 of 23

Related Works

Differences with MAML

MAML does not introduce additional parameters for meta-learning nor require a particular learner architecture

Methods introduced before are difficult to directly extend to reinforcement learning

MAML learner’s weights are updated using the gradient, rather than a learned update.

17 of 23

Problems of MAML

Gradient Instability

Computational Expensive

Shared (across step) Batch Normalization Bias

Cross-Domain Data

Does not work well if large domain shift exists

Shared Inner Loop (across step ) Learning Rate

From “How to train your MAML”, ICLR 2019

18 of 23

Improved MAML - MAML++

Gradient Instability - > Multi-Step Loss Optimization (MSL)

Previous: Update outer loop when completed all inner-loop updates

New: Update outer loop after every step of the inner-loop

Annealed weighting for the pre step losses

19 of 23

Improved MAML - MAML++

Computational Expensive - > Derivate-Order Annealing (DA)

Use first-order approximated gradients for the first 50 epochs, then switch to second-order gradients for the remainder of the training phase.

20 of 23

Improved MAML - MAML++

Shared Inner Loop Learning Rate (across step and across parameter) -> Learning Per-Layer Per-Step Learning Rates and Gradient Directions

Learn a learning rate and direction for each layer in the network as well as learn different learning rates for each adaptation of the base-network as it takes steps.

21 of 23

Improved MAML - MAML++

Shared (across step) Batch Normalization Bias -> Per-Step Batch Normalization Weights and Biases (BNWB)

Learn a set of biases per-step within the inner-loop update process.

Increase convergence speed, stability and generalization performance.

22 of 23

Pre-training + Fine-Tuning V.S. MAML

MAML

Learned meta model will adapt very quick to new task with less data, and have good performance.
We consider the model performance after adaption during the learning procedure.
Using the 2nd order gradient information

Pre-training + Fine-Tuning

Learned model has the best performance on existing tasks, but not means it will quickly adapt to new task.
During pre-training, we not consider the performance of finetune.
Normally using the 1st order gradient information

23 of 23

Summary

MAML can learn the parameters of any standard model via meta-learning to prepare that model for fast adaptation.

MAML can also be viewed as explicitly maximizing sensitivity of new task losses to the model parameters

Gradient based meta-learning method

Parameter (of the framework) free

Could extend to many different models (like RL)