1 of 23

Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks

By Chelsea Finn, Pieter Abbeel, and Sergey Levine, ICML 2017

Present by: Shihao Ma, Yichun Zhang, and Zilun Zhang

CSC 2541 Winter 2021

April 1st, 2021

2 of 23

Agenda

  • Background
    • Meta-learning
    • FSL
  • MAML Algorithm
    • Generic Definition
    • Algorithm Explanation
    • Case Study 1: Classification
    • Case Study 2: Regression (*)
    • Case Study 3: Reinforcement Learning (*)
  • Related Works
    • Meta-learning
    • FSL
    • Differences with MAML
  • Problems of MAML
    • Computation Expensive
    • Gradient Instability
    • Shared Batch Normalization
    • Shared Inner loop LR
  • Improved MAML
    • MAML++ (with Omniglot)
    • Reptile
  • Summary
    • Pros & Cons
    • Difference between Pre-train+Fine-tune and Meta-learning + Adaption

3 of 23

Background: Meta Learning

  • A learn to Learn Framework

  • Training stage: learn some prior knowledge (metric, good initialization, data distribution etc.)

  • Testing stage: Apply the prior knowledge to meta test set with fast adaptation.

4 of 23

Background: Meta Learning

5 of 23

Background: Few-Shot Learning

  • Goal: to train a model that can perform well in the case where only a few samples are given.

  • The minimum unit during training is ‘episode’, each episode contains a Support Set and a Query Set.

    • Support Set has N ∙ K samples, Query Set has N ∙ Q samples, we call such FSL task: “N-way K-shot task”

6 of 23

Background: Few-Shot Learning

  • Whole Dataset could split to meta-train, meta-validation, and meta-test Set, and classes in each set are distinct to classes in other set.
  • N: Number of class
  • K: Number of support samples in each class
  • Q: Number of query samples in each class (In most FSL models, Q = 1 or 15)

7 of 23

MAML: Generic Definition

Learning Task (generic notation across different learning problems):

task specific loss

distribution over initial observations

transition distribution

episode length ( In i.i.d. supervised learning problems, the length )

output of model at time t

8 of 23

MAML: Algorithm Explanation

Inner Loop:

Outer Loop:

The MAML meta-gradient update involves a gradient through a gradient, which requires to compute Hessian vector product

9 of 23

Case Study 1: Classification

Data

  • Omniglot, 1623 classes, 50 image each, 28x28 size
  • MiniImageNet, 100 classes, 600 image each, 84x84 size

Objective

  • Classify images in meta-test set to the correct class

Backbone - Conv4

  • Structure: {Conv3x3 -> BN -> ReLU -> Maxpool2x2} x 4
  • Channel: 3 -> 64 -> 64 -> 64 -> 64

Loss

  • Cross Entropy

10 of 23

Case Study 1: Classification

The model learned with MAML uses fewer overall parameters compared to sota models (in 2017), since the algorithm does not introduce additional parameters beyond the weights of the classifier itself

Split (train-val-test)

  • Omniglot
    • 1000-200-423

  • MiniImageNet
    • 64-16-20

  • Cub_200_2011
    • 100-50-50

11 of 23

Case Study 2: Sine Wave Regression

Data

  • Data points sampled uniformly from [−5.0, 5.0]
  • Amplitude varies within [0.1, 5.0]
  • Phase varies within [0, π]

Objective

  • Predict the outputs of a continuous-valued function from only a few data points

Model

  • NN (denote as f) with 2 hidden layers of size 40 with ReLU nonlinearities

Loss

  • MSE between the prediction f (x) and true value

Evaluation Protocol

  • Fine-tune a single meta-learned model (MAML) on varying numbers of K examples

  • Compare performance to two baselines

    • Pretraining on all of the tasks, fine-tuning with gradient descent on the K provided points at test-time.

    • Oracle which receives the true amplitude and phase as input

12 of 23

Case Study 2: Sine Wave Regression

  • MAML is able to estimate parts of the curve without data points, and is able to quickly adapt to test data.
  • Pre-trained model is unable to recover a suitable representation from the small number samples at test time.
  • Model learned with MAML continues to improve with additional gradient steps

13 of 23

Case Study 3: Reinforcement Learning

  • Similar Framework
  • Similar Objective
    • Achieving a new goal or succeeding on a previously trained goal in a new environment (agent learn to quickly figure out how to navigate mazes so it can determine with only a few samples in a new maze)
  • Different Loss

  • Different Gradient Estimation Method

Policy Gradient for discrete data (non-differentiable reward)

14 of 23

Related Works

Meta-Learning Application

  • Train a meta-learner that learns how to update the parameters of the learner’s model (Bengio et al., 1992; Schmidhuber, 1992; Bengio et al., 1990)

  • Learning to optimize deep networks (Hochreiter et al., 2001; Andrychowicz et al., 2016; Li & Malik, 2017)
  • Learning dynamically changing recurrent networks (Ha et al., 2017)
  • Learns the weight initialization (Ravi & Larochelle, 2017)
  • Train memory augmented models on many tasks, where the recurrent learner is trained to adapt to new tasks as it is rolled out (Santoro et al., 2016; Munkhdalai & Yu, 2017).

15 of 23

Related Works

Few-shot Learning

  • Learn to compare new examples in a learned metric space (Koch, 2015)
  • Recurrence with attention mechanisms (Vinyals et al., 2016; Shyam et al., 2017; Snell et al., 2017).
  • Generative Model (Edwards & Storkey, 2017; Rezende et al., 2016)

16 of 23

Related Works

Differences with MAML

  • MAML does not introduce additional parameters for meta-learning nor require a particular learner architecture
  • Methods introduced before are difficult to directly extend to reinforcement learning
  • MAML learner’s weights are updated using the gradient, rather than a learned update.

17 of 23

Problems of MAML

  • Gradient Instability
  • Computational Expensive
  • Shared (across step) Batch Normalization Bias

  • Cross-Domain Data
  • Shared Inner Loop (across step ) Learning Rate

18 of 23

Improved MAML - MAML++

  1. Gradient Instability - > Multi-Step Loss Optimization (MSL)

Previous: Update outer loop when completed all inner-loop updates

New: Update outer loop after every step of the inner-loop

Annealed weighting for the pre step losses

19 of 23

Improved MAML - MAML++

  1. Computational Expensive - > Derivate-Order Annealing (DA)

Use first-order approximated gradients for the first 50 epochs, then switch to second-order gradients for the remainder of the training phase.

20 of 23

Improved MAML - MAML++

  1. Shared Inner Loop Learning Rate (across step and across parameter) -> Learning Per-Layer Per-Step Learning Rates and Gradient Directions

Learn a learning rate and direction for each layer in the network as well as learn different learning rates for each adaptation of the base-network as it takes steps.

21 of 23

Improved MAML - MAML++

  • Shared (across step) Batch Normalization Bias -> Per-Step Batch Normalization Weights and Biases (BNWB)

Learn a set of biases per-step within the inner-loop update process.

Increase convergence speed, stability and generalization performance.

22 of 23

Pre-training + Fine-Tuning V.S. MAML

MAML

  • Learned meta model will adapt very quick to new task with less data, and have good performance.
  • We consider the model performance after adaption during the learning procedure.
  • Using the 2nd order gradient information

Pre-training + Fine-Tuning

  • Learned model has the best performance on existing tasks, but not means it will quickly adapt to new task.
  • During pre-training, we not consider the performance of finetune.
  • Normally using the 1st order gradient information

23 of 23

Summary

  • MAML can learn the parameters of any standard model via meta-learning to prepare that model for fast adaptation.

  • MAML can also be viewed as explicitly maximizing sensitivity of new task losses to the model parameters
  • Gradient based meta-learning method
  • Parameter (of the framework) free
  • Could extend to many different models (like RL)