2 of 110

Overview

Session 1: Transfer Learning - Pretraining and representations
Session 2: Transfer Learning - Adaptation and downstream tasks
Session 3: Transfer Learning - Limitations, open-questions, future directions

Many slides are adapted from a Tutorial on Transfer Learning in NLP I gave at NAACL 2019 with my amazing collaborators

👈

Sebastian Ruder

Matthew Peters

Swabha

Swayamdipta

3 of 110

Transfer Learning in NLP

NLPL Winter School

Session 2

4 of 110

Transfer Learning in Natural Language Processing

Transfer Learning in NLP

Follow along with the tutorial:

Colab: https://tinyurl.com/NAACLTransferColab
Code: https://tinyurl.com/NAACLTransferCode

5 of 110

Agenda

[2] Pretraining

[4] Adaptation

[6]

Open Problems

[5] Downstream

[3] What’s in a representation?

[1] Introduction

6 of 110

4. Adaptation

Image credit: Ben Didier

7 of 110

4 – How to adapt the pretrained model

Several orthogonal directions we can make decisions on:

Architectural modifications?�How much to change the pretrained model architecture for adaptation��
Optimization schemes?�Which weights to train during adaptation and following what schedule ��
More signal: Weak supervision, Multi-tasking & Ensembling�How to get more supervision signal for the target task

8 of 110

4.1 – Architecture

Two general options:

Keep pretrained model internals unchanged:�Add classifiers on top, embeddings at the bottom, use outputs as features�
Modify pretrained model internal architecture: �Initialize encoder-decoders, task-specific modifications, adapters

Image credit: Darmawansyah

9 of 110

4.1.A – Architecture: Keep model unchanged

General workflow:

Remove pretraining task head if not useful for target task

Example: remove softmax classifier from pretrained LM
Not always needed: some adaptation schemes re-use the pretraining objective/task, e.g. for multi-task learning

10 of 110

4.1.A – Architecture: Keep model unchanged

General workflow:

Task-specific, randomly initialized

General, pretrained

Add target task-specific layers on top/bottom of pretrained model

Simple: adding linear layer(s) on top of the pretrained model

11 of 110

4.1.A – Architecture: Keep model unchanged

General workflow:

Add target task-specific layers on top/bottom of pretrained model

Simple: adding linear layer(s) on top of the pretrained model
More complex: model output as input for a separate model
Often beneficial when target task requires interactions that are not available in pretrained embedding

12 of 110

4.1.B – Architecture: Modifying model internals

Various reasons:

Adapting to a structurally different target task

Ex: Pretraining with a single input sequence (ex: language modeling) but adapting to a task with several input sequences (ex: translation, conditional generation...)
Use the pretrained model weights to initialize as much as possible of a structurally different target task model
Ex: Use monolingual LMs to initialize encoder and decoder parameters for MT (Ramachandran et al., EMNLP 2017; Lample & Conneau, 2019)

13 of 110

4.1.B – Architecture: Modifying model internals

Various reasons:

Task-specific modifications

Provide pretrained model with capabilities that are useful for the target task
Ex: Adding skip/residual connections, attention (Ramachandran et al., EMNLP 2017)

14 of 110

4.1.B – Architecture: Modifying model internals

Using less parameters for adaptation:

Less parameters to fine-tune
Can be very useful given the increasing size of model parameters
Ex: add bottleneck modules (“adapters”) between the layers of the pretrained model (Rebuffi et al., NIPS 2017; CVPR 2018)

Various reasons:

15 of 110

4.1.B – Architecture: Modifying model internals

Adapters

Commonly connected with a residual connection in parallel to an existing layer
Most effective when placed at every layer (smaller effect at bottom layers)
Different operations (convolutions, self-attention) possible
Particularly suitable for modular architectures like Transformers (Houlsby et al., ICML 2019; Stickland and Murray, ICML 2019)

Image credit: Caique Lima

16 of 110

4.1.B – Architecture: Modifying model internals

Multi-head attention (MH; shared across layers) is used in parallel with self-attention (SA) layer of BERT
Both are added together and fed into a layer-norm (LN)

Adapters (Stickland & Murray, ICML 2019)

17 of 110

Hands-on #2:

Adapting our pretrained model

Image credit: Chanaky

18 of 110

Hands-on: Model adaptation

Plan

Start from our Transformer language model
Adapt the model to a target task:

keep the model core unchanged, load the pretrained weights
add a linear layer on top, newly initialized
use additional embeddings at the bottom, newly initialized

Reminder — material is here:

Colab http://tiny.cc/NAACLTransferColab ⇨ code of the following slides
Code http://tiny.cc/NAACLTransferCode ⇨ same code in a repo

Let’s see how a simple fine-tuning scheme can be implemented with our pretrained model:

19 of 110

Adaptation task

We select a text classification task as the downstream task

TREC-6: The Text REtrieval Conference (TREC) Question Classification (Li et al., COLING 2002)�
TREC consists of open-domain, fact-based questions divided into broad semantic categories contains 5500 labeled training questions & 500 testing questions with 6 labels:� NUM, LOC, HUM, DESC, ENTY, ABBR

(Howard and Ruder, ACL 2018)

Hands-on: Model adaptation

Ex:

How did serfdom develop in and then leave Russia ? —> DESC
What films featured the character Popeye Doyle ? —> ENTY

Transfer learning models shine on this type of low-resource task

20 of 110

Modifications:

Keep model internals unchanged
Add a linear layer on top
Add an additional embedding (classification token) at the bottom�

Computation flow:

Model input: the tokenized question with a classification token at the end
Extract the last hidden-state associated to the classification token
Pass the hidden-state in a linear layer and softmax to obtain class probabilities

(Radford et al., 2018)

Hands-on: Model adaptation

First adaptation scheme

21 of 110

Let’s load and prepare our dataset:

Fine-tuning hyper-parameters:

– 6 classes in TREC-6

– Use fine tuning hyper parameters from Radford et al., 2018:

learning rate from 6.5e-5 to 0.0
fine-tune for 3 epochs

Hands-on: Model adaptation

- trim to the transformer input size & add a classification token at the end of each sample,

- pad to the left,

- convert to tensors,

- extract a validation set.

22 of 110

Adapt our model architecture

Replace the pre-training head (language modeling) with the classification head:

A linear layer, which takes as input the hidden-state of the [CLF] token (using a mask)

Keep our pretrained model unchanged as the backbone.

* Initialize all the weights of the model.

Hands-on: Model adaptation

* Reload common weights from the pretrained model.

23 of 110

Our fine-tuning code:

We will evaluate on our validation and test sets:

* validation: after each epoch

* test: at the end

A simple training update function:

* prepare inputs: transpose and build padding & classification token masks

* we have options to clip and accumulate gradients

Schedule:

* linearly increasing to lr

* linearly decreasing to 0.0

Hands-on: Model adaptation

24 of 110

We can now fine-tune our model on TREC:

We are at the state-of-the-art

(ULMFiT)

Remarks:

The error rate goes down quickly! After one epoch we already have >90% accuracy.�⇨ Fine-tuning is highly data efficient in Transfer Learning
We took our pre-training & fine-tuning hyper-parameters straight from the literature on related models.�⇨ Fine-tuning is often robust to the exact choice of hyper-parameters

Hands-on: Model adaptation – Results

25 of 110

Let’s conclude this hands-on with a few additional words on robustness & variance.

Large pretrained models (e.g. BERT large) are prone to degenerate performance when fine-tuned on tasks with small training sets.
Observed behavior is often “on-off”: it either works very well or doesn’t work at all.
Understanding the conditions and causes of this behavior (models, adaptation schemes) is an open research question.

Hands-on: Model adaptation – Results

Phang et al., 2018

26 of 110

4.2 – Optimization

Several directions when it comes to the optimization itself:

Choose which weights we should update�Feature extraction, fine-tuning, adapters�
Choose how and when to update the weights�From top to bottom, gradual unfreezing, discriminative fine-tuning�
Consider practical trade-offs�Space and time complexity, performance

Image credit: ProSymbols, purplestudio, Markus, Alfredo

27 of 110

4.2.A – Optimization: Which weights?

The main question: To tune or not to tune (the pretrained weights)?

Do not change pretrained weights�Feature extraction, adapters�
Change pretrained weights�Fine-tuning

Image credit: purplestudio

28 of 110

4.2.A – Optimization: Which weights?

Don’t touch the pretrained weights!��Feature extraction:

Weights are frozen

❄️

29 of 110

4.2.A – Optimization: Which weights?

Don’t touch the pretrained weights!��Feature extraction:

Weights are frozen
A linear classifier is trained on top of the pretrained representations

❄️

30 of 110

4.2.A – Optimization: Which weights?

Don’t touch the pretrained weights!��Feature extraction:

Weights are frozen
A linear classifier is trained on top of the pretrained representations
Don’t just use features of the top layer!
Learn a linear combination of layers (Peters et al., NAACL 2018, Ruder et al., AAAI 2019)

❄️

31 of 110

4.2.A – Optimization: Which weights?

Don’t touch the pretrained weights!

Feature extraction:

Alternatively, pretrained representations are used as features in downstream model

32 of 110

4.2.A – Optimization: Which weights?

Don’t touch the pretrained weights!��Adapters

Task-specific modules that are added in between existing layers

33 of 110

4.2.A – Optimization: Which weights?

Don’t touch the pretrained weights!��Adapters

Task-specific modules that are added in between existing layers
Only adapters are trained

34 of 110

4.2.A – Optimization: Which weights?

Yes, change the pretrained weights!

Fine-tuning:

Pretrained weights are used as initialization for parameters of the downstream model
The whole pretrained architecture is trained during the adaptation phase

35 of 110

Hands-on #3:

Using Adapters and freezing

Image credit: Chanaky

36 of 110

Modifications:

add Adapters inside the backbone model: Linear ⇨ ReLU ⇨ Linear� with a skip-connection

As previously:

add a linear layer on top
use an additional embedding (classification token) at the bottom

Hands-on: Model adaptation

Second adaptation scheme: Using Adapters

Houlsby et al., ICML 2019

We will only train the adapters, the added linear layer and the embeddings. The other parameters of the model will be frozen.

37 of 110

Let’s adapt our model architecture

Add the adapter modules:

Bottleneck layers with 2 linear layers and a non-linear activation function (ReLU)

Hidden dimension is small: e.g. 32, 64, 256

Inherit from our pretrained model to have all the modules.

The Adapters are inserted inside skip-connections after:

the attention module
the feed-forward module

Hands-on: Model adaptation

38 of 110

Now we need to freeze the portions of our model we don’t want to train.

We just indicate that no gradient is needed for the frozen parameters by setting param.requires_grad to False for the frozen parameters:

In our case we will train 25% of the parameters. The model is small & deep (many adapters) and we need to train the embeddings so the ratio stay quite high. For a larger model this ratio would be a lot lower.

Hands-on: Model adaptation

39 of 110

Results similar to full-fine-tuning case with advantage of training only 25% of the full model parameters.

For a small 50M parameters model this method is overkill ⇨ for 300M–1.5B parameters models.

We use a hidden dimension of 32 for the adapters and a learning rate ten times higher for the fine-tuning (we have added quite a lot of newly initialized parameters to train from scratch).

Hands-on: Model adaptation

40 of 110

4.2.B – Optimization: What schedule?

We have decided which weights to update, but in which order and how should be update them?

Motivation: We want to avoid overwriting useful pretrained information and maximize positive transfer.

Related concept: Catastrophic forgetting (McCloskey & Cohen, 1989; French, 1999)�When a model forgets the task it was originally trained on.

Image credit: Markus

41 of 110

4.2.B – Optimization: What schedule?

A guiding principle:�Update from top-to-bottom

Progressively in time: freezing
Progressively in intensity: Varying the learning rates
Progressively vs. the pretrained model: Regularization

42 of 110