Transfer Learning in NLP�NLPL Winter School
Thomas Wolf - HuggingFace Inc.
1
Overview
Many slides are adapted from a Tutorial on Transfer Learning in NLP I gave at NAACL 2019 with my amazing collaborators
👈
2
Sebastian Ruder
Matthew Peters
Swabha
Swayamdipta
Transfer Learning in NLP
NLPL Winter School
Session 2
3
Transfer Learning in Natural Language Processing
Transfer Learning in NLP
4
Follow along with the tutorial:
Agenda
5
[2] Pretraining
[4] Adaptation
[6]
[5] Downstream
[1] Introduction
4. Adaptation
6
Image credit: Ben Didier
4 – How to adapt the pretrained model
Several orthogonal directions we can make decisions on:
4.1 – Architecture
Two general options:
8
Image credit: Darmawansyah
4.1.A – Architecture: Keep model unchanged
General workflow:
9
4.1.A – Architecture: Keep model unchanged
General workflow:
Task-specific, randomly initialized
General, pretrained
10
4.1.A – Architecture: Keep model unchanged
General workflow:
11
4.1.B – Architecture: Modifying model internals
Various reasons:
12
4.1.B – Architecture: Modifying model internals
Various reasons:
13
4.1.B – Architecture: Modifying model internals
Various reasons:
14
4.1.B – Architecture: Modifying model internals
Adapters
15
Image credit: Caique Lima
4.1.B – Architecture: Modifying model internals
Adapters (Stickland & Murray, ICML 2019)
16
Hands-on #2:
Adapting our pretrained model
17
Image credit: Chanaky
Hands-on: Model adaptation
18
Let’s see how a simple fine-tuning scheme can be implemented with our pretrained model:
Adaptation task
Hands-on: Model adaptation
19
Ex:
Transfer learning models shine on this type of low-resource task
Hands-on: Model adaptation
20
First adaptation scheme
Let’s load and prepare our dataset:
Fine-tuning hyper-parameters:
– 6 classes in TREC-6
– Use fine tuning hyper parameters from Radford et al., 2018:
Hands-on: Model adaptation
21
- trim to the transformer input size & add a classification token at the end of each sample,
- pad to the left,
- convert to tensors,
- extract a validation set.
Adapt our model architecture
Replace the pre-training head (language modeling) with the classification head:
A linear layer, which takes as input the hidden-state of the [CLF] token (using a mask)
Keep our pretrained model unchanged as the backbone.
* Initialize all the weights of the model.
Hands-on: Model adaptation
22
* Reload common weights from the pretrained model.
Our fine-tuning code:
We will evaluate on our validation and test sets:
* validation: after each epoch
* test: at the end
A simple training update function:
* prepare inputs: transpose and build padding & classification token masks
* we have options to clip and accumulate gradients
Schedule:
* linearly increasing to lr
* linearly decreasing to 0.0
Hands-on: Model adaptation
23
We can now fine-tune our model on TREC:
We are at the state-of-the-art
(ULMFiT)
Remarks:
Hands-on: Model adaptation – Results
24
Let’s conclude this hands-on with a few additional words on robustness & variance.
Hands-on: Model adaptation – Results
25
4.2 – Optimization
Several directions when it comes to the optimization itself:
26
Image credit: ProSymbols, purplestudio, Markus, Alfredo
4.2.A – Optimization: Which weights?
The main question: To tune or not to tune (the pretrained weights)?
27
Image credit: purplestudio
4.2.A – Optimization: Which weights?
Don’t touch the pretrained weights!��Feature extraction:
28
❄️
4.2.A – Optimization: Which weights?
Don’t touch the pretrained weights!��Feature extraction:
29
❄️
4.2.A – Optimization: Which weights?
Don’t touch the pretrained weights!��Feature extraction:
30
❄️
4.2.A – Optimization: Which weights?
Don’t touch the pretrained weights!
Feature extraction:
31
4.2.A – Optimization: Which weights?
Don’t touch the pretrained weights!��Adapters
32
4.2.A – Optimization: Which weights?
Don’t touch the pretrained weights!��Adapters
33
4.2.A – Optimization: Which weights?
Yes, change the pretrained weights!
Fine-tuning:
34
Hands-on #3:
Using Adapters and freezing
35
Image credit: Chanaky
Hands-on: Model adaptation
36
Second adaptation scheme: Using Adapters
We will only train the adapters, the added linear layer and the embeddings. The other parameters of the model will be frozen.
Let’s adapt our model architecture
Add the adapter modules:
Bottleneck layers with 2 linear layers and a non-linear activation function (ReLU)
Hidden dimension is small: e.g. 32, 64, 256
Inherit from our pretrained model to have all the modules.
The Adapters are inserted inside skip-connections after:
Hands-on: Model adaptation
37
Now we need to freeze the portions of our model we don’t want to train.
We just indicate that no gradient is needed for the frozen parameters by setting param.requires_grad to False for the frozen parameters:
In our case we will train 25% of the parameters. The model is small & deep (many adapters) and we need to train the embeddings so the ratio stay quite high. For a larger model this ratio would be a lot lower.
Hands-on: Model adaptation
38
Results similar to full-fine-tuning case with advantage of training only 25% of the full model parameters.
For a small 50M parameters model this method is overkill ⇨ for 300M–1.5B parameters models.
We use a hidden dimension of 32 for the adapters and a learning rate ten times higher for the fine-tuning (we have added quite a lot of newly initialized parameters to train from scratch).
Hands-on: Model adaptation
39
4.2.B – Optimization: What schedule?
We have decided which weights to update, but in which order and how should be update them?
Motivation: We want to avoid overwriting useful pretrained information and maximize positive transfer.
Related concept: Catastrophic forgetting (McCloskey & Cohen, 1989; French, 1999)�When a model forgets the task it was originally trained on.
40
Image credit: Markus
4.2.B – Optimization: What schedule?
A guiding principle:�Update from top-to-bottom
41
4.2.B – Optimization: Freezing
Main intuition: Training all layers at the same time on data of a different distribution and task may lead to instability and poor solutions.
Solution: Train layers individually to give them time to adapt to new task and data.
Goes back to layer-wise training of early deep neural networks (Hinton et al., 2006; Bengio et al., 2007).
42
4.2.B – Optimization: Freezing
43
4.2.B – Optimization: Freezing
44
4.2.B – Optimization: Freezing
45
4.2.B – Optimization: Freezing
46
4.2.B – Optimization: Freezing
47
4.2.B – Optimization: Freezing
48
4.2.B – Optimization: Freezing
49
4.2.B – Optimization: Freezing
50
4.2.B – Optimization: Freezing
51
4.2.B – Optimization: Freezing
52
4.2.B – Optimization: Freezing
53
4.2.B – Optimization: Freezing
54
4.2.B – Optimization: Freezing
55
4.2.B – Optimization: Freezing
Commonality: Train all parameters jointly in the end
56
Hands-on #4:
Using gradual unfreezing
57
Image credit: Chanaky
Gradual unfreezing is similar to our previous freezing process.�We start by freezing all the model except the newly added parameters:
We then gradually unfreeze an additional block along the training so that we train the full model at the end:
Find index of layer to unfreeze
Name pattern matching
Unfreezing interval
Hands-on: Adaptation
58
Gradual unfreezing has not been investigated in details for Transformer models� ⇨ no specific hyper-parameters advocated in the literature
Residual connections may have an impact on the method
⇨ should probably adapt LSTM hyper-parameters
Hands-on: Adaptation
59
We show simple experiments in the Colab. Better hyper-parameters settings can probably be found.
4.2.B – Optimization: Learning rates
Main idea: Use lower learning rates to avoid overwriting useful information.
Where and when?
60
4.2.B – Optimization: Learning rates
61
4.2.B – Optimization: Learning rates
62
4.2.B – Optimization: Learning rates
63
4.2.B – Optimization: Regularization
Main idea: minimize catastrophic forgetting by encouraging target model parameters to stay close to pretrained model parameters�using a regularization term .
64
4.2.B – Optimization: Regularization
65
4.2.B – Optimization: Regularization
66
4.2.B – Optimization: Regularization
EWC has downsides in continual learning:
67
4.2.B – Optimization: Regularization
68
Hands-on #5:
Using discriminative learning
69
Image credit: Chanaky
Discriminative learning rate can be implemented using two steps in our example:
We can then compute the learning rate of each group depending on its label (at each training iteration):
First we organize the parameters of the various layers in labelled parameters groups in the optimizer:
Hyper-parameter
Hands-on: Model adaptation
70
4.2.C – Optimization: Trade-offs
Several trade-offs when choosing which weights to update:
71
Image credit: Alfredo
4.2.C – Optimization trade-offs: Space
Task-specific modifications
Additional parameters
Parameter reuse
72
Many
Few
Feature extraction
Fine-tuning
Adapters
Many
Few
Feature extraction
Fine-tuning
Adapters
All
None
Feature extraction
Fine-tuning
Adapters
4.2.C – Optimization trade-offs: Time
Training time
73
Feature extraction
Fine-tuning
Adapters
Slow
Fast
4.2.C – Optimization trade-offs: Performance
*dissimilar: certain capabilities (e.g. modelling inter-sentence relations) are beneficial for target task, but pretrained model lacks them (see more later)
74
4.3 – Getting more signal
The target task is often a low-resource task. We can often�improve the performance of transfer learning by �combining a diverse set of signals:
75
Image credit: Naveen
4.3.A – Getting more signal: Basic fine-tuning
Simple example of fine-tuning on a text classification task:
76
4.3.B – Getting more signal: Related datasets/tasks
77
4.3.B – Getting more signal: Sequential adaptation
Fine-tuning on related high-resource dataset
78
4.3.B – Getting more signal: Sequential adaptation
Fine-tuning on related high-resource dataset
79
4.3.B – Getting more signal: Multi-task fine-tuning
Fine-tune the model jointly on related tasks
80
4.3.B – Getting more signal: Multi-task fine-tuning
Fine-tune the model jointly on related tasks
81
4.3.B – Getting more signal: Multi-task fine-tuning
Fine-tune the model with an unsupervised auxiliary task
82
4.3.B – Getting more signal: Dataset slicing
Use auxiliary heads that are trained only on particular subsets of the data
83
4.3.B – Getting more signal: Semi-supervised learning
Can be used to make model predictions more consistent using unlabelled data
84
4.3.B – Getting more signal: Semi-supervised learning
Can be used to make model predictions more consistent using unlabelled data
85
4.3.C – Getting more signal: Ensembling
Reaching the state-of-the-art by ensembling independently fine-tuned models
86
4.3.C – Getting more signal: Ensembling
Model fine-tuned...
87
Combining the predictions of models fine-tuned with various hyper-parameters.
4.3.C – Getting more signal: Distilling
88
Distilling ensembles of large models back in a single model
Hands-on #6:
Using multi-task learning
89
Image credit: Chanaky
Multitasking with a classification loss + language modeling loss.
Create two heads:
– language modeling head
– classification head
Total loss is a weighted sum of
– language modeling loss and
– classification loss
Hands-on: Multi-task learning
90
Multi-tasking helped us improve over single-task full-model fine-tuning!
We use a coefficient of 1.0 for the classification loss and 0.5 for the language modeling loss and fine-tune a little longer (6 epochs instead of 3 epochs, the validation loss was still decreasing).
Hands-on: Multi-task learning
91
Agenda
92
[2] Pretraining
[4] Adaptation
[6]
[5] Downstream
[1] Introduction
5. Downstream applications�Hands-on examples
93
Image credit: Fahmi
5. Downstream applications - Hands-on examples
In this section we will explore downstream applications and practical considerations along two orthogonal directions:
94
Practical considerations
Frameworks & libraries: practical considerations
95
“Energy and Policy Considerations for Deep Learning in NLP” - Strubell, Ganesh, McCallum - ACL 2019
5. Downstream applications - Hands-on examples
96
Icons credits: David, Susannanova, Flatart, ProSymbols
5.A – Sequence & document level classification
Transfer learning for document classification using the fast.ai library.
97
fast.ai gives access to many high-level API out-of-the-box for vision, text, tabular data and collaborative filtering.
DataBunch for the language model and the classifier
Load IMDB dataset & inspect it.
Load an AWD-LSTM (Merity et al., 2017) pretrained on WikiText-103 & fine-tune it on IMDB using the language modeling loss.
Fast.ai then comprises all the high level modules needed to quickly setup a transfer learning experiment.
5.A – Document level classification using fast.ai
98
The library is designed for speed of experimentation, e.g. by importing all necessary modules at once in interactive computing environments, like:
Now we fine-tune in two steps:�
Once we have a fine-tune language model (AWD-LSTM), we can create a text classifier by adding a classification head with:
– A layer to concatenate the final outputs of the RNN with the maximum and average of all the intermediate outputs (along the sequence length)
– Two blocks of nn.BatchNorm1d ⇨ nn.Dropout ⇨ nn.Linear ⇨ nn.ReLU with a hidden dimension of 50.
5.A – Document level classification using fast.ai
99
1. train the classification head only while keeping the language model frozen, and
2. fine-tune the whole architecture.
5.B – Token level classification: BERT & Tensorflow
Transfer learning for token level classification: Google’s BERT in TensorFlow.
100
Let’s adapt BERT to the target task.
Replace the pre-training head (language modeling) with a classification head:
a linear projection layer to estimate 2 probabilities for each token:
– being the start of an answer
– being the end of an answer.
Keep our core model unchanged.
5.B – SQuAD with BERT & Tensorflow
101
Load our pretrained checkpoint
To load our checkpoint, we just need to setup an assignement_map from the variables of the checkpoint to the model variable, keeping only the variables in the model.
And we can use
tf.train.init_from_ckeckpoint
5.B – SQuAD with BERT & Tensorflow
102
TensorFlow-Hub
TensorFlow Hub is a library for sharing machine learning models as self-contained pieces of TensorFlow graph with their weights and assets.
Working directly with TensorFlow requires to have access to–and include in your code– the full code of the pretrained model.
Modules are automatically downloaded and cached when instantiated.
Each time a module m is called e.g. y = m(x), it adds operations to the current TensorFlow graph to compute y from x.
5.B – SQuAD with BERT & Tensorflow
103
Tensorflow Hub host a nice selection of pretrained models for NLP
Tensorflow Hub can also used with Keras exactly how we saw in the BERT example
The main limitations of Hubs are:
5.B – SQuAD with BERT & Tensorflow
104
5.C – Language Generation: OpenAI GPT & PyTorch
Transfer learning for language generation: OpenAI GPT and HuggingFace library.
105
A dialog generation task:
5.C – Chit-chat with OpenAI GPT & PyTorch
106
Language generation tasks are close to the language modeling pre-training objective, but:
How should we adapt the model?
Golovanov, Kurbanov, Nikolenko, Truskovskyi, Tselousov and Wolf, ACL 2019
5.C – Chit-chat with OpenAI GPT & PyTorch
107
Several options:
Concatenate the various context separated by delimiters and add position and segment embeddings
5.C – Chit-chat with OpenAI GPT & PyTorch
108
Let’s import pretrained versions of OpenAI GPT tokenizer and model.
Now most of the work is about preparing the inputs for the model.
Then train our model using the pretraining language modeling objective.
And add a few new tokens to the vocabulary
We organize the contexts in segments
Add delimiter at the extremities of the segments
And build our word, position and segment inputs for the model.
5.C – Chit-chat with OpenAI GPT & PyTorch
109
PyTorch Hub
Last Friday, the PyTorch team soft-launched a beta version of PyTorch Hub. Let’s have a quick look.
In our case, to use torch.hub instead of pytorch-pretrained-bert, we can simply call torch.hub.load with the path to pytorch-pretrained-bert GitHub repository:
PyTorch Hub will fetch the model from the master branch on GitHub. This means that you don’t need to package your model (pip) & users will always access the most recent version (master).
That’s all for this time
110
Image credit: Andrejs Kirma