Transfer Learning in
Natural Language Processing
June 2, 2019
NAACL-HLT 2019
1
Sebastian Ruder
Matthew Peters
Swabha
Swayamdipta
Thomas Wolf
Transfer Learning in Natural Language Processing
Transfer Learning in NLP
2
Follow along with the tutorial:
Questions:
What is transfer learning?
3
Why transfer learning in NLP?
4
Why transfer learning in NLP? (Empirically)
5
Performance on Named Entity Recognition (NER) on CoNLL-2003 (English) over time
Types of transfer learning in NLP
6
We will focus on this
What this tutorial is about and what it’s not about
7
Agenda
8
[2] Pretraining
[4] Adaptation
[6]
[5] Downstream
[1] Introduction
9
Sequential transfer learning
Learn on one task / dataset, then transfer to another task / dataset
10
word2vec
GloVe
skip-thought
InferSent
ELMo
ULMFiT
GPT
BERT
classification
sequence labeling
Q&A
....
Pretraining
Adaptation
Pretraining tasks and datasets
11
Target tasks and datasets
Target tasks are typically supervised and span a range of common NLP tasks:
12
Concrete example—word vectors
Word embedding methods (e.g. word2vec) learn one vector per word:
13
cat = [0.1, -0.2, 0.4, …]
dog = [0.2, -0.1, 0.7, …]
Concrete example—word vectors
Word embedding methods (e.g. word2vec) learn one vector per word:
14
cat = [0.1, -0.2, 0.4, …]
dog = [0.2, -0.1, 0.7, …]
PRP VBP PRP NN CC NN .
I love my cat and dog .
Concrete example—word vectors
Word embedding methods (e.g. word2vec) learn one vector per word:
15
cat = [0.1, -0.2, 0.4, …]
dog = [0.2, -0.1, 0.7, …]
PRP VBP PRP NN CC NN .
I love my cat and dog .
I love my cat and dog . }-> “positive"
Major Themes
16
Major themes: From words to words-in-context
Word vectors
Sentence / doc vectors
Word-in-context vectors
17
cats = [0.2, -0.3, …]
dogs = [0.4, -0.5, …]
It’s raining cats and dogs.
We have two cats.
[0.8, 0.9, …]
[-1.2, 0.0, …]
}
}
We have two cats.
}
[1.2, -0.3, …]
It’s raining cats and dogs.
}
[-0.4, 0.9, …]
Major themes: LM pretraining
18
Devlin et al 2019: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
1 layer
Major themes: From shallow to deep
19
24 layers
Major themes: pretraining vs target task
20
Choice of pretraining and target tasks are coupled
In general:
Agenda
21
[2] Pretraining
[4] Adaptation
[6]
[5] Downstream
[1] Introduction
2. Pretraining
22
Image credit: Creative Stall
Overview
23
Word Type Representation
We [have a ??? and three] dogs
We have a ???
We have a MASK and three dogs
We have a ???
We like pets. }
LM pretraining
24
ELMo, Peters et al. 2018, ULMFiT (Howard & Ruder 2018), GPT (Radford et al. 2018)
BERT, Devlin et al 2019
???
Word vectors
25
Why embed words?
26
Word Type Representation
Unsupervised pretraining : Pre-Neural
Latent Dirichlet Allocation (LDA)—Documents are mixtures of topics and topics are mixtures of words (Blei et al., 2003)
27
Word Type Representation
Supervised multitask word embeddings (Collobert and Weston, 2008)
Word vector pretraining
28
See also:
word2vec
Efficient algorithm + large scale training → high quality word vectors
29
Sentence and document vectors
30
Doc2vec
Paragraph vector
Unsupervised paragraph embeddings (Le & Mikolov, 2014)
SOTA classification (IMDB, SST)
31
Doc2vec
Skip-Thought Vectors
Predict previous / next sentence with seq2seq model (Kiros et al., 2015)
Hidden state of encoder transfers to sentence tasks (classification, semantic similarity)
32
Dai & Le (2015): Pretrain a sequence autoencoder (SA) and generative LM
See also:
Autoencoder pretraining
SOTA classification (IMDB)
33
Autoencoder pretraining
Supervised sentence embeddings
Also possible to train sentence embeddings with supervised objective
34
Contextual word vectors
35
Contextual word vectors - Motivation
Word vectors compress all contexts into a single vector
Nearest neighbor GloVe vectors to “play”
36
VERB
playing
played
NOUN
game
games
players
football
??
plays
Play
ADJ
multiplayer
Contextual word vectors - Key Idea
Instead of learning one vector per word, learn a vector that depends on context
Many approaches based on language models
37
f(play | The kids play a game in the park.)
f(play | The Broadway play premiered yesterday.)
!=
Sentence completion
Lexical substitution
WSD
Use bidirectional LSTM and cloze prediction objective (a 1 layer masked LM)
Learn representations for both words and contexts (minus word)
context2vec
38
Pretrain two LMs (forward and backward) and add to sequence tagger.
SOTA NER and chunking results
TagLM
39
Pretrain encoder and decoder with LMs (everything shaded is pretrained).
Large boost for MT.
Unsupervised Pretraining for Seq2Seq
40
Pretrain bidirectional encoder with MT supervision, extract LSTM states
Adding CoVe with GloVe gives improvements for classification, NLI, Q&A
CoVe
41
Pretrain deep bidirectional LM, extract contextual word vectors as learned linear combination of hidden states
SOTA for 6 diverse tasks
ELMo
42
Pretrain AWD-LSTM LM, fine-tune LM in two stages with different adaptation techniques
SOTA for six classification datasets
ULMFiT
43
Pretrain large 12-layer left-to-right Transformer, fine tune for sentence, sentence pair and multiple choice questions.
SOTA results for 9 tasks.
GPT
44
BERT
45
BERT pretrains both sentence and contextual word representations,
using masked LM and next sentence prediction.
BERT-large has 340M parameters, 24 layers!
See also: Logeswaran and Lee, ICLR 2018
BERT
46
SOTA GLUE benchmark results (sentence pair classification).
BERT
47
SOTA SQuAD v1.1 (and v2.0) Q&A
Other pretraining objectives
48
Why does language modeling work so well?
49
Sample efficiency
50
Pretraining reduces need for annotated data
51
Pretraining reduces need for annotated data
52
Pretraining reduces need for annotated data
53
Scaling up pretraining
54
Scaling up pretraining
55
Pretrained Language Models: More Data
Scaling up pretraining
56
Scaling up pretraining
57
Cross-lingual pretraining
58
Cross-lingual pretraining
59
Cross-lingual Polyglot Pretraining
Key idea: Share vocabulary and representations across languages by training one model on many languages.
Advantages: Easy to implement, enables cross-lingual pretraining by itself
Disadvantages: Leads to under-representation of low-resource languages
60
Hands-on #1:
Pretraining a Transformer Language Model
61
Image credit: Chanaky
Hands-on: Overview
62
Current developments in Transfer Learning combine new approaches for training schemes (sequential training) as well as models (transformers) ⇨ can look intimidating and complex
Hands-on pre-training
63
Hands-on pre-training
64
Our core model will be a Transformer. Large-scale transformer architectures (GPT-2, BERT, XLM…) are very similar to each other and consist of:�
Main differences between GPT/GPT-2/BERT are the objective functions:
We’ll play with both
Let’s code the backbone of our model!
PyTorch 1.1 now has a nn.MultiHeadAttention module: lets us encapsulate the self-attention logic while still controlling the internals of the Transformer.
Hands-on pre-training
65
Two attention masks?
Hands-on pre-training
66
1. A pretraining head on top of our core model: we choose a language modeling head with tied weights
Hands-on pre-training
67
To pretrain our model, we need to add a few elements: a head, a loss and initialize weights.
We add these elements with a pretraining model encapsulating our model.
2. Initialize the weights
3. Define a loss function: we choose a cross-entropy loss on current (or next) token predictions
Hyper-parameters taken from Dai et al., 2018 (Transformer-XL) ⇨ ~50M parameters causal model.
Use a large dataset for pre-trainining: WikiText-103 with 103M tokens (Merity et al., 2017).
Instantiate our model and optimizer (Adam)
Hands-on pre-training
68
Now let’s take care of our data and configuration
We'll use a pre-defined open vocabulary tokenizer: BERT’s model cased tokenizer.
Hands-on pre-training
69
A simple update loop.
We use gradient accumulation to have a large batch size even on 1 GPU (>64).
Learning rate schedule:
- linear warmup to start�- then cosine or inverse square root decrease
Go!
And we’re done: let’s train!
no warm-up
Hands-on pre-training — Concluding remarks
70
Hands-on pre-training — Concluding remarks
71
Wikitext-103 Validation/Test PPL
Agenda
72
[2] Pretraining
[4] Adaptation
[6]
[5] Downstream
[1] Introduction
3. What is in a Representation?
73
Image credit: Caique Lima
Why care about what is in a representation?
74
What to analyze?
75
Analysis Method 1: Visualization
76
Hold the embeddings / network activations static or frozen
Visualizing Embedding Geometries
77
Image: Tensorflow
Visualizing Neuron Activations
78
Visualizing Layer-Importance Weights
Layer-wise analysis (static)
79
Also see Tenney et al., ACL 2019
Visualizing Attention Weights
Visualization: Attention Weights
80
Analysis Method 2: Behavioral Probes
81
Analysis Method 2: Behavioral Probes
82
Analysis Method 3: Classifier Probes
83
Hold the embeddings / network activations static and
train a simple supervised model on top
Probe classification task (Linear / MLP)
Probing Surface-level Features
84
Probing Morphology, Syntax, Semantics
Sentence-level Syntax
Tree Depth
Tense of main clause verb
Top Constituents
Long-distance number agreement
# Objects
Subject-Verb Agreement
85
Adi et al., 2017; Conneau et al., 2018; Belinkov et al., 2017; Zhang et al., 2018; Blevins et al., 2018; Tenney et al. 2019; Liu et al., 2019
Probing classifier findings
86
Probing classifier findings
87
Probing: Layers of the network
Layer-wise analysis (dynamic)
88
Probing: Pretraining Objectives
89
What have we learnt so far?
90
Open questions about probes
91
Analysis Method 4: Model Alterations
92
So, what is in a representation?
93
Very current and ongoing!
94
First column for citations in and before 2015
What’s next?
Correlation of probes to downstream tasks
Interpretability + transferability to downstream tasks is key
Some Pointers
96
Break
97
Image credit: Andrejs Kirma
Transfer Learning in Natural Language Processing
Transfer Learning in NLP
98
Follow along with the tutorial:
Questions:
Agenda
99
[2] Pretraining
[4] Adaptation
[6]
[5] Downstream
[1] Introduction
4. Adaptation
100
Image credit: Ben Didier
4 – How to adapt the pretrained model
Several orthogonal directions we can make decisions on:
4.1 – Architecture
Two general options:
102
Image credit: Darmawansyah
4.1.A – Architecture: Keep model unchanged
General workflow:
103
4.1.A – Architecture: Keep model unchanged
General workflow:
Task-specific, randomly initialized
General, pretrained
104
4.1.A – Architecture: Keep model unchanged
General workflow:
105
4.1.B – Architecture: Modifying model internals
Various reasons:
106
4.1.B – Architecture: Modifying model internals
Various reasons:
107
4.1.B – Architecture: Modifying model internals
Various reasons:
108
4.1.B – Architecture: Modifying model internals
Adapters
109
Image credit: Caique Lima
4.1.B – Architecture: Modifying model internals
Adapters (Stickland & Murray, ICML 2019)
110
Hands-on #2:
Adapting our pretrained model
111
Image credit: Chanaky
Hands-on: Model adaptation
112
Let’s see how a simple fine-tuning scheme can be implemented with our pretrained model:
Adaptation task
Hands-on: Model adaptation
113
Ex:
Transfer learning models shine on this type of low-resource task
Hands-on: Model adaptation
114
First adaptation scheme
Let’s load and prepare our dataset:
Fine-tuning hyper-parameters:
– 6 classes in TREC-6
– Use fine tuning hyper parameters from Radford et al., 2018:
Hands-on: Model adaptation
115
- trim to the transformer input size & add a classification token at the end of each sample,
- pad to the left,
- convert to tensors,
- extract a validation set.
Adapt our model architecture
Replace the pre-training head (language modeling) with the classification head:
A linear layer, which takes as input the hidden-state of the [CLF] token (using a mask)
Keep our pretrained model unchanged as the backbone.
* Initialize all the weights of the model.
Hands-on: Model adaptation
116
* Reload common weights from the pretrained model.
Our fine-tuning code:
We will evaluate on our validation and test sets:
* validation: after each epoch
* test: at the end
A simple training update function:
* prepare inputs: transpose and build padding & classification token masks
* we have options to clip and accumulate gradients
Schedule:
* linearly increasing to lr
* linearly decreasing to 0.0
Hands-on: Model adaptation
117
We can now fine-tune our model on TREC:
We are at the state-of-the-art
(ULMFiT)
Remarks:
Hands-on: Model adaptation – Results
118
Let’s conclude this hands-on with a few additional words on robustness & variance.
Hands-on: Model adaptation – Results
119
4.2 – Optimization
Several directions when it comes to the optimization itself:
120
Image credit: ProSymbols, purplestudio, Markus, Alfredo
4.2.A – Optimization: Which weights?
The main question: To tune or not to tune (the pretrained weights)?
121
Image credit: purplestudio
4.2.A – Optimization: Which weights?
Don’t touch the pretrained weights!��Feature extraction:
122
❄️
4.2.A – Optimization: Which weights?
Don’t touch the pretrained weights!��Feature extraction:
123
❄️
4.2.A – Optimization: Which weights?
Don’t touch the pretrained weights!��Feature extraction:
124
❄️
4.2.A – Optimization: Which weights?
Don’t touch the pretrained weights!
Feature extraction:
125
4.2.A – Optimization: Which weights?
Don’t touch the pretrained weights!��Adapters
126
4.2.A – Optimization: Which weights?
Don’t touch the pretrained weights!��Adapters
127
4.2.A – Optimization: Which weights?
Yes, change the pretrained weights!
Fine-tuning:
128
Hands-on #3:
Using Adapters and freezing
129
Image credit: Chanaky
Hands-on: Model adaptation
130
Second adaptation scheme: Using Adapters
We will only train the adapters, the added linear layer and the embeddings. The other parameters of the model will be frozen.
Let’s adapt our model architecture
Add the adapter modules:
Bottleneck layers with 2 linear layers and a non-linear activation function (ReLU)
Hidden dimension is small: e.g. 32, 64, 256
Inherit from our pretrained model to have all the modules.
The Adapters are inserted inside skip-connections after:
Hands-on: Model adaptation
131
Now we need to freeze the portions of our model we don’t want to train.
We just indicate that no gradient is needed for the frozen parameters by setting param.requires_grad to False for the frozen parameters:
In our case we will train 25% of the parameters. The model is small & deep (many adapters) and we need to train the embeddings so the ratio stay quite high. For a larger model this ratio would be a lot lower.
Hands-on: Model adaptation
132
Results similar to full-fine-tuning case with advantage of training only 25% of the full model parameters.
For a small 50M parameters model this method is overkill ⇨ for 300M–1.5B parameters models.
We use a hidden dimension of 32 for the adapters and a learning rate ten times higher for the fine-tuning (we have added quite a lot of newly initialized parameters to train from scratch).
Hands-on: Model adaptation
133
4.2.B – Optimization: What schedule?
We have decided which weights to update, but in which order and how should be update them?
Motivation: We want to avoid overwriting useful pretrained information and maximize positive transfer.
Related concept: Catastrophic forgetting (McCloskey & Cohen, 1989; French, 1999)�When a model forgets the task it was originally trained on.
134
Image credit: Markus
4.2.B – Optimization: What schedule?
A guiding principle:�Update from top-to-bottom
135
4.2.B – Optimization: Freezing
Main intuition: Training all layers at the same time on data of a different distribution and task may lead to instability and poor solutions.
Solution: Train layers individually to give them time to adapt to new task and data.
Goes back to layer-wise training of early deep neural networks (Hinton et al., 2006; Bengio et al., 2007).
136
4.2.B – Optimization: Freezing
137
4.2.B – Optimization: Freezing
138
4.2.B – Optimization: Freezing
139
4.2.B – Optimization: Freezing
140
4.2.B – Optimization: Freezing
141
4.2.B – Optimization: Freezing
142
4.2.B – Optimization: Freezing
143
4.2.B – Optimization: Freezing
144
4.2.B – Optimization: Freezing
145
4.2.B – Optimization: Freezing
146
4.2.B – Optimization: Freezing
147
4.2.B – Optimization: Freezing
148
4.2.B – Optimization: Freezing
149
4.2.B – Optimization: Freezing
Commonality: Train all parameters jointly in the end
150
Hands-on #4:
Using gradual unfreezing
151
Image credit: Chanaky
Gradual unfreezing is similar to our previous freezing process.�We start by freezing all the model except the newly added parameters:
We then gradually unfreeze an additional block along the training so that we train the full model at the end:
Find index of layer to unfreeze
Name pattern matching
Unfreezing interval
Hands-on: Adaptation
152
Gradual unfreezing has not been investigated in details for Transformer models� ⇨ no specific hyper-parameters advocated in the literature
Residual connections may have an impact on the method
⇨ should probably adapt LSTM hyper-parameters
Hands-on: Adaptation
153
We show simple experiments in the Colab. Better hyper-parameters settings can probably be found.
4.2.B – Optimization: Learning rates
Main idea: Use lower learning rates to avoid overwriting useful information.
Where and when?
154
4.2.B – Optimization: Learning rates
155
4.2.B – Optimization: Learning rates
156
4.2.B – Optimization: Learning rates
157
4.2.B – Optimization: Regularization
Main idea: minimize catastrophic forgetting by encouraging target model parameters to stay close to pretrained model parameters�using a regularization term .
158
4.2.B – Optimization: Regularization
159
4.2.B – Optimization: Regularization
160
4.2.B – Optimization: Regularization
EWC has downsides in continual learning:
161
4.2.B – Optimization: Regularization
162
Hands-on #5:
Using discriminative learning
163
Image credit: Chanaky
Discriminative learning rate can be implemented using two steps in our example:
We can then compute the learning rate of each group depending on its label (at each training iteration):
First we organize the parameters of the various layers in labelled parameters groups in the optimizer:
Hyper-parameter
Hands-on: Model adaptation
164
4.2.C – Optimization: Trade-offs
Several trade-offs when choosing which weights to update:
165
Image credit: Alfredo
4.2.C – Optimization trade-offs: Space
Task-specific modifications
Additional parameters
Parameter reuse
166
Many
Few
Feature extraction
Fine-tuning
Adapters
Many
Few
Feature extraction
Fine-tuning
Adapters
All
None
Feature extraction
Fine-tuning
Adapters
4.2.C – Optimization trade-offs: Time
Training time
167
Feature extraction
Fine-tuning
Adapters
Slow
Fast
4.2.C – Optimization trade-offs: Performance
*dissimilar: certain capabilities (e.g. modelling inter-sentence relations) are beneficial for target task, but pretrained model lacks them (see more later)
168
4.3 – Getting more signal
The target task is often a low-resource task. We can often�improve the performance of transfer learning by �combining a diverse set of signals:
169
Image credit: Naveen
4.3.A – Getting more signal: Basic fine-tuning
Simple example of fine-tuning on a text classification task:
170
4.3.B – Getting more signal: Related datasets/tasks
171
4.3.B – Getting more signal: Sequential adaptation
Fine-tuning on related high-resource dataset
172
4.3.B – Getting more signal: Sequential adaptation
Fine-tuning on related high-resource dataset
173
4.3.B – Getting more signal: Multi-task fine-tuning
Fine-tune the model jointly on related tasks
174
4.3.B – Getting more signal: Multi-task fine-tuning
Fine-tune the model jointly on related tasks
175
4.3.B – Getting more signal: Multi-task fine-tuning
Fine-tune the model with an unsupervised auxiliary task
176
4.3.B – Getting more signal: Dataset slicing
Use auxiliary heads that are trained only on particular subsets of the data
177
4.3.B – Getting more signal: Semi-supervised learning
Can be used to make model predictions more consistent using unlabelled data
178
4.3.B – Getting more signal: Semi-supervised learning
Can be used to make model predictions more consistent using unlabelled data
179
4.3.C – Getting more signal: Ensembling
Reaching the state-of-the-art by ensembling independently fine-tuned models
180
4.3.C – Getting more signal: Ensembling
Model fine-tuned...
181
Combining the predictions of models fine-tuned with various hyper-parameters.
4.3.C – Getting more signal: Distilling
182
Distilling ensembles of large models back in a single model
Hands-on #6:
Using multi-task learning
183
Image credit: Chanaky
Multitasking with a classification loss + language modeling loss.
Create two heads:
– language modeling head
– classification head
Total loss is a weighted sum of
– language modeling loss and
– classification loss
Hands-on: Multi-task learning
184
Multi-tasking helped us improve over single-task full-model fine-tuning!
We use a coefficient of 1.0 for the classification loss and 0.5 for the language modeling loss and fine-tune a little longer (6 epochs instead of 3 epochs, the validation loss was still decreasing).
Hands-on: Multi-task learning
185
Agenda
186
[2] Pretraining
[4] Adaptation
[6]
[5] Downstream
[1] Introduction
5. Downstream applications�Hands-on examples
187
Image credit: Fahmi
5. Downstream applications - Hands-on examples
In this section we will explore downstream applications and practical considerations along two orthogonal directions:
188
Practical considerations
Frameworks & libraries: practical considerations
189
“Energy and Policy Considerations for Deep Learning in NLP” - Strubell, Ganesh, McCallum - ACL 2019
5. Downstream applications - Hands-on examples
190
Icons credits: David, Susannanova, Flatart, ProSymbols
5.A – Sequence & document level classification
Transfer learning for document classification using the fast.ai library.
191
fast.ai gives access to many high-level API out-of-the-box for vision, text, tabular data and collaborative filtering.
DataBunch for the language model and the classifier
Load IMDB dataset & inspect it.
Load an AWD-LSTM (Merity et al., 2017) pretrained on WikiText-103 & fine-tune it on IMDB using the language modeling loss.
Fast.ai then comprises all the high level modules needed to quickly setup a transfer learning experiment.
5.A – Document level classification using fast.ai
192
The library is designed for speed of experimentation, e.g. by importing all necessary modules at once in interactive computing environments, like:
Now we fine-tune in two steps:�
Once we have a fine-tune language model (AWD-LSTM), we can create a text classifier by adding a classification head with:
– A layer to concatenate the final outputs of the RNN with the maximum and average of all the intermediate outputs (along the sequence length)
– Two blocks of nn.BatchNorm1d ⇨ nn.Dropout ⇨ nn.Linear ⇨ nn.ReLU with a hidden dimension of 50.
5.A – Document level classification using fast.ai
193
1. train the classification head only while keeping the language model frozen, and
2. fine-tune the whole architecture.
5.B – Token level classification: BERT & Tensorflow
Transfer learning for token level classification: Google’s BERT in TensorFlow.
194
Let’s adapt BERT to the target task.
Replace the pre-training head (language modeling) with a classification head:
a linear projection layer to estimate 2 probabilities for each token:
– being the start of an answer
– being the end of an answer.
Keep our core model unchanged.
5.B – SQuAD with BERT & Tensorflow
195
Load our pretrained checkpoint
To load our checkpoint, we just need to setup an assignement_map from the variables of the checkpoint to the model variable, keeping only the variables in the model.
And we can use
tf.train.init_from_ckeckpoint
5.B – SQuAD with BERT & Tensorflow
196
TensorFlow-Hub
TensorFlow Hub is a library for sharing machine learning models as self-contained pieces of TensorFlow graph with their weights and assets.
Working directly with TensorFlow requires to have access to–and include in your code– the full code of the pretrained model.
Modules are automatically downloaded and cached when instantiated.
Each time a module m is called e.g. y = m(x), it adds operations to the current TensorFlow graph to compute y from x.
5.B – SQuAD with BERT & Tensorflow
197
Tensorflow Hub host a nice selection of pretrained models for NLP
Tensorflow Hub can also used with Keras exactly how we saw in the BERT example
The main limitations of Hubs are:
5.B – SQuAD with BERT & Tensorflow
198
5.C – Language Generation: OpenAI GPT & PyTorch
Transfer learning for language generation: OpenAI GPT and HuggingFace library.
199
A dialog generation task:
5.C – Chit-chat with OpenAI GPT & PyTorch
200
Language generation tasks are close to the language modeling pre-training objective, but:
How should we adapt the model?
Golovanov, Kurbanov, Nikolenko, Truskovskyi, Tselousov and Wolf, ACL 2019
5.C – Chit-chat with OpenAI GPT & PyTorch
201
Several options:
Concatenate the various context separated by delimiters and add position and segment embeddings
5.C – Chit-chat with OpenAI GPT & PyTorch
202
Let’s import pretrained versions of OpenAI GPT tokenizer and model.
Now most of the work is about preparing the inputs for the model.
Then train our model using the pretraining language modeling objective.
And add a few new tokens to the vocabulary
We organize the contexts in segments
Add delimiter at the extremities of the segments
And build our word, position and segment inputs for the model.
5.C – Chit-chat with OpenAI GPT & PyTorch
203
PyTorch Hub
Last Friday, the PyTorch team soft-launched a beta version of PyTorch Hub. Let’s have a quick look.
In our case, to use torch.hub instead of pytorch-pretrained-bert, we can simply call torch.hub.load with the path to pytorch-pretrained-bert GitHub repository:
PyTorch Hub will fetch the model from the master branch on GitHub. This means that you don’t need to package your model (pip) & users will always access the most recent version (master).
Agenda
204
[2] Pretraining
[4] Adaptation
[6]
[5] Downstream
[1] Introduction
6. Open problems and future directions
205
Image credit: Yazmin Alanis
6. Open problems and future directions
206
Image credit: Yazmin Alanis
Shortcomings of pretrained language models
207
Shortcomings of pretrained language models
Large, pretrained language models can be difficult to optimize.
208
Shortcomings of pretrained language models
Current pretrained language models are very large.
209
Pretraining tasks
Shortcomings of the language modeling objective:
210
More diverse self-supervised objectives
Sampling a patch and a neighbour and predicting their spatial configuration (Doersch et al., ICCV 2015)
Image colorization (Zhang et al., ECCV 2016)
Pretraining tasks
211
Pretraining tasks
Specialized pretraining tasks that teach what our model is missing
212
Pretraining tasks
Need for grounded representations
213
Tasks and task similarity
Many tasks can be expressed as variants of language modeling
214
Tasks and task similarity
215
Continual and meta-learning
216
Continual and meta-learning
217
Bias
218
Conclusion
219
Questions?
If you found these slides helpful, consider citing the tutorial as:�@inproceedings{ruder2019transfer,� title={Transfer Learning in Natural Language Processing},� author={Ruder, Sebastian and Peters, Matthew E and Swayamdipta, Swabha and Wolf, Thomas},� booktitle={Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Tutorials},� pages={15--18},� year={2019}�}
220
Extra slides
221
Why transfer learning in NLP? (Empirically)
222
BERT + X
GLUE* performance over time
223
*General Language Understanding Evaluation (GLUE; Wang et al., 2019):�includes 11 diverse NLP tasks
Pretrained Language Models: More Parameters
224
More word vectors
225
GLoVe: very large scale (840B tokens), co-occurrence based. Learns linear relationships (SOTA word analogy) (Pennington et al., 2014)
fastText
skipgram
SOTA sequence modeling results
Semi-supervised Sequence Modeling with Cross-View Training
226
Pretrain bidirectional character level model, extract embeddings from first/last character
SOTA CoNLL 2003 NER results
Contextual String Embeddings
227
Cloze-driven Pretraining of Self-attention Networks
228
Pretraining
Fine-tuning
SOTA NER and PTB constituency parsing, ~3.3% less than BERT-large for GLUE
Model is jointly pretrained on three variants of LM (bidirectional, left-to-right, seq-to-seq)
SOTA on three natural language generation tasks
UniLM - Dong et al., 2019
229
Pretrain encoder-decoder
Masked Sequence to Sequence Pretraining (MASS)
230
What matters: Pretraining Objective, Encoder
Probing tasks for sentential features:
231
An inspiration from Computer Vision
�From lower to higher layers, information goes from general to task-specific.
232
Image credit: Distill
Other methods for analysis
Other analyses
233
Adversarial methods
Analysis: Inputs and Outputs
What to analyze?
What to look for?
234
Analysis: Methods
Model Alterations
Visualization
Model Probes
* Not hard and fast categories
Analysis / Evaluation : Adversarial Methods
Adversarial Approaches
236
Credits: Jia & Liang (2017) and Percy Liang. AI Frontiers. 2018
Probes are simple linear / neural layers
Liu et al., NAACL 2019
237
What is still left unanswered?
Transferability to downstream tasks
Interpretability is important, but not enough on its own.
Interpretability + transferability to downstream tasks is key - that’s next!