Transformer Architectures
Human Language Technologies
Dipartimento di Informatica
Giuseppe Attardi
Università di Pisa
From: pretrained Word Embeddings
Circa 2017:
Issues:
not pretrained
pretrained
Slide from Anna Goldie
To: pretrained Whole Model
In modern NLP:
This has been exceptionally effective at building strong:
[This model has learned how to represent
entire sentences through pretraining]
pretrained jointly
Slide from Anna Goldie
Learning from context
I put ___ fork down on the table.
The woman walked across the street, checking for traffic over ___ shoulder.
I went to the ocean to see the fish, turtles, seals, and _____.
Overall, the value I got from the two hours watching it was the sum total of the popcorn and the drink. The movie was ___.
Iroh went into the kitchen to make some tea. Standing next to Iroh, Zuko pondered his destiny. Zuko left the ______.
I was thinking about the sequence that goes 1, 1, 2, 3, 5, 8, 13, 21, ____
Slide from Anna Goldie
Pretrained Transformers
Two Step Development
Pretraining through language modeling [Dai and Le, 2015]
Recall the language modelling task:
Pretraining through language modelling:
Decoder
(Transformer, LSTM, ++ )
Iroh goes to
make
tasty tea
goes
to
make
tasty
tea END
Slide by John Hewitt
The Pretraining / Finetuning Paradigm
Pretraining can improve NLP applications by serving as parameter initialization.
Decoder
(Transformer, LSTM, ++ )
Iroh
goes
to
make
tasty
tea
goes to
make tasty tea END
Step 1: Pretrain (on language modeling)
Lots of text; learn general things!
Decoder
(Transformer, LSTM, ++ )
☺/☹
Step 2: Finetune (on your task)
Not many labels; adapt to the task!
… the movie was …
Slide by John Hewitt
Model Pretraining
Stochastic gradient descent and pretrain/finetune
Slide by John Hewitt
Pretraining for three types of architectures
The neural architecture influences the type of pretraining, and natural use cases.
Decoders
Encoders
Encoder- Decoders
Slide by John Hewitt
Pretraining for three types of architectures
The neural architecture influences the type of pretraining, and natural use cases.
Decoders
Encoders
Encoder- Decoders
Slide by John Hewitt
Pretraining decoders
ℎ1, … , ℎ𝑇
When using language model pretrained decoders, we can ignore that they were trained to model 𝑝(𝑤𝑡|𝑤1:𝑡−1).
We can finetune them by training a classifier
on the last word’s hidden state.
ℎ1, … , ℎ𝑇 = Decoder 𝑤1, … , 𝑤𝑇
𝑦 ∼ 𝐴𝑤𝑇 + 𝑏
Where 𝐴 and 𝑏 are randomly initialized and specified by the downstream task.
Gradients backpropagate through the whole network.
𝑤1, … , 𝑤𝑇
☺/☹
Linear
𝐴, 𝑏
[Note how the linear layer hasn’t been
pretrained and must be learned from scratch.]
Slide by John Hewitt
Pretraining decoders
It’s natural to pretrain decoders as language models and then
use them as generators, finetuning their 𝑝𝜃
𝑤𝑡 𝑤1:𝑡−1)!
This is helpful in tasks where the output is a sequence with a vocabulary like that at pretraining time!
ℎ1, … , ℎ𝑇 = Decoder 𝑤1, … , 𝑤𝑇
𝑤𝑡 ∼ 𝐴𝑤𝑡−1 + 𝑏
Where 𝐴, 𝑏 were pretrained in the language model!
𝑤2 𝑤3 𝑤4 𝑤5 𝑤6
𝐴, 𝑏
ℎ1, … , ℎ𝑇
𝑤1 𝑤2 𝑤3 𝑤4 𝑤5
[Note how the linear layer has been pretrained.]
Slide by John Hewitt
Generative Pretrained Transformer (GPT) [Radford et al., 2018]
2018’s GPT was a big success in pretraining a decoder!
“Generative PreTraining” or “Generative Pretrained Transformer”
Slide by John Hewitt
Generative Pretrained Transformer (GPT) [Radford et al., 2018]
How do we format inputs to our decoder for finetuning tasks?
Natural Language Inference: Label pairs of sentences as entailing/contradictory/neutral
Premise: The man is in the doorway
Hypothesis: The person is near the door
Radford et al., 2018 evaluate on natural language inference.
Here’s roughly how the input was formatted, as a sequence of tokens for the decoder.
[START] The man is in the doorway [DELIM] The person is near the door [EXTRACT]
The linear classifier is applied to the representation of the [EXTRACT] token.
entailment
Slide by John Hewitt
GPT: input formats
Input format inputs for various finetuning tasks
Generative Pretrained Transformer (GPT) [Radford et al., 2018]
GPT results on various natural language inference datasets.
Slide by John Hewitt
Pretrained decoders can be used in their capacities as language models.
GPT-2, a larger version of GPT trained on more data, was shown to produce relatively convincing samples of natural language.
Increasingly convincing generations (GPT2) [Radford et al., 2018]
Slide by John Hewitt
Pretraining for three types of architectures
The neural architecture influences the type of pretraining, and natural use cases.
Decoders
Encoders
Encoder- Decoders
Slide by John Hewitt
Pretraining encoders: what pretraining objective to use?
So far, we’ve looked at language model pretraining. But encoders get bidirectional context, so we can’t do language modeling!
Idea: replace some fraction of words in the input with a special [MASK] token; predict these words.
ℎ1, … , ℎ𝑇 = Encoder 𝑤1, … , 𝑤𝑇
𝑦𝑖 ∼ 𝐴𝑤𝑖 + 𝑏
I [M] to the [M]
went
store
𝐴, 𝑏
ℎ1, … , ℎ𝑇
Slide by John Hewitt
BERT
Illustrated BERT:
Notebook:
BERT: Bidirectional Encoder Representations for Transformer
Problem with Previous Methods
Slide from Jacob Delvin
What makes BERT different?
BERT is the first, deeply bidirectional, unsupervised language representation, pre-trained using only a plain text corpus
Pre-trained representation
Context free
Contextual
unidirectional
bidirectional
word2vedc, GloVE
GPT
ElMO (shallow), BERT
Masked LM
store gallon
� the man went to the [MASK] to buy a [MASK] of milk
Slide from Jacob Delvin
Masked LM
went to the store → went to the [MASK]
Slide from Jacob Delvin
Transformer
Encoder
I pizza to the [M]
[Replaced] [Not replaced] [Masked]
went store
[Predict these!]
Next Sentence Prediction
Slide from Jacob Delvin
Input Representation
Slide from Jacob Delvin
Hidden state corresponding to [CLS] will be used as the sentence representation
WordPiece
at, fairfax, 1910s
hypatia = h ##yp ##ati ##a
BERT Tokenizer
from transformers import AutoModel, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
model = AutoModel.from_pretrained('bert-base-uncased’)
wps_ids = tokenizer.encode("Hypatia was a mathematician")
wordpieces = tokenizer.convert_ids_to_tokens(wps_ids)
['[CLS]', 'h', '##yp', '##ati', '##a', 'was', 'a', 'mathematician', '[SEP]']
Explore Embeddings
https://medialab.di.unipi.it:8000/hub/user-redirect/lab/tree/HLT/Lectures/TransformerExplore.ipynb
Unidirectional vs. Bidirectional Models
Unidirectional context
Build representation incrementally
Bidirectional context
Words can “see themselves”
open a bank
<s> open a
Layer 1
Layer 1
Layer 2
Layer 2
Layer 1
Layer 2
open a bank
<s> open a
Layer 1
Layer 1
Layer 2
Layer 2
Layer 1
Layer 2
Slide from Jacob Delvin
Pretraining encoder-decoders: what pretraining objective to use?
The encoder portion benefits from bidirectional context; the decoder portion is used to train the whole model through language modeling.
1
𝑤 , … , 𝑤
𝑤𝑇+1, … , 𝑤2𝑇
For encoder-decoders, we could do something like language modeling, but where a prefix of every input is provided to the encoder and is not predicted.
𝑤𝑇+2, … ,
ℎ1, … , ℎ𝑇 = Encoder 𝑤1, … , 𝑤𝑇
ℎ𝑇+1, … , ℎ2 = 𝐷𝑒𝑐𝑜𝑑𝑒𝑟 𝑤1, … , 𝑤𝑇, ℎ1, … , ℎ𝑇
𝑦𝑖 ∼ 𝐴𝑤𝑖 + 𝑏, 𝑖 > 𝑇
Pretraining encoder-decoders: what pretraining objective to use?
What Raffel et al., 2018 found to work best was span corruption. Their model: T5.
Replace different-length spans from the input with unique placeholders; decode out the spans that were removed!
This is implemented in text preprocessing: it’s still an objective that looks like language modeling at the decoder side.
Pretraining encoder-decoders: what pretraining objective to use?
Raffel et al., 2018 found encoder-decoders to work better than decoders for their tasks, and span corruption (denoising) to work better than language modeling.
Pretraining encoder-decoders: what pretraining objective to use?
A fascinating property of T5: it can be finetuned to answer a wide range of questions, retrieving knowledge from its parameters.
NQ: Natural Questions WQ: WebQuestions TQA: Trivia QA
All “open-domain”
versions
220 million params
770 million params
3 billion params
11 billion params
Two Step Development
Pre-training Tasks
Next Sentence Prediction
Masked LM
Masked LM
Next Sentence Prediction
Binary classification
Randomly select a split over sentences:
Use one as sentence A
For 50% of the time:
For 50% of the time:
Masking (Truncate([segment A, segment B]))
Later work has argued this “next sentence prediction” is not necessary.
Model Architecture
Model Details
Slide from Jacob Delvin
Fine Tuning Procedure
Example: Sentence Classification
Task Specific Models
Evaluation of BERT
General Language Understanding Evaluation (GLUE) benchmark: Standard split of data to train, validation, test, where labels for the test set is only held in the server.
GLUE Results
MultiNLI (Natural Language Inference)
Premise: Hills and mountains are especially sanctified in Jainism.�Hypothesis: Jainism hates nature.�Label: Contradiction
CoLa (Corpus of Linguistic Acceptability)
Sentence: The wagon rumbled down the road. Label: Acceptable
Sentence: The car honked down the road.
Label: Unacceptable
Slide from Jacob Delvin
SQUAD
The Stanford Question Answering Dataset (SQuAD) is a collection of 100k question/answer pairs posed by crowdworkers on a set of Wikipedia articles
Input Question:
Where do water droplets collide with ice to make precipitation?
Input Paragraph:
Precipitation forms as smaller droplets coalesce with collision with other rain drops or ice cristals within a cloud
Answer:
within a cloud
Too easy: answer always present
SQUAD 2.0
Slide from Jacob Delvin
What action did the US begin that started the second oil shock?
Ground Thruth Answers: <No Answer>
Prediction: <No Answer>
Effect of pre-training tasks
Effects of Model Size
Slide from Jacob Delvin
BERT for Contextualized Word Embeddings
Which Layers
References
Post BERT
RoBERTa
Slide from Jacob Delvin
XLNet
Slide from J. Delvin
XLNet
Slide from J. Delvin
XLNet
Slide from J. Delvin
ALBERT
1024
x
100k
128
x
100k
1024
x
128
x
vs
Slide from J. Delvin
ALBERT
Slide from J. Delvin
T5
Slide from J. Delvin
T5
Slide from J. Delvin
Compute
Slide from J. Delvin
Computation and Energy Costs
Parameters, accelerator years of computation, energy consumption, and gross CO2e for GPT-3 and GLaM
GLaM is a mixture of experts model that only activates experts selectively based on the input so that no more than 95B parameters are active per input token
In-context Learning
GPT-3, In-context learning, and very large models
So far, we’ve interacted with pretrained models in two ways:
Very large language models seem to perform some kind of learning without gradient steps simply from examples you provide within their contexts.
GPT-3 is the canonical example of this. The largest T5 model had 11 billion parameters.
GPT-3 has 175 billion parameters.
GPT-3, In-context learning, and very large models
“thanks -> merci
hello -> bonjour
mint -> menthe
otter -> “
GPT-3, In-context learning, and very large models
Very large language models seem to perform some kind of learning without gradient steps simply from examples you provide within their contexts.
Distillation
Applying to production
Slide from J. Delvin
Model Size Growth
Distillation
Slide from J. Delvin
Distillation
Well-Read Students Learn Better: On the Importance of Pre-training Compact Models (Turc et al, 2020)
Slide from J. Delvin
Distillation
Slide from J. Delvin
Conclusions
Slide from J. Delvin