1 of 77

Transformer Architectures

Human Language Technologies

Dipartimento di Informatica

Giuseppe Attardi

Università di Pisa

2 of 77

From: pretrained Word Embeddings

Circa 2017:

Start with pretrained word embeddings (no context!)
Learn how to incorporate context in an LSTM or Transformer while training on the task.

Issues:

The training data we have for our downstream task (like question answering) must be sufficient to teach all contextual aspects of language.
Most of the parameters in our network are randomly initialized!

not pretrained

pretrained

Slide from Anna Goldie

3 of 77

To: pretrained Whole Model

In modern NLP:

All (or almost all) parameters in NLP networks are initialized via pretraining.
Pretraining methods hide parts of the input from the model, and then train the model to reconstruct those parts.

This has been exceptionally effective at building strong:

representations of language
parameter initializations for strong NLP models.
probability distributions over language that we can sample from

[This model has learned how to represent

entire sentences through pretraining]

pretrained jointly

Slide from Anna Goldie

4 of 77

Learning from context

I put ___ fork down on the table.

The woman walked across the street, checking for traffic over ___ shoulder.

I went to the ocean to see the fish, turtles, seals, and _____.

Overall, the value I got from the two hours watching it was the sum total of the popcorn and the drink. The movie was ___.

Iroh went into the kitchen to make some tea. Standing next to Iroh, Zuko pondered his destiny. Zuko left the ______.

I was thinking about the sequence that goes 1, 1, 2, 3, 5, 8, 13, 21, ____

Slide from Anna Goldie

5 of 77

Pretrained Transformers

6 of 77

Two Step Development

7 of 77

Pretraining through language modeling [Dai and Le, 2015]

Recall the language modelling task:

Model 𝑝_𝜃(𝑤_𝑡|𝑤_1:𝑡−1), the probability distribution over words given their past contexts.
There’s lots of data for this! (In English.)

Pretraining through language modelling:

Train a neural network to perform language modelling on a large amount of text.
Save the network parameters.

Decoder

(Transformer, LSTM, ++ )

^Iroh goesto

make

^tastytea

goes

make

tasty

^teaEND

Slide by John Hewitt

8 of 77

The Pretraining / Finetuning Paradigm

Pretraining can improve NLP applications by serving as parameter initialization.

Decoder

(Transformer, LSTM, ++ )

Iroh

goes

make

tasty

tea

^goesto

^{make tasty tea}END

Step 1: Pretrain (on language modeling)

Lots of text; learn general things!

Decoder

(Transformer, LSTM, ++ )

☺/☹

Step 2: Finetune (on your task)

Not many labels; adapt to the task!

… the movie was …

Slide by John Hewitt

9 of 77

Model Pretraining

10 of 77

Stochastic gradient descent and pretrain/finetune

Slide by John Hewitt

11 of 77

Pretraining for three types of architectures

The neural architecture influences the type of pretraining, and natural use cases.

Decoders

Language models!
Nice to generate from; can’t condition on future words
Examples: GPT-2, GPT-3, LaMDA

Encoders

Gets bidirectional context – can condition on future!
How do we pretrain them?
Examples: BERT and its many variants, e.g. RoBERTa

Encoder- Decoders

Good parts of decoders and encoders?
What’s the best way to pretrain them?
Examples: Transformer, T5, Meena

Slide by John Hewitt

12 of 77

Pretraining for three types of architectures

The neural architecture influences the type of pretraining, and natural use cases.

Decoders

Language models! What we’ve seen so far.
Nice to generate from; can’t condition on future words

Encoders

Gets bidirectional context – can condition on future!
Wait, how do we pretrain them?

Encoder- Decoders

Good parts of decoders and encoders?
What’s the best way to pretrain them?

Slide by John Hewitt

13 of 77

Pretraining decoders

ℎ₁, … , ℎ_𝑇

When using language model pretrained decoders, we can ignore that they were trained to model 𝑝(𝑤_𝑡|𝑤_1:𝑡−1).

We can finetune them by training a classifier

on the last word’s hidden state.

ℎ₁, … , ℎ_𝑇= Decoder 𝑤₁, … , 𝑤_𝑇

𝑦 ∼ 𝐴𝑤_𝑇+ 𝑏

Where 𝐴 and 𝑏 are randomly initialized and specified by the downstream task.

Gradients backpropagate through the whole network.

𝑤₁, … , 𝑤_𝑇

☺/☹

Linear

𝐴, 𝑏

[Note how the linear layer hasn’t been

pretrained and must be learned from scratch.]

Slide by John Hewitt

14 of 77

Pretraining decoders

It’s natural to pretrain decoders as language models and then

use them as generators, finetuning their 𝑝_𝜃

^𝑤𝑡 ^𝑤1:𝑡−1⁾^!

This is helpful in tasks where the output is a sequence with a vocabulary like that at pretraining time!

Dialogue (context=dialogue history)
Summarization (context=document)

ℎ₁, … , ℎ_𝑇= Decoder 𝑤₁, … , 𝑤_𝑇

𝑤_𝑡∼ 𝐴𝑤_𝑡−1+ 𝑏

Where 𝐴, 𝑏 were pretrained in the language model!

^𝑤₂𝑤₃𝑤₄𝑤₅^𝑤₆

𝐴, 𝑏

ℎ₁, … , ℎ_𝑇

^𝑤₁𝑤₂𝑤₃𝑤₄^𝑤₅

[Note how the linear layer has been pretrained.]

Slide by John Hewitt

15 of 77

Generative Pretrained Transformer (GPT) [Radford et al., 2018]

2018’s GPT was a big success in pretraining a decoder!

Transformer decoder with 12 layers.
768-dimensional hidden states, 3072-dimensional feed-forward hidden layers.
Byte-pair encoding with 40,000 merges
Trained on BooksCorpus: over 7000 unique books.

Contains long spans of contiguous text, for learning long-distance dependencies.

The acronym “GPT” never showed up in the original paper; it could stand for

“Generative PreTraining” or “Generative Pretrained Transformer”

[Devlin et al., 2018]

Slide by John Hewitt

16 of 77

Generative Pretrained Transformer (GPT) [Radford et al., 2018]

How do we format inputs to our decoder for finetuning tasks?

Natural Language Inference: Label pairs of sentences as entailing/contradictory/neutral

Premise: The man is in the doorway

Hypothesis: The person is near the door

Radford et al., 2018 evaluate on natural language inference.

Here’s roughly how the input was formatted, as a sequence of tokens for the decoder.

[START] The man is in the doorway [DELIM] The person is near the door [EXTRACT]

The linear classifier is applied to the representation of the [EXTRACT] token.

entailment

Slide by John Hewitt

17 of 77

GPT: input formats

Input format inputs for various finetuning tasks

18 of 77

Generative Pretrained Transformer (GPT) [Radford et al., 2018]

GPT results on various natural language inference datasets.

Slide by John Hewitt

19 of 77

Pretrained decoders can be used in their capacities as language models.

GPT-2, a larger version of GPT trained on more data, was shown to produce relatively convincing samples of natural language.

Increasingly convincing generations (GPT2) [Radford et al., 2018]

Slide by John Hewitt

20 of 77

Pretraining for three types of architectures

The neural architecture influences the type of pretraining, and natural use cases.

Decoders

Language models! What we’ve seen so far.
Nice to generate from; can’t condition on future words

Encoders

Gets bidirectional context – can condition on future!
Wait, how do we pretrain them?

Encoder- Decoders

Good parts of decoders and encoders?
What’s the best way to pretrain them?

Slide by John Hewitt

21 of 77

Pretraining encoders: what pretraining objective to use?

So far, we’ve looked at language model pretraining. But encoders get bidirectional context, so we can’t do language modeling!

Idea: replace some fraction of words in the input with a special [MASK] token; predict these words.

ℎ₁, … , ℎ_𝑇= Encoder 𝑤₁, … , 𝑤_𝑇

𝑦_𝑖∼ 𝐴𝑤_𝑖+ 𝑏

I [M] to the [M]

went

store

𝐴, 𝑏

ℎ₁, … , ℎ_𝑇

[Devlin et al., 2018]

Slide by John Hewitt

23 of 77

Illustrated BERT:

http://jalammar.github.io/illustrated-bert/

Notebook:

https://colab.research.google.com/drive/1hMLd5-r82FrnFnBub-B-fVW78Px4KPX1

BERT: Bidirectional Encoder Representations for Transformer

24 of 77

Problem with Previous Methods

Problem: Language models only use left context or right context, but language understanding is bidirectional.
Why are LMs unidirectional?
Reason 1: Directionality is needed to generate a well-formed probability distribution.

We don’t care about this.

Reason 2: Words can “see themselves” in a bidirectional encoder.

Slide from Jacob Delvin

25 of 77

What makes BERT different?

BERT is the first, deeply bidirectional, unsupervised language representation, pre-trained using only a plain text corpus

Pre-trained representation

Context free

Contextual

unidirectional

bidirectional

word2vedc, GloVE

GPT

ElMO (shallow), BERT

26 of 77

Masked LM

Solution: Mask out k% of the input words, and then predict the masked words

Typically k = 15%

store gallon

� the man went to the [MASK] to buy a [MASK] of milk

Too little masking: Too expensive to train
Too much masking: Not enough context

Slide from Jacob Delvin

27 of 77

Masked LM

Problem: Mask token never seen at fine-tuning
Solution: 15% of the words to predict, but don’t replace with [MASK] 100% of the time. Instead:

80% of the time, replace with [MASK]

went to the store → went to the [MASK]

10% of the time, replace random word�went to the store → pizza to the store
10% of the time, keep same�went to the store → went to the store

Slide from Jacob Delvin

Transformer

Encoder

I pizza to the [M]

[Replaced] [Not replaced] [Masked]

went store

[Predict these!]

28 of 77

Next Sentence Prediction

To learn relationships between sentences, predict whether Sentence B is actual sentence that follows Sentence A, or a random sentence

Slide from Jacob Delvin

29 of 77

Input Representation

Use 30,000 WordPiece vocabulary on input.
Each token is sum of three embeddings
Single sequence is much more efficient.

Slide from Jacob Delvin

Hidden state corresponding to [CLS] will be used as the sentence representation

30 of 77

WordPiece

BERT uses a variant of the wordpiece model
wordpieces give a good balance between the flexibility of single characters and the efficiency of full words for decoding, and also sidesteps the need for special treatment of unknown words.
(Relatively) common words are in the vocabulary:

at, fairfax, 1910s

Other words are built from wordpieces:

hypatia = h ##yp ##ati ##a

Wordpiece Model:

Given a training corpus and a number of desired tokens D, select D wordpieces such that the resulting corpus is minimal in the number of wordpieces when segmented according to the chosen wordpiece model.

If you’re using BERT in an otherwise word based model, you have to deal with this

31 of 77

BERT Tokenizer

from transformers import AutoModel, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

model = AutoModel.from_pretrained('bert-base-uncased’)

wps_ids = tokenizer.encode("Hypatia was a mathematician")

wordpieces = tokenizer.convert_ids_to_tokens(wps_ids)

['[CLS]', 'h', '##yp', '##ati', '##a', 'was', 'a', 'mathematician', '[SEP]']

32 of 77

Explore Embeddings

See notebook:

https://medialab.di.unipi.it:8000/hub/user-redirect/lab/tree/HLT/Lectures/TransformerExplore.ipynb

33 of 77

Unidirectional vs. Bidirectional Models

Unidirectional context

Build representation incrementally

Bidirectional context

Words can “see themselves”

open a bank

<s> open a

Layer 1

Layer 2

Layer 1

Layer 2

open a bank

<s> open a

Layer 1

Layer 2

Layer 1

Layer 2

Slide from Jacob Delvin

34 of 77

Pretraining encoder-decoders: what pretraining objective to use?

The encoder portion benefits from bidirectional context; the decoder portion is used to train the whole model through language modeling.

𝑤 , … , 𝑤

𝑇

[Raffel et al., 2018]

𝑤_𝑇+1, … , 𝑤_2𝑇

For encoder-decoders, we could do something like language modeling, but where a prefix of every input is provided to the encoder and is not predicted.

𝑤_𝑇+2, … ,

ℎ₁, … , ℎ_𝑇= Encoder 𝑤₁, … , 𝑤_𝑇

ℎ_𝑇+1, … , ℎ₂= 𝐷𝑒𝑐𝑜𝑑𝑒𝑟 𝑤₁, … , 𝑤_𝑇, ℎ₁, … , ℎ_𝑇

𝑦_𝑖∼ 𝐴𝑤_𝑖+ 𝑏, 𝑖 > 𝑇

35 of 77

Pretraining encoder-decoders: what pretraining objective to use?

What Raffel et al., 2018 found to work best was span corruption. Their model: T5.

Replace different-length spans from the input with unique placeholders; decode out the spans that were removed!

This is implemented in text preprocessing: it’s still an objective that looks like language modeling at the decoder side.

36 of 77

Pretraining encoder-decoders: what pretraining objective to use?

Raffel et al., 2018 found encoder-decoders to work better than decoders for their tasks, and span corruption (denoising) to work better than language modeling.

37 of 77

Pretraining encoder-decoders: what pretraining objective to use?

A fascinating property of T5: it can be finetuned to answer a wide range of questions, retrieving knowledge from its parameters.

NQ: Natural Questions WQ: WebQuestions TQA: Trivia QA

All “open-domain”

versions

[Raffel et al., 2018]

220 million params

770 million params

3 billion params

11 billion params

38 of 77

Two Step Development

39 of 77

Pre-training Tasks

Next Sentence Prediction

Masked LM

train a deep bidirectional representation, masking some percentage of the input tokens at random, and then predicting those masked tokens
the final hidden vectors corresponding to the mask tokens are fed into an output softmax over the vocabulary, as in a standard LM

In order to train a model that understands sentence relationships, we pre-train for a binarized next sentence prediction task generated from any corpus
50% of the time B is the actual next sentence that follows A and 50% of the time it is a random sentence

40 of 77

Masked LM

Randomly select 15% of tokens (up to 20 per seq)
For 80% of the time:

Replace the word with the [MASK] token

For 10% of the time:

Replace the word with a random word

For 10% of the time

Keep the word unchanged

41 of 77

Next Sentence Prediction

Binary classification

Randomly select a split over sentences:

Use one as sentence A

For 50% of the time:

Sample random sentence split from another document as sentence B.

For 50% of the time:

Use the actual sentences as sentence B.

Masking (Truncate([segment A, segment B]))

Later work has argued this “next sentence prediction” is not necessary.

42 of 77

Model Architecture

BERT BASE – 12 layer model Comparable in size to the OpenAI Transformer in order to compare performance
BERT LARGE – A huge 24 layer model which achieved the state of the art results

BERT is basically a trained Transformer Encoder stack.

43 of 77

Model Details

Data: Wikipedia (2.5B words) + BookCorpus (800M words)
Batch Size: 131,072 words (1024 sequences * 128 length or 256 sequences * 512 length)
Training Time: 1M steps (~40 epochs)
Optimizer: AdamW, 1e-4 learning rate, linear decay
BERT-Base: 12-layer, 768-hidden, 12-head
BERT-Large: 24-layer, 1024-hidden, 16-head
Trained on 4x4 or 8x8 TPU slice for 4 days

Slide from Jacob Delvin

44 of 77

Fine Tuning Procedure

45 of 77

Example: Sentence Classification

46 of 77

Task Specific Models

47 of 77

Evaluation of BERT

General Language Understanding Evaluation (GLUE) benchmark: Standard split of data to train, validation, test, where labels for the test set is only held in the server.

Sentence pair tasks

MNLI, Multi-Genre Natural Language Inference
QQP, Quora Question Pairs
QNLI, Question Natural Language Inference
STS-B The Semantic Textual Similarity Benchmark
MRPC Microsoft Research Paraphrase Corpus
RTE Recognizing Textual Entailment
WNLI Winograd NLI is a small natural language inference dataset

Single sentence classification

SST-2 The Stanford Sentiment Treebank
CoLA The Corpus of Linguistic Acceptability

48 of 77

GLUE Results

MultiNLI (Natural Language Inference)

Premise: Hills and mountains are especially sanctified in Jainism.�Hypothesis: Jainism hates nature.�Label: Contradiction

CoLa (Corpus of Linguistic Acceptability)

Sentence: The wagon rumbled down the road. Label: Acceptable

Sentence: The car honked down the road.

Label: Unacceptable

Slide from Jacob Delvin

49 of 77

SQUAD

The Stanford Question Answering Dataset (SQuAD) is a collection of 100k question/answer pairs posed by crowdworkers on a set of Wikipedia articles

Input Question:

Where do water droplets collide with ice to make precipitation?

Input Paragraph:

Precipitation forms as smaller droplets coalesce with collision with other rain drops or ice cristals within a cloud

Answer:

within a cloud

Too easy: answer always present

50 of 77

SQUAD 2.0

Use token 0 ([CLS]) to emit logit for “no answer”
“No answer” directly competes with answer span
Threshold is optimized on dev set

Slide from Jacob Delvin

What action did the US begin that started the second oil shock?

Ground Thruth Answers: <No Answer>

Prediction: <No Answer>

51 of 77

Effect of pre-training tasks

Masked LM (compared to left-to-right LM) is very important on some tasks, Next Sentence Prediction is important on other tasks.
Left-to-right model does very poorly on word-level task (SQuAD), although this is mitigated by BiLSTM

52 of 77

Effects of Model Size

Big models help a lot
Going from 110M -> 340M params helps even on datasets with 3,600 labelled examples
Improvements have not asymptoted

Slide from Jacob Delvin

53 of 77

BERT for Contextualized Word Embeddings

54 of 77

Which Layers

55 of 77

References

57 of 77

RoBERTa

RoBERTa: A Robustly Optimized BERT Pretraining Approach (Liu et al, University of Washington and Facebook, 2019)
Trained BERT for more epochs and/or on more data

Showed that more epochs alone helps, even on same data
More data also helps

Improved masking and pre-training data slightly

Slide from Jacob Delvin

58 of 77

XLNet

XLNet: Generalized Autoregressive Pretraining for Language Understanding (Yang et al, CMU and Google, 2019)
Innovation #1: Relative position embeddings

Sentence: John ate a hot dog
Absolute attention: “How much should dog attend to hot (in any position), and how much should dog in position 4 attend to the word in position 3? (Or 508 attend to 507, ...)”
Relative attention: “How much should dog attend to hot (in any position) and how much should dog attend to the previous word?”

Slide from J. Delvin

59 of 77

XLNet

Innovation #2: Permutation Language Modeling

In a left-to-right language model, every word is predicted based on all of the words to its left
Instead: Randomly permute the order for every training sentence
Equivalent to masking, but many more predictions per sentence
Can be done efficiently with Transformers

Slide from J. Delvin

60 of 77

XLNet

Also used more data and bigger models, but showed that innovations improved on BERT even with same data and model size
XLNet results:

Slide from J. Delvin

61 of 77

ALBERT

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations (Lan et al, Google and TTI Chicago, 2019)
Innovation #1: Factorized embedding parameterization

Use small embedding size (e.g., 128) and then project it to Transformer hidden size (e.g., 1024) with parameter matrix

1024

100k

128

100k

1024

128

Slide from J. Delvin

62 of 77

ALBERT

Innovation #2: Cross-layer parameter sharing

Share all parameters between Transformer layers

Results:

ALBERT is light in terms of parameters, not speed

Slide from J. Delvin

63 of 77

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (Raffel et al, Google, 2019)
Ablated many aspects of pre-training:

Model size
Amount of training data
Domain/cleanness of training data
Pre-training objective details (e.g., span length of masked text)
Ensembling
Finetuning recipe (e.g., only allowing certain layers to finetune)
Multi-task training

Slide from J. Delvin

64 of 77

Conclusions:

Scaling up model size and amount of training data helps a lot
Best model is 11B parameters (BERT-Large is 330M), trained on 120B words of cleaned common crawl text
Exact masking/corruptions strategy doesn’t matter that much
Mostly negative results for better finetuning and multi-task strategies

T5 results:

Slide from J. Delvin

65 of 77

Compute

SoTA requires lots of compute

Slide from J. Delvin

66 of 77

Computation and Energy Costs

Parameters, accelerator years of computation, energy consumption, and gross CO₂e for GPT-3 and GLaM

GLaM is a mixture of experts model that only activates experts selectively based on the input so that no more than 95B parameters are active per input token

67 of 77

In-context Learning

68 of 77

GPT-3, In-context learning, and very large models

So far, we’ve interacted with pretrained models in two ways:

Sample from the distributions they define (maybe providing a prompt)
Fine-tune them on a task we care about, and take their predictions.

Very large language models seem to perform some kind of learning without gradient steps simply from examples you provide within their contexts.

GPT-3 is the canonical example of this. The largest T5 model had 11 billion parameters.

GPT-3 has 175 billion parameters.

69 of 77

GPT-3, In-context learning, and very large models

Very large language models seem to perform some kind of learning without gradient steps simply from examples you provide within their contexts.
The in-context examples seem to specify the task to be performed, and the conditional distribution mocks performing the task to a certain extent.
Input (prefix within a single Transformer decoder context):

“thanks -> merci

hello -> bonjour

mint -> menthe

otter -> “

Output (conditional generations):
loutre…”

70 of 77

GPT-3, In-context learning, and very large models

Very large language models seem to perform some kind of learning without gradient steps simply from examples you provide within their contexts.

71 of 77

Distillation

72 of 77

Applying to production

BERT and other pre-trained language models are extremely large and expensive
How are companies applying them to low-latency production services?

Slide from J. Delvin

73 of 77

Model Size Growth

74 of 77

Distillation

Answer: Distillation (a.k.a., model compression)
Idea has been around for a long time:

Model Compression (Bucila et al, 2006)
Distilling the Knowledge in a Neural Network (Hinton et al, 2015)

Simple technique:

Train “Teacher”: Use SOTA pre-training + fine-tuning technique to train model with maximum accuracy
Label a large amount of unlabeled input examples with Teacher
Train “Student”: Much smaller model (e.g., 50x smaller) which is trained to mimic Teacher output
Student objective is typically Mean Square Error or Cross Entropy

Slide from J. Delvin

75 of 77

Distillation

Example distillation results

50k labeled examples, 8M unlabeled examples

Distillation works much better than pre-training + fine-tuning with smaller model

Well-Read Students Learn Better: On the Importance of Pre-training Compact Models (Turc et al, 2020)

Slide from J. Delvin

76 of 77

Distillation

Why does distillation work so well? A hypothesis:

Language modeling is the “ultimate” NLP task in many ways

I.e., a perfect language model is also a perfect question answering/entailment/sentiment analysis model

Training a massive language model learns millions of latent features which are useful for these other NLP tasks
Finetuning mostly just picks up and tweaks these existing latent features
This requires an oversized model, because only a subset of the features are useful for any given task
Distillation allows the model to only focus on those features
Supporting evidence: Simple self-distillation (distilling a smaller BERT model) doesn’t work

Slide from J. Delvin

77 of 77

Conclusions

Pre-trained bidirectional language models work incredibly well
However, the models are extremely expensive
Improvements (unfortunately) seem to mostly come from even more expensive models and more data
The inference/serving problem is addressed through distillation
Emergent in-context learning is not yet well-understood!

Slide from J. Delvin

1 of 77

2 of 77

3 of 77

4 of 77

5 of 77

6 of 77

7 of 77

8 of 77

9 of 77

10 of 77

11 of 77

12 of 77

13 of 77

14 of 77

15 of 77

16 of 77

17 of 77

18 of 77

19 of 77

20 of 77

21 of 77

22 of 77

23 of 77

24 of 77

25 of 77

26 of 77

27 of 77

28 of 77

29 of 77

30 of 77

31 of 77

32 of 77

33 of 77

34 of 77

35 of 77

36 of 77

37 of 77

38 of 77

39 of 77

40 of 77

41 of 77

42 of 77

43 of 77

44 of 77

45 of 77

46 of 77

47 of 77

48 of 77

49 of 77

50 of 77

51 of 77

52 of 77

53 of 77

54 of 77

55 of 77

56 of 77

57 of 77

58 of 77

59 of 77

60 of 77

61 of 77

62 of 77

63 of 77

64 of 77

65 of 77

66 of 77

67 of 77

68 of 77

69 of 77

70 of 77

71 of 77

72 of 77

73 of 77

74 of 77

75 of 77

76 of 77

77 of 77