1 of 77

Transformer Architectures

Human Language Technologies

Dipartimento di Informatica

Giuseppe Attardi

Università di Pisa

2 of 77

From: pretrained Word Embeddings

Circa 2017:

  • Start with pretrained word embeddings (no context!)
  • Learn how to incorporate context in an LSTM or Transformer while training on the task.

Issues:

  • The training data we have for our downstream task (like question answering) must be sufficient to teach all contextual aspects of language.
  • Most of the parameters in our network are randomly initialized!

not pretrained

pretrained

Slide from Anna Goldie

3 of 77

To: pretrained Whole Model

In modern NLP:

  • All (or almost all) parameters in NLP networks are initialized via pretraining.
  • Pretraining methods hide parts of the input from the model, and then train the model to reconstruct those parts.

This has been exceptionally effective at building strong:

  • representations of language
  • parameter initializations for strong NLP models.
  • probability distributions over language that we can sample from

[This model has learned how to represent

entire sentences through pretraining]

pretrained jointly

Slide from Anna Goldie

4 of 77

Learning from context

I put ___ fork down on the table.

The woman walked across the street, checking for traffic over ___ shoulder.

I went to the ocean to see the fish, turtles, seals, and _____.

Overall, the value I got from the two hours watching it was the sum total of the popcorn and the drink. The movie was ___.

Iroh went into the kitchen to make some tea. Standing next to Iroh, Zuko pondered his destiny. Zuko left the ______.

I was thinking about the sequence that goes 1, 1, 2, 3, 5, 8, 13, 21, ____

Slide from Anna Goldie

5 of 77

Pretrained Transformers

6 of 77

Two Step Development

7 of 77

Pretraining through language modeling [Dai and Le, 2015]

Recall the language modelling task:

  • Model 𝑝𝜃(𝑤𝑡| 𝑤1:𝑡−1), the probability distribution over words given their past contexts.
  • There’s lots of data for this! (In English.)

Pretraining through language modelling:

  • Train a neural network to perform language modelling on a large amount of text.
  • Save the network parameters.

Decoder

(Transformer, LSTM, ++ )

Iroh goes to

make

tasty tea

goes

to

make

tasty

tea END

Slide by John Hewitt

8 of 77

The Pretraining / Finetuning Paradigm

Pretraining can improve NLP applications by serving as parameter initialization.

Decoder

(Transformer, LSTM, ++ )

Iroh

goes

to

make

tasty

tea

goes to

make tasty tea END

Step 1: Pretrain (on language modeling)

Lots of text; learn general things!

Decoder

(Transformer, LSTM, ++ )

/

Step 2: Finetune (on your task)

Not many labels; adapt to the task!

… the movie was …

Slide by John Hewitt

9 of 77

Model Pretraining

10 of 77

Stochastic gradient descent and pretrain/finetune

  •  

Slide by John Hewitt

11 of 77

Pretraining for three types of architectures

The neural architecture influences the type of pretraining, and natural use cases.

Decoders

  • Language models!
  • Nice to generate from; can’t condition on future words
  • Examples: GPT-2, GPT-3, LaMDA

Encoders

  • Gets bidirectional context – can condition on future!
  • How do we pretrain them?
  • Examples: BERT and its many variants, e.g. RoBERTa

Encoder- Decoders

  • Good parts of decoders and encoders?
  • What’s the best way to pretrain them?
  • Examples: Transformer, T5, Meena

Slide by John Hewitt

12 of 77

Pretraining for three types of architectures

The neural architecture influences the type of pretraining, and natural use cases.

Decoders

  • Language models! What we’ve seen so far.
  • Nice to generate from; can’t condition on future words

Encoders

  • Gets bidirectional context – can condition on future!
  • Wait, how do we pretrain them?

Encoder- Decoders

  • Good parts of decoders and encoders?
  • What’s the best way to pretrain them?

Slide by John Hewitt

13 of 77

Pretraining decoders

1, … , ℎ𝑇

When using language model pretrained decoders, we can ignore that they were trained to model 𝑝(𝑤𝑡|𝑤1:𝑡−1).

We can finetune them by training a classifier

on the last word’s hidden state.

1, … , ℎ𝑇 = Decoder 𝑤1, … , 𝑤𝑇

𝑦 ∼ 𝐴𝑤𝑇 + 𝑏

Where 𝐴 and 𝑏 are randomly initialized and specified by the downstream task.

Gradients backpropagate through the whole network.

𝑤1, … , 𝑤𝑇

/

Linear

𝐴, 𝑏

[Note how the linear layer hasn’t been

pretrained and must be learned from scratch.]

Slide by John Hewitt

14 of 77

Pretraining decoders

It’s natural to pretrain decoders as language models and then

use them as generators, finetuning their 𝑝𝜃

𝑤𝑡 𝑤1:𝑡−1)!

This is helpful in tasks where the output is a sequence with a vocabulary like that at pretraining time!

  • Dialogue (context=dialogue history)
  • Summarization (context=document)

1, … , ℎ𝑇 = Decoder 𝑤1, … , 𝑤𝑇

𝑤𝑡 ∼ 𝐴𝑤𝑡−1 + 𝑏

Where 𝐴, 𝑏 were pretrained in the language model!

𝑤2 𝑤3 𝑤4 𝑤5 𝑤6

𝐴, 𝑏

1, … , ℎ𝑇

𝑤1 𝑤2 𝑤3 𝑤4 𝑤5

[Note how the linear layer has been pretrained.]

Slide by John Hewitt

15 of 77

Generative Pretrained Transformer (GPT) [Radford et al., 2018]

2018’s GPT was a big success in pretraining a decoder!

  • Transformer decoder with 12 layers.
  • 768-dimensional hidden states, 3072-dimensional feed-forward hidden layers.
  • Byte-pair encoding with 40,000 merges
  • Trained on BooksCorpus: over 7000 unique books.
    • Contains long spans of contiguous text, for learning long-distance dependencies.
  • The acronym “GPT” never showed up in the original paper; it could stand for

“Generative PreTraining” or “Generative Pretrained Transformer”

Slide by John Hewitt

16 of 77

Generative Pretrained Transformer (GPT) [Radford et al., 2018]

How do we format inputs to our decoder for finetuning tasks?

Natural Language Inference: Label pairs of sentences as entailing/contradictory/neutral

Premise: The man is in the doorway

Hypothesis: The person is near the door

Radford et al., 2018 evaluate on natural language inference.

Here’s roughly how the input was formatted, as a sequence of tokens for the decoder.

[START] The man is in the doorway [DELIM] The person is near the door [EXTRACT]

The linear classifier is applied to the representation of the [EXTRACT] token.

entailment

Slide by John Hewitt

17 of 77

GPT: input formats

Input format inputs for various finetuning tasks

18 of 77

Generative Pretrained Transformer (GPT) [Radford et al., 2018]

GPT results on various natural language inference datasets.

Slide by John Hewitt

19 of 77

Pretrained decoders can be used in their capacities as language models.

GPT-2, a larger version of GPT trained on more data, was shown to produce relatively convincing samples of natural language.

Increasingly convincing generations (GPT2) [Radford et al., 2018]

Slide by John Hewitt

20 of 77

Pretraining for three types of architectures

The neural architecture influences the type of pretraining, and natural use cases.

Decoders

  • Language models! What we’ve seen so far.
  • Nice to generate from; can’t condition on future words

Encoders

  • Gets bidirectional context – can condition on future!
  • Wait, how do we pretrain them?

Encoder- Decoders

  • Good parts of decoders and encoders?
  • What’s the best way to pretrain them?

Slide by John Hewitt

21 of 77

Pretraining encoders: what pretraining objective to use?

So far, we’ve looked at language model pretraining. But encoders get bidirectional context, so we can’t do language modeling!

Idea: replace some fraction of words in the input with a special [MASK] token; predict these words.

1, … , ℎ𝑇 = Encoder 𝑤1, … , 𝑤𝑇

𝑦𝑖 ∼ 𝐴𝑤𝑖 + 𝑏

 

I [M] to the [M]

went

store

𝐴, 𝑏

1, … , ℎ𝑇

Slide by John Hewitt

22 of 77

BERT

23 of 77

Illustrated BERT:

  • http://jalammar.github.io/illustrated-bert/

Notebook:

  • https://colab.research.google.com/drive/1hMLd5-r82FrnFnBub-B-fVW78Px4KPX1

BERT: Bidirectional Encoder Representations for Transformer

24 of 77

Problem with Previous Methods

  • Problem: Language models only use left context or right context, but language understanding is bidirectional.
  • Why are LMs unidirectional?
  • Reason 1: Directionality is needed to generate a well-formed probability distribution.
    • We don’t care about this.
  • Reason 2: Words can “see themselves” in a bidirectional encoder.

Slide from Jacob Delvin

25 of 77

What makes BERT different?

BERT is the first, deeply bidirectional, unsupervised language representation, pre-trained using only a plain text corpus

Pre-trained representation

Context free

Contextual

unidirectional

bidirectional

word2vedc, GloVE

GPT

ElMO (shallow), BERT

26 of 77

Masked LM

  • Solution: Mask out k% of the input words, and then predict the masked words
    • Typically k = 15%

store gallon

the man went to the [MASK] to buy a [MASK] of milk

  • Too little masking: Too expensive to train
  • Too much masking: Not enough context

Slide from Jacob Delvin

27 of 77

Masked LM

  • Problem: Mask token never seen at fine-tuning
  • Solution: 15% of the words to predict, but don’t replace with [MASK] 100% of the time. Instead:
    • 80% of the time, replace with [MASK]

went to the store → went to the [MASK]

    • 10% of the time, replace random wordwent to the store → pizza to the store
    • 10% of the time, keep samewent to the store → went to the store

Slide from Jacob Delvin

Transformer

Encoder

I pizza to the [M]

[Replaced] [Not replaced] [Masked]

went store

[Predict these!]

28 of 77

Next Sentence Prediction

  • To learn relationships between sentences, predict whether Sentence B is actual sentence that follows Sentence A, or a random sentence

Slide from Jacob Delvin

29 of 77

Input Representation

  • Use 30,000 WordPiece vocabulary on input.
  • Each token is sum of three embeddings
  • Single sequence is much more efficient.

Slide from Jacob Delvin

Hidden state corresponding to [CLS] will be used as the sentence representation

30 of 77

WordPiece

  • BERT uses a variant of the wordpiece model
  • wordpieces give a good balance between the flexibility of single characters and the efficiency of full words for decoding, and also sidesteps the need for special treatment of unknown words.
  • (Relatively) common words are in the vocabulary:

at, fairfax, 1910s

  • Other words are built from wordpieces:

hypatia = h ##yp ##ati ##a

  • Wordpiece Model:
    • Given a training corpus and a number of desired tokens D, select D wordpieces such that the resulting corpus is minimal in the number of wordpieces when segmented according to the chosen wordpiece model. 
  • If you’re using BERT in an otherwise word based model, you have to deal with this

31 of 77

BERT Tokenizer

from transformers import AutoModel, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

model = AutoModel.from_pretrained('bert-base-uncased’)

wps_ids = tokenizer.encode("Hypatia was a mathematician")

wordpieces = tokenizer.convert_ids_to_tokens(wps_ids)

['[CLS]', 'h', '##yp', '##ati', '##a', 'was', 'a', 'mathematician', '[SEP]']

32 of 77

Explore Embeddings

  • See notebook:

https://medialab.di.unipi.it:8000/hub/user-redirect/lab/tree/HLT/Lectures/TransformerExplore.ipynb

33 of 77

Unidirectional vs. Bidirectional Models

Unidirectional context

Build representation incrementally

Bidirectional context

Words can “see themselves”

open a bank

<s> open a

Layer 1

Layer 1

Layer 2

Layer 2

Layer 1

Layer 2

open a bank

<s> open a

Layer 1

Layer 1

Layer 2

Layer 2

Layer 1

Layer 2

Slide from Jacob Delvin

34 of 77

Pretraining encoder-decoders: what pretraining objective to use?

The encoder portion benefits from bidirectional context; the decoder portion is used to train the whole model through language modeling.

1

𝑤 , … , 𝑤

𝑤𝑇+1, … , 𝑤2𝑇

For encoder-decoders, we could do something like language modeling, but where a prefix of every input is provided to the encoder and is not predicted.

𝑤𝑇+2, … ,

1, … , ℎ𝑇 = Encoder 𝑤1, … , 𝑤𝑇

𝑇+1, … , ℎ2 = 𝐷𝑒𝑐𝑜𝑑𝑒𝑟 𝑤1, … , 𝑤𝑇, ℎ1, … , ℎ𝑇

𝑦𝑖 ∼ 𝐴𝑤𝑖 + 𝑏, 𝑖 > 𝑇

35 of 77

Pretraining encoder-decoders: what pretraining objective to use?

What Raffel et al., 2018 found to work best was span corruption. Their model: T5.

Replace different-length spans from the input with unique placeholders; decode out the spans that were removed!

This is implemented in text preprocessing: it’s still an objective that looks like language modeling at the decoder side.

36 of 77

Pretraining encoder-decoders: what pretraining objective to use?

Raffel et al., 2018 found encoder-decoders to work better than decoders for their tasks, and span corruption (denoising) to work better than language modeling.

37 of 77

Pretraining encoder-decoders: what pretraining objective to use?

A fascinating property of T5: it can be finetuned to answer a wide range of questions, retrieving knowledge from its parameters.

NQ: Natural Questions WQ: WebQuestions TQA: Trivia QA

All “open-domain”

versions

220 million params

770 million params

3 billion params

11 billion params

38 of 77

Two Step Development

39 of 77

Pre-training Tasks

Next Sentence Prediction

Masked LM

  • train a deep bidirectional representation, masking some percentage of the input tokens at random, and then predicting those masked tokens
  • the final hidden vectors corresponding to the mask tokens are fed into an output softmax over the vocabulary, as in a standard LM

  • In order to train a model that understands sentence relationships, we pre-train for a binarized next sentence prediction task generated from any corpus
  • 50% of the time B is the actual next sentence that follows A and 50% of the time it is a random sentence

40 of 77

Masked LM

  • Randomly select 15% of tokens (up to 20 per seq)
  • For 80% of the time:
    • Replace the word with the [MASK] token
  • For 10% of the time:
    • Replace the word with a random word
  • For 10% of the time
    • Keep the word unchanged

41 of 77

Next Sentence Prediction

Binary classification

Randomly select a split over sentences:

Use one as sentence A

For 50% of the time:

  • Sample random sentence split from another document as sentence B.

For 50% of the time:

  • Use the actual sentences as sentence B.

Masking (Truncate([segment A, segment B]))

Later work has argued this “next sentence prediction” is not necessary.

42 of 77

Model Architecture

  • BERT BASE 12 layer model Comparable in size to the OpenAI Transformer in order to compare performance
  • BERT LARGE – A huge 24 layer model which achieved the state of the art results

  • BERT is basically a trained Transformer Encoder stack.

43 of 77

Model Details

  • Data: Wikipedia (2.5B words) + BookCorpus (800M words)
  • Batch Size: 131,072 words (1024 sequences * 128 length or 256 sequences * 512 length)
  • Training Time: 1M steps (~40 epochs)
  • Optimizer: AdamW, 1e-4 learning rate, linear decay
  • BERT-Base: 12-layer, 768-hidden, 12-head
  • BERT-Large: 24-layer, 1024-hidden, 16-head
  • Trained on 4x4 or 8x8 TPU slice for 4 days

Slide from Jacob Delvin

44 of 77

Fine Tuning Procedure

45 of 77

Example: Sentence Classification

46 of 77

Task Specific Models

47 of 77

Evaluation of BERT

General Language Understanding Evaluation (GLUE) benchmark: Standard split of data to train, validation, test, where labels for the test set is only held in the server.

  • Sentence pair tasks
    • MNLI, Multi-Genre Natural Language Inference
    • QQP, Quora Question Pairs
    • QNLI, Question Natural Language Inference
    • STS-B The Semantic Textual Similarity Benchmark
    • MRPC Microsoft Research Paraphrase Corpus
    • RTE Recognizing Textual Entailment
    • WNLI Winograd NLI is a small natural language inference dataset
  • Single sentence classification
    • SST-2 The Stanford Sentiment Treebank
    • CoLA The Corpus of Linguistic Acceptability

48 of 77

GLUE Results

MultiNLI (Natural Language Inference)

Premise: Hills and mountains are especially sanctified in Jainism.�Hypothesis: Jainism hates nature.�Label: Contradiction

CoLa (Corpus of Linguistic Acceptability)

Sentence: The wagon rumbled down the road. Label: Acceptable

Sentence: The car honked down the road.

Label: Unacceptable

Slide from Jacob Delvin

49 of 77

SQUAD

The Stanford Question Answering Dataset (SQuAD) is a collection of 100k question/answer pairs posed by crowdworkers on a set of Wikipedia articles

Input Question:

Where do water droplets collide with ice to make precipitation?

Input Paragraph:

Precipitation forms as smaller droplets coalesce with collision with other rain drops or ice cristals within a cloud

Answer:

within a cloud

Too easy: answer always present

50 of 77

SQUAD 2.0

  • Use token 0 ([CLS]) to emit logit for “no answer”
  • “No answer” directly competes with answer span
  • Threshold is optimized on dev set

Slide from Jacob Delvin

What action did the US begin that started the second oil shock?

Ground Thruth Answers: <No Answer>

Prediction: <No Answer>

51 of 77

Effect of pre-training tasks

  • Masked LM (compared to left-to-right LM) is very important on some tasks, Next Sentence Prediction is important on other tasks.
  • Left-to-right model does very poorly on word-level task (SQuAD), although this is mitigated by BiLSTM

52 of 77

Effects of Model Size

  • Big models help a lot
  • Going from 110M -> 340M params helps even on datasets with 3,600 labelled examples
  • Improvements have not asymptoted

Slide from Jacob Delvin

53 of 77

BERT for Contextualized Word Embeddings

54 of 77

Which Layers

55 of 77

References

56 of 77

Post BERT

57 of 77

RoBERTa

  • RoBERTa: A Robustly Optimized BERT Pretraining Approach (Liu et al, University of Washington and Facebook, 2019)
  • Trained BERT for more epochs and/or on more data
    • Showed that more epochs alone helps, even on same data
    • More data also helps
  • Improved masking and pre-training data slightly

Slide from Jacob Delvin

58 of 77

XLNet

  • XLNet: Generalized Autoregressive Pretraining for Language Understanding (Yang et al, CMU and Google, 2019)
  • Innovation #1: Relative position embeddings
    • Sentence: John ate a hot dog
    • Absolute attention: “How much should dog attend to hot (in any position), and how much should dog in position 4 attend to the word in position 3? (Or 508 attend to 507, ...)”
    • Relative attention: “How much should dog attend to hot (in any position) and how much should dog attend to the previous word?”

Slide from J. Delvin

59 of 77

XLNet

  • Innovation #2: Permutation Language Modeling
    • In a left-to-right language model, every word is predicted based on all of the words to its left
    • Instead: Randomly permute the order for every training sentence
    • Equivalent to masking, but many more predictions per sentence
    • Can be done efficiently with Transformers

Slide from J. Delvin

60 of 77

XLNet

  • Also used more data and bigger models, but showed that innovations improved on BERT even with same data and model size
  • XLNet results:

Slide from J. Delvin

61 of 77

ALBERT

  • ALBERT: A Lite BERT for Self-supervised Learning of Language Representations (Lan et al, Google and TTI Chicago, 2019)
  • Innovation #1: Factorized embedding parameterization
    • Use small embedding size (e.g., 128) and then project it to Transformer hidden size (e.g., 1024) with parameter matrix

1024

x

100k

128

x

100k

1024

x

128

x

vs

Slide from J. Delvin

62 of 77

ALBERT

  • Innovation #2: Cross-layer parameter sharing
    • Share all parameters between Transformer layers
  • Results:

  • ALBERT is light in terms of parameters, not speed

Slide from J. Delvin

63 of 77

T5

  • Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (Raffel et al, Google, 2019)
  • Ablated many aspects of pre-training:
    • Model size
    • Amount of training data
    • Domain/cleanness of training data
    • Pre-training objective details (e.g., span length of masked text)
    • Ensembling
    • Finetuning recipe (e.g., only allowing certain layers to finetune)
    • Multi-task training

Slide from J. Delvin

64 of 77

T5

  • Conclusions:
    • Scaling up model size and amount of training data helps a lot
    • Best model is 11B parameters (BERT-Large is 330M), trained on 120B words of cleaned common crawl text
    • Exact masking/corruptions strategy doesn’t matter that much
    • Mostly negative results for better finetuning and multi-task strategies
  • T5 results:

Slide from J. Delvin

65 of 77

Compute

  • SoTA requires lots of compute

Slide from J. Delvin

66 of 77

Computation and Energy Costs

Parameters, accelerator years of computation, energy consumption, and gross CO2e for GPT-3 and GLaM

GLaM is a mixture of experts model that only activates experts selectively based on the input so that no more than 95B parameters are active per input token

67 of 77

In-context Learning

68 of 77

GPT-3, In-context learning, and very large models

So far, we’ve interacted with pretrained models in two ways:

  • Sample from the distributions they define (maybe providing a prompt)
  • Fine-tune them on a task we care about, and take their predictions.

Very large language models seem to perform some kind of learning without gradient steps simply from examples you provide within their contexts.

GPT-3 is the canonical example of this. The largest T5 model had 11 billion parameters.

GPT-3 has 175 billion parameters.

69 of 77

GPT-3, In-context learning, and very large models

  • Very large language models seem to perform some kind of learning without gradient steps simply from examples you provide within their contexts.
  • The in-context examples seem to specify the task to be performed, and the conditional distribution mocks performing the task to a certain extent.
  • Input (prefix within a single Transformer decoder context):

“thanks -> merci

hello -> bonjour

mint -> menthe

otter -> “

  • Output (conditional generations):
  • loutre…”

70 of 77

GPT-3, In-context learning, and very large models

Very large language models seem to perform some kind of learning without gradient steps simply from examples you provide within their contexts.

71 of 77

Distillation

72 of 77

Applying to production

  • BERT and other pre-trained language models are extremely large and expensive
  • How are companies applying them to low-latency production services?

Slide from J. Delvin

73 of 77

Model Size Growth

74 of 77

Distillation

  • Answer: Distillation (a.k.a., model compression)
  • Idea has been around for a long time:
    • Model Compression (Bucila et al, 2006)
    • Distilling the Knowledge in a Neural Network (Hinton et al, 2015)
  • Simple technique:
    • Train “Teacher”: Use SOTA pre-training + fine-tuning technique to train model with maximum accuracy
    • Label a large amount of unlabeled input examples with Teacher
    • Train “Student”: Much smaller model (e.g., 50x smaller) which is trained to mimic Teacher output
    • Student objective is typically Mean Square Error or Cross Entropy

Slide from J. Delvin

75 of 77

Distillation

  • Example distillation results
    • 50k labeled examples, 8M unlabeled examples

  • Distillation works much better than pre-training + fine-tuning with smaller model

Well-Read Students Learn Better: On the Importance of Pre-training Compact Models (Turc et al, 2020)

Slide from J. Delvin

76 of 77

Distillation

  • Why does distillation work so well? A hypothesis:
    • Language modeling is the “ultimate” NLP task in many ways
      • I.e., a perfect language model is also a perfect question answering/entailment/sentiment analysis model
    • Training a massive language model learns millions of latent features which are useful for these other NLP tasks
    • Finetuning mostly just picks up and tweaks these existing latent features
    • This requires an oversized model, because only a subset of the features are useful for any given task
    • Distillation allows the model to only focus on those features
    • Supporting evidence: Simple self-distillation (distilling a smaller BERT model) doesn’t work

Slide from J. Delvin

77 of 77

Conclusions

  • Pre-trained bidirectional language models work incredibly well
  • However, the models are extremely expensive
  • Improvements (unfortunately) seem to mostly come from even more expensive models and more data
  • The inference/serving problem is addressed through distillation
  • Emergent in-context learning is not yet well-understood!

Slide from J. Delvin