5 of 40

Implement the core components of attention

pairwise similarities (transpose the right dimensions)
Attention scaled

Attention softmax

Compute outputs

6 of 40

Experiment with Your Transformer

Self attention (on the notebook)

Split heads
Calculate raw attention scores, i.e., before softmax
Create and apply the causal mask to attention
Softmax the raw attention and use it to get outputs
Merge heads

7 of 40

Encoder-Decoder LM

Slides credit: Daniel Kashabi, Collin Raffel, Abhishek Panigrahi, Victoria Graf and others

8 of 40

Transformers are the default building blocks for NLP

Encoders

Examples: BERT, RoBERTa, SciBERT. Captures bidirectional context

9 of 40

Transformers are the default building blocks for NLP

Encoders

Examples: BERT, RoBERTa, SciBERT. Captures bidirectional context

Decoders

Examples: GPT-2, GPT-3, LaMDA

Also known as: causal or auto-regressive language model

Natural if the goal is generation, but can not condition on future words

10 of 40

Transformers are the default building blocks for NLP

Encoders

Examples: BERT, RoBERTa, SciBERT. Captures bidirectional context

Examples: BART, T5, Meena

Conditional generation based on an encoded input

Encoder-Decoders

Decoders

Examples: GPT-2, GPT-3, LaMDA

Also known as: causal or auto-regressive language model

Natural if the goal is generation, but can not condition on future words

11 of 40

T5: Text-To-Text Transfer Transformer

[Raffel et al 2019]

This paper:

Represent a collection of NLP tasks in a common format that takes in text and produces text

An encoder decoder architecture

A thorough exploration of model design choices

12 of 40

The claim: All text processing tasks → text-to-text format

Translation

Linguistic acceptability

Semantic textual similarity

Summarization

13 of 40

The claim: All text processing tasks → text-to-text format

Translation

Linguistic acceptability

Semantic textual similarity

Summarization

Textual entailment Paraphrase recognition Reading comprehension

For each task, design a template so that the input and outputs are text

(Some previous papers had also explored this idea)

14 of 40

T5: Text-To-Text Transfer Transformer

[Raffel et al 2019]

This paper:

Represent a collection of NLP tasks in a common format that takes in text and produces text

An encoder decoder architecture

A thorough exploration of model design choices

15 of 40

Experimental Setup

Decide a default model

Encoder-decoder architecture
Pretraining objective
….

Evaluate a design axis, fixing the rest of the parameters

16 of 40

Key findings

Model Architecture

Encoder-decoder models outperform "decoder-only" language models

Pre-training Objectives

Fill-in-the-blank-style denoising objectives are most effective. Computational cost is a crucial factor

Training Strategies

Multitask learning is competitive with pre-train-then-fine-tune, but task frequency needs careful consideration

17 of 40

Architectures: Different Choices

18 of 40

Architectures: Different Attention Masks

Allows fully-visible masking on a portion of input

Allows the self attention mechanism to attend to the full input.

Doesn’t allow output elements to look into the future

19 of 40

Architectures: Different Positional Encodings

Sinusoidal Positional Embeddings

The initial transformer proposed in Attention Is All You Need uses sine and cosine positional embeddings.

These sine and cosine positional embeddings aren’t learned by the model, but they are also not suited to deal with large input sizes.

As the sequence length increases, the frequency of the sinusoidal functions used in the positional embeddings becomes too high, resulting in very short periods. This can lead to inadequate representation of long-range dependencies and difficulties in capturing fine-grained positional information.

20 of 40

Architectures: Different Positional Encodings

Rotary Positional Embeddings

Instead of adding a positional vector, it applies a rotation to the word vector.

Stability of Vectors: Adding tokens at the end of a sentence doesn’t affect the vectors for words at the beginning, facilitating efficient caching.

Preservation of Relative Positions: If two words, say “pig” and “dog,” maintain the same relative distance in different contexts, their vectors are rotated by the same amount. This ensures that the angle, and consequently the dot product between these vectors, remains constant

21 of 40

Architectural Variants: Experiments

Language model is decoder-only

Slide credit: Abhishek Panigrahi, Victoria Graf

22 of 40

Architectural Variants: Experiments

LM looks at both input and target, while encoder only looks at input sequence and decoder looks at output sequence.

Slide credit: Abhishek Panigrahi, Victoria Graf

23 of 40

Architectural Variants: Experiments

Halving the number of layers in encoder and decoder hurts the performance.

Performance of Encoder and Decoder with shared parameters is better than decoder only LM and prefix LM.

Slide credit: Abhishek Panigrahi, Victoria Graf

24 of 40

Decoder-only LM

Slide credit: Sbhya Chhabria & Michael Tang

25 of 40

Transformers are the default building blocks for NLP today

Encoders

Examples: BERT, RoBERTa, SciBERT. Captures bidirectional context

Examples: BART, T5, Meena

Conditional generation based on an encoded input

Encoder-Decoders

Decoders

Examples: GPT-2, GPT-3, LaMDA

Also known as: causal or auto-regressive language model

Natural if the goal is generation, but can not condition on future words

26 of 40

Causal or Auto-regressive models

𝑣₁

𝑣₂

𝑣₃

𝑣₄

Model

𝑥₁𝑥₂𝑥₃𝑥₄

A non-auto-regressive model: Inputs and outputs are different

Use case: When we want to assign labels for each word (e.g. part-of- speech tagging)

𝑥₂

𝑥₃

𝑥₄

𝑥₅

Model

𝑥₁𝑥₂𝑥₃𝑥₄

A causal or an auto-regressive model: Each output is the next input in the sequence

Use case: When we want to generate tokens (e.g. language modeling)

27 of 40

The GPT family

GPT (2018), 117 million parameters

GPT-2 (2019), 1.5 billion parameters

GPT-3 (2020), 175 billion parameters

NeurIPS 2020 best paper

28 of 40

The anatomy of a GPT model

An autoregressive model that predicts the next token given tokens so far (either predicted or given as part of input)

And so on

29 of 40

The anatomy of a GPT model

An autoregressive model that predicts the next token given tokens so far (either predicted or given as part of input)

30 of 40

The anatomy of a GPT model

This part

does not exist

As it processes each subword, it masks the “future” words and conditions on (i.e. attends to) the previous words

Consists only of decoder transformer blocks (contrast with BERT which consists only of encoders)

31 of 40

The anatomy of a GPT model

As it processes each subword, it masks the “future” words and conditions on (i.e. attends to) the previous words

Consists only of decoder transformer blocks (contrast with BERT which consists only of encoders)

This part

does not exist

32 of 40

The first GPT model (sometimes called GPT-1)

Pretrained on the BooksCorpus

33 of 40

The first GPT model (sometimes called GPT-1)

Pretrained on the BooksCorpus

Also shows results on fine-tuning for end tasks, where inputs and outputs are converted to text

34 of 40

GPT-2 is identical to GPT-1, but:

Layer norm moved to the input of each sub-block

Vocabulary extended to 50,257 tokens and context size increased from 512 to 1024

Trained on 8 million docs from the web (Common Crawl), minus Wikipedia

35 of 40

GPT-2: Model Sizes

1542M

762M

345M

117M parameters

Play with it here: https://huggingface.co/gpt2

Image from http://jalammar.github.io/illustrated-gpt2/

36 of 40

GPT-3: A Very Large Language Model (2020)

More layers & parameters
Bigger dataset
Longer training
Larger embedding/hidden dimension
Larger context window

37 of 40

Size Comparisons

BERT-Base model has 12 transformer blocks, 12 attention heads,

110M parameters

BERT-Large model has 24 transformer blocks, 16 attention heads,

340M parameters

GPT-2 is trained on 40GB of text data (8M webpages)!

1.5B parameters

GPT-3 is an even bigger version of GPT-2, but isn’t open-source

175B parameters

39 of 40

GPT

The GPT family: Decoder only models

General theme: Train the largest language model your resources allow on the largest dataset you can find

Impressive generation performance

– Even more impressive: Zero-shot capabilities

1 of 40

2 of 40

3 of 40

4 of 40

5 of 40

6 of 40

7 of 40

8 of 40

9 of 40

10 of 40

11 of 40

12 of 40

13 of 40

14 of 40

15 of 40

16 of 40

17 of 40

18 of 40

19 of 40

20 of 40

21 of 40

22 of 40

23 of 40

24 of 40

25 of 40

26 of 40

27 of 40

28 of 40

29 of 40

30 of 40

31 of 40

32 of 40

33 of 40

34 of 40

35 of 40

36 of 40

37 of 40

38 of 40

39 of 40

40 of 40