1 of 40

Assignment 5, Encoder-Decoder and Decoder-Only LMs

CSE 447 / 517

FEB 27TH, 2025 (WEEK 8)

2 of 40

Logistics

  • Project Checkpoint 3 is due on Monday, 3/03
  • Assignment 5 (A5) is due on Wednesday, 3/05

3 of 40

Agenda

  • Assignment 5
  • Encoder-Decoder LM
    • T5
  • Decoder-Only LM

4 of 40

Assignment 5

5 of 40

Implement the core components of attention

  • pairwise similarities (transpose the right dimensions)
  • Attention scaled

  • Attention softmax

  • Compute outputs

6 of 40

Experiment with Your Transformer

  • Self attention (on the notebook)
    • Split heads
    • Calculate raw attention scores, i.e., before softmax
    • Create and apply the causal mask to attention
    • Softmax the raw attention and use it to get outputs
    • Merge heads

7 of 40

Encoder-Decoder LM

Slides credit: Daniel Kashabi, Collin Raffel, Abhishek Panigrahi, Victoria Graf and others

8 of 40

Transformers are the default building blocks for NLP

8

Encoders

Examples: BERT, RoBERTa, SciBERT. Captures bidirectional context

9 of 40

Transformers are the default building blocks for NLP

9

Encoders

Examples: BERT, RoBERTa, SciBERT. Captures bidirectional context

Decoders

Examples: GPT-2, GPT-3, LaMDA

Also known as: causal or auto-regressive language model

Natural if the goal is generation, but can not condition on future words

10 of 40

Transformers are the default building blocks for NLP

10

Encoders

Examples: BERT, RoBERTa, SciBERT. Captures bidirectional context

Examples: BART, T5, Meena

Conditional generation based on an encoded input

Encoder-Decoders

Decoders

Examples: GPT-2, GPT-3, LaMDA

Also known as: causal or auto-regressive language model

Natural if the goal is generation, but can not condition on future words

11 of 40

T5: Text-To-Text Transfer Transformer

[Raffel et al 2019]

11

This paper:

Represent a collection of NLP tasks in a common format that takes in text and produces text

An encoder decoder architecture

A thorough exploration of model design choices

12 of 40

The claim: All text processing tasks text-to-text format

12

Translation

Linguistic acceptability

Semantic textual similarity

Summarization

13 of 40

The claim: All text processing tasks text-to-text format

13

Translation

Linguistic acceptability

Semantic textual similarity

Summarization

Textual entailment Paraphrase recognition Reading comprehension

For each task, design a template so that the input and outputs are text

(Some previous papers had also explored this idea)

14 of 40

T5: Text-To-Text Transfer Transformer

[Raffel et al 2019]

14

This paper:

Represent a collection of NLP tasks in a common format that takes in text and produces text

An encoder decoder architecture

A thorough exploration of model design choices

15 of 40

Experimental Setup

15

Decide a default model

  • Encoder-decoder architecture
  • Pretraining objective
  • ….

Evaluate a design axis, fixing the rest of the parameters

16 of 40

Key findings

Model Architecture

  • Encoder-decoder models outperform "decoder-only" language models

Pre-training Objectives

  • Fill-in-the-blank-style denoising objectives are most effective. Computational cost is a crucial factor

Training Strategies

  • Multitask learning is competitive with pre-train-then-fine-tune, but task frequency needs careful consideration

16

17 of 40

Architectures: Different Choices

18 of 40

Architectures: Different Attention Masks

Allows fully-visible masking on a portion of input

Allows the self attention mechanism to attend to the full input.

Doesn’t allow output elements to look into the future

19 of 40

Architectures: Different Positional Encodings

Sinusoidal Positional Embeddings

The initial transformer proposed in Attention Is All You Need uses sine and cosine positional embeddings.

These sine and cosine positional embeddings aren’t learned by the model, but they are also not suited to deal with large input sizes.

As the sequence length increases, the frequency of the sinusoidal functions used in the positional embeddings becomes too high, resulting in very short periods. This can lead to inadequate representation of long-range dependencies and difficulties in capturing fine-grained positional information.

20 of 40

Architectures: Different Positional Encodings

Rotary Positional Embeddings

Instead of adding a positional vector, it applies a rotation to the word vector.

Stability of Vectors: Adding tokens at the end of a sentence doesn’t affect the vectors for words at the beginning, facilitating efficient caching.

Preservation of Relative Positions: If two words, say “pig” and “dog,” maintain the same relative distance in different contexts, their vectors are rotated by the same amount. This ensures that the angle, and consequently the dot product between these vectors, remains constant

21 of 40

Architectural Variants: Experiments

Language model is decoder-only

Slide credit: Abhishek Panigrahi, Victoria Graf

22 of 40

Architectural Variants: Experiments

LM looks at both input and target, while encoder only looks at input sequence and decoder looks at output sequence.

Slide credit: Abhishek Panigrahi, Victoria Graf

23 of 40

Architectural Variants: Experiments

  • Halving the number of layers in encoder and decoder hurts the performance.

  • Performance of Encoder and Decoder with shared parameters is better than decoder only LM and prefix LM.

Slide credit: Abhishek Panigrahi, Victoria Graf

24 of 40

Decoder-only LM

Slide credit: Sbhya Chhabria & Michael Tang

25 of 40

Transformers are the default building blocks for NLP today

25

Encoders

Examples: BERT, RoBERTa, SciBERT. Captures bidirectional context

Examples: BART, T5, Meena

Conditional generation based on an encoded input

Encoder-Decoders

Decoders

Examples: GPT-2, GPT-3, LaMDA

Also known as: causal or auto-regressive language model

Natural if the goal is generation, but can not condition on future words

26 of 40

Causal or Auto-regressive models

26

𝑣1

𝑣2

𝑣3

𝑣4

Model

𝑥1 𝑥2 𝑥3 𝑥4

A non-auto-regressive model: Inputs and outputs are different

Use case: When we want to assign labels for each word (e.g. part-of- speech tagging)

𝑥2

𝑥3

𝑥4

𝑥5

Model

𝑥1 𝑥2 𝑥3 𝑥4

A causal or an auto-regressive model: Each output is the next input in the sequence

Use case: When we want to generate tokens (e.g. language modeling)

27 of 40

The GPT family

GPT (2018), 117 million parameters

GPT-2 (2019), 1.5 billion parameters

GPT-3 (2020), 175 billion parameters

NeurIPS 2020 best paper

28 of 40

The anatomy of a GPT model

  • An autoregressive model that predicts the next token given tokens so far (either predicted or given as part of input)

28

And so on

29 of 40

The anatomy of a GPT model

29

  • An autoregressive model that predicts the next token given tokens so far (either predicted or given as part of input)

30 of 40

The anatomy of a GPT model

30

This part

does not exist

As it processes each subword, it masks the “future” words and conditions on (i.e. attends to) the previous words

Consists only of decoder transformer blocks (contrast with BERT which consists only of encoders)

31 of 40

The anatomy of a GPT model

31

As it processes each subword, it masks the “future” words and conditions on (i.e. attends to) the previous words

Consists only of decoder transformer blocks (contrast with BERT which consists only of encoders)

This part

does not exist

32 of 40

The first GPT model (sometimes called GPT-1)

32

Pretrained on the BooksCorpus

33 of 40

The first GPT model (sometimes called GPT-1)

33

Pretrained on the BooksCorpus

Also shows results on fine-tuning for end tasks, where inputs and outputs are converted to text

34 of 40

GPT-2 is identical to GPT-1, but:

20

  • Layer norm moved to the input of each sub-block

  • Vocabulary extended to 50,257 tokens and context size increased from 512 to 1024

  • Trained on 8 million docs from the web (Common Crawl), minus Wikipedia

35 of 40

GPT-2: Model Sizes

1542M

762M

345M

117M parameters

Play with it here: https://huggingface.co/gpt2

36 of 40

GPT-3: A Very Large Language Model (2020)

  • More layers & parameters
  • Bigger dataset
  • Longer training
  • Larger embedding/hidden dimension
  • Larger context window

37 of 40

Size Comparisons

  • BERT-Base model has 12 transformer blocks, 12 attention heads,
    • 110M parameters

  • BERT-Large model has 24 transformer blocks, 16 attention heads,
    • 340M parameters

  • GPT-2 is trained on 40GB of text data (8M webpages)!
    • 1.5B parameters

  • GPT-3 is an even bigger version of GPT-2, but isn’t open-source
    • 175B parameters

38 of 40

39 of 40

GPT

  • The GPT family: Decoder only models

  • General theme: Train the largest language model your resources allow on the largest dataset you can find

  • Impressive generation performance

Even more impressive: Zero-shot capabilities

40 of 40

Questions?

  • Thank you!