Assignment 5, Encoder-Decoder and Decoder-Only LMs
CSE 447 / 517
FEB 27TH, 2025 (WEEK 8)
Logistics
Agenda
Assignment 5
Implement the core components of attention
Experiment with Your Transformer
Encoder-Decoder LM
Slides credit: Daniel Kashabi, Collin Raffel, Abhishek Panigrahi, Victoria Graf and others
Transformers are the default building blocks for NLP
8
Encoders
Examples: BERT, RoBERTa, SciBERT. Captures bidirectional context
Transformers are the default building blocks for NLP
9
Encoders
Examples: BERT, RoBERTa, SciBERT. Captures bidirectional context
Decoders
Examples: GPT-2, GPT-3, LaMDA
Also known as: causal or auto-regressive language model
Natural if the goal is generation, but can not condition on future words
Transformers are the default building blocks for NLP
10
Encoders
Examples: BERT, RoBERTa, SciBERT. Captures bidirectional context
Examples: BART, T5, Meena
Conditional generation based on an encoded input
Encoder-Decoders
Decoders
Examples: GPT-2, GPT-3, LaMDA
Also known as: causal or auto-regressive language model
Natural if the goal is generation, but can not condition on future words
T5: Text-To-Text Transfer Transformer
[Raffel et al 2019]
11
This paper:
Represent a collection of NLP tasks in a common format that takes in text and produces text
An encoder decoder architecture
A thorough exploration of model design choices
The claim: All text processing tasks → text-to-text format
12
Translation
Linguistic acceptability
Semantic textual similarity
Summarization
The claim: All text processing tasks → text-to-text format
13
Translation
Linguistic acceptability
Semantic textual similarity
Summarization
Textual entailment Paraphrase recognition Reading comprehension
For each task, design a template so that the input and outputs are text
(Some previous papers had also explored this idea)
T5: Text-To-Text Transfer Transformer
[Raffel et al 2019]
14
This paper:
Represent a collection of NLP tasks in a common format that takes in text and produces text
An encoder decoder architecture
A thorough exploration of model design choices
Experimental Setup
15
Decide a default model
Evaluate a design axis, fixing the rest of the parameters
Key findings
Model Architecture
Pre-training Objectives
Training Strategies
16
Architectures: Different Choices
Architectures: Different Attention Masks
Allows fully-visible masking on a portion of input
Allows the self attention mechanism to attend to the full input.
Doesn’t allow output elements to look into the future
Architectures: Different Positional Encodings
Sinusoidal Positional Embeddings
The initial transformer proposed in Attention Is All You Need uses sine and cosine positional embeddings.
These sine and cosine positional embeddings aren’t learned by the model, but they are also not suited to deal with large input sizes.
As the sequence length increases, the frequency of the sinusoidal functions used in the positional embeddings becomes too high, resulting in very short periods. This can lead to inadequate representation of long-range dependencies and difficulties in capturing fine-grained positional information.
Architectures: Different Positional Encodings
Rotary Positional Embeddings
Instead of adding a positional vector, it applies a rotation to the word vector.
Stability of Vectors: Adding tokens at the end of a sentence doesn’t affect the vectors for words at the beginning, facilitating efficient caching.
Preservation of Relative Positions: If two words, say “pig” and “dog,” maintain the same relative distance in different contexts, their vectors are rotated by the same amount. This ensures that the angle, and consequently the dot product between these vectors, remains constant
Architectural Variants: Experiments
Language model is decoder-only
Slide credit: Abhishek Panigrahi, Victoria Graf
Architectural Variants: Experiments
LM looks at both input and target, while encoder only looks at input sequence and decoder looks at output sequence.
Slide credit: Abhishek Panigrahi, Victoria Graf
Architectural Variants: Experiments
Slide credit: Abhishek Panigrahi, Victoria Graf
Decoder-only LM
Slide credit: Sbhya Chhabria & Michael Tang
Transformers are the default building blocks for NLP today
25
Encoders
Examples: BERT, RoBERTa, SciBERT. Captures bidirectional context
Examples: BART, T5, Meena
Conditional generation based on an encoded input
Encoder-Decoders
Decoders
Examples: GPT-2, GPT-3, LaMDA
Also known as: causal or auto-regressive language model
Natural if the goal is generation, but can not condition on future words
Causal or Auto-regressive models
26
𝑣1
𝑣2
𝑣3
𝑣4
Model
𝑥1 𝑥2 𝑥3 𝑥4
A non-auto-regressive model: Inputs and outputs are different
Use case: When we want to assign labels for each word (e.g. part-of- speech tagging)
𝑥2
𝑥3
𝑥4
𝑥5
Model
𝑥1 𝑥2 𝑥3 𝑥4
A causal or an auto-regressive model: Each output is the next input in the sequence
Use case: When we want to generate tokens (e.g. language modeling)
The GPT family
GPT (2018), 117 million parameters
GPT-2 (2019), 1.5 billion parameters
GPT-3 (2020), 175 billion parameters
NeurIPS 2020 best paper
The anatomy of a GPT model
28
And so on
The anatomy of a GPT model
29
The anatomy of a GPT model
30
This part
does not exist
As it processes each subword, it masks the “future” words and conditions on (i.e. attends to) the previous words
Consists only of decoder transformer blocks (contrast with BERT which consists only of encoders)
The anatomy of a GPT model
31
As it processes each subword, it masks the “future” words and conditions on (i.e. attends to) the previous words
Consists only of decoder transformer blocks (contrast with BERT which consists only of encoders)
This part
does not exist
The first GPT model (sometimes called GPT-1)
32
Pretrained on the BooksCorpus
The first GPT model (sometimes called GPT-1)
33
Pretrained on the BooksCorpus
Also shows results on fine-tuning for end tasks, where inputs and outputs are converted to text
GPT-2 is identical to GPT-1, but:
20
GPT-2: Model Sizes
1542M
762M
345M
117M parameters
Play with it here: https://huggingface.co/gpt2
Image from http://jalammar.github.io/illustrated-gpt2/
GPT-3: A Very Large Language Model (2020)
Size Comparisons
GPT
– Even more impressive: Zero-shot capabilities
Questions?