10-605 / 10-805
Machine Learning from Large Datasets
A few reminders
DEEP LEARNING
The Evolution of Transformers
Topics
6
AI Stats 2010
Histogram of gradients in a 5-layer network for an artificial image recognition task
input
output
7
AI Stats 2010
input
output
Topics
Topics
Topics
Topics
sequence classification
translation
named entity recognition
image captioning
seq2seq
Encoder
Decoder
Transformers
BERT – Encoder-Only Transformers
2019
=
Decoder-Only Transformers
GPT-1
Tokenization in BERT / GPT2
Tokenization for Vision Transformers
2D positional encoding is not better than 1D positional encoding
LoRA (low rank adaptation) finetuning
Key idea: don’t learn a full d x d matrix of weights, instead learn a low-rank approximation to this matrix!
full finetuning
LoRA
finetuning
HYPERPARAMETER TUNING
Hyperband and early stopping
Hyperparameter Selection
Adaptive selection loop:
A simple case:
Goal: find the best of the k values quickly
Explore the space intelligently
Theory: Bandit Problems in ML
Gaussian Process Regression
Hyperparameter configuration x to learner performance f(x) is a regression task
Challenges:
Gaussian Process Regression is a good match for this
Hyperparameter Selection: Early Stopping
What can we prove about this?
What do we need to assume?
In general we don’t know where a learning curve will end up…
In practice there are often big jumps in discrete eval metrics like error rate, win ratios, …
The losses you train and test on are often fairly smooth … and sometimes there are models for them
Successive Halving Method
If the number of configurations n is large, we may get very noisy results early on and miss the best
If n is too small then we will waste resources on bad arms.
Hyperband
Hyperband runs several rounds of SuccessiveHalving with very different sizes n
EMBARRASSINGLY PARALLEL LEARNING METHODS
interleaved with model compression / distillation
Outline
BTM: Key ideas
ELM = expert LM
ELMForest = group of ELMs
𝜃1, 𝜃2, …, 𝜃k
small GPT 𝜃
merged with parameter averaging
Background: Expert Parallelism
This can be parallelized!
Often need to train to “balance” the experts with an extra loss term
2024
Branch and Train as in Branch-train-merge
Mix the expert LMs by
BTX results
FlexOLMO
Multiple local datasets Di; one public dataset Dpub; one model Mpub trained on Dpub
For each Di
FlexOLMO
Multiple local datasets Di; one public dataset Dpub; one model Mpub trained on Dpub
For each Di
Wr is “router embeddings”
ri is initialized with off-the-shelf embeddings from Di
and then trained pairwise with Mpub
Mixture output
Optionally: create local datasets D’i where each is extracted from Mpub and similar to Di
(These can be small)
Sample uniformly from the “proxy datasets” and Dpub to fine-tune Wr
FlexOLMO
Experiments with task analogies
λ1𝜏AmazonSentiment + (λ2𝜏YelpLM — λ3𝜏AmazonLM)
Experiments with task analogies
Collect 200 images of kings, queens, women, men.
Fine-tune CLIP models on each category, using 1000 ImageNet classes as negative examples
(class name ”something”)
Experiment: adding different task vectors
Note the training for the subtasks is embarrassingly parallel
MAKING TRANSFORMER MODELS MORE EFFICIENT
Model compression
Distillation
Typically we distill a model on lots and lots of data only labeled by the teacher (transfer set)
The simple case: teacher predictions are “hard” labels (e.g., training a generative LLM).
Recap: making a large model smaller
Recap: making a large model smaller
simple quant rules
complex rules, learning
“32 bits everywhere”
“16 bits except for bias”
int8() except for outliers, defined as ….
jointly optimize codebook/codes and where to quant while doing gradient updates on weights
Recap: making a large model smaller
simple optimization
complex optimization
Wanda: greedy weight pruning
shortened Llama: greedy layer pruning + LoRA
Magnitude: greedy weight pruning alternating with gradients
Sheared Llama: jointly optimize weights and continuous approximation to mask.
Recap: making a large model smaller
simple optimization
complex optimization
+ start token
H2O: score by running accumulated attention
SnapKV: Adaptively pick specific weights based on observation window in prefilling
FastGen: Pick strategy per head and layer based on prefilling.
Score by recency (FIFO)
+ separator
Lots of detail but I tried to pick out “key ideas” for the papers
RAG AND CONTRASTIVE LEARNING
Outline
Recap: The Dense Passage Retriever (DPR)
Recap: Contriever vs DPR
Recap: The OG RAG paper
returns top K docs
Recap: Discussion of RAG
Encoder-only vs decoder-only Transformers to Retrieval
Hypothetical Document Embedding (HyDE)
ExpandR
loosely interpreted
ExpandR
Decoder-only models as encoders?
2023
2025
PromptEOL
Echo embeddings
Recap: Fusion in Decoder (FiD)
O(N2 m2) ignoring k
O(N m)
O(N m2)
O(N m)
quadratic in N 🡪 linear in N
can afford to retrieve more docs
we lose cross-attention between tokens in different passages
Recap: FiD Optimized (FiDO)
2023
Recap: Fusion in Decoder (FiD)
FLOP analysis: Encoder is 6x as expensive as decoder!
…but at inference time decoding is slowest. Why?
What’s the fix?
Predicted by counting FLOPS for all the matmuls and assuming nt << ns and nt << d
Predicted by memory/FLOPs:
Multilayer perceptron
self-attention
cross-attention
Recap: Fusion in Decoder Optimized (FiDO)
Decoder
Encoder
Decoder
Multi-query and Grouped-Query Attention
Grouped query attention: don’t require that every attention head have its own keys, values, and queries – instead re-use keys and values in “groups”
Recap: Fusion in Decoder Optimized (FiDO)
Recap: LUMEN: FiD with caching
The first N-K layers of the FiD encoder
The last K layers of the FiD encoder
FiD decoder
Passages are encoded and stored off-line for every document
Main ideas in FiD and extensions
FlashAttention (2023) and FlexAttention (2024)—also improve decoder bottlenecks
Parallel Context Windows (PCW) - 2023
TurboRAG, Blockwise Sparse Attention - 2024
Dynamic Blockwise Sparse Attention - 2025
Analog for decoder-only LLMs
2023
Key idea: cross-attend within a context window, and cross-attend between task tokens and all context windows.
Very similar to FiD
2025
Key idea: same as Block-Attention except
KV Retrieval vs KV Cache Eviction
PPI: Statistically Unbiased AI Judges
Arguments for:
Bayesian PPI
Difference estimate
(old)
Bayesian version
Bayesian PPI: QA evals
C I width
human labels: 3k AR labels
Final messages