10-605 / 10-805
Machine Learning from Large Datasets
Outline
RETRIEVAL AUGMENTED LLMS�A RECAP
Questions
Is the source of information reliable?
What if documents contradict each other?
What if information needed is spread across many documents?
RECAP
Contrastive learning for retrieval
"Projection head” to compress large representation
Loss usually needs positive and negative retrievals for q
Usually not a bit vector
Contrastive loss fine-tunes the large encoder (and the projection head)
RECAP
Contrastive learning: losses
NT-Xent: normalized temperature-scale cross-entropy
RECAP
2020
RECAP
The Dense Passage Retriever (DPR)
RECAP
The Dense Passage Retriever (DPR)
RECAP
Discussion of DPR
Encoder only
Decoder only
Encoder-decoder
RECAP
Discussion of DPR
RECAP
Cross-Encoders
query
i-th candidate
c1, c2, … cN
relevance of i-th candidate to query
RECAP
Mega-batching and Dynamic Dictionaries
RECAP
Momentum Encoders
RECAP
Contriever vs DPR
RECAP
Independent cropping
..Zebras have four gaits: walk, trot, canter and gallop. They are generally slower than horses, but their great stamina helps them outrun predators. When chased, a zebra will zigzag from side to side...
four gaits: walk, trot, canter and gallop. They are generally
great stamina helps them outrun predators. When chased
p
a
b
RECAP
RETRIEVAL AUGMENTED LLMS:�SOME EXAMPLES
The OG RAG paper
2021
The OG RAG paper
returns top K docs
sort of
The OG RAG paper
returns top K docs
based on BART
(BERT-sized encoder-decoder)
The OG RAG paper
returns top K docs
based on DPR
based on BART
(BERT-sized encoder-decoder)
updated in training
frozen – not updated!
The OG RAG paper
updated in training
frozen – not updated!
Index needs to rebuilt when BERTd changes
Re-indexing is slow and complicated
Experiments with RAG
TREC
Natural
Questions
TriviaQA
test1/2
Web
Questions
Experiments with RAG
TREC
Natural
Questions
TriviaQA
test1/2
Web
Questions
RAG vs REALM
RAG: discussion
Fusion in Decoder (FiD)
2021
Fusion in Decoder (FiD)
O(N2 m2) ignoring k
O(N m)
O(N m2)
O(N m)
quadratic in N 🡪 linear in N
can afford to retrieve more docs
we lose cross-attention between tokens in different passages
FiD Experiments
FiD Experiments
N = 100 passages
Fusion in Decoder (FiD): Discussion
RETRIEVAL WITH MODERN LLMS
Recap: Discussion of DPR
Decoder only
Discussion of DPR
Encoder only
Decoder only
Encoder-decoder
But … everybody is working on improving decoder-only LLMs, so it would be great if we could use them!
2023
Hypothetical Document Embedding (HyDE)
Language Model as Retriever (LameR)
https://arxiv.org/pdf/2304.14233 2023 Shen et al
ExpandR
EMNLP 2025
ExpandR
ExpandR
loosely interpreted
ExpandR
ExpandR
RL: doesn’t need dq but does need preferences dq1 > dq2
ExpandR: DPO Background
where:
ExpandR: DPO Background
Note the gradient of the loss looks like this:
where:
ExpandR
*and some other tricks
loosely interpreted
ExpandR: Results
Decoder-only models as encoders?
2023
2025
PromptEOL
Echo embeddings
PromptEOL: Key ideas
This sentence: “x” means in one word:
Echo embeddings: key ideas
Rewrite the sentence: x; rewritten sentence: x
Break Here?
“FID” WITH DECODER-ONLY LLMS
Main ideas in FiD and extensions
FlashAttention (2023) and FlexAttention (2024)—also improve decoder bottlenecks
Parallel Context Windows (PCW) - 2023
TurboRAG, Blockwise Sparse Attention - 2024
Dynamic Blockwise Sparse Attention - 2025
Analog for decoder-only LLMs
2023
Key idea: cross-attend within a context window, and cross-attend between task tokens and all context windows.
Very similar to FiD
2023
Key idea: cross-attend within a context window, and cross-attend between task tokens and all context windows.
Very similar to FiD
Good results for NQ (vs conventional RAG system) and ICL (especially classification with many classes)
Implemented with attention masks rather than parallel generation of keys and values
2024
Key idea: same as PCW except
TTFT = Time To First Token
FLOPS also for first token
2025
Key idea: same as Block-Attention except
DBSA: Details
Baseline: many-shot learning
DBSA: Details
Dynamic example selection
DBSA: Results
Total latency including set-up time
DBSA: Results
TurboRAG EMNLP 2025
Similar plan as DBSA
TurboRAG