1 of 62

10-605 / 10-805

Machine Learning from Large Datasets

2 of 62

Outline

Recap of contrastive learning and retrieval
Some retrieval systems

DPR and Contriever (recap)

Combining retrieval and generation

The original RAG paper
REALM (earlier work)
Fusion in Decoder (later work)

Some recent retrieval tricks

HyDE and LameR
ExpandR

Guest lecture”: Michiel de Jong, Cursor

3 of 62

RETRIEVAL AUGMENTED LLMS�A RECAP

4 of 62

Questions

How to retrieve? Can we learn to retrieve better?
How to generate? Are there problems here specific to RAG?
Do we learn to retrieve jointly with learning to generate, or separately?

Is the source of information reliable?

What if documents contradict each other?

What if information needed is spread across many documents?

RECAP

5 of 62

Contrastive learning for retrieval

"Projection head” to compress large representation

Loss usually needs positive and negative retrievals for q

Usually not a bit vector

Contrastive loss fine-tunes the large encoder (and the projection head)

RECAP

6 of 62

Contrastive learning: losses

NT-Xent: normalized temperature-scale cross-entropy

RECAP

7 of 62

2020

RECAP

8 of 62

The Dense Passage Retriever (DPR)

Split Wikipedia into “passages”

21M passages, each 100 tokens long

Passages also include the page title

Encoder is Bert

“Projection head” is the [CLS] token

Process several QA datasets to get triples:

(question, gold passage id, gold answer span)

RECAP

9 of 62

The Dense Passage Retriever (DPR)

Process several QA datasets to get triples:

(question, gold passage id, gold answer span)

Triples give positive question/passage examples (q_i, p_i) encoded in d=768 dimensions by pre-trained BERT
Q is mini-batch of B questions, P is B passages, S = QP^Tis (B x B) matrix of similarity scores

score is positive for q_i, p_jiff i=j

“in-batch negatives” trick

also add as additional hard negatives

1-2 passages with high TFIDF score to q which do not contain the answer span

RECAP

10 of 62

Discussion of DPR

Encoder used in DPR:

BERT with [CLS] token

The obvious choice in 2020
Still a very good choice

Encoder-only and encoder-decoder models have a “bottleneck”

Encoder only

Decoder only

Encoder-decoder

RECAP

11 of 62

Discussion of DPR

Contrastively learned encoders are limited

They don’t know what the question is when they encode a document
So they can’t represent the relation between the document and the questions

One solution

Retrieve documents with a fast contrastive model
Rerank them with a more powerful model

RECAP

12 of 62

Cross-Encoders

query

i-th candidate

c₁, c2, … c_N

relevance of i-th candidate to query

RECAP

13 of 62

Mega-batching and Dynamic Dictionaries

Assume we have

A dataloader that streams through mini-batches
An SGD trainer than learns from the mini-batchers

Idea: modify data stream

Keep a queue of the M most recently-seen examples
Add to each mini-batch the hardest negative example in the queue

Advantage:

Queue size is independent of batch size

RECAP

14 of 62

Momentum Encoders

RECAP

15 of 62

Contriever vs DPR

Same NT-Xent contrastive loss
One BERT encoder instead of two
Data augmentation tricks explored

Independent cropping
Random token deletion

Negatives were from

in-batch negatives
momentum contrast (MoCo)

MoCo: dynamic dictionary queue and averaging last two models to find hard negatives

RECAP

16 of 62

Independent cropping

Given a paragraph p create a pair (a, b) where

a and b are both independently chosen subsequences of the text
length of a and b are either fixed or sampled

Note a and b will be similar size, not sentence vs paragraph

..Zebras have four gaits: walk, trot, canter and gallop. They are generally slower than horses, but their great stamina helps them outrun predators. When chased, a zebra will zigzag from side to side...

four gaits: walk, trot, canter and gallop. They are generally

great stamina helps them outrun predators. When chased

p

a

b

RECAP

17 of 62

RETRIEVAL AUGMENTED LLMS:�SOME EXAMPLES

18 of 62

The OG RAG paper

2021

19 of 62

The OG RAG paper

returns top K docs

sort of

20 of 62

The OG RAG paper

returns top K docs

based on BART

(BERT-sized encoder-decoder)

21 of 62

The OG RAG paper

returns top K docs

based on DPR

based on BART

(BERT-sized encoder-decoder)

updated in training

frozen – not updated!

22 of 62

The OG RAG paper

updated in training

frozen – not updated!

Index needs to rebuilt when BERT_dchanges

Re-indexing is slow and complicated

23 of 62

Experiments with RAG

TREC

Natural

Questions

TriviaQA

test1/2

Web

Questions

24 of 62

Experiments with RAG

TREC

Natural

Questions

TriviaQA

test1/2

Web

Questions

25 of 62

RAG vs REALM

REALM (Retrieval augmented LM, 2020):

full joint training of retriever and generator
requires periodic re-indexing

one process reindexes:

look for latest checkpoint,
loads and reindexes Wikipedia

second process generates training data

get query/answer
augment with top-K docs from latest index

third process trains both models

few systems tried to jointly train afterwards

using REALM’s approach is complex
independently-trained retrieval systems got better

26 of 62

RAG: discussion

RAG is now generic term for using retrieval as preprocessing step for LLMs

Simplest case is fixed retriever, and appending all docs to the question to form LLM input
No custom learner / model / model-combination!
Simple case improves immediately whenever

LLM improves
retriever improves

Appending docs scales poorly when many documents are used

but there are some tricks to improve that

27 of 62

Fusion in Decoder (FiD)

2021

28 of 62

Fusion in Decoder (FiD)

In open-book QA, often

Questions q and answer a are short: k=O(10)
Passages are longer: m=O(100)

Retrieving and appending N passages:

Encoder = O(Nm + k)²
Decoder = O(k * (Nm + k))

Trick:

cross-encode q with each passage separately: O(N(k+m)²)
decode, attending to all: O(k * N(m + k))

O(N²m²) ignoring k

O(Nm)

O(Nm²)

O(Nm)

quadratic in N 🡪 linear in N

can afford to retrieve more docs

we lose cross-attention between tokens in different passages

29 of 62

FiD Experiments

30 of 62

FiD Experiments

N = 100 passages

31 of 62

Fusion in Decoder (FiD): Discussion

FiD is nearly as simple as RAG to implement

Small change from existing Encoder-Decoder like T5
Somewhat more expensive in training/inference

You can train mostly with small N and then train a little further with larger N

In practice larger N helps more than joint training on many cases
FiD was the “strong baseline” for most open-book QA work done with encoder-decoder models

32 of 62

RETRIEVAL WITH MODERN LLMS

33 of 62

Recap: Discussion of DPR

Encoder used in DPR:

BERT with [CLS] token

The obvious choice in 2020
Still a very good choice

Decoder only models: intuitively

Token representations don’t “know” about tokens appearing after them (causal attention)
Token representations from late in the document aren’t used much

Decoder only

34 of 62

Discussion of DPR

Encoder used in DPR:

BERT with [CLS] token

The obvious choice in 2020
Still a very good choice

Encoder-only and encoder-decoder models have a “bottleneck”

Encoder only

Decoder only

Encoder-decoder

But … everybody is working on improving decoder-only LLMs, so it would be great if we could use them!

35 of 62

2023

36 of 62

Hypothetical Document Embedding (HyDE)

Details

prompt model (InstructGPT) to convert question q to a query document d

sample N documents d₁, …, d_N, by generation with temperature T
query vector for Contriever is average of embeddings for q and d₁, …, d_N

37 of 62

Language Model as Retriever (LameR)

https://arxiv.org/pdf/2304.14233 2023 Shen et al

38 of 62

ExpandR

EMNLP 2025

39 of 62

ExpandR

40 of 62

ExpandR

Start out like HYDE

Prompted query expansion
Followed by dense retrieval

Then learn to improve both modules

The retriever: fix expander + contrastive learning!

loosely interpreted

41 of 62

ExpandR

Start out like HYDE

Prompted query expansion
Followed by dense retrieval

Then learn to improve both modules

The retriever: contrastive learning!

42 of 62

ExpandR

Start out like HYDE

Prompted query expansion
Followed by dense retrieval

Then learn to improve both modules

The expander: an RL method called direct preference optimization (DPO)

RL: doesn’t need d^q but does need preferences d^q1> d^q2

43 of 62

ExpandR: DPO Background

where:

π* is current model, π_refis pre-trained or SFT-trained model
nothing actually computes “reward” r(x,y)

44 of 62

ExpandR: DPO Background

Note the gradient of the loss looks like this:

where:

π* is current model, π_refis pre-trained or SFT-trained model
nothing actually computes “reward” r(x,y)

45 of 62

ExpandR

Start out like HYDE

Prompted query expansion
Followed by dense retrieval

Then learn to improve both modules

The expander trains with^* rewards computed based on ranks of retrieved documents, based on candidate query expansions

*and some other tricks

loosely interpreted

46 of 62

ExpandR: Results

47 of 62

Decoder-only models as encoders?

2023

2025

PromptEOL

Echo embeddings

48 of 62

PromptEOL: Key ideas

To summarize a sentence x

Prompt the model with

Take the last hidden state of Transformer as representation
Then

Use 300 sentence/word pairs as ICL demonstrations
Train the representations contrastively with

This sentence: “x” means in one word:

49 of 62

Echo embeddings: key ideas

To summarize a sentence x

Prompt the model with

Mean-pool the tokens for x as the representation

Rewrite the sentence: x; rewritten sentence: x

50 of 62

Break Here?

51 of 62

“FID” WITH DECODER-ONLY LLMS

52 of 62

Main ideas in FiD and extensions

In encoder

cross-encoder q+p1, q+p2, … separately

restricting cross-attention

In decoder

attend to everything as you generate

Extensions

FiDO: optimize performance to avoid decoder bottlenecks
LUMEN: pre-compute encoder outputs
GLIMMER: add reranking

FlashAttention (2023) and FlexAttention (2024)—also improve decoder bottlenecks

Parallel Context Windows (PCW) - 2023

TurboRAG, Blockwise Sparse Attention - 2024

Dynamic Blockwise Sparse Attention - 2025

Analog for decoder-only LLMs

53 of 62

2023

Context tokens: retrieved document(s) for RAG, in-context examples, …
Task tokens: question for RAG, test input for ICL, …

I believe also output tokens are task tokens

Key idea: cross-attend within a context window, and cross-attend between task tokens and all context windows.

Very similar to FiD

if question/answer are task tokens
no post-training required

54 of 62

2023

Context tokens: retrieved document(s) for RAG, in-context examples, …
Task tokens: question for RAG, test input for ICL, …

I believe also output tokens are task tokens

Key idea: cross-attend within a context window, and cross-attend between task tokens and all context windows.

Very similar to FiD

if question/answer are task tokens

Good results for NQ (vs conventional RAG system) and ICL (especially classification with many classes)

Implemented with attention masks rather than parallel generation of keys and values

55 of 62

2024

Context tokens: retrieved document(s) for RAG, in-context examples, …
Task tokens: question for RAG, test input for ICL, …

Key idea: same as PCW except

cache the independently-produced KV pairs of the documents
load KV pairs for relevant docs directly
fine-tune for a short time (100-1000 steps)

TTFT = Time To First Token

FLOPS also for first token

56 of 62

2025

Key idea: same as Block-Attention except

experiments are on ICL instead of RAG
cache the independently-produced KV pairs of the documents
evaluate addition of KV pairs computed for retrieved documents

retrieve ICL examples with BM25

avoid fine-tuning

clever use of StreamingLLM trick of “attention sink”
don’t mess with position encodings

57 of 62

DBSA: Details

Baseline: many-shot learning

Given n ICL examples

encode n/k blocks of ICL examples containing

k demonstrations
a shared “anchor block”

load all blocks into a KV cache
add the usual positional encoding features

Implemented with FlexAttention

Doesn’t actually encode the masked blocks

50 examples/block, 30k and 90k training examples for classification tasks

58 of 62

DBSA: Details

Dynamic example selection

Given a query q*

retrieve m<n encoded blocks (with BM25)
load retrieved blocks and anchor block into a KV cache and add positional encoding

retrieve 30% of the available training examples

59 of 62

DBSA: Results

Total latency including set-up time

60 of 62

DBSA: Results

61 of 62

TurboRAG EMNLP 2025

Similar plan as DBSA

retrieved KVs for chunks of relevant text
no “attention sink” tricks, instead FT more for post-retrieval generation
dense retrieval, not BM25

62 of 62

TurboRAG