1 of 62

10-605 / 10-805

Machine Learning from Large Datasets

2 of 62

Outline

  • Recap of contrastive learning and retrieval
  • Some retrieval systems
    • DPR and Contriever (recap)
  • Combining retrieval and generation
    • The original RAG paper
    • REALM (earlier work)
    • Fusion in Decoder (later work)
  • Some recent retrieval tricks
    • HyDE and LameR
    • ExpandR
  • Guest lecture”: Michiel de Jong, Cursor

3 of 62

RETRIEVAL AUGMENTED LLMS�A RECAP

4 of 62

Questions

  • How to retrieve? Can we learn to retrieve better?
  • How to generate? Are there problems here specific to RAG?
  • Do we learn to retrieve jointly with learning to generate, or separately?

Is the source of information reliable?

What if documents contradict each other?

What if information needed is spread across many documents?

RECAP

5 of 62

Contrastive learning for retrieval

"Projection head” to compress large representation

Loss usually needs positive and negative retrievals for q

Usually not a bit vector

Contrastive loss fine-tunes the large encoder (and the projection head)

RECAP

6 of 62

Contrastive learning: losses

NT-Xent: normalized temperature-scale cross-entropy

RECAP

7 of 62

2020

RECAP

8 of 62

The Dense Passage Retriever (DPR)

  • Split Wikipedia into “passages”
    • 21M passages, each 100 tokens long
      • Passages also include the page title
  • Encoder is Bert
    • “Projection head” is the [CLS] token
  • Process several QA datasets to get triples:
    • (question, gold passage id, gold answer span)

RECAP

9 of 62

The Dense Passage Retriever (DPR)

  • Process several QA datasets to get triples:
    • (question, gold passage id, gold answer span)
  • Triples give positive question/passage examples (qi, pi) encoded in d=768 dimensions by pre-trained BERT
  • Q is mini-batch of B questions, P is B passages, S = QPT is (B x B) matrix of similarity scores
    • score is positive for qi, pj iff i=j
      • “in-batch negatives” trick
    • also add as additional hard negatives
      • 1-2 passages with high TFIDF score to q which do not contain the answer span

RECAP

10 of 62

Discussion of DPR

  • Encoder used in DPR:
    • BERT with [CLS] token
      • The obvious choice in 2020
      • Still a very good choice
    • Encoder-only and encoder-decoder models have a “bottleneck

Encoder only

Decoder only

Encoder-decoder

RECAP

11 of 62

Discussion of DPR

  • Contrastively learned encoders are limited
    • They don’t know what the question is when they encode a document
    • So they can’t represent the relation between the document and the questions
  • One solution
    • Retrieve documents with a fast contrastive model
    • Rerank them with a more powerful model

RECAP

12 of 62

Cross-Encoders

query

i-th candidate

c1, c2, … cN

relevance of i-th candidate to query

RECAP

13 of 62

Mega-batching and Dynamic Dictionaries

  • Assume we have
    • A dataloader that streams through mini-batches
    • An SGD trainer than learns from the mini-batchers
  • Idea: modify data stream
    • Keep a queue of the M most recently-seen examples
    • Add to each mini-batch the hardest negative example in the queue
  • Advantage:
    • Queue size is independent of batch size

RECAP

14 of 62

Momentum Encoders

  •  

RECAP

15 of 62

Contriever vs DPR

  • Same NT-Xent contrastive loss
  • One BERT encoder instead of two
  • Data augmentation tricks explored
    • Independent cropping
    • Random token deletion
  • Negatives were from
    • in-batch negatives
    • momentum contrast (MoCo)
      • MoCo: dynamic dictionary queue and averaging last two models to find hard negatives

RECAP

16 of 62

Independent cropping

  • Given a paragraph p create a pair (a, b) where
    • a and b are both independently chosen subsequences of the text
    • length of a and b are either fixed or sampled
  • Note a and b will be similar size, not sentence vs paragraph

..Zebras have four gaits: walk, trot, canter and gallop. They are generally slower than horses, but their great stamina helps them outrun predators. When chased, a zebra will zigzag from side to side...

four gaits: walk, trot, canter and gallop. They are generally

great stamina helps them outrun predators. When chased

p

a

b

RECAP

17 of 62

RETRIEVAL AUGMENTED LLMS:�SOME EXAMPLES

18 of 62

The OG RAG paper

2021

19 of 62

The OG RAG paper

returns top K docs

sort of

20 of 62

The OG RAG paper

returns top K docs

based on BART

(BERT-sized encoder-decoder)

 

21 of 62

The OG RAG paper

returns top K docs

based on DPR

based on BART

(BERT-sized encoder-decoder)

updated in training

frozen – not updated!

22 of 62

The OG RAG paper

updated in training

frozen – not updated!

Index needs to rebuilt when BERTd changes

Re-indexing is slow and complicated

23 of 62

Experiments with RAG

TREC

Natural

Questions

TriviaQA

test1/2

Web

Questions

24 of 62

Experiments with RAG

TREC

Natural

Questions

TriviaQA

test1/2

Web

Questions

25 of 62

RAG vs REALM

  • REALM (Retrieval augmented LM, 2020):
    • full joint training of retriever and generator
    • requires periodic re-indexing
      • one process reindexes:
        • look for latest checkpoint,
        • loads and reindexes Wikipedia
      • second process generates training data
        • get query/answer
        • augment with top-K docs from latest index
      • third process trains both models
    • few systems tried to jointly train afterwards
      • using REALM’s approach is complex
      • independently-trained retrieval systems got better

26 of 62

RAG: discussion

  • RAG is now generic term for using retrieval as preprocessing step for LLMs
    • Simplest case is fixed retriever, and appending all docs to the question to form LLM input
    • No custom learner / model / model-combination!
    • Simple case improves immediately whenever
      • LLM improves
      • retriever improves

    • Appending docs scales poorly when many documents are used
      • but there are some tricks to improve that

27 of 62

Fusion in Decoder (FiD)

2021

28 of 62

Fusion in Decoder (FiD)

  • In open-book QA, often
    • Questions q and answer a are short: k=O(10)
    • Passages are longer: m=O(100)
  • Retrieving and appending N passages:
    • Encoder = O(Nm + k)2
    • Decoder = O(k * (Nm + k))
  • Trick:
    • cross-encode q with each passage separately: O(N(k+m)2)
    • decode, attending to all: O(k * N(m + k))

O(N2 m2) ignoring k

O(N m)

O(N m2)

O(N m)

quadratic in N 🡪 linear in N

can afford to retrieve more docs

we lose cross-attention between tokens in different passages

29 of 62

FiD Experiments

30 of 62

FiD Experiments

N = 100 passages

31 of 62

Fusion in Decoder (FiD): Discussion

  • FiD is nearly as simple as RAG to implement
    • Small change from existing Encoder-Decoder like T5
    • Somewhat more expensive in training/inference
      • You can train mostly with small N and then train a little further with larger N
    • In practice larger N helps more than joint training on many cases
    • FiD was the “strong baseline” for most open-book QA work done with encoder-decoder models

32 of 62

RETRIEVAL WITH MODERN LLMS

33 of 62

Recap: Discussion of DPR

  • Encoder used in DPR:
    • BERT with [CLS] token
      • The obvious choice in 2020
      • Still a very good choice
    • Decoder only models: intuitively
      • Token representations don’t “know” about tokens appearing after them (causal attention)
      • Token representations from late in the document aren’t used much

Decoder only

34 of 62

Discussion of DPR

  • Encoder used in DPR:
    • BERT with [CLS] token
      • The obvious choice in 2020
      • Still a very good choice
    • Encoder-only and encoder-decoder models have a “bottleneck

Encoder only

Decoder only

Encoder-decoder

But … everybody is working on improving decoder-only LLMs, so it would be great if we could use them!

35 of 62

2023

36 of 62

Hypothetical Document Embedding (HyDE)

  • Details
    • prompt model (InstructGPT) to convert question q to a query document d

    • sample N documents d1, …, dN, by generation with temperature T
    • query vector for Contriever is average of embeddings for q and d1, …, dN

37 of 62

Language Model as Retriever (LameR)

38 of 62

ExpandR

EMNLP 2025

39 of 62

ExpandR

40 of 62

ExpandR

  • Start out like HYDE
    • Prompted query expansion
    • Followed by dense retrieval
  • Then learn to improve both modules

    • The retriever: fix expander + contrastive learning!

loosely interpreted

41 of 62

ExpandR

  • Start out like HYDE
    • Prompted query expansion
    • Followed by dense retrieval
  • Then learn to improve both modules

    • The retriever: contrastive learning!

42 of 62

ExpandR

  • Start out like HYDE
    • Prompted query expansion
    • Followed by dense retrieval
  • Then learn to improve both modules

    • The expander: an RL method called direct preference optimization (DPO)

RL: doesn’t need dq but does need preferences dq1 > dq2

43 of 62

ExpandR: DPO Background

where:

  • π* is current model, πref is pre-trained or SFT-trained model
  • nothing actually computes “reward” r(x,y)

44 of 62

ExpandR: DPO Background

Note the gradient of the loss looks like this:

where:

  • π* is current model, πref is pre-trained or SFT-trained model
  • nothing actually computes “reward” r(x,y)

45 of 62

ExpandR

  • Start out like HYDE
    • Prompted query expansion
    • Followed by dense retrieval
  • Then learn to improve both modules

    • The expander trains with* rewards computed based on ranks of retrieved documents, based on candidate query expansions

*and some other tricks

loosely interpreted

46 of 62

ExpandR: Results

47 of 62

Decoder-only models as encoders?

2023

2025

PromptEOL

Echo embeddings

48 of 62

PromptEOL: Key ideas

  • To summarize a sentence x
    • Prompt the model with

    • Take the last hidden state of Transformer as representation
    • Then
      • Use 300 sentence/word pairs as ICL demonstrations
      • Train the representations contrastively with

This sentence: “xmeans in one word:

49 of 62

Echo embeddings: key ideas

  • To summarize a sentence x
    • Prompt the model with

    • Mean-pool the tokens for x as the representation

Rewrite the sentence: x; rewritten sentence: x

50 of 62

Break Here?

51 of 62

“FID” WITH DECODER-ONLY LLMS

52 of 62

Main ideas in FiD and extensions

  • In encoder
    • cross-encoder q+p1, q+p2, … separately
      • restricting cross-attention
  • In decoder
    • attend to everything as you generate
  • Extensions
    • FiDO: optimize performance to avoid decoder bottlenecks
    • LUMEN: pre-compute encoder outputs
    • GLIMMER: add reranking

FlashAttention (2023) and FlexAttention (2024)—also improve decoder bottlenecks

Parallel Context Windows (PCW) - 2023

TurboRAG, Blockwise Sparse Attention - 2024

Dynamic Blockwise Sparse Attention - 2025

Analog for decoder-only LLMs

53 of 62

2023

  • Context tokens: retrieved document(s) for RAG, in-context examples, …
  • Task tokens: question for RAG, test input for ICL, …
    • I believe also output tokens are task tokens

Key idea: cross-attend within a context window, and cross-attend between task tokens and all context windows.

Very similar to FiD

  • if question/answer are task tokens
  • no post-training required

54 of 62

2023

  • Context tokens: retrieved document(s) for RAG, in-context examples, …
  • Task tokens: question for RAG, test input for ICL, …
    • I believe also output tokens are task tokens

Key idea: cross-attend within a context window, and cross-attend between task tokens and all context windows.

Very similar to FiD

  • if question/answer are task tokens

Good results for NQ (vs conventional RAG system) and ICL (especially classification with many classes)

Implemented with attention masks rather than parallel generation of keys and values

55 of 62

2024

  • Context tokens: retrieved document(s) for RAG, in-context examples, …
  • Task tokens: question for RAG, test input for ICL, …

Key idea: same as PCW except

  • cache the independently-produced KV pairs of the documents
  • load KV pairs for relevant docs directly
  • fine-tune for a short time (100-1000 steps)

TTFT = Time To First Token

FLOPS also for first token

56 of 62

2025

Key idea: same as Block-Attention except

  • experiments are on ICL instead of RAG
  • cache the independently-produced KV pairs of the documents
  • evaluate addition of KV pairs computed for retrieved documents
    • retrieve ICL examples with BM25
  • avoid fine-tuning
    • clever use of StreamingLLM trick of “attention sink”
    • don’t mess with position encodings

57 of 62

DBSA: Details

Baseline: many-shot learning

  • Given n ICL examples
    • encode n/k blocks of ICL examples containing
      • k demonstrations
      • a shared “anchor block”
    • load all blocks into a KV cache
    • add the usual positional encoding features
  • Implemented with FlexAttention
    • Doesn’t actually encode the masked blocks

  • 50 examples/block, 30k and 90k training examples for classification tasks

58 of 62

DBSA: Details

Dynamic example selection

  • Given a query q*
    • retrieve m<n encoded blocks (with BM25)
    • load retrieved blocks and anchor block into a KV cache and add positional encoding

  • retrieve 30% of the available training examples

59 of 62

DBSA: Results

Total latency including set-up time

60 of 62

DBSA: Results

61 of 62

TurboRAG EMNLP 2025

Similar plan as DBSA

  • retrieved KVs for chunks of relevant text
  • no “attention sink” tricks, instead FT more for post-retrieval generation
  • dense retrieval, not BM25

62 of 62

TurboRAG