1 of 45

Evaluating and Optimizing your RAG App

Jerry Liu, LlamaIndex co-founder/CEO

2 of 45

RAG

3 of 45

Context

LLMs are a phenomenal piece of technology for knowledge generation and reasoning. They are pre-trained on large amounts of publicly available data.

Use Cases

Question-Answering

Text Generation

Summarization

Planning

LLM’s

4 of 45

Context

How do we best augment LLMs with our own private data?

Use Cases

Question-Answering

Text Generation

Summarization

Planning

LLM’s

API’s

Raw Files

SQL DB’s

Vector Stores

?

5 of 45

Paradigms for inserting knowledge

Fine-tuning - baking knowledge into the weights of the network

LLM

Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep...

RLHF, Adam, SGD, etc.

6 of 45

Paradigms for inserting knowledge

Retrieval Augmentation - Fix the model, put context into the prompt

LLM

Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep...

Input Prompt

Here is the context:

Before college the two main things…

Given the context, answer the following question:�{query_str}

7 of 45

LlamaIndex: A data framework for LLM applications

Data Management and Query Engine for your LLM application
Offers components across the data lifecycle: ingest, index, and query over data

Data Ingestion (LlamaHub 🦙)

Data Structures

Queries

Connect your existing data sources and data formats (API’s, PDF’s, docs, SQL, etc.)

Store and index your data for different use cases. Integrate with different db’s (vector db, graph db, kv db)

Retrieve and query over data
Includes: QA, Summarization, Agents, and more

8 of 45

9 of 45

RAG Stack

10 of 45

Current RAG Stack for building a QA System

Vector Database

Doc

Chunk

LLM

Data Ingestion / Parsing

Data Querying

5 Lines of Code in LlamaIndex!

11 of 45

Current RAG Stack (Data Ingestion/Parsing)

Vector Database

Doc

Chunk

Process:

Split up document(s) into even chunks.
Each chunk is a piece of raw text.
Generate embedding for each chunk (e.g. OpenAI embeddings, sentence_transformer)
Store each chunk into a vector database

12 of 45

Current RAG Stack (Querying)

Vector Database

Chunk

LLM

Process:

Find top-k most similar chunks from vector database collection
Plug into LLM response synthesis module

13 of 45

Current RAG Stack (Querying)

Vector Database

Chunk

LLM

Process:

Find top-k most similar chunks from vector database collection
Plug into LLM response synthesis module

Retrieval

Synthesis

14 of 45

Challenges with “Naive” RAG

15 of 45

Challenges with Naive RAG

Failure Modes

Quality-Related (Hallucination, Accuracy)
Non-Quality-Related (Latency, Cost, Syncing)

16 of 45

Challenges with Naive RAG (Response Quality)

Bad Retrieval

Low Precision: Not all chunks in retrieved set are relevant

Hallucination + Lost in the Middle Problems

Low Recall: Now all relevant chunks are retrieved.

Lacks enough context for LLM to synthesize an answer

Outdated information: The data is redundant or out of date.

17 of 45

Challenges with Naive RAG (Response Quality)

Bad Retrieval

Low Precision: Not all chunks in retrieved set are relevant

Hallucination + Lost in the Middle Problems

Low Recall: Now all relevant chunks are retrieved.

Lacks enough context for LLM to synthesize an answer

Outdated information: The data is redundant or out of date.

Bad Response Generation

Hallucination: Model makes up an answer that isn’t in the context.
Irrelevance: Model makes up an answer that doesn’t answer the question.
Toxicity/Bias: Model makes up an answer that’s harmful/offensive.

18 of 45

What do we do?

Data: Can we store additional information beyond raw text chunks?
Embeddings: Can we optimize our embedding representations?
Retrieval: Can we do better than top-k embedding lookup?
Synthesis: Can we use LLMs for more than generation?

But before all this…

We need a way to measure performance

19 of 45

Evaluation

20 of 45

Evaluation

How do we properly evaluate a RAG system?

Evaluate in isolation (retrieval, synthesis)
Evaluate e2e

Open question: which one should we do first?

Vector Database

Chunk

LLM

Retrieval

Synthesis

21 of 45

Evaluation in Isolation (Retrieval)

Details: Evaluate quality of retrieved chunks given user query
Collect dataset

Input: query
Output: the “ground-truth” documents relevant to the query

Run retriever over dataset
Measure ranking metrics

Success rate / hit-rate
MRR
Hit-rate

22 of 45

Synthetic Dataset Generation for Retrieval Evals

Parse / chunk up text corpus
Prompt GPT-4 to generate questions from each chunk (or subset of chunks)
Each (question, chunk) is now your evaluation pair!

22

23 of 45

Evaluation E2E

Details: Evaluation of final generated response given input
Collect dataset

Input: query
[Optional] Output: the “ground-truth” answer

Run through full RAG pipeline
Collect evaluation metrics:

If no labels: label-free evals
If labels: with-label evals

24 of 45

Synthetic Dataset Generation for E2E Evals

Parse / chunk up text corpus
Prompt GPT-4 to generate questions from each chunk
Run (question, context) through GPT-4 → Get a “ground-truth” response
Each (question, response) is now your evaluation pair!

24

25 of 45

LLM-based Evaluation Modules

GPT-4 is a good human grader
Label-free Modules

Faithfulness: whether response matches retrieved context
Relevancy: whether response matches query
Guidelines: whether response matches guidelines

With-Labels

Correctness: whether response matches “golden” answer.

25

https://www.databricks.com/blog/LLM-auto-eval-best-practices-RAG

https://arxiv.org/pdf/2306.05685.pdf

26 of 45

Techniques for Better Performing RAG

27 of 45

Decouple Embeddings from Raw Text Chunks

Raw text chunks can bias your embedding representation with filler content (Max Rumpf, sid.ai)

28 of 45

Small-to-Big Retrieval

Solutions:

Embed text at the sentence-level - then expand that window during LLM synthesis

29 of 45

Small-to-Big Retrieval

Solutions:

Embed text at the sentence-level - then expand that window during LLM synthesis

Sentence Window Retrieval (k=2)

Naive Retrieval (k=5)

Only one out of the 5 chunks is relevant - “lost in the middle” problem

30 of 45

Embed References to Text Chunks

Solutions:

Embed “references” to text chunks instead of the text chunks directly.
Examples

Smaller Chunks
Metadata
Summaries

Retrieve those references first, then the text chunks.

31 of 45

Organize your data for more structured retrieval

(Recursive Retrieval)

Summaries → documents

Embed larger documents via summaries. First retrieve documents by summaries, then retrieve chunks within those documents

32 of 45

Organize your data for more structured retrieval

(Metadata)

Metadata: context you can inject into each text chunk
Examples

Page number
Document title
Summary of adjacent chunks
Questions that chunk can answer (reverse HyDE)

Benefits

Can Help Retrieval
Can Augment Response Quality
Integrates with Vector DB Metadata Filters

We report the development of GPT-4, a large-scale, multimodal…

{“page_num”: 1, “org”: “OpenAI”}

Metadata

Text Chunk

Example of Metadata

33 of 45

Organize your data for more structured retrieval

(Metadata Filters)

Question: “Can you tell me about Google’s R&D initiatives from 2020 to 2023?”

Dumping chunks to a single collection doesn’t work.

Single Collection of all 10Q Document Chunks

2020 10Q chunk 4

top-4

2020 10Q chunk 7

2021 10Q chunk 4

2023 10Q chunk 4

No guarantee you’ll return the relevant document chunks!

query_str: <query_embedding>

34 of 45

Organize your data for more structured retrieval

(Metadata Filters)

Question: “Can you tell me about Google’s R&D initiatives from 2020 to 2023?”

Here, we separate and tag the documents with metadata filters.
During query-time, we can infer these metadata filters in addition to semantic query.

2020 10Q

2021 10Q

2022 10Q

2023 10Q

2020 10Q chunk 4

2021 10Q chunk 4

2022 10Q chunk 4

2023 10Q chunk 4

query_str: <query_embedding>

Metadata tags:

<metadata_tags>

35 of 45

Organize your data for more structured retrieval

(Recursive Retrieval)

Organize your data hierarchically

Summaries → documents
Documents → embedded objects (Tables/Graphs)

36 of 45

Organize your data for more structured retrieval

(Recursive Retrieval)

Summaries → documents

Embed larger documents via summaries. First retrieve documents by summaries, then retrieve chunks within those documents

37 of 45

Organize your data for more structured retrieval

(Recursive Retrieval)

Documents → Embedded Objects

If you have embedded objects in your PDF documents (tables, graphs), first retrieve entities by a reference object, then query the underlying object.

38 of 45

Production RAG Guide

https://gpt-index.readthedocs.io/en/latest/end_to_end_tutorials/dev_practices/production_rag.html

39 of 45

Fine-Tuning

40 of 45

Fine-tuning

You can choose to fine-tune the embeddings or the LLM

41 of 45

Fine-tuning (Embeddings)

Generate a synthetic query dataset from raw text chunks using LLMs

NOTE: Similar process to generating an evaluation dataset!

Credits: Jo Bergum, vespa.ai

42 of 45

Fine-tuning (Embeddings)

Use this synthetic dataset to finetune an embedding model.

Directly finetune sentence_transformers model
Finetune a black-box adapter (linear, NN, any neural network)

43 of 45

Fine-tuning a Black-box Adapter

44 of 45

Fine-tuning (LLMs)

Use OpenAI to distill GPT-4 to gpt-3.5-turbo

Final response generation
Agent intermediate reasoning

45 of 45

Finetuning Abstractions in LlamaIndex

https://gpt-index.readthedocs.io/en/latest/end_to_end_tutorials/finetuning.html