1 of 45

Evaluating and Optimizing your RAG App

Jerry Liu, LlamaIndex co-founder/CEO

2 of 45

RAG

3 of 45

Context

  • LLMs are a phenomenal piece of technology for knowledge generation and reasoning. They are pre-trained on large amounts of publicly available data.

Use Cases

Question-Answering

Text Generation

Summarization

Planning

LLM’s

4 of 45

Context

  • How do we best augment LLMs with our own private data?

Use Cases

Question-Answering

Text Generation

Summarization

Planning

LLM’s

API’s

Raw Files

SQL DB’s

Vector Stores

?

5 of 45

Paradigms for inserting knowledge

Fine-tuning - baking knowledge into the weights of the network

LLM

Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep...

RLHF, Adam, SGD, etc.

6 of 45

Paradigms for inserting knowledge

Retrieval Augmentation - Fix the model, put context into the prompt

LLM

Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep...

Input Prompt

Here is the context:

Before college the two main things…

Given the context, answer the following question:�{query_str}

7 of 45

LlamaIndex: A data framework for LLM applications

  • Data Management and Query Engine for your LLM application
  • Offers components across the data lifecycle: ingest, index, and query over data

Data Ingestion (LlamaHub 🦙)

Data Structures

Queries

  • Connect your existing data sources and data formats (API’s, PDF’s, docs, SQL, etc.)
  • Store and index your data for different use cases. Integrate with different db’s (vector db, graph db, kv db)
  • Retrieve and query over data
  • Includes: QA, Summarization, Agents, and more

8 of 45

9 of 45

RAG Stack

10 of 45

Current RAG Stack for building a QA System

Vector Database

Doc

Chunk

Chunk

Chunk

Chunk

Chunk

Chunk

Chunk

LLM

Data Ingestion / Parsing

Data Querying

5 Lines of Code in LlamaIndex!

11 of 45

Current RAG Stack (Data Ingestion/Parsing)

Vector Database

Doc

Chunk

Chunk

Chunk

Chunk

Process:

  • Split up document(s) into even chunks.
  • Each chunk is a piece of raw text.
  • Generate embedding for each chunk (e.g. OpenAI embeddings, sentence_transformer)
  • Store each chunk into a vector database

12 of 45

Current RAG Stack (Querying)

Vector Database

Chunk

Chunk

Chunk

LLM

Process:

  • Find top-k most similar chunks from vector database collection
  • Plug into LLM response synthesis module

13 of 45

Current RAG Stack (Querying)

Vector Database

Chunk

Chunk

Chunk

LLM

Process:

  • Find top-k most similar chunks from vector database collection
  • Plug into LLM response synthesis module

Retrieval

Synthesis

14 of 45

Challenges with “Naive” RAG

15 of 45

Challenges with Naive RAG

  • Failure Modes
    • Quality-Related (Hallucination, Accuracy)
    • Non-Quality-Related (Latency, Cost, Syncing)

16 of 45

Challenges with Naive RAG (Response Quality)

  • Bad Retrieval
    • Low Precision: Not all chunks in retrieved set are relevant
      • Hallucination + Lost in the Middle Problems
    • Low Recall: Now all relevant chunks are retrieved.
      • Lacks enough context for LLM to synthesize an answer
    • Outdated information: The data is redundant or out of date.

17 of 45

Challenges with Naive RAG (Response Quality)

  • Bad Retrieval
    • Low Precision: Not all chunks in retrieved set are relevant
      • Hallucination + Lost in the Middle Problems
    • Low Recall: Now all relevant chunks are retrieved.
      • Lacks enough context for LLM to synthesize an answer
    • Outdated information: The data is redundant or out of date.
  • Bad Response Generation
    • Hallucination: Model makes up an answer that isn’t in the context.
    • Irrelevance: Model makes up an answer that doesn’t answer the question.
    • Toxicity/Bias: Model makes up an answer that’s harmful/offensive.

18 of 45

What do we do?

  • Data: Can we store additional information beyond raw text chunks?
  • Embeddings: Can we optimize our embedding representations?
  • Retrieval: Can we do better than top-k embedding lookup?
  • Synthesis: Can we use LLMs for more than generation?

But before all this…

We need a way to measure performance

19 of 45

Evaluation

20 of 45

Evaluation

  • How do we properly evaluate a RAG system?
    • Evaluate in isolation (retrieval, synthesis)
    • Evaluate e2e
  • Open question: which one should we do first?

Vector Database

Chunk

Chunk

Chunk

LLM

Retrieval

Synthesis

21 of 45

Evaluation in Isolation (Retrieval)

  • Details: Evaluate quality of retrieved chunks given user query
  • Collect dataset
    • Input: query
    • Output: the “ground-truth” documents relevant to the query
  • Run retriever over dataset
  • Measure ranking metrics
    • Success rate / hit-rate
    • MRR
    • Hit-rate

22 of 45

Synthetic Dataset Generation for Retrieval Evals

  1. Parse / chunk up text corpus
  2. Prompt GPT-4 to generate questions from each chunk (or subset of chunks)
  3. Each (question, chunk) is now your evaluation pair!

22

23 of 45

Evaluation E2E

  • Details: Evaluation of final generated response given input
  • Collect dataset
    • Input: query
    • [Optional] Output: the “ground-truth” answer
  • Run through full RAG pipeline
  • Collect evaluation metrics:
    • If no labels: label-free evals
    • If labels: with-label evals

24 of 45

Synthetic Dataset Generation for E2E Evals

  • Parse / chunk up text corpus
  • Prompt GPT-4 to generate questions from each chunk
  • Run (question, context) through GPT-4 → Get a “ground-truth” response
  • Each (question, response) is now your evaluation pair!

24

25 of 45

LLM-based Evaluation Modules

  • GPT-4 is a good human grader
  • Label-free Modules
    • Faithfulness: whether response matches retrieved context
    • Relevancy: whether response matches query
    • Guidelines: whether response matches guidelines
  • With-Labels
    • Correctness: whether response matches “golden” answer.

25

26 of 45

Techniques for Better Performing RAG

27 of 45

Decouple Embeddings from Raw Text Chunks

Raw text chunks can bias your embedding representation with filler content (Max Rumpf, sid.ai)

28 of 45

Small-to-Big Retrieval

Solutions:

  • Embed text at the sentence-level - then expand that window during LLM synthesis

29 of 45

Small-to-Big Retrieval

Solutions:

  • Embed text at the sentence-level - then expand that window during LLM synthesis

Sentence Window Retrieval (k=2)

Naive Retrieval (k=5)

Only one out of the 5 chunks is relevant - “lost in the middle” problem

30 of 45

Embed References to Text Chunks

Solutions:

  • Embed “references” to text chunks instead of the text chunks directly.
  • Examples
    • Smaller Chunks
    • Metadata
    • Summaries
  • Retrieve those references first, then the text chunks.

31 of 45

Organize your data for more structured retrieval

(Recursive Retrieval)

Summaries → documents

  • Embed larger documents via summaries. First retrieve documents by summaries, then retrieve chunks within those documents

32 of 45

Organize your data for more structured retrieval

(Metadata)

  • Metadata: context you can inject into each text chunk
  • Examples
    • Page number
    • Document title
    • Summary of adjacent chunks
    • Questions that chunk can answer (reverse HyDE)
  • Benefits
    • Can Help Retrieval
    • Can Augment Response Quality
    • Integrates with Vector DB Metadata Filters

We report the development of GPT-4, a large-scale, multimodal…

{“page_num”: 1, “org”: “OpenAI”}

Metadata

Text Chunk

Example of Metadata

33 of 45

Organize your data for more structured retrieval

(Metadata Filters)

Question: “Can you tell me about Google’s R&D initiatives from 2020 to 2023?”

  • Dumping chunks to a single collection doesn’t work.

Single Collection of all 10Q Document Chunks

2020 10Q chunk 4

top-4

2020 10Q chunk 7

2021 10Q chunk 4

2023 10Q chunk 4

No guarantee you’ll return the relevant document chunks!

query_str: <query_embedding>

34 of 45

Organize your data for more structured retrieval

(Metadata Filters)

Question: “Can you tell me about Google’s R&D initiatives from 2020 to 2023?”

  • Here, we separate and tag the documents with metadata filters.
  • During query-time, we can infer these metadata filters in addition to semantic query.

2020 10Q

2021 10Q

2022 10Q

2023 10Q

2020 10Q chunk 4

2021 10Q chunk 4

2022 10Q chunk 4

2023 10Q chunk 4

query_str: <query_embedding>

Metadata tags:

<metadata_tags>

35 of 45

Organize your data for more structured retrieval

(Recursive Retrieval)

  • Organize your data hierarchically
    • Summaries → documents
    • Documents → embedded objects (Tables/Graphs)

36 of 45

Organize your data for more structured retrieval

(Recursive Retrieval)

Summaries → documents

  • Embed larger documents via summaries. First retrieve documents by summaries, then retrieve chunks within those documents

37 of 45

Organize your data for more structured retrieval

(Recursive Retrieval)

Documents → Embedded Objects

  • If you have embedded objects in your PDF documents (tables, graphs), first retrieve entities by a reference object, then query the underlying object.

38 of 45

Production RAG Guide

39 of 45

Fine-Tuning

40 of 45

Fine-tuning

You can choose to fine-tune the embeddings or the LLM

41 of 45

Fine-tuning (Embeddings)

Generate a synthetic query dataset from raw text chunks using LLMs

NOTE: Similar process to generating an evaluation dataset!

Credits: Jo Bergum, vespa.ai

42 of 45

Fine-tuning (Embeddings)

Use this synthetic dataset to finetune an embedding model.

  • Directly finetune sentence_transformers model
  • Finetune a black-box adapter (linear, NN, any neural network)

43 of 45

Fine-tuning a Black-box Adapter

44 of 45

Fine-tuning (LLMs)

Use OpenAI to distill GPT-4 to gpt-3.5-turbo

  • Final response generation
  • Agent intermediate reasoning

45 of 45

Finetuning Abstractions in LlamaIndex