1 of 30

Building Production-Ready RAG Applications

Jerry Liu, LlamaIndex co-founder/CEO

2 of 30

GenAI - Enterprise Use-cases

Document Processing

Tagging & Extraction

Knowledge Search & QA

Conversational Agent

Workflow Automation

Agent: …

Human: …

Agent: …

Document

Topic:

Summary:

Author:

Knowledge Base

Answer:

Sources: …

Workflow:

  • Read latest messages from user A
  • Send email suggesting next-steps

Inbox

read

Email

write

3 of 30

Paradigms for inserting knowledge

Retrieval Augmentation - Fix the model, put context into the prompt

LLM

Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep...

Input Prompt

Here is the context:

Before college the two main things…

Given the context, answer the following question:�{query_str}

4 of 30

Paradigms for inserting knowledge

Fine-tuning - baking knowledge into the weights of the network

LLM

Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep...

RLHF, Adam, SGD, etc.

5 of 30

RAG Stack

6 of 30

Current RAG Stack for building a QA System

Vector Database

Doc

Chunk

Chunk

Chunk

Chunk

Chunk

Chunk

Chunk

LLM

Data Ingestion

Data Querying (Retrieval + Synthesis)

5 Lines of Code in Llama Index!

7 of 30

Challenges with “Naive” RAG

8 of 30

Challenges with Naive RAG (Response Quality)

  • Bad Retrieval
    • Low Precision: Not all chunks in retrieved set are relevant
      • Hallucination + Lost in the Middle Problems
    • Low Recall: Now all relevant chunks are retrieved.
      • Lacks enough context for LLM to synthesize an answer
    • Outdated information: The data is redundant or out of date.

9 of 30

Challenges with Naive RAG (Response Quality)

  • Bad Retrieval
    • Low Precision: Not all chunks in retrieved set are relevant
      • Hallucination + Lost in the Middle Problems
    • Low Recall: Now all relevant chunks are retrieved.
      • Lacks enough context for LLM to synthesize an answer
    • Outdated information: The data is redundant or out of date.
  • Bad Response Generation
    • Hallucination: Model makes up an answer that isn’t in the context.
    • Irrelevance: Model makes up an answer that doesn’t answer the question.
    • Toxicity/Bias: Model makes up an answer that’s harmful/offensive.

10 of 30

What do we do?

  • Data: Can we store additional information beyond raw text chunks?
  • Embeddings: Can we optimize our embedding representations?
  • Retrieval: Can we do better than top-k embedding lookup?
  • Synthesis: Can we use LLMs for more than generation?

Vector Database

Doc

Chunk

Chunk

Chunk

Chunk

Chunk

LLM

Data

Embeddings

Retrieval

Synthesis

11 of 30

What do we do?

  • Data: Can we store additional information beyond raw text chunks?
  • Embeddings: Can we optimize our embedding representations?
  • Retrieval: Can we do better than top-k embedding lookup?
  • Synthesis: Can we use LLMs for more than generation?

But before all this…

We need a way to measure performance

12 of 30

Evaluation

13 of 30

Evaluation

  • How do we properly evaluate a RAG system?
    • Evaluate in isolation (retrieval, synthesis)
    • Evaluate e2e

Vector Database

Chunk

Chunk

Chunk

LLM

Retrieval

Synthesis

14 of 30

Evaluation in Isolation (Retrieval)

  • Evaluate quality of retrieved chunks given user query
  • Create dataset
    • Input: query
    • Output: the “ground-truth” documents relevant to the query
  • Run retriever over dataset
  • Measure ranking metrics
    • Success rate / hit-rate
    • MRR
    • Hit-rate

15 of 30

Evaluation E2E

  • Evaluation of final generated response given input
  • Create Dataset
    • Input: query
    • [Optional] Output: the “ground-truth” answer
  • Run through full RAG pipeline
  • Collect evaluation metrics:
    • If no labels: label-free evals
    • If labels: with-label evals

16 of 30

Optimizing RAG Systems

17 of 30

From Simple to Advanced RAG

Less Expressive

Easier to Implement

Lower Latency/Cost

More Expressive

Harder to Implement

Higher Latency/Cost

Table Stakes

Better Parsers

Chunk Sizes

Hybrid Search

Metadata Filters

🛠️

Advanced Retrieval

Reranking

Recursive Retrieval

Embedded Tables

Small-to-big Retrieval

🔎

Agentic Behavior

Routing

Query Planning

Multi-document Agents

🤖

Fine-tuning

Embedding fine-tuning

LLM fine-tuning

⚙️

18 of 30

Table Stakes: Chunk Sizes

Tuning your chunk size can have outsized impacts on performance

Not obvious that more retrieved tokens == higher performance!

Note: Reranking (shuffling context order) isn’t always beneficial.

19 of 30

Table Stakes: Metadata Filtering

  • Metadata: context you can inject into each text chunk
  • Examples
    • Page number
    • Document title
    • Summary of adjacent chunks
    • Questions that chunk can answer (reverse HyDE)
  • Benefits
    • Can Help Retrieval
    • Can Augment Response Quality
    • Integrates with Vector DB Metadata Filters

We report the development of GPT-4, a large-scale, multimodal…

{“page_num”: 1, “org”: “OpenAI”}

Metadata

Text Chunk

Example of Metadata

20 of 30

Table Stakes: Metadata Filtering

Question: “Can you tell me the risk factors in 2021?”

Raw Semantic Search is low precision.

Single Collection of all 10Q Document Chunks

2020 10Q chunk 6

top-4

2021 10Q chunk 4

2019 10Q chunk 7

2021 10Q chunk 7

No guarantee you’ll return the relevant document chunks!

query_str: <query_embedding>

21 of 30

Table Stakes: Metadata Filtering

Question: “Can you tell me the risk factors in 2021?”

If we can infer the metadata filters (year=2021), we remove irrelevant candidates, increasing precision!

2020 10Q

2021 10Q

2022 10Q

2023 10Q

2021 10Q chunk 7

query_str: <query_embedding>

Metadata tags:

{“year”: 2021}

22 of 30

Advanced Retrieval: Small-to-Big

Intuition: Embedding a big text chunk feels suboptimal.

Solution: Embed text at the sentence-level - then expand that window during LLM synthesis

23 of 30

Advanced Retrieval: Small-to-Big

Leads to more precise retrieval.

Avoids “lost in the middle” problems.

Sentence Window Retrieval (k=2)

Naive Retrieval (k=5)

Only one out of the 5 chunks is relevant - “lost in the middle” problem

24 of 30

Advanced Retrieval: Small-to-Big

Intuition: Embedding a big text chunk feels suboptimal.

Solution: Embed a smaller reference to the parent chunk. Use parent chunk for synthesis

Examples: Smaller chunks, summaries, metadata

25 of 30

Agentic Behavior: Multi-Document Agents

Intuition: There’s certain questions that “top-k” RAG can’t answer.

Solution: Multi-Document Agents

  • Fact-based QA and Summarization over any subsets of documents
  • Chain-of-thought and query planning.

26 of 30

Fine-Tuning: Embeddings

Intuition: Embedding Representations are not optimized over your dataset

Solution: Generate a synthetic query dataset from raw text chunks using LLMs

Use this synthetic dataset to finetune an embedding model.

Credits: Jo Bergum, vespa.ai

27 of 30

Fine-Tuning: LLMs

Intuition: Weaker LLMs are not bad at response synthesis, reasoning, structured outputs, etc.

Solution: Generate a synthetic dataset from raw chunks (e.g. using GPT-4). Help fix all of the above!

28 of 30

Resources

29 of 30

Thanks!

We’ll make talk + workshop slides publicly available.

30 of 30

Finetuning Abstractions in LlamaIndex