1 of 30

Building Production-Ready RAG Applications

Jerry Liu, LlamaIndex co-founder/CEO

2 of 30

GenAI - Enterprise Use-cases

Document Processing

Tagging & Extraction

Knowledge Search & QA

Conversational Agent

Workflow Automation

Agent: …

Human: …

Agent: …

Document

Topic:

Summary:

Author:

Knowledge Base

Answer:

Sources: …

Workflow:

Read latest messages from user A
Send email suggesting next-steps

Inbox

read

Email

write

3 of 30

Paradigms for inserting knowledge

Retrieval Augmentation - Fix the model, put context into the prompt

LLM

Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep...

Input Prompt

Here is the context:

Before college the two main things…

Given the context, answer the following question:�{query_str}

4 of 30

Paradigms for inserting knowledge

Fine-tuning - baking knowledge into the weights of the network

LLM

Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep...

RLHF, Adam, SGD, etc.

5 of 30

RAG Stack

6 of 30

Current RAG Stack for building a QA System

Vector Database

Doc

Chunk

LLM

Data Ingestion

Data Querying (Retrieval + Synthesis)

5 Lines of Code in Llama Index!

7 of 30

Challenges with “Naive” RAG

8 of 30

Challenges with Naive RAG (Response Quality)

Bad Retrieval

Low Precision: Not all chunks in retrieved set are relevant

Hallucination + Lost in the Middle Problems

Low Recall: Now all relevant chunks are retrieved.

Lacks enough context for LLM to synthesize an answer

Outdated information: The data is redundant or out of date.

9 of 30

Challenges with Naive RAG (Response Quality)

Bad Retrieval

Low Precision: Not all chunks in retrieved set are relevant

Hallucination + Lost in the Middle Problems

Low Recall: Now all relevant chunks are retrieved.

Lacks enough context for LLM to synthesize an answer

Outdated information: The data is redundant or out of date.

Bad Response Generation

Hallucination: Model makes up an answer that isn’t in the context.
Irrelevance: Model makes up an answer that doesn’t answer the question.
Toxicity/Bias: Model makes up an answer that’s harmful/offensive.

10 of 30

What do we do?

Data: Can we store additional information beyond raw text chunks?
Embeddings: Can we optimize our embedding representations?
Retrieval: Can we do better than top-k embedding lookup?
Synthesis: Can we use LLMs for more than generation?

Vector Database

Doc

Chunk

LLM

Data

Embeddings

Retrieval

Synthesis

11 of 30

What do we do?

Data: Can we store additional information beyond raw text chunks?
Embeddings: Can we optimize our embedding representations?
Retrieval: Can we do better than top-k embedding lookup?
Synthesis: Can we use LLMs for more than generation?

But before all this…

We need a way to measure performance

12 of 30

Evaluation

13 of 30

Evaluation

How do we properly evaluate a RAG system?

Evaluate in isolation (retrieval, synthesis)
Evaluate e2e

Vector Database

Chunk

LLM

Retrieval

Synthesis

14 of 30

Evaluation in Isolation (Retrieval)

Evaluate quality of retrieved chunks given user query
Create dataset

Input: query
Output: the “ground-truth” documents relevant to the query

Run retriever over dataset
Measure ranking metrics

Success rate / hit-rate
MRR
Hit-rate

15 of 30

Evaluation E2E

Evaluation of final generated response given input
Create Dataset

Input: query
[Optional] Output: the “ground-truth” answer

Run through full RAG pipeline
Collect evaluation metrics:

If no labels: label-free evals
If labels: with-label evals

16 of 30

Optimizing RAG Systems

17 of 30

From Simple to Advanced RAG

Less Expressive

Easier to Implement

Lower Latency/Cost

More Expressive

Harder to Implement

Higher Latency/Cost

Table Stakes

Better Parsers

Chunk Sizes

Hybrid Search

Metadata Filters

🛠️

Advanced Retrieval

Reranking

Recursive Retrieval

Embedded Tables

Small-to-big Retrieval

🔎

Agentic Behavior

Routing

Query Planning

Multi-document Agents

🤖

Fine-tuning

Embedding fine-tuning

LLM fine-tuning

⚙️

18 of 30

Table Stakes: Chunk Sizes

Tuning your chunk size can have outsized impacts on performance

Not obvious that more retrieved tokens == higher performance!

Note: Reranking (shuffling context order) isn’t always beneficial.

Source:

Arize Phoenix + LlamaIndex Workshop: https://colab.research.google.com/drive/1Siufl13rLI-kII1liaNfvf-NniBdwUpS?usp=sharing#scrollTo=_as7h-u1IlwR

19 of 30

Table Stakes: Metadata Filtering

Metadata: context you can inject into each text chunk
Examples

Page number
Document title
Summary of adjacent chunks
Questions that chunk can answer (reverse HyDE)

Benefits

Can Help Retrieval
Can Augment Response Quality
Integrates with Vector DB Metadata Filters

We report the development of GPT-4, a large-scale, multimodal…

{“page_num”: 1, “org”: “OpenAI”}

Metadata

Text Chunk

Example of Metadata

20 of 30

Table Stakes: Metadata Filtering

Question: “Can you tell me the risk factors in 2021?”

Raw Semantic Search is low precision.

Single Collection of all 10Q Document Chunks

2020 10Q chunk 6

top-4

2021 10Q chunk 4

2019 10Q chunk 7

2021 10Q chunk 7

No guarantee you’ll return the relevant document chunks!

query_str: <query_embedding>

21 of 30

Table Stakes: Metadata Filtering

Question: “Can you tell me the risk factors in 2021?”

If we can infer the metadata filters (year=2021), we remove irrelevant candidates, increasing precision!

2020 10Q

2021 10Q

2022 10Q

2023 10Q

2021 10Q chunk 7

query_str: <query_embedding>

Metadata tags:

{“year”: 2021}

22 of 30

Advanced Retrieval: Small-to-Big

Intuition: Embedding a big text chunk feels suboptimal.

Solution: Embed text at the sentence-level - then expand that window during LLM synthesis

23 of 30

Advanced Retrieval: Small-to-Big

Leads to more precise retrieval.

Avoids “lost in the middle” problems.

Sentence Window Retrieval (k=2)

Naive Retrieval (k=5)

Only one out of the 5 chunks is relevant - “lost in the middle” problem

24 of 30

Advanced Retrieval: Small-to-Big

Intuition: Embedding a big text chunk feels suboptimal.

Solution: Embed a smaller reference to the parent chunk. Use parent chunk for synthesis

Examples: Smaller chunks, summaries, metadata

25 of 30

Agentic Behavior: Multi-Document Agents

Intuition: There’s certain questions that “top-k” RAG can’t answer.

Solution: Multi-Document Agents

Fact-based QA and Summarization over any subsets of documents
Chain-of-thought and query planning.

26 of 30

Fine-Tuning: Embeddings

Intuition: Embedding Representations are not optimized over your dataset

Solution: Generate a synthetic query dataset from raw text chunks using LLMs

Use this synthetic dataset to finetune an embedding model.

Credits: Jo Bergum, vespa.ai

27 of 30

Fine-Tuning: LLMs

Intuition: Weaker LLMs are not bad at response synthesis, reasoning, structured outputs, etc.

Solution: Generate a synthetic dataset from raw chunks (e.g. using GPT-4). Help fix all of the above!

28 of 30

Resources

Production RAG

https://docs.llamaindex.ai/en/stable/end_to_end_tutorials/dev_practices/production_rag.html

Fine-tuning

https://docs.llamaindex.ai/en/stable/end_to_end_tutorials/finetuning.html

29 of 30

Thanks!

We’ll make talk + workshop slides publicly available.

30 of 30

Finetuning Abstractions in LlamaIndex

https://gpt-index.readthedocs.io/en/latest/end_to_end_tutorials/finetuning.html