Building Production-Ready RAG Applications
Jerry Liu, LlamaIndex co-founder/CEO
GenAI - Enterprise Use-cases
Document Processing
Tagging & Extraction
Knowledge Search & QA
Conversational Agent
Workflow Automation
Agent: …
Human: …
Agent: …
Document
Topic:
Summary:
Author:
Knowledge Base
Answer:
Sources: …
Workflow:
Inbox
read
write
Paradigms for inserting knowledge
Retrieval Augmentation - Fix the model, put context into the prompt
LLM
Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep...
Input Prompt
Here is the context:
Before college the two main things…
Given the context, answer the following question:�{query_str}
Paradigms for inserting knowledge
Fine-tuning - baking knowledge into the weights of the network
LLM
Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep...
RLHF, Adam, SGD, etc.
RAG Stack
Current RAG Stack for building a QA System
Vector Database
Doc
Chunk
Chunk
Chunk
Chunk
Chunk
Chunk
Chunk
LLM
Data Ingestion
Data Querying (Retrieval + Synthesis)
5 Lines of Code in Llama Index!
Challenges with “Naive” RAG
Challenges with Naive RAG (Response Quality)
Challenges with Naive RAG (Response Quality)
What do we do?
Vector Database
Doc
Chunk
Chunk
Chunk
Chunk
Chunk
LLM
Data
Embeddings
Retrieval
Synthesis
What do we do?
But before all this…
We need a way to measure performance
Evaluation
Evaluation
Vector Database
Chunk
Chunk
Chunk
LLM
Retrieval
Synthesis
Evaluation in Isolation (Retrieval)
Evaluation E2E
Optimizing RAG Systems
From Simple to Advanced RAG
Less Expressive
Easier to Implement
Lower Latency/Cost
More Expressive
Harder to Implement
Higher Latency/Cost
Table Stakes
Better Parsers
Chunk Sizes
Hybrid Search
Metadata Filters
🛠️
Advanced Retrieval
Reranking
Recursive Retrieval
Embedded Tables
Small-to-big Retrieval
🔎
Agentic Behavior
Routing
Query Planning
Multi-document Agents
🤖
Fine-tuning
Embedding fine-tuning
LLM fine-tuning
⚙️
Table Stakes: Chunk Sizes
Tuning your chunk size can have outsized impacts on performance
Not obvious that more retrieved tokens == higher performance!
Note: Reranking (shuffling context order) isn’t always beneficial.
Source:
Arize Phoenix + LlamaIndex Workshop: https://colab.research.google.com/drive/1Siufl13rLI-kII1liaNfvf-NniBdwUpS?usp=sharing#scrollTo=_as7h-u1IlwR
Table Stakes: Metadata Filtering
We report the development of GPT-4, a large-scale, multimodal…
{“page_num”: 1, “org”: “OpenAI”}
Metadata
Text Chunk
Example of Metadata
Table Stakes: Metadata Filtering
Question: “Can you tell me the risk factors in 2021?”
Raw Semantic Search is low precision.
Single Collection of all 10Q Document Chunks
2020 10Q chunk 6
top-4
2021 10Q chunk 4
2019 10Q chunk 7
2021 10Q chunk 7
No guarantee you’ll return the relevant document chunks!
query_str: <query_embedding>
Table Stakes: Metadata Filtering
Question: “Can you tell me the risk factors in 2021?”
If we can infer the metadata filters (year=2021), we remove irrelevant candidates, increasing precision!
2020 10Q
2021 10Q
2022 10Q
2023 10Q
2021 10Q chunk 7
query_str: <query_embedding>
Metadata tags:
{“year”: 2021}
Advanced Retrieval: Small-to-Big
Intuition: Embedding a big text chunk feels suboptimal.
Solution: Embed text at the sentence-level - then expand that window during LLM synthesis
Advanced Retrieval: Small-to-Big
Leads to more precise retrieval.
Avoids “lost in the middle” problems.
Sentence Window Retrieval (k=2)
Naive Retrieval (k=5)
Only one out of the 5 chunks is relevant - “lost in the middle” problem
Advanced Retrieval: Small-to-Big
Intuition: Embedding a big text chunk feels suboptimal.
Solution: Embed a smaller reference to the parent chunk. Use parent chunk for synthesis
Examples: Smaller chunks, summaries, metadata
Agentic Behavior: Multi-Document Agents
Intuition: There’s certain questions that “top-k” RAG can’t answer.
Solution: Multi-Document Agents
Fine-Tuning: Embeddings
Intuition: Embedding Representations are not optimized over your dataset
Solution: Generate a synthetic query dataset from raw text chunks using LLMs
Use this synthetic dataset to finetune an embedding model.
Credits: Jo Bergum, vespa.ai
Fine-Tuning: LLMs
Intuition: Weaker LLMs are not bad at response synthesis, reasoning, structured outputs, etc.
Solution: Generate a synthetic dataset from raw chunks (e.g. using GPT-4). Help fix all of the above!
Resources
Thanks!
We’ll make talk + workshop slides publicly available.
Finetuning Abstractions in LlamaIndex