1 of 38

Beyond RAG: Building Advanced Context-Augmented LLM

Applications

Jerry Liu, LlamaIndex co-founder/CEO

2 of 38

LlamaIndex: Context Augmentation for your LLM app

3 of 38

RAG

4 of 38

RAG

Data Parsing & Ingestion

Data Querying

Index

Data

Data Parsing + Ingestion

Retrieval

LLM + Prompts

Response

5 of 38

Naive RAG

PyPDF

Sentence Splitting

Chunk Size 256

Simple QA Prompt

Dense Retrieval

Top-k = 5

Index

Data

Data Parsing + Ingestion

Retrieval

LLM + Prompts

Response

6 of 38

Naive RAG is Limited

7 of 38

RAG Prototypes are Limited

Naive RAG approaches tend to work well for simple questions over a simple, small set of documents.

  • “What are the main risk factors for Tesla?” (over Tesla 2021 10K)
  • “What did the author do during his time at YC?” (Paul Graham essay)

8 of 38

Pain Points

There’s certain questions we want to ask where naive RAG will fail.

Examples:

  • Summarization Questions: “Give me a summary of the entire <company> 10K annual report”

9 of 38

Pain Points

There’s certain questions we want to ask where naive RAG will fail.

Examples:

  • Summarization Questions: “Give me a summary of the entire <company> 10K annual report”
  • Comparison Questions: “Compare the open-source contributions of candidate A and candidate B”

10 of 38

Pain Points

There’s certain questions we want to ask where naive RAG will fail.

Examples:

  • Summarization Questions: “Give me a summary of the entire <company> 10K annual report”
  • Comparison Questions: “Compare the open-source contributions of candidate A and candidate B”
  • Structured Analytics + Semantic Search: “Tell me about the risk factors of the highest-performing rideshare company in the US”

11 of 38

Pain Points

There’s certain questions we want to ask where naive RAG will fail.

Examples:

  • Summarization Questions: “Give me a summary of the entire <company> 10K annual report”
  • Comparison Questions: “Compare the open-source contributions of candidate A and candidate B”
  • Structured Analytics + Semantic Search: “Tell me about the risk factors of the highest-performing rideshare company in the US”
  • General Multi-part Questions: “Tell me about the pro-X arguments in article A, and tell me about the pro-Y arguments in article B, make a table based on our internal style guide, then generate your own conclusion based on these facts.”

12 of 38

Can we do more?

In the naive setting, RAG is boring.

🚫 It’s just a glorified search system

🚫 There’s many questions/tasks that naive RAG can’t give an answer to.

💡 Can we go beyond simple search/QA to building a general context-augmented research assistant?

13 of 38

Beyond RAG: Adding Layers of Agentic Reasoning

14 of 38

From RAG to Agents

RAG

Query

Response

15 of 38

From RAG to Agents

RAG

Query

Response

⚠️ Single-shot

⚠️ No query understanding/planning

⚠️ No tool use

⚠️ No reflection, error correction

⚠️ No memory (stateless)

16 of 38

From RAG to Agents

✅ Multi-turn

✅ Query / task planning layer

✅ Tool interface for external environment

✅ Reflection

✅ Memory for personalization

Agent

RAG

Query

Response

Tool

Tool

Tool

17 of 38

From Simple to Advanced Agents

Full Agents

Agent Ingredients

Simple

Lower Cost

Lower Latency

Advanced

Higher Cost

Higher Latency

Routing

One-Shot Query Planning

Tool Use

ReAct

Dynamic Planning + Execution

Conversation Memory

18 of 38

Routing

Simplest form of agentic reasoning.

Given user query and set of choices, output subset of choices to route query to.

19 of 38

Routing

Use Case: Joint QA and Summarization

Guide

20 of 38

Conversation Memory

In addition to current query, take into account conversation history as input to your RAG pipeline.

21 of 38

Conversation Memory

How to account for conversation history in a RAG pipeline?

  • Condense question
  • Condense question + context

22 of 38

Query Planning

Break down query into parallelizable sub-queries.

Each sub-query can be executed against any set of RAG pipelines

Uber 10-K chunk 4

top-2

Uber 10-K chunk 8

Lyft 10-K chunk 4

Lyft 10-K chunk 8

Compare revenue growth of Uber and Lyft in 2021

Uber 10-K

Lyft 10-K

Describe revenue growth of Uber in 2021

Describe revenue growth of Lyft in 2021

top-2

23 of 38

Query Planning

Example: Compare revenue of Uber and Lyft in 2021

Query Planning Guide

Uber 10-K chunk 4

top-2

Uber 10-K chunk 8

Lyft 10-K chunk 4

Lyft 10-K chunk 8

Compare revenue growth of Uber and Lyft in 2021

Uber 10-K

Lyft 10-K

Describe revenue growth of Uber in 2021

Describe revenue growth of Lyft in 2021

top-2

24 of 38

Tool Use

Use an LLM to call an API

Infer the parameters of that API

25 of 38

Tool Use

In normal RAG you just pass through the query.

But what if you used the LLM to infer all the parameters for the API interface?

A key capability in many QA use cases (auto-retrieval, text-to-SQL, and more)

26 of 38

Let’s put them together

  • All of these are agent ingredients
  • Let’s put them together for a full agent system
    • Query planning
    • Memory
    • Tool Use
  • Let’s add additional components
    • Reflection
    • Controllability
    • Observability

27 of 38

Core Components of a Full Agent

Minimum necessary ingredients:

  • Query planning
  • Memory
  • Tool Use

28 of 38

ReAct: Reasoning + Acting with LLMs

29 of 38

ReAct: Reasoning + Acting with LLMs

Query Planning: Generate next step given previous steps (chain-of-thought prompt)

Tool Use: Sequential tool calling.

Memory: Maintain simple buffer.

30 of 38

ReAct: Reasoning + Acting with LLMs

31 of 38

Can we make this even better?

  • Stop being so short-sighted - plan ahead at each step
  • Parallelize execution where we can

32 of 38

LLMCompiler (Kim et al. 2023)

Kim et al. 2023

An agent compiler for parallel multi-function planning + execution.

33 of 38

LLMCompiler

Query Planning: Generate a DAG of steps. Replan if steps don’t reach desired state

Tool Use: Parallel function calling.

Memory: Maintain simple buffer.

LLMCompiler Agent

34 of 38

Tree-based Planning

Tree of Thoughts (Yao et al. 2023)

Reasoning via Planning (Hao et al. 2023)

Language Agent Tree Search (Zhou et al. 2023)

35 of 38

Tree-based Planning

Query Planning in the face of uncertainty: Instead of planning out a fixed sequence of steps, sample a few different states.

Run Monte-Carlo Tree Search (MCTS) to balance exploration vs. exploitation.

36 of 38

Self-Reflection

Use feedback to improve agent execution and reduce errors

🗣️ Human feedback

🤖 LLM feedback

Use few-shot examples instead of retraining the model!

Reflexion: Language Agents with Verbal Reinforcement Learning, by Shinn et al. (2023)

37 of 38

Additional Requirements

  • Observability: see the full trace of the agent
    • Observability Guide
  • Control: Be able to guide the intermediate steps of an agent step-by-step
    • Lower-Level Agent API
  • Customizability: Define your own agentic logic around any set of tools.
  • Multi-agents: Define multi-agent interactions!
    • Synchronously: Define an explicit flow between agents
    • Asynchronously: Treat each agent as a microservice that can communicate with each other.
      • Upcoming in LlamaIndex!
    • Current Frameworks: Autogen, CrewAI

38 of 38

LlamaIndex + W&B

Tracing and Observability are essential developer tools for RAG/agent development.

We have first-class integrations with Weights and Biases.

Guide