Anish Shah
�LLM Agent Fine-tuning: Enhancing Task Automation
RAG: The Current State of Chatbots
User Question
Embedding model
Docs
ChromaDB
Vector Store
Similarity search
K neighbors
PROMPT
Prompt template
GPT-4
Answer
Feedback
User Feedback
Feedback | Count |
👍 | 21% |
👎 | 8% |
None | 70% |
Create evaluation set
There’s 1,000s of questions.
We cluster them into topics to efficiently evaluate on the breadth of questions asked.
Select a question from each cluster to shrink the eval set.
Reduce the eval set from 1000s to 132
How to log a json file to W&B Tables?
How do I move the first column to the end of a W&B Table?
Evaluate the model: LLM Eval
Evaluate the model: Expert Eval
Is the link real
Is the question related to wandb
Is it accurate?
Evaluate the model: Expert Eval
You can set your configuration with an argparse object. You can add the args or parse the args and pass them to wandb as follows:
```python
import wandb
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--param1", type=int, default=42)
args = parser.parse_args()
wandb.config.update(args)
…
Source: https://docs.wandb.ai/tutorial/track/config
Hallucination
Can’t find the answer
Highlighted a gap in our docs
I apologize, but I could not find any documentation related to a `cast` method for the `Table` class in wandb. The documented methods for the `Table` class in wandb are `__init__`, `add_data`, `add_column`, `add_computed_column`, `get_column`, `iterrows`, `to_data_frame`, `apply`, `apply_column`, `get_row`, `set_row`, `get_num_rows`, and `get_schema`. Would you like me to provide information on any of these methods?
Traditional Evaluation vs LLM-system Evaluation
Training Distribution
Production Distribution
Train Set
Valid Set
Test
Set
Prod
Data
acc = 0.90
acc = 0.89
acc = 0.84
acc = 0.79
Overfitting/Underfitting
Domain Shift
Drift
Inspired by Josh Tobin’s talk at DataBricks
Doesn’t work for LLM-system Evaluation
Training Distribution
Production Distribution
No access to training (pretraining) distribution.
Prod distribution “has” to be different - your customer ain’t represented by public benchmarks.
Inspired by Josh Tobin’s talk at DataBricks
What are some ways of evaluating?
Evaluation Data
Evaluation Metric
More like prod dist
Less like prod dist
Less correlated with your use case
More correlated with your use case
💸 Public Benchmarks
💸 LLM eval LLM
💸 💸 User Testing
💸 💸 💸 Human Evaluation
Inspired by Josh Tobin’s talk at DataBricks
What are some ways of evaluating?
Evaluation Data
Evaluation Metric
More like prod dist
Less like prod dist
Less correlated with your use case
More correlated with your use case
💸 Public Benchmarks
💸 LLM eval LLM
Pick LLM performing well on public benchmarks.
Choose a benchmark that correlates to your use case to some extent.
⏰
💸 💸 User Testing
💸 💸 💸 Human Evaluation
What are some ways of evaluating?
Evaluation Data
Evaluation Metric
More like prod dist
Less like prod dist
Less correlated with your use case
More correlated with your use case
💸 Public Benchmarks
💸 LLM eval LLM
Hire annotators to build a gold standard eval set and evaluate your pipeline.
If your pipeline got multiple components - create eval set to evaluate each component.
⏰⏰⏰⏰⏰
💸 💸 User Testing
💸 💸 💸 Human Evaluation
What are some ways of evaluating?
Evaluation Data
Evaluation Metric
More like prod dist
Less like prod dist
Less correlated with your use case
More correlated with your use case
💸 Public Benchmarks
💸 LLM eval LLM
Work closely with your customer. Have a dedicated user base. Go back and forth with your changes.
⏰⏰⏰
💸 💸 User Testing
💸 💸 💸 Human Evaluation
What are some ways of evaluating?
Evaluation Data
Evaluation Metric
More like prod dist
Less like prod dist
Less correlated with your use case
More correlated with your use case
💸 Public Benchmarks
💸 LLM eval LLM
⏰
As crazy as it sounds, it works!
It is a quick way of evaluating incremental changes.
But think through it properly.
💸 💸 User Testing
💸 💸 💸 Human Evaluation
LLM-based System Evaluation - In Practice
Eyeballing
👀
Supervision Based
🕵️♀️
LLM Eval LLM
🤝
LLM-based System Evaluation - In Practice
Eyeballing
👀
Supervision Based
🕵️♀️
LLM Eval LLM
🤝
👀 Eyeballing meets W&B Tracers
LLM-based System Evaluation - In Practice
Eyeballing
Supervision Based
LLM Eval LLM
Supervised Evaluation - A Simple Example
Supervised Evaluation - A Simple Example
"""
The following is the mathematical expression provided by the user.
{question}
Find the answer using the BODMAS rule in the {format_instructions}:
"""
LLM-based System Evaluation - In Practice
Eyeballing
Supervision Based
LLM Eval LLM
Prompt Parameter
Optimization
maths_sweeps.yaml
program: maths_sweeps.py
method: random
name: random_maths_sweeps
parameters:
prompt_template_file:
values: [
"data/maths/maths_prompt_template_1.txt",
"data/maths/maths_prompt_template_2.txt",
"data/maths/maths_prompt_template_3.txt",
]
temperature:
values: [0, 0.1, 0.3, 0.5, 0.7, 0.9, 1]
llm_model_name:
values: ["gpt-4", "gpt-3.5-turbo", "text-davinci-003"]
Prompt Parameter Optimization
💸 = ⬆️ Acc
Can we improve wandbot by improving our LLM as opposed to our processes around using a pre-trained LLM
27
Training and Fine-tuning LLMs
wandb.me/llm-course
Tokens = $$$
RLHF made LLMs more chat friendly - we can use the results of this for finetuning
However, Parameter Efficient Fine Tuning allows for cheap, fast, and effective fine tuning of LLMs
LoRa is the most common method currently
As compared to other PEFT methods
Fine-tuning with LoRA works almost as well as full-parameter
Baseline
LoRA
Full Parameter
Baseline
LoRA
Full Parameter
The dataset may be formatted as so
38
A system of record for all ML workflows
39
!pip install wandb # Install W&B
wandb.init() # Start experiment
wandb.log(metrics) # Log metrics + more!
A system of record for all ML workflows
Get started in 60 seconds…
!pip install wandb # Install W&B
from wandb.integration.openai.fine_tuning import WandbLogger
WandbLogger.sync(...) # Autolog metrics
Or less!
Weights & Biases The AI Developer Platform
Agents: Self Reasoning LLMs with Tools
Adapted slides from:
How to Build, Evaluate, and Iterate on LLM Agents by Llamaindex and TruEra
From RAG to Agents
RAG
Query
Response
From RAG to Agents
Agents?
RAG
Query
Response
Range of AI agents are possible
Agents that can take action in real world
Specialized Data Agents
General Data Agents
Range of AI agents are possible
Agents that can take action in real world
Specialized Data Agents
General Data Agents
Data Agents - LLM-powered knowledge workers
Read latest emails
Knowledge Base
Retrieve
context
Analysis Agent
Analyze
file
Slack
Send update
Data
Agent
Data Agents - Core Components
How to retrieve & analyze data from knowledge base?
Use our query engines as “data tools” over your agent:
“Simple” Interface - all agent has to infer is a query string!
Data agents for real-time retrieval
User Input
API Tool
Tool Output
Tool Input
Reasoning Agent
Conversation History
Fetch History
Write History
Agent Failure Modes
User Input
API Tool
Tool Output
Reasoning Agent
Conversation History
Fetch History
Write History
Wrong tool selection/input
Rogue Paths
Agent Failure Modes
User Input
API Tool
Reasoning Agent
Conversation History
Fetch History
Write History
Wrong tool selection/input
Rogue Paths
Infinite Loops
Agent Failure Modes
User Input
API Tool
Reasoning Agent
Conversation History
Fetch History
Write History
Wrong tool selection/input
Rogue Paths
Infinite Loops
Failed API Calls
Agent Failure Modes
User Input
API Tool
Wrong tool selection/input
Hallucination
Conversation History
Fetch History
Write History
Failed API Calls
Rogue Paths
Infinite Loops
Testing Agents for Hallucinations
The Agent Quad
54
Query
Context
Response
Agent
Tool Selection
Context Relevance
Groundedness
Answer Relevance
Best practices for Agents
Blog Post: https://blog.llamaindex.ai/building-better-tools-for-llm-agents-f8c5a6714f11
Example: WandAgent -> Use the LLM to decide best action
As LLM models and Use Cases evolve, so do the usage patterns around Agents
Source: Wang et al., A Survey on Large Language Model based Autonomous Agents
Recap RL Agent vs LLM Agent
A crucial aspect is to design rational agent architectures to assist LLMs in maximizing their capabilities
ReAct
ReAct
ReAct
ReAct
ReAct
ReAct as a standard format for LLM Behavior Finetuning
ReAct
Chain of Thought: Verbose thought processes
Example: Financial Analysis with Agents + RAG
Question: “Compare and contrast Uber’s revenue between quarters”
Demo
69
Evaluating Agents
Agents: Extending Past Chain of Thought
Reflexion
Reflexion
Reflexion
Reflexion
Tree of Thought: Planning
Tree of Thought: Example
Tree of Thought: How to Generate Thoughts
Tree of Thought: How to Evaluate Thoughts
Tree of Thought: Example of Evaluating Thoughts
Tree of Thought
The paradigm of how we utilize data to improve models is shifting
Source: Wang et al., A Survey on Large Language Model based Autonomous Agents
New prompting paradigms which affect your LLM data distributions are forming
Case Study: FireAct
Case Study: FireAct
Demo
86
Stay in touch!
wandb.me/discord
wandb.me/courses
LinkedIn: Anish Shah
87