1 of 87

Anish Shah

�LLM Agent Fine-tuning: Enhancing Task Automation

2 of 87

RAG: The Current State of Chatbots

3 of 87

User Question

Embedding model

Docs

ChromaDB

Vector Store

Similarity search

K neighbors

PROMPT

Prompt template

GPT-4

Answer

Feedback

4 of 87

User Feedback

Feedback	Count
👍	21%
👎	8%
None	70%

5 of 87

Create evaluation set

There’s 1,000s of questions.

We cluster them into topics to efficiently evaluate on the breadth of questions asked.

6 of 87

Select a question from each cluster to shrink the eval set.

Reduce the eval set from 1000s to 132

How to log a json file to W&B Tables?

How do I move the first column to the end of a W&B Table?

7 of 87

Evaluate the model: LLM Eval

Faithfulness Evaluation: does the answer accurately reflect the information in the source documents without introducing unverified or incorrect details?

Relevancy Evaluation: Does the answer address the user's query with information related to the question and context provided?

8 of 87

Evaluate the model: Expert Eval

Is the link real

Is the question related to wandb

Is it accurate?

9 of 87

Evaluate the model: Expert Eval

You can set your configuration with an argparse object. You can add the args or parse the args and pass them to wandb as follows:

```python

import wandb

import argparse

parser = argparse.ArgumentParser()

parser.add_argument("--param1", type=int, default=42)

args = parser.parse_args()

wandb.config.update(args)

…

Source: https://docs.wandb.ai/tutorial/track/config

Hallucination

Can’t find the answer

Highlighted a gap in our docs

I apologize, but I could not find any documentation related to a `cast` method for the `Table` class in wandb. The documented methods for the `Table` class in wandb are `__init__`, `add_data`, `add_column`, `add_computed_column`, `get_column`, `iterrows`, `to_data_frame`, `apply`, `apply_column`, `get_row`, `set_row`, `get_num_rows`, and `get_schema`. Would you like me to provide information on any of these methods?

10 of 87

Traditional Evaluation vs LLM-system Evaluation

Training Distribution

Production Distribution

Train Set

Valid Set

Test

Set

Prod

Data

acc = 0.90

acc = 0.89

acc = 0.84

acc = 0.79

Overfitting/Underfitting

Domain Shift

Drift

Inspired by Josh Tobin’s talk at DataBricks

11 of 87

Doesn’t work for LLM-system Evaluation

Training Distribution

Production Distribution

No access to training (pretraining) distribution.

Prod distribution “has” to be different - your customer ain’t represented by public benchmarks.

Inspired by Josh Tobin’s talk at DataBricks

12 of 87

What are some ways of evaluating?

Evaluation Data

Evaluation Metric

More like prod dist

Less like prod dist

Less correlated with your use case

More correlated with your use case

💸 Public Benchmarks

💸 LLM eval LLM

💸 💸 User Testing

💸 💸 💸 Human Evaluation

Inspired by Josh Tobin’s talk at DataBricks

13 of 87

What are some ways of evaluating?

Evaluation Data

Evaluation Metric

More like prod dist

Less like prod dist

Less correlated with your use case

More correlated with your use case

💸 Public Benchmarks

💸 LLM eval LLM

Pick LLM performing well on public benchmarks.

Choose a benchmark that correlates to your use case to some extent.

⏰

💸 💸 User Testing

💸 💸 💸 Human Evaluation

14 of 87

15 of 87

What are some ways of evaluating?

Evaluation Data

Evaluation Metric

More like prod dist

Less like prod dist

Less correlated with your use case

More correlated with your use case

💸 Public Benchmarks

💸 LLM eval LLM

Hire annotators to build a gold standard eval set and evaluate your pipeline.

If your pipeline got multiple components - create eval set to evaluate each component.

⏰⏰⏰⏰⏰

💸 💸 User Testing

💸 💸 💸 Human Evaluation

16 of 87

What are some ways of evaluating?

Evaluation Data

Evaluation Metric

More like prod dist

Less like prod dist

Less correlated with your use case

More correlated with your use case

💸 Public Benchmarks

💸 LLM eval LLM

Work closely with your customer. Have a dedicated user base. Go back and forth with your changes.

⏰⏰⏰

💸 💸 User Testing

💸 💸 💸 Human Evaluation

17 of 87

What are some ways of evaluating?

Evaluation Data

Evaluation Metric

More like prod dist

Less like prod dist

Less correlated with your use case

More correlated with your use case

💸 Public Benchmarks

💸 LLM eval LLM

⏰

As crazy as it sounds, it works!

It is a quick way of evaluating incremental changes.

But think through it properly.

💸 💸 User Testing

💸 💸 💸 Human Evaluation

18 of 87

LLM-based System Evaluation - In Practice

Eyeballing

👀

Supervision Based

🕵️‍♀️

LLM Eval LLM

🤝

19 of 87

LLM-based System Evaluation - In Practice

You start with a few running examples (3-4),

Iterate on the application such that it works on those few samples.

You try a few more harder examples to validate/evaluate .

Eyeballing

👀

Supervision Based

🕵️‍♀️

LLM Eval LLM

🤝

20 of 87

👀 Eyeballing meets W&B Tracers

21 of 87

LLM-based System Evaluation - In Practice

Eyeballing

Supervision Based

LLM Eval LLM

Recommended way to evaluate for best coverage.
It is both expensive and very slow way of evaluating.
Should not be considered the first way of evaluating.

22 of 87

Supervised Evaluation - A Simple Example

23 of 87

Supervised Evaluation - A Simple Example

"""

The following is the mathematical expression provided by the user.

{question}

Find the answer using the BODMAS rule in the {format_instructions}:

"""

24 of 87

LLM-based System Evaluation - In Practice

Eyeballing

Supervision Based

LLM Eval LLM

Fastest way to evaluate your pipeline sensibly.
Evaluate every stage of the pipeline.
This is a proxy way of evaluating incremental changes to your pipeline.

25 of 87

Prompt Parameter

Optimization

maths_sweeps.yaml

program: maths_sweeps.py

method: random

name: random_maths_sweeps

parameters:

prompt_template_file:

values: [

"data/maths/maths_prompt_template_1.txt",

"data/maths/maths_prompt_template_2.txt",

"data/maths/maths_prompt_template_3.txt",

]

temperature:

values: [0, 0.1, 0.3, 0.5, 0.7, 0.9, 1]

llm_model_name:

values: ["gpt-4", "gpt-3.5-turbo", "text-davinci-003"]

26 of 87

Prompt Parameter Optimization

💸 = ⬆️ Acc

27 of 87

Can we improve wandbot by improving our LLM as opposed to our processes around using a pre-trained LLM

27

28 of 87

Training and Fine-tuning LLMs

wandb.me/llm-course

29 of 87

Tokens = $$$

30 of 87

RLHF made LLMs more chat friendly - we can use the results of this for finetuning

31 of 87

However, Parameter Efficient Fine Tuning allows for cheap, fast, and effective fine tuning of LLMs

Adapters were one of the first parameter-efficient fine-tuning techniques released. In the paper, they showed that you can add more layers to the pre-existing transformer architecture and only finetune them instead of the whole model. They showed that this technique resulted in similar performance when compared to complete fine-tuning.

LoRA is a similar strategy to Adapter layers but it aims to further reduce the number of trainable parameters. It takes a more mathematically rigorous approach. LoRA works by modifying how the updatable parameters are trained and updated in the neural network. during finetuning only a very few weights are updated a lot as most of the learning is done during the pretraining phase of the neural network. LoRA uses this information to reduce the number of trainable parameters.nvented a way to train only a small delta with much fewer parameters, resulting in much faster training speed and lower ram requirement. The beauty is its weights can be simply added to the foundation model to have a fine tuned model for a different task.. You can have multiple LORA adapters and “hotswap” them before interacting with an LLM for a specific data/task.

32 of 87

LoRa is the most common method currently

33 of 87

34 of 87

As compared to other PEFT methods

35 of 87

Fine-tuning with LoRA works almost as well as full-parameter

Baseline

LoRA

Full Parameter

Baseline

LoRA

Full Parameter

36 of 87

The dataset may be formatted as so

37 of 87

38 of 87

38

A system of record for all ML workflows

39 of 87

39

!pip install wandb # Install W&B

wandb.init() # Start experiment

wandb.log(metrics) # Log metrics + more!

A system of record for all ML workflows

Get started in 60 seconds…

!pip install wandb # Install W&B

from wandb.integration.openai.fine_tuning import WandbLogger

WandbLogger.sync(...) # Autolog metrics

Or less!

40 of 87

Weights & Biases The AI Developer Platform

41 of 87

Agents: Self Reasoning LLMs with Tools

Adapted slides from:

How to Build, Evaluate, and Iterate on LLM Agents by Llamaindex and TruEra

42 of 87

From RAG to Agents

RAG

Query

Response

43 of 87

From RAG to Agents

Agents?

RAG

Query

Response

44 of 87

Range of AI agents are possible

Agents that can take action in real world

Book plane tickets
Scheduling appointment
Order doordash
…

Specialized Data Agents

Similar to retrieval from vector store
But with access to real-time information

General Data Agents

Access to more than one tool
Can accomplish a wider range of tasks

45 of 87

Range of AI agents are possible

Agents that can take action in real world

Book plane tickets
Create calendar invites
…

Specialized Data Agents

Similar to retrieval
access to real-time information

General Data Agents

Access to more than one tool
Can accomplish a wider range of tasks

46 of 87

Data Agents - LLM-powered knowledge workers

Email

Read latest emails

Knowledge Base

Retrieve

context

Analysis Agent

Analyze

file

Slack

Send update

Data

Agent

47 of 87

Data Agents - Core Components

Agent Reasoning Loop

ReAct Agent (any LLM)
OpenAI Agent (only OAI)

Tools

Query Engine Tools (RAG pipeline)

48 of 87

How to retrieve & analyze data from knowledge base?

Use our query engines as “data tools” over your agent:

Semantic search
Summarization
Text-to-SQL
Document comparisons
Combining Structured Data w/ Unstructured

“Simple” Interface - all agent has to infer is a query string!

49 of 87

Data agents for real-time retrieval

User Input

API Tool

Tool Output

Tool Input

Reasoning Agent

Conversation History

Fetch History

Write History

50 of 87

Agent Failure Modes

User Input

API Tool

Tool Output

Reasoning Agent

Conversation History

Fetch History

Write History

Wrong tool selection/input

Rogue Paths

51 of 87

Agent Failure Modes

User Input

API Tool

Reasoning Agent

Conversation History

Fetch History

Write History

Wrong tool selection/input

Rogue Paths

Infinite Loops

52 of 87

Agent Failure Modes

User Input

API Tool

Reasoning Agent

Conversation History

Fetch History

Write History

Wrong tool selection/input

Rogue Paths

Infinite Loops

Failed API Calls

53 of 87

Agent Failure Modes

User Input

API Tool

Wrong tool selection/input

Hallucination

Conversation History

Fetch History

Write History

Failed API Calls

Rogue Paths

Infinite Loops

54 of 87

Testing Agents for Hallucinations

The Agent Quad

54

Query

Context

Response

Agent

Tool Selection

Context Relevance

Groundedness

Answer Relevance

55 of 87

Best practices for Agents

Blog Post: https://blog.llamaindex.ai/building-better-tools-for-llm-agents-f8c5a6714f11

Writing useful tool prompts
Make tools tolerant of partial/faulty inputs
Prompt engineering error messages
Returning the right prompts for “POST” requests
Don’t overload agent with tools
Try hierarchical agent modeling

56 of 87

Example: WandAgent -> Use the LLM to decide best action

57 of 87

As LLM models and Use Cases evolve, so do the usage patterns around Agents

Source: Wang et al., A Survey on Large Language Model based Autonomous Agents

58 of 87

Recap RL Agent vs LLM Agent

59 of 87

A crucial aspect is to design rational agent architectures to assist LLMs in maximizing their capabilities

60 of 87

ReAct

61 of 87

ReAct

62 of 87

ReAct

63 of 87

ReAct

64 of 87

ReAct

65 of 87

ReAct as a standard format for LLM Behavior Finetuning

66 of 87

ReAct

67 of 87

Chain of Thought: Verbose thought processes

68 of 87

Example: Financial Analysis with Agents + RAG

Question: “Compare and contrast Uber’s revenue between quarters”

Agent: breaks down question into sub-questions over tools
Per-Document RAG: Answer question over a given document via top-k retrieval.

69 of 87

Demo

69

70 of 87

Evaluating Agents

Existing LLM benchmark systems rarely satisfy all of these properties. Classical LLM benchmark frameworks, such as HELM and lm-evaluation-harness, provide multi-metric measurements for tasks commonly used in academic research. However, they are not based on pairwise comparison and are not effective at evaluating open-ended questions. OpenAI also launched the evals project to collect better questions, but this project does not provide ranking mechanisms for all participating models. When we launched our Vicuna model, we utilized a GPT-4-based evaluation pipeline, but it does not provide a solution for scalable and incremental ratings.

In this blog post, we introduce Chatbot Arena, an LLM benchmark platform featuring anonymous randomized battles in a crowdsourced manner. Chatbot Arena adopts the Elo rating system, which is a widely-used rating system in chess and other competitive games. The Elo rating system is promising to provide the desired property mentioned above. We noticed that the Anthropic LLM paper also adopted the Elo rating system.

71 of 87

Agents: Extending Past Chain of Thought

72 of 87

Reflexion

73 of 87

Reflexion

74 of 87

Reflexion

75 of 87

Reflexion

76 of 87

Tree of Thought: Planning

77 of 87

Tree of Thought: Example

78 of 87

Tree of Thought: How to Generate Thoughts

79 of 87

Tree of Thought: How to Evaluate Thoughts

80 of 87

Tree of Thought: Example of Evaluating Thoughts

81 of 87

Tree of Thought

82 of 87

The paradigm of how we utilize data to improve models is shifting

Source: Wang et al., A Survey on Large Language Model based Autonomous Agents

83 of 87

New prompting paradigms which affect your LLM data distributions are forming

84 of 87

Case Study: FireAct

85 of 87

Case Study: FireAct

86 of 87

Demo

86

87 of 87

Stay in touch!

wandb.me/discord

wandb.me/courses

LinkedIn: Anish Shah

87