1 of 87

Anish Shah

LLM Agent Fine-tuning: Enhancing Task Automation

2 of 87

RAG: The Current State of Chatbots

3 of 87

User Question

Embedding model

Docs

ChromaDB

Vector Store

Similarity search

K neighbors

PROMPT

Prompt template

GPT-4

Answer

Feedback

4 of 87

User Feedback

Feedback

Count

👍

21%

👎

8%

None

70%

5 of 87

Create evaluation set

There’s 1,000s of questions.

We cluster them into topics to efficiently evaluate on the breadth of questions asked.

6 of 87

Select a question from each cluster to shrink the eval set.

Reduce the eval set from 1000s to 132

How to log a json file to W&B Tables?

How do I move the first column to the end of a W&B Table?

7 of 87

Evaluate the model: LLM Eval

  • Faithfulness Evaluation: does the answer accurately reflect the information in the source documents without introducing unverified or incorrect details?

  • Relevancy Evaluation: Does the answer address the user's query with information related to the question and context provided?

8 of 87

Evaluate the model: Expert Eval

Is the link real

Is the question related to wandb

Is it accurate?

9 of 87

Evaluate the model: Expert Eval

You can set your configuration with an argparse object. You can add the args or parse the args and pass them to wandb as follows:

```python

import wandb

import argparse

parser = argparse.ArgumentParser()

parser.add_argument("--param1", type=int, default=42)

args = parser.parse_args()

wandb.config.update(args)

Source: https://docs.wandb.ai/tutorial/track/config

Hallucination

Can’t find the answer

Highlighted a gap in our docs

I apologize, but I could not find any documentation related to a `cast` method for the `Table` class in wandb. The documented methods for the `Table` class in wandb are `__init__`, `add_data`, `add_column`, `add_computed_column`, `get_column`, `iterrows`, `to_data_frame`, `apply`, `apply_column`, `get_row`, `set_row`, `get_num_rows`, and `get_schema`. Would you like me to provide information on any of these methods?

10 of 87

Traditional Evaluation vs LLM-system Evaluation

Training Distribution

Production Distribution

Train Set

Valid Set

Test

Set

Prod

Data

acc = 0.90

acc = 0.89

acc = 0.84

acc = 0.79

Overfitting/Underfitting

Domain Shift

Drift

Inspired by Josh Tobin’s talk at DataBricks

11 of 87

Doesn’t work for LLM-system Evaluation

Training Distribution

Production Distribution

No access to training (pretraining) distribution.

Prod distribution “has” to be different - your customer ain’t represented by public benchmarks.

Inspired by Josh Tobin’s talk at DataBricks

12 of 87

What are some ways of evaluating?

Evaluation Data

Evaluation Metric

More like prod dist

Less like prod dist

Less correlated with your use case

More correlated with your use case

💸 Public Benchmarks

💸 LLM eval LLM

💸 💸 User Testing

💸 💸 💸 Human Evaluation

Inspired by Josh Tobin’s talk at DataBricks

13 of 87

What are some ways of evaluating?

Evaluation Data

Evaluation Metric

More like prod dist

Less like prod dist

Less correlated with your use case

More correlated with your use case

💸 Public Benchmarks

💸 LLM eval LLM

Pick LLM performing well on public benchmarks.

Choose a benchmark that correlates to your use case to some extent.

💸 💸 User Testing

💸 💸 💸 Human Evaluation

14 of 87

15 of 87

What are some ways of evaluating?

Evaluation Data

Evaluation Metric

More like prod dist

Less like prod dist

Less correlated with your use case

More correlated with your use case

💸 Public Benchmarks

💸 LLM eval LLM

Hire annotators to build a gold standard eval set and evaluate your pipeline.

If your pipeline got multiple components - create eval set to evaluate each component.

⏰⏰⏰⏰⏰

💸 💸 User Testing

💸 💸 💸 Human Evaluation

16 of 87

What are some ways of evaluating?

Evaluation Data

Evaluation Metric

More like prod dist

Less like prod dist

Less correlated with your use case

More correlated with your use case

💸 Public Benchmarks

💸 LLM eval LLM

Work closely with your customer. Have a dedicated user base. Go back and forth with your changes.

⏰⏰⏰

💸 💸 User Testing

💸 💸 💸 Human Evaluation

17 of 87

What are some ways of evaluating?

Evaluation Data

Evaluation Metric

More like prod dist

Less like prod dist

Less correlated with your use case

More correlated with your use case

💸 Public Benchmarks

💸 LLM eval LLM

As crazy as it sounds, it works!

It is a quick way of evaluating incremental changes.

But think through it properly.

💸 💸 User Testing

💸 💸 💸 Human Evaluation

18 of 87

LLM-based System Evaluation - In Practice

Eyeballing

👀

Supervision Based

🕵️‍♀️

LLM Eval LLM

🤝

19 of 87

LLM-based System Evaluation - In Practice

  • You start with a few running examples (3-4),

  • Iterate on the application such that it works on those few samples.

  • You try a few more harder examples to validate/evaluate .

Eyeballing

👀

Supervision Based

🕵️‍♀️

LLM Eval LLM

🤝

20 of 87

👀 Eyeballing meets W&B Tracers

21 of 87

LLM-based System Evaluation - In Practice

Eyeballing

Supervision Based

LLM Eval LLM

  • Recommended way to evaluate for best coverage.
  • It is both expensive and very slow way of evaluating.
  • Should not be considered the first way of evaluating.

22 of 87

Supervised Evaluation - A Simple Example

23 of 87

Supervised Evaluation - A Simple Example

"""

The following is the mathematical expression provided by the user.

{question}

Find the answer using the BODMAS rule in the {format_instructions}:

"""

24 of 87

LLM-based System Evaluation - In Practice

Eyeballing

Supervision Based

LLM Eval LLM

  • Fastest way to evaluate your pipeline sensibly.
  • Evaluate every stage of the pipeline.
  • This is a proxy way of evaluating incremental changes to your pipeline.

25 of 87

Prompt Parameter

Optimization

maths_sweeps.yaml

program: maths_sweeps.py

method: random

name: random_maths_sweeps

parameters:

prompt_template_file:

values: [

"data/maths/maths_prompt_template_1.txt",

"data/maths/maths_prompt_template_2.txt",

"data/maths/maths_prompt_template_3.txt",

]

temperature:

values: [0, 0.1, 0.3, 0.5, 0.7, 0.9, 1]

llm_model_name:

values: ["gpt-4", "gpt-3.5-turbo", "text-davinci-003"]

26 of 87

Prompt Parameter Optimization

💸 = ⬆️ Acc

27 of 87

Can we improve wandbot by improving our LLM as opposed to our processes around using a pre-trained LLM

27

28 of 87

Training and Fine-tuning LLMs

wandb.me/llm-course

29 of 87

Tokens = $$$

30 of 87

RLHF made LLMs more chat friendly - we can use the results of this for finetuning

31 of 87

However, Parameter Efficient Fine Tuning allows for cheap, fast, and effective fine tuning of LLMs

32 of 87

LoRa is the most common method currently

33 of 87

34 of 87

As compared to other PEFT methods

35 of 87

Fine-tuning with LoRA works almost as well as full-parameter

Baseline

LoRA

Full Parameter

Baseline

LoRA

Full Parameter

36 of 87

The dataset may be formatted as so

37 of 87

38 of 87

38

A system of record for all ML workflows

39 of 87

39

!pip install wandb # Install W&B

wandb.init() # Start experiment

wandb.log(metrics) # Log metrics + more!

A system of record for all ML workflows

Get started in 60 seconds…

!pip install wandb # Install W&B

from wandb.integration.openai.fine_tuning import WandbLogger

WandbLogger.sync(...) # Autolog metrics

Or less!

40 of 87

Weights & Biases The AI Developer Platform

41 of 87

Agents: Self Reasoning LLMs with Tools

Adapted slides from:

How to Build, Evaluate, and Iterate on LLM Agents by Llamaindex and TruEra

42 of 87

From RAG to Agents

RAG

Query

Response

43 of 87

From RAG to Agents

Agents?

RAG

Query

Response

44 of 87

Range of AI agents are possible

Agents that can take action in real world

  • Book plane tickets
  • Scheduling appointment
  • Order doordash

Specialized Data Agents

  • Similar to retrieval from vector store
  • But with access to real-time information

General Data Agents

  • Access to more than one tool
  • Can accomplish a wider range of tasks

45 of 87

Range of AI agents are possible

Agents that can take action in real world

  • Book plane tickets
  • Create calendar invites

Specialized Data Agents

  • Similar to retrieval
  • access to real-time information

General Data Agents

  • Access to more than one tool
  • Can accomplish a wider range of tasks

46 of 87

Data Agents - LLM-powered knowledge workers

Email

Read latest emails

Knowledge Base

Retrieve

context

Analysis Agent

Analyze

file

Slack

Send update

Data

Agent

47 of 87

Data Agents - Core Components

Agent Reasoning Loop

  • ReAct Agent (any LLM)
  • OpenAI Agent (only OAI)

Tools

Query Engine Tools (RAG pipeline)

  • Code interpreter
  • Slack
  • Notion
  • Zapier
  • … (15+ tools)

48 of 87

How to retrieve & analyze data from knowledge base?

Use our query engines as “data tools” over your agent:

  • Semantic search
  • Summarization
  • Text-to-SQL
  • Document comparisons
  • Combining Structured Data w/ Unstructured

“Simple” Interface - all agent has to infer is a query string!

49 of 87

Data agents for real-time retrieval

User Input

API Tool

Tool Output

Tool Input

Reasoning Agent

Conversation History

Fetch History

Write History

50 of 87

Agent Failure Modes

User Input

API Tool

Tool Output

Reasoning Agent

Conversation History

Fetch History

Write History

Wrong tool selection/input

Rogue Paths

51 of 87

Agent Failure Modes

User Input

API Tool

Reasoning Agent

Conversation History

Fetch History

Write History

Wrong tool selection/input

Rogue Paths

Infinite Loops

52 of 87

Agent Failure Modes

User Input

API Tool

Reasoning Agent

Conversation History

Fetch History

Write History

Wrong tool selection/input

Rogue Paths

Infinite Loops

Failed API Calls

53 of 87

Agent Failure Modes

User Input

API Tool

Wrong tool selection/input

Hallucination

Conversation History

Fetch History

Write History

Failed API Calls

Rogue Paths

Infinite Loops

54 of 87

Testing Agents for Hallucinations

The Agent Quad

54

Query

Context

Response

Agent

Tool Selection

Context Relevance

Groundedness

Answer Relevance

55 of 87

Best practices for Agents

Blog Post: https://blog.llamaindex.ai/building-better-tools-for-llm-agents-f8c5a6714f11

  • Writing useful tool prompts
  • Make tools tolerant of partial/faulty inputs
  • Prompt engineering error messages
  • Returning the right prompts for “POST” requests
  • Don’t overload agent with tools
  • Try hierarchical agent modeling

56 of 87

Example: WandAgent -> Use the LLM to decide best action

57 of 87

As LLM models and Use Cases evolve, so do the usage patterns around Agents

Source: Wang et al., A Survey on Large Language Model based Autonomous Agents

58 of 87

Recap RL Agent vs LLM Agent

59 of 87

A crucial aspect is to design rational agent architectures to assist LLMs in maximizing their capabilities

60 of 87

ReAct

61 of 87

ReAct

62 of 87

ReAct

63 of 87

ReAct

64 of 87

ReAct

65 of 87

ReAct as a standard format for LLM Behavior Finetuning

66 of 87

ReAct

67 of 87

Chain of Thought: Verbose thought processes

68 of 87

Example: Financial Analysis with Agents + RAG

Question: “Compare and contrast Uber’s revenue between quarters”

  • Agent: breaks down question into sub-questions over tools
  • Per-Document RAG: Answer question over a given document via top-k retrieval.

69 of 87

Demo

69

70 of 87

Evaluating Agents

71 of 87

Agents: Extending Past Chain of Thought

72 of 87

Reflexion

73 of 87

Reflexion

74 of 87

Reflexion

75 of 87

Reflexion

76 of 87

Tree of Thought: Planning

77 of 87

Tree of Thought: Example

78 of 87

Tree of Thought: How to Generate Thoughts

79 of 87

Tree of Thought: How to Evaluate Thoughts

80 of 87

Tree of Thought: Example of Evaluating Thoughts

81 of 87

Tree of Thought

82 of 87

The paradigm of how we utilize data to improve models is shifting

Source: Wang et al., A Survey on Large Language Model based Autonomous Agents

83 of 87

New prompting paradigms which affect your LLM data distributions are forming

84 of 87

Case Study: FireAct

85 of 87

Case Study: FireAct

86 of 87

Demo

86

87 of 87

Stay in touch!

wandb.me/discord

wandb.me/courses

LinkedIn: Anish Shah

87