1 of 23

From Workflows to Tools:

Building Adaptive Agents on the Fly

Zora Wang (王芷若)

2 of 23

About Myself

Zora Wang (王芷若)

PhD student at Carnegie Mellon University
Advised by Daniel Fried and Graham Neubig
Research: programmatic approaches to solve real-world tasks

3 of 23

In Today’s Talk

Agent Workflow Memory

TroVE: Tool Making

4 of 23

Agent Workflow Memory

“create a repository named Awesome_DIY_ideas that includes a README file with the links to the most active 6 DIY ideas on DIY subreddit?”

Hard to learn this effectively from training

Long-horizon tasks are hard to collect
Hard to train on all possibilities

What’s essential: to learn reusable workflows

Create a repository
Find active ideas in subreddit
Create a file with content

Collect and train on a “comprehensive” set of long-horizon tasks

Learn a set of reusable workflows than generalize to arbitrarily complex tasks

Wang, Zora Zhiruo, Jiayuan Mao, Daniel Fried, and Graham Neubig. "Agent workflow memory." arXiv preprint arXiv:2409.07429 (2024).

5 DIY ideas?

least active?

DIY_ideas?

5 of 23

Agent Workflow Memory

Environment

state s

LM Backbone

Memory

Agent

action

observation

Who ordered order #0130?

Step 2.

Trajectory Evaluation

Solve query correctly?

YES

NO

pass

…

Step 3. Induce Workflows

Task Objective d

This workflow aims to find an customer order with specified ID.

Workflow Trajectory

[env desc] The current page shows.. [reason] I need to click “Orders” to..

[action] click(‘order-link-id’)

… … … …

[env desc] Order {id} is shown.

[reason] Order {id} is found, I will now terminate the task.

[action] stop()

p1

pn

integrate into memory

# I need to click the “Orders” link to see all orders.

click(‘126’) # id of the button

# I need to find order 0130 in the current page.

scroll(0, 200)

… … … …

# The current page shows order 0130.

send_msg_to_user(“Emma Lopez”)

stop()

Step 1. Obtain Actions (annotate/generate/…)

6 of 23

Two Application Scenarios

When additional examples are available

e.g., training examples annotated by humans, or synthesized by neural models

When only test queries are available

Learning from self-generated, expectedly correct past experiences on the fly

“Training” w/ extra examples

Infer test examples w/ workflows

apply workflows

2

workflow add into memory

induce workflows

Offline

Online

Test examples passed in a stream

Continuously adding workflows into agent memory

… …

induce

1

2

apply

grow over time

…

apply workflows for test inference

7 of 23

How to Induce Workflows?

[experiences] → MODEL → [workflows]

Two requirements

Extract reusable sub-trajectory
Remove example-specific contexts

Find flights from Seattle to New York on June 5th and only show those that can be purchased with miles.

[link] From Departure Airport or City Your Origin -> CLICK

[textbox] Origin City or Airport -> TYPE: Seattle

[link] SEA Seattle, WA -> CLICK

[link] To Destination Airport or City Your Destination -> CLICK

[textbox] Destination City or Airport -> TYPE: New York

[link] NYC New York City Area Airports, NY -> CLICK

[combobox] Trip Type:, changes will reload the page -> CLICK

[option] One Way -> CLICK

[button] Depart and Return Calendar... -> CLICK

[link] Next -> CLICK

[link] 5 June 2023, Monday -> CLICK

[button] done -> CLICK

[label] Shop with Miles -> CLICK

[button] SUBMIT -> CLICK

Example: travel, airlines domain

# select_oneway_date

This workflow selects a one-way flight and its date.

1. CLICK the [Trip Type: changes will reload the page] box, then CLICK the [One Way] option to select one-way flights.

2. CLICK the [Depart and Return Calendar] button to open the date selector. Navigate using the [Next] link until the desired month is displayed, then CLICK the link corresponding to the {date}.

3. CLICK the [done] button to finish date selection.

# enter_flight_locations

This workflow enters the departure and destination cities/airports for a flight.

1. CLICK the [From Departure Airport or City Your Origin] link.

2. TYPE the {departure-location} in the [Origin City or Airport] textbox and CLICK the {popup-departure} option that matches the input.

3. CLICK the [To Destination Airport or City Your Destination] link.

4. TYPE the {destination-location} in the [Destination City or Airport] textbox and CLICK the {popup-destination} option that matches the input.

# use_miles

Filters flight search results to show only those that can be purchased with miles.

1. Complete steps from `enter_flight_locations` and `select_oneway_date` workflows.

2. CLICK the [Shop with Miles] label to filter results.

3. CLICK the [SUBMIT] button to apply the filter and view results.

… …

8 of 23

Experiments: Web Navigation

WebArena

Five websites: shopping, cms, online forum, software engineer, travel
Rigorous evaluation: function correctness

Only 812 test queries

No ground-truth solution annotation
No compatible training data

Mind2Web

Stressing generalization: 3 test splits

Cross-task
Cross-website
Cross-domain

Lexical-based Evaluation

Step-wise: element acc, action f1, success
Task-wise: if all steps succeed

Over 2000 tasks, 137 websites, 31 domains
Have training data covering part of the tested websites/domains

9 of 23

WebArena: AWM achieves SOTA

AWM achieves state-of-the-art success rate: 35.5
51.1% relative increase over the BrowserGym baseline
Even 7.6% increase over SteP, which uses 14 human expert written workflows
Efficient: use fewer steps

10 of 23

WebArena: Cross-Task Generalization

[template→example] creation: “Show me {location}” → “Show me Pittsburgh”

Evaluate on a subset where each example is derived from a different task template

AWM still achieves the highest success rate!

11 of 23

WebArena: Cross-Task Generalization

Learn to build increasingly complex workflows upon easier workflows induced earlier

12 of 23

Mind2Web: Cross Tasks, WebSites, and Domains

Both AWM methods outperform baseline MindAct

From task→website→domain:

(train-test gaps widen)

AWMonline gets better over AWMoffline

AWMonline naturally generalizes

13 of 23

Mind2Web: Robust to Example Ordering

Random or complexity-based example ordering barely affect SR

Why?

AWM induces sub-task level workflows, regardless of task complexity

14 of 23

Agent Leverage Tools Beyond Basic Actions

Easier to use tools sometimes

click(12)

→ type(310, “Pittsburgh weather”)

→ press(“Enter”)

→ click(1200)

→ send_msg_to_user(“sunny”)

Wang, Zora Zhiruo, et al., "What Are Tools Anyway? A Survey from the Language Model Perspective." CoLM 2024

15 of 23

What If No Useful Tools Exist Yet?

Can only solve problems using primitive actions, e.g., Python built-in functions

Or let the agents make new tools!

Prone to errors :(

16 of 23

TroVE: Make Tools with Zero Supervision

17 of 23

How Do TroVE Make Tools?

Using and growing the toolbox

Agreement-based selection

Periodic toolbox trimming

18 of 23

CodeLLaMa Performance: Math, TableQA, Visual

Answer correctness (acc ↑)

Solution complexity (#ops ↓)

Toolbox size (#lib ↓)

19 of 23

GPT-4 Performance w/ TroVE

Overall:

Higher accuracy
Much smaller toolbox

Interestingly:

GPT-4 performs comparably to CodeLLaMa-7B on GQA

20 of 23

Accurate, Efficient Human Verification

An intuitive example: primitive solution vs. trove solution

10% more accurate

31.4 – 43.0% faster

# get the row for each time stamp

row_2015 = df[df[“Year”] == 2015

row_2016 = df[df[“Year”] == 2016

# get the value for each time

value_2015 = row_2015[“Vacation days”].values[0]

value_2016 = row_2015[“Vacation days”].values[0]

# calculate the rate of change

rate = (value_2016 - value_2015) / 1

primitive solution

calc_rate_of_change(df, “Vacation days”, “Year”, 2015, 2016)

advanced solution

VS

21 of 23

Diverse Tools Across Tasks

22 of 23

To Summarize

Agent Workflow Memory

Building increasingly complex workflows over time
Achieves highest success rate with more efficient pipeline
Generalize across tasks, websites, domains

TroVE: Tool Making

Higher acc on math, table-based QA, visual tasks
More accurate, faster human verification process
Makes diverse tools that reflect varied data distribution

23 of 23

Thank You!

Any Questions?