1 of 23

From Workflows to Tools:

Building Adaptive Agents on the Fly

Zora Wang (王芷若)

2 of 23

About Myself

Zora Wang (王芷若)

  • PhD student at Carnegie Mellon University
  • Advised by Daniel Fried and Graham Neubig
  • Research: programmatic approaches to solve real-world tasks

3 of 23

In Today’s Talk

Agent Workflow Memory

TroVE: Tool Making

4 of 23

Agent Workflow Memory

“create a repository named Awesome_DIY_ideas that includes a README file with the links to the most active 6 DIY ideas on DIY subreddit?”

Hard to learn this effectively from training

  • Long-horizon tasks are hard to collect
  • Hard to train on all possibilities

What’s essential: to learn reusable workflows

  • Create a repository
  • Find active ideas in subreddit
  • Create a file with content

Collect and train on a “comprehensive” set of long-horizon tasks

Learn a set of reusable workflows than generalize to arbitrarily complex tasks

Wang, Zora Zhiruo, Jiayuan Mao, Daniel Fried, and Graham Neubig. "Agent workflow memory." arXiv preprint arXiv:2409.07429 (2024).

5 DIY ideas?

least active?

DIY_ideas?

5 of 23

Agent Workflow Memory

Environment

state s

LM Backbone

Memory

Agent

action

observation

Who ordered order #0130?

Step 2.

Trajectory Evaluation

Solve query correctly?

YES

NO

pass

Step 3. Induce Workflows

  • Task Objective d

This workflow aims to find an customer order with specified ID.

  • Workflow Trajectory

[env desc] The current page shows.. [reason] I need to click “Orders” to..

[action] click(‘order-link-id’)

… … … …

[env desc] Order {id} is shown.

[reason] Order {id} is found, I will now terminate the task.

[action] stop()

p1

pn

integrate into memory

# I need to click the “Orders” link to see all orders.

click(‘126’) # id of the button

# I need to find order 0130 in the current page.

scroll(0, 200)

… … … …

# The current page shows order 0130.

send_msg_to_user(“Emma Lopez”)

stop()

Step 1. Obtain Actions (annotate/generate/…)

6 of 23

Two Application Scenarios

When additional examples are available

e.g., training examples annotated by humans, or synthesized by neural models

When only test queries are available

Learning from self-generated, expectedly correct past experiences on the fly

“Training” w/ extra examples

Infer test examples w/ workflows

apply workflows

2

workflow add into memory

induce workflows

Offline

Online

Test examples passed in a stream

Continuously adding workflows into agent memory

… …

induce

1

1

2

2

apply

grow over time

apply workflows for test inference

7 of 23

How to Induce Workflows?

[experiences] → MODEL → [workflows]

Two requirements

  • Extract reusable sub-trajectory
  • Remove example-specific contexts

Find flights from Seattle to New York on June 5th and only show those that can be purchased with miles.

[link] From Departure Airport or City Your Origin -> CLICK

[textbox] Origin City or Airport -> TYPE: Seattle

[link] SEA Seattle, WA -> CLICK

[link] To Destination Airport or City Your Destination -> CLICK

[textbox] Destination City or Airport -> TYPE: New York

[link] NYC New York City Area Airports, NY -> CLICK

[combobox] Trip Type:, changes will reload the page -> CLICK

[option] One Way -> CLICK

[button] Depart and Return Calendar... -> CLICK

[link] Next -> CLICK

[link] 5 June 2023, Monday -> CLICK

[button] done -> CLICK

[label] Shop with Miles -> CLICK

[button] SUBMIT -> CLICK

Example: travel, airlines domain

# select_oneway_date

This workflow selects a one-way flight and its date.

1. CLICK the [Trip Type: changes will reload the page] box, then CLICK the [One Way] option to select one-way flights.

2. CLICK the [Depart and Return Calendar] button to open the date selector. Navigate using the [Next] link until the desired month is displayed, then CLICK the link corresponding to the {date}.

3. CLICK the [done] button to finish date selection.

# enter_flight_locations

This workflow enters the departure and destination cities/airports for a flight.

1. CLICK the [From Departure Airport or City Your Origin] link.

2. TYPE the {departure-location} in the [Origin City or Airport] textbox and CLICK the {popup-departure} option that matches the input.

3. CLICK the [To Destination Airport or City Your Destination] link.

4. TYPE the {destination-location} in the [Destination City or Airport] textbox and CLICK the {popup-destination} option that matches the input.

# use_miles

Filters flight search results to show only those that can be purchased with miles.

1. Complete steps from `enter_flight_locations` and `select_oneway_date` workflows.

2. CLICK the [Shop with Miles] label to filter results.

3. CLICK the [SUBMIT] button to apply the filter and view results.

… …

8 of 23

Experiments: Web Navigation

WebArena

  • Five websites: shopping, cms, online forum, software engineer, travel
  • Rigorous evaluation: function correctness

  • Only 812 test queries
    • No ground-truth solution annotation
    • No compatible training data

Mind2Web

  • Stressing generalization: 3 test splits
    • Cross-task
    • Cross-website
    • Cross-domain
  • Lexical-based Evaluation
    • Step-wise: element acc, action f1, success
    • Task-wise: if all steps succeed
  • Over 2000 tasks, 137 websites, 31 domains
  • Have training data covering part of the tested websites/domains

9 of 23

WebArena: AWM achieves SOTA

  • AWM achieves state-of-the-art success rate: 35.5
  • 51.1% relative increase over the BrowserGym baseline
  • Even 7.6% increase over SteP, which uses 14 human expert written workflows
  • Efficient: use fewer steps

10 of 23

WebArena: Cross-Task Generalization

  • [template→example] creation: “Show me {location}” → “Show me Pittsburgh”

Evaluate on a subset where each example is derived from a different task template

AWM still achieves the highest success rate!

11 of 23

WebArena: Cross-Task Generalization

Learn to build increasingly complex workflows upon easier workflows induced earlier

12 of 23

Mind2Web: Cross Tasks, WebSites, and Domains

Both AWM methods outperform baseline MindAct

From task→website→domain:

(train-test gaps widen)

AWMonline gets better over AWMoffline

AWMonline naturally generalizes

13 of 23

Mind2Web: Robust to Example Ordering

Random or complexity-based example ordering barely affect SR

Why?

AWM induces sub-task level workflows, regardless of task complexity

14 of 23

Agent Leverage Tools Beyond Basic Actions

Easier to use tools sometimes

click(12)

→ type(310, “Pittsburgh weather”)

→ press(“Enter”)

→ click(1200)

→ send_msg_to_user(“sunny”)

Wang, Zora Zhiruo, et al., "What Are Tools Anyway? A Survey from the Language Model Perspective." CoLM 2024

15 of 23

What If No Useful Tools Exist Yet?

Can only solve problems using primitive actions, e.g., Python built-in functions

Or let the agents make new tools!

Prone to errors :(

16 of 23

TroVE: Make Tools with Zero Supervision

17 of 23

How Do TroVE Make Tools?

Using and growing the toolbox

Agreement-based selection

Periodic toolbox trimming

18 of 23

CodeLLaMa Performance: Math, TableQA, Visual

Answer correctness (acc ↑)

Solution complexity (#ops ↓)

Toolbox size (#lib ↓)

19 of 23

GPT-4 Performance w/ TroVE

Overall:

  • Higher accuracy
  • Much smaller toolbox

Interestingly:

GPT-4 performs comparably to CodeLLaMa-7B on GQA

20 of 23

Accurate, Efficient Human Verification

An intuitive example: primitive solution vs. trove solution

10% more accurate

31.4 – 43.0% faster

# get the row for each time stamp

row_2015 = df[df[“Year”] == 2015

row_2016 = df[df[“Year”] == 2016

# get the value for each time

value_2015 = row_2015[“Vacation days”].values[0]

value_2016 = row_2015[“Vacation days”].values[0]

# calculate the rate of change

rate = (value_2016 - value_2015) / 1

primitive solution

calc_rate_of_change(df, “Vacation days”, “Year”, 2015, 2016)

advanced solution

VS

21 of 23

Diverse Tools Across Tasks

22 of 23

To Summarize

Agent Workflow Memory

  • Building increasingly complex workflows over time
  • Achieves highest success rate with more efficient pipeline
  • Generalize across tasks, websites, domains

TroVE: Tool Making

  • Higher acc on math, table-based QA, visual tasks
  • More accurate, faster human verification process
  • Makes diverse tools that reflect varied data distribution

23 of 23

Thank You!

Any Questions?