From Workflows to Tools:
Building Adaptive Agents on the Fly
Zora Wang (王芷若)
About Myself
Zora Wang (王芷若)
In Today’s Talk
Agent Workflow Memory
TroVE: Tool Making
Agent Workflow Memory
“create a repository named Awesome_DIY_ideas that includes a README file with the links to the most active 6 DIY ideas on DIY subreddit?”
Hard to learn this effectively from training
What’s essential: to learn reusable workflows
Collect and train on a “comprehensive” set of long-horizon tasks
Learn a set of reusable workflows than generalize to arbitrarily complex tasks
Wang, Zora Zhiruo, Jiayuan Mao, Daniel Fried, and Graham Neubig. "Agent workflow memory." arXiv preprint arXiv:2409.07429 (2024).
5 DIY ideas?
least active?
DIY_ideas?
Agent Workflow Memory
Environment
state s
LM Backbone
Memory
Agent
action
observation
Who ordered order #0130?
Step 2.
Trajectory Evaluation
Solve query correctly?
YES
NO
pass
…
Step 3. Induce Workflows
This workflow aims to find an customer order with specified ID.
[env desc] The current page shows.. [reason] I need to click “Orders” to..
[action] click(‘order-link-id’)
… … … …
[env desc] Order {id} is shown.
[reason] Order {id} is found, I will now terminate the task.
[action] stop()
p1
pn
integrate into memory
# I need to click the “Orders” link to see all orders.
click(‘126’) # id of the button
# I need to find order 0130 in the current page.
scroll(0, 200)
… … … …
# The current page shows order 0130.
send_msg_to_user(“Emma Lopez”)
stop()
Step 1. Obtain Actions (annotate/generate/…)
Two Application Scenarios
When additional examples are available
e.g., training examples annotated by humans, or synthesized by neural models
When only test queries are available
Learning from self-generated, expectedly correct past experiences on the fly
“Training” w/ extra examples
Infer test examples w/ workflows
apply workflows
2
workflow add into memory
induce workflows
Offline
Online
Test examples passed in a stream
Continuously adding workflows into agent memory
… …
induce
1
1
2
2
apply
grow over time
…
apply workflows for test inference
How to Induce Workflows?
[experiences] → MODEL → [workflows]
Two requirements
Find flights from Seattle to New York on June 5th and only show those that can be purchased with miles.
[link] From Departure Airport or City Your Origin -> CLICK
[textbox] Origin City or Airport -> TYPE: Seattle
[link] SEA Seattle, WA -> CLICK
[link] To Destination Airport or City Your Destination -> CLICK
[textbox] Destination City or Airport -> TYPE: New York
[link] NYC New York City Area Airports, NY -> CLICK
[combobox] Trip Type:, changes will reload the page -> CLICK
[option] One Way -> CLICK
[button] Depart and Return Calendar... -> CLICK
[link] Next -> CLICK
[link] 5 June 2023, Monday -> CLICK
[button] done -> CLICK
[label] Shop with Miles -> CLICK
[button] SUBMIT -> CLICK
Example: travel, airlines domain
# select_oneway_date
This workflow selects a one-way flight and its date.
1. CLICK the [Trip Type: changes will reload the page] box, then CLICK the [One Way] option to select one-way flights.
2. CLICK the [Depart and Return Calendar] button to open the date selector. Navigate using the [Next] link until the desired month is displayed, then CLICK the link corresponding to the {date}.
3. CLICK the [done] button to finish date selection.
# enter_flight_locations
This workflow enters the departure and destination cities/airports for a flight.
1. CLICK the [From Departure Airport or City Your Origin] link.
2. TYPE the {departure-location} in the [Origin City or Airport] textbox and CLICK the {popup-departure} option that matches the input.
3. CLICK the [To Destination Airport or City Your Destination] link.
4. TYPE the {destination-location} in the [Destination City or Airport] textbox and CLICK the {popup-destination} option that matches the input.
# use_miles
Filters flight search results to show only those that can be purchased with miles.
1. Complete steps from `enter_flight_locations` and `select_oneway_date` workflows.
2. CLICK the [Shop with Miles] label to filter results.
3. CLICK the [SUBMIT] button to apply the filter and view results.
… …
Experiments: Web Navigation
WebArena
Mind2Web
WebArena: AWM achieves SOTA
WebArena: Cross-Task Generalization
Evaluate on a subset where each example is derived from a different task template
AWM still achieves the highest success rate!
WebArena: Cross-Task Generalization
Learn to build increasingly complex workflows upon easier workflows induced earlier
Mind2Web: Cross Tasks, WebSites, and Domains
Both AWM methods outperform baseline MindAct
From task→website→domain:
(train-test gaps widen)
AWMonline gets better over AWMoffline
AWMonline naturally generalizes
Mind2Web: Robust to Example Ordering
Random or complexity-based example ordering barely affect SR
Why?
AWM induces sub-task level workflows, regardless of task complexity
Agent Leverage Tools Beyond Basic Actions
Easier to use tools sometimes
click(12)
→ type(310, “Pittsburgh weather”)
→ press(“Enter”)
→ click(1200)
→ send_msg_to_user(“sunny”)
Wang, Zora Zhiruo, et al., "What Are Tools Anyway? A Survey from the Language Model Perspective." CoLM 2024
What If No Useful Tools Exist Yet?
Can only solve problems using primitive actions, e.g., Python built-in functions
Or let the agents make new tools!
Prone to errors :(
TroVE: Make Tools with Zero Supervision
How Do TroVE Make Tools?
Using and growing the toolbox
Agreement-based selection
Periodic toolbox trimming
CodeLLaMa Performance: Math, TableQA, Visual
Answer correctness (acc ↑)
Solution complexity (#ops ↓)
Toolbox size (#lib ↓)
GPT-4 Performance w/ TroVE
Overall:
Interestingly:
GPT-4 performs comparably to CodeLLaMa-7B on GQA
Accurate, Efficient Human Verification
An intuitive example: primitive solution vs. trove solution
10% more accurate
31.4 – 43.0% faster
# get the row for each time stamp
row_2015 = df[df[“Year”] == 2015
row_2016 = df[df[“Year”] == 2016
# get the value for each time
value_2015 = row_2015[“Vacation days”].values[0]
value_2016 = row_2015[“Vacation days”].values[0]
# calculate the rate of change
rate = (value_2016 - value_2015) / 1
primitive solution
calc_rate_of_change(df, “Vacation days”, “Year”, 2015, 2016)
advanced solution
VS
Diverse Tools Across Tasks
To Summarize
Agent Workflow Memory
TroVE: Tool Making
Thank You!
Any Questions?