ABCDEFGHI
1
aOpen?Model Size (billion)ModelSuccess Rate (%)Result SourceWorkTrajNote
2
02/2025-IBM CUGA61.7IBM CUGAIBM CUGAhtml+ json
3
01/2025-OpenAI Operator58.1OpenAI CUAOpenAI CUALinkSystem card
4
08/2024-Jace.AI57.1Reported by zetalabs.aihttps://www.jace.ai/
Action description + Screenshots
Note from the developer of the work, see the comment of the cell
5
12/2024-ScribeAgent + GPT-4o53ScribeAgentScribeAgentLink
ScribeAgent is finetuned with proprietary data
6
01/2025-AgentSymbiotic52.1AgentSymbioticAgentSymbioticLinkCode
7
01/2025-Learn-by-Interact48Learn-by-interactLearn-by-interactLink
8
10/2024-AgentOccam-Judge45.7AgentOccam-JudgeAgentOccam-JudgeLink
9
08/2024-WebPilot37.2WebPilotWebPilot
No open source code or trajectory released from the work
10
10/2024-GUI-API Hybrid Agent35.8Beyond BrowsingBeyond BrowsingLinkUsing both API and GUI
11
09/2024-Agent Workflow Memory35.5AWMAWM
12
04/2024-SteP33.5StePStePLinkHigh-level plans are derived by human
13
06/202512TTI26.1TTITTILink
14
04/2024-BrowserGym + GPT-423.5WorkArenaBrowserGymdifferent observation representation
15
01/202532AgentTrek-1.0-32B22.4AgentTrekAgentTrekLink
16
04/2024-GPT-4 + Auto Eval20.2Auto Eval & RefineAuto Eval & Refine
17
06/2024-GPT-4o + Tree Search19.2Tree Search for LM AgentsTree Search for LM Agents
18
04/20247AutoWebGLM18.2AutoWebGLMAutoWebGLM
19
01/20258NNetNav16.3NNetscapeNNetscapeLink
LLama 3.1-8B-instruct fine-tuned on NNetNav6k (a newer version of the dataset where the work keeps the best of 3 trajectories for each instruction, where we use a llama 3.1 70b as the reward model).
The model is available here: https://huggingface.co/stanfordnlp/llama8b-nnetnav-wa
20
06/2023-gpt-4-061314.9WebArenaGPTLinkwhen "not achievable" hint is not provided
21
05/2024-gpt-4o-2024-05-1313.1WebArena TeamGPTLinkwhen "not achievable" hint is provided
22
06/2023-gpt-4-061311.7WebArenaGPTwhen "not achievable" hint is provided
23
05/202472Patel et al + 20249.36Patel et al + 2024Patel et al + 2024
24
03/2023-gpt-3.5-turbo-16k-06138.87WebArenaGPTLink
25
09/202372Qwen-1.5-chat-72b7.14Patel et al + 2024Qwen
26
12/2023-Gemini Pro7.12WebArenaGemini Pro
27
04/202470Llama3-chat-70b7.02WebArena TeamLlama3
28
10/20247Synatra-CodeLLama7b6.28SynatraSynatraLink
29
10/202370Lemur-chat-70b5.3LemurLemur
30
03/20247Agent Flan4.68Agent FlanAgent Flan
31
08/202334CodeLlama-instruct-34b4.06LemurLlama2
32
10/202370AgentLM-70b3.81Agent TuningAgent Tuning
33
04/20248Llama3-chat-8b3.32WebArena TeamLlama3
34
02/20247CodeAct Agent2.3WebArena TeamCodeAct
35
10/202313AgentLM-13b1.6Agent TuningAgent Tuning
36
01/20248x7Mixtral1.39Gemini In-depth lookMixtral
37
10/20237AgentLM-7b0.74Agent TuningAgent Tuning
38
10/20237FireAct0.25Agent FlanFireAct
39
08/20237CodeLlama-instruct-7b0WebArena TeamCodeLLama
40
41
WebArena Subset
42
03/2024-AutoGuide43.7AutoGuideAutoGuideReddit subset
43
09/2024AutoManual65.1AutoManualAutoManualLinkReddit subset
44
Human Performance
45
Human78.24WebArenaSelected tasks by templates
46
47
Comment here or email shuyanzhxxx@gmail.com to submit your work! Please attach how should the data entry look like.
48
Starting in September 2024, we require submissions to include raw trajectories (per-step observation and predicted action), to enhance transparency on the leaderboard.