ABCDEFGHI
1
aOpen?Model Size (billion)ModelSuccess Rate (%)Result SourceWorkTrajNote
2
09/2025-DeepSky Agent66.9Self-reportedDeepSky AgentLink
3
10/2025Narada AI64.2Self-reportedNarada AILink
4
02/2025-IBM CUGA61.7IBM CUGAIBM CUGAhtml+ json
5
01/2025-OpenAI Operator58.1OpenAI CUAOpenAI CUALinkSystem card
6
08/2024-Jace.AI57.1Reported by zetalabs.aihttps://www.jace.ai/
Action description + Screenshots
Note from the developer of the work, see the comment of the cell
7
12/2024-ScribeAgent + GPT-4o53ScribeAgentScribeAgentLink
ScribeAgent is finetuned with proprietary data
8
01/2025-AgentSymbiotic52.1AgentSymbioticAgentSymbioticLinkCode
9
01/2025-Learn-by-Interact48Learn-by-interactLearn-by-interactLink
10
10/2024-AgentOccam-Judge45.7AgentOccam-JudgeAgentOccam-JudgeLink
11
08/2024-WebPilot37.2WebPilotWebPilot
No open source code or trajectory released from the work
12
10/2024-GUI-API Hybrid Agent35.8Beyond BrowsingBeyond BrowsingLinkUsing both API and GUI
13
09/2024-Agent Workflow Memory35.5AWMAWM
14
04/2024-SteP33.5StePStePLinkHigh-level plans are derived by human
15
06/202512TTI26.1TTITTILink
16
04/2024-BrowserGym + GPT-423.5WorkArenaBrowserGymdifferent observation representation
17
01/202532AgentTrek-1.0-32B22.4AgentTrekAgentTrekLink
18
04/2024-GPT-4 + Auto Eval20.2Auto Eval & RefineAuto Eval & Refine
19
06/2024-GPT-4o + Tree Search19.2Tree Search for LM AgentsTree Search for LM Agents
20
04/20247AutoWebGLM18.2AutoWebGLMAutoWebGLM
21
01/20258NNetNav16.3NNetscapeNNetscapeLink
LLama 3.1-8B-instruct fine-tuned on NNetNav6k (a newer version of the dataset where the work keeps the best of 3 trajectories for each instruction, where we use a llama 3.1 70b as the reward model).
The model is available here: https://huggingface.co/stanfordnlp/llama8b-nnetnav-wa
22
06/2023-gpt-4-061314.9WebArenaGPTLinkwhen "not achievable" hint is not provided
23
05/2024-gpt-4o-2024-05-1313.1WebArena TeamGPTLinkwhen "not achievable" hint is provided
24
06/2023-gpt-4-061311.7WebArenaGPTwhen "not achievable" hint is provided
25
05/202472Patel et al + 20249.36Patel et al + 2024Patel et al + 2024
26
03/2023-gpt-3.5-turbo-16k-06138.87WebArenaGPTLink
27
09/202372Qwen-1.5-chat-72b7.14Patel et al + 2024Qwen
28
12/2023-Gemini Pro7.12WebArenaGemini Pro
29
04/202470Llama3-chat-70b7.02WebArena TeamLlama3
30
10/20247Synatra-CodeLLama7b6.28SynatraSynatraLink
31
10/202370Lemur-chat-70b5.3LemurLemur
32
03/20247Agent Flan4.68Agent FlanAgent Flan
33
08/202334CodeLlama-instruct-34b4.06LemurLlama2
34
10/202370AgentLM-70b3.81Agent TuningAgent Tuning
35
04/20248Llama3-chat-8b3.32WebArena TeamLlama3
36
02/20247CodeAct Agent2.3WebArena TeamCodeAct
37
10/202313AgentLM-13b1.6Agent TuningAgent Tuning
38
01/20248x7Mixtral1.39Gemini In-depth lookMixtral
39
10/20237AgentLM-7b0.74Agent TuningAgent Tuning
40
10/20237FireAct0.25Agent FlanFireAct
41
08/20237CodeLlama-instruct-7b0WebArena TeamCodeLLama
42
43
WebArena Subset
44
03/2024-AutoGuide43.7AutoGuideAutoGuideReddit subset
45
09/2024AutoManual65.1AutoManualAutoManualLinkReddit subset
46
Human Performance
47
Human78.24WebArenaSelected tasks by templates
48
49
Comment here or email shuyanzhxxx@gmail.com to submit your work! Please attach how should the data entry look like.
50
Starting in September 2024, we require submissions to include raw trajectories (per-step observation and predicted action), to enhance transparency on the leaderboard.