ABCDEFGHI
1
aOpen?Model Size (billion)ModelSuccess Rate (%)Result SourceWorkTrajNote
2
09/2025-DeepSky Agent66.9Self-reportedDeepSky AgentLink
3
02/2025-IBM CUGA61.7IBM CUGAIBM CUGAhtml+ json
4
01/2025-OpenAI Operator58.1OpenAI CUAOpenAI CUALinkSystem card
5
08/2024-Jace.AI57.1Reported by zetalabs.aihttps://www.jace.ai/
Action description + Screenshots
Note from the developer of the work, see the comment of the cell
6
12/2024-ScribeAgent + GPT-4o53ScribeAgentScribeAgentLink
ScribeAgent is finetuned with proprietary data
7
01/2025-AgentSymbiotic52.1AgentSymbioticAgentSymbioticLinkCode
8
01/2025-Learn-by-Interact48Learn-by-interactLearn-by-interactLink
9
10/2024-AgentOccam-Judge45.7AgentOccam-JudgeAgentOccam-JudgeLink
10
08/2024-WebPilot37.2WebPilotWebPilot
No open source code or trajectory released from the work
11
10/2024-GUI-API Hybrid Agent35.8Beyond BrowsingBeyond BrowsingLinkUsing both API and GUI
12
09/2024-Agent Workflow Memory35.5AWMAWM
13
04/2024-SteP33.5StePStePLinkHigh-level plans are derived by human
14
06/202512TTI26.1TTITTILink
15
04/2024-BrowserGym + GPT-423.5WorkArenaBrowserGymdifferent observation representation
16
01/202532AgentTrek-1.0-32B22.4AgentTrekAgentTrekLink
17
04/2024-GPT-4 + Auto Eval20.2Auto Eval & RefineAuto Eval & Refine
18
06/2024-GPT-4o + Tree Search19.2Tree Search for LM AgentsTree Search for LM Agents
19
04/20247AutoWebGLM18.2AutoWebGLMAutoWebGLM
20
01/20258NNetNav16.3NNetscapeNNetscapeLink
LLama 3.1-8B-instruct fine-tuned on NNetNav6k (a newer version of the dataset where the work keeps the best of 3 trajectories for each instruction, where we use a llama 3.1 70b as the reward model).
The model is available here: https://huggingface.co/stanfordnlp/llama8b-nnetnav-wa
21
06/2023-gpt-4-061314.9WebArenaGPTLinkwhen "not achievable" hint is not provided
22
05/2024-gpt-4o-2024-05-1313.1WebArena TeamGPTLinkwhen "not achievable" hint is provided
23
06/2023-gpt-4-061311.7WebArenaGPTwhen "not achievable" hint is provided
24
05/202472Patel et al + 20249.36Patel et al + 2024Patel et al + 2024
25
03/2023-gpt-3.5-turbo-16k-06138.87WebArenaGPTLink
26
09/202372Qwen-1.5-chat-72b7.14Patel et al + 2024Qwen
27
12/2023-Gemini Pro7.12WebArenaGemini Pro
28
04/202470Llama3-chat-70b7.02WebArena TeamLlama3
29
10/20247Synatra-CodeLLama7b6.28SynatraSynatraLink
30
10/202370Lemur-chat-70b5.3LemurLemur
31
03/20247Agent Flan4.68Agent FlanAgent Flan
32
08/202334CodeLlama-instruct-34b4.06LemurLlama2
33
10/202370AgentLM-70b3.81Agent TuningAgent Tuning
34
04/20248Llama3-chat-8b3.32WebArena TeamLlama3
35
02/20247CodeAct Agent2.3WebArena TeamCodeAct
36
10/202313AgentLM-13b1.6Agent TuningAgent Tuning
37
01/20248x7Mixtral1.39Gemini In-depth lookMixtral
38
10/20237AgentLM-7b0.74Agent TuningAgent Tuning
39
10/20237FireAct0.25Agent FlanFireAct
40
08/20237CodeLlama-instruct-7b0WebArena TeamCodeLLama
41
42
WebArena Subset
43
03/2024-AutoGuide43.7AutoGuideAutoGuideReddit subset
44
09/2024AutoManual65.1AutoManualAutoManualLinkReddit subset
45
Human Performance
46
Human78.24WebArenaSelected tasks by templates
47
48
Comment here or email shuyanzhxxx@gmail.com to submit your work! Please attach how should the data entry look like.
49
Starting in September 2024, we require submissions to include raw trajectories (per-step observation and predicted action), to enhance transparency on the leaderboard.