ABCDEFGHI
1
aOpen?Model Size (billion)ModelSuccess Rate (%)Result SourceWorkTrajNote
2
10/2025-Claude Code + GBOX MCP68GBOX AIGBOX AILink
3
09/2025-DeepSky Agent66.9Self-reportedDeepSky Agent
https://github.com/babelcloud/webarena/tree/main/gbox/trajectories
4
10/2025Narada AI64.2Self-reportedNarada AILink
5
02/2025-IBM CUGA61.7IBM CUGAIBM CUGAhtml+ json
6
01/2025-OpenAI Operator58.1OpenAI CUAOpenAI CUALinkSystem card
7
08/2024-Jace.AI57.1Reported by zetalabs.aihttps://www.jace.ai/
Action description + Screenshots
Note from the developer of the work, see the comment of the cell
8
12/2024-ScribeAgent + GPT-4o53ScribeAgentScribeAgentLink
ScribeAgent is finetuned with proprietary data
9
01/2025-AgentSymbiotic52.1AgentSymbioticAgentSymbioticLinkCode
10
01/2025-Learn-by-Interact48Learn-by-interactLearn-by-interactLink
11
10/2024-AgentOccam-Judge45.7AgentOccam-JudgeAgentOccam-JudgeLink
12
08/2024-WebPilot37.2WebPilotWebPilot
No open source code or trajectory released from the work
13
10/2024-GUI-API Hybrid Agent35.8Beyond BrowsingBeyond BrowsingLinkUsing both API and GUI
14
09/2024-Agent Workflow Memory35.5AWMAWM
15
04/2024-SteP33.5StePStePLinkHigh-level plans are derived by human
16
06/202512TTI26.1TTITTILink
17
04/2024-BrowserGym + GPT-423.5WorkArenaBrowserGymdifferent observation representation
18
01/202532AgentTrek-1.0-32B22.4AgentTrekAgentTrekLink
19
04/2024-GPT-4 + Auto Eval20.2Auto Eval & RefineAuto Eval & Refine
20
06/2024-GPT-4o + Tree Search19.2Tree Search for LM AgentsTree Search for LM Agents
21
04/20247AutoWebGLM18.2AutoWebGLMAutoWebGLM
22
01/20258NNetNav16.3NNetscapeNNetscapeLink
LLama 3.1-8B-instruct fine-tuned on NNetNav6k (a newer version of the dataset where the work keeps the best of 3 trajectories for each instruction, where we use a llama 3.1 70b as the reward model).
The model is available here: https://huggingface.co/stanfordnlp/llama8b-nnetnav-wa
23
06/2023-gpt-4-061314.9WebArenaGPTLinkwhen "not achievable" hint is not provided
24
05/2024-gpt-4o-2024-05-1313.1WebArena TeamGPTLinkwhen "not achievable" hint is provided
25
06/2023-gpt-4-061311.7WebArenaGPTwhen "not achievable" hint is provided
26
05/202472Patel et al + 20249.36Patel et al + 2024Patel et al + 2024
27
03/2023-gpt-3.5-turbo-16k-06138.87WebArenaGPTLink
28
09/202372Qwen-1.5-chat-72b7.14Patel et al + 2024Qwen
29
12/2023-Gemini Pro7.12WebArenaGemini Pro
30
04/202470Llama3-chat-70b7.02WebArena TeamLlama3
31
10/20247Synatra-CodeLLama7b6.28SynatraSynatraLink
32
10/202370Lemur-chat-70b5.3LemurLemur
33
03/20247Agent Flan4.68Agent FlanAgent Flan
34
08/202334CodeLlama-instruct-34b4.06LemurLlama2
35
10/202370AgentLM-70b3.81Agent TuningAgent Tuning
36
04/20248Llama3-chat-8b3.32WebArena TeamLlama3
37
02/20247CodeAct Agent2.3WebArena TeamCodeAct
38
10/202313AgentLM-13b1.6Agent TuningAgent Tuning
39
01/20248x7Mixtral1.39Gemini In-depth lookMixtral
40
10/20237AgentLM-7b0.74Agent TuningAgent Tuning
41
10/20237FireAct0.25Agent FlanFireAct
42
08/20237CodeLlama-instruct-7b0WebArena TeamCodeLLama
43
44
WebArena Subset
45
03/2024-AutoGuide43.7AutoGuideAutoGuideReddit subset
46
09/2024AutoManual65.1AutoManualAutoManualLinkReddit subset
47
Human Performance
48
Human78.24WebArenaSelected tasks by templates
49
50
Comment here or email shuyanzhxxx@gmail.com to submit your work! Please attach how should the data entry look like.
51
Starting in September 2024, we require submissions to include raw trajectories (per-step observation and predicted action), to enhance transparency on the leaderboard.