Useful benchmarks that have human scores beyond AI SOTA.

This data was organized and put together by LDJ (@LDJConfirmed on X/twitter)
If you have any questions about where I'm deriving any data points from, feel free to reach out!
(For those without X/twitter, you can also reach out to me via LDJmachineLearning@gmail.com)

These benchmarks are arranged from those with the largest to smallest gap between minimally assisted SOTA model and human baseline. (current total benchmarks is 10)

WorkArena++ (white-collar computer tasks)

- Average human: 94%

- GPT-4o: 15.2%

- Llama3-70b: 5.96%

- Mixtral-8x22b: 4.13%

- GPT-3.5: 2.03%

GAIA: (real-world multi-step assistant tasks)

- Average human baseline (mostly college undergrads): 92%

- omne v0.1 + o1-preview: 40.53%

- Magentic-1 + o1-preview: 38%

- Magentic-1 + GPT-4o: 32.33%

- Hugging Face Agents + GPT-4o: 33.33%

- GPT4 + manually selected plugins: 14.6%

- GPT-4o standalone 9.3%

- GPT4 Turbo alone: 6.67%

- GPT4 alone: 4%

- GPT3.5 alone: 2.67%

OSWorld (CUA/ operating system interactions)
- Human Baseline: 72.36%
- Screenshots + Claude-3.5-sonnet(new) when allowed to do extra steps: 22%
- Screenshots + Claude-3.5-sonnet(new): 14.9%
- A11y tree + GPT-4: 12.24%
- A11y tree + GPT-4o: 11.36%
- A11y tree + Gemini-Pro-1.5: 4.81%
- A11y tree + Mixtral-8x7B: 2.98%
- A11y tree + GPT-3.5: 2.69%
- A11y tree + Llama-3-70B: 1.61%

ARC-AGI: (Pattern understanding and interface interaction)
(following model scores are from the semi-private leaderboard)

- Average human: ~75%

- o1-preview: 18%

- Claude-3.5-sonnet: 14%

- o1-mini: 9.5%

- GPT-4o: 5%

- Gemini-1.5-pro: 4.5%

WebArena: (Diverse web interface navigation)

- Average human: 78.2%

- GPT-4o: 24.0%

- Mixtral-8x22b: 12.6%

- Llama3-70b: 11.0%

- GPT-3.5: 6.7%

Simple-bench: (Basic spatial, temporal, social and linguistic reasoning)

- Average human baseline (amazon turks): 83.7%

- o1-preview: 41.7%

- Claude 3.5 Sonnet 10-22: 41.4%

- Claude 3.5 Sonnet 06-20: 27.5%

- Gemini 1.5 Pro 002: 27.1%

- Llama 3.1 405b instruct: 23%

- GPT-4o 08-06: 17.8%

- Claude 3.5 Haiku: 15.6%

Lab-Bench: (Biology research assistance)

- Average Biology PhDs: 78%

- Claude 3.5 Sonnet (Old): 36%

- Claude 3 Haiku: 35%

- GPT-4o: 33%

- Meta-Llama-3-70B-Instruct: 31%

NOCHA (Very long context understanding and comprehension)

- Average human reader: 97.4%

- O1-Preview: 67.36%

- GPT-4o: 55.60%

- Claude-3.5-Sonnet: 48.88%

- Gemini 1.5 Pro: 48.29%

- LLaMA 3.1 405B: 47.21%

- LLaMA 3.1 70B: 27.89%

- LLaMA 3.1 8B: 16.53%

MiniWoB (interface navigation)

- Average human: 93.5%

- GPT-4o: 72.5%

- Llama3-70b: 68.2%

- Mixtral-8x22b: 62.4%

- GPT-3.5: 43.4%

Putnam Bench (Advanced Mathematics)

- Human baseline: Exceptional Math undergrads: ~10%

- PhD human at MetaAI: 1.09%

- InternLM2.5-StepProver: 0.93%

- GPT-4o: 0.16%

I’m thinking perhaps we should call these… Moravec style benchmarks? In reference to Moravecs Paradox regarding the fact that it’s hard for AI to do certain things that humans often find relatively easy.

The natural next step is to of course have benchmarks that test the true moravec's paradox of robotic abilities versus humans, but I think we probably don’t have enough data on that yet to make useful insights.

Naturally I would describe it as 3 levels of Moravec difficulty:

1. Linguistic (things like interpreting nuance in tests like winogrande, and long context understanding like NOCHA, but we’ve mostly past these)

2. Agentic (things like arc-agi, GAIA and especially WorkArena++)

3. Robotic (robustness of control over your body, things like navigating stairs, walking a tight rope, hiking a mountain)