| A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z | AA | AB | AC | AD | AE | AF | AG | AH | AI | AJ | AK | AL | AM | AN | AO | AP | AQ | AR | AS | AT | AU | AV | AW | AX | AY | AZ | BA | BB | BC | BD | BE | BF | BG | BH | BI | BJ | BK | BL | BM | BN | BO | BP | BQ | BR | BS | BT | BU | BV | BW | BX | BY | BZ | CA | CB | CC | CD | CE | CF | CG | CH | CI | CJ | CK | CL | CM | CN | CO | CP | CQ | CR | CS | CT | CU | CV | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | MODEL | o4-mini (high) | gpt-5 (high) | Gemini 2.5 Pro Preview 06-05 | Gemini 2.5 Pro Experimental | o3 pro | o3 (high) | Gemini 2.5 Pro Preview 05-06 | gpt-5 (medium) | o3 (medium) | Claude 4 Opus Thinking | Claude 4 Sonnet Thinking | gpt-5 (low) | Grok 3 Mini Reasoning (high) | Gemini 2.5 Flash (Thinking) Preview | o3-mini (high) | Claude 4 Opus | o4-mini (medium) | o1 | GPT-4.5 Preview | Claude 3.7 Sonnet Thinking | DeepSeek R1 | Gemini 2.5 Flash Lite (Reasoning) Preview 06-17 | Claude 4 Sonnet | GLM-4.6 | Qwen 3 30B A3B (Reasoning) | GPT-4o 2025-03-27 | o3-mini (medium) | Gemini 2.5 Flash Preview | DeepSeek V3 0324 | GPT-4.1 | Gemini 2.5 Flash | o1-preview | Qwen 2.5 Max | Gemini 2.0 Pro Experimental | Claude 3.7 Sonnet | GPT-4.1 mini | Gemini 2.0 Flash Thinking Experimental | Qwen 3 32B | Mistral Medium 3 | o1-mini | Grok 3 | DeepSeek V3 | Claude 3.5 Sonnet New 2024-10 | Reka Flash 3 Preview | Llama 4 Maverick | Qwen 3 30B A3B | GPT-4o 2024-11-20 | Gemini 2.5 Flash Lite Preview 06-17 | DeepSeek R1 Distill Llama 70B | Gemini 2.0 Flash | Gemini 1.5 Pro 002 | Sonar Pro | Qwen 2.5 Max | Phi-4 Reasoning Plus | GPT-4o 2024-08-06 | Claude 3.5 Sonnet 2024-06 | Gemini 2.0 Flash-Lite | Command A | DeepSeek R1 Distill Qwen 32B | Qwen 2.5 (72B) | Gemma 3 27B | Llama 4 Scout | GPT-4 Turbo | Sonar | Llama 3.1 405B | Grok-2 1212 | Llama 3.3 70B | Pixtral Large | Grok Beta | Gemma 3 12B | Phi-4 | Mistral Large 2 2024-07-24 | DeepSeek-V2.5 | Claude 3 Opus | GPT-4.1 nano | Nova Pro | GPT-4o mini | Hunyuan-Large 2025-02 | Gemini 1.5 Pro 001 | Mistral Small 3 2025-03 | Gemini 1.5 Flash 002 | Claude 3.5 Haiku | GPT-4 | Llama 3.1 70B | Llama 3.2 90B | Mistral Small 3 2025-01 | Hunyuan-Standard 2025-02 | DeepSeek-V2 | Qwen 2 (72B) | Nova Lite | Jamba 1.5 Large | DeepSeek-Coder-V2 | Gemma 2 27B | Jamba 1.6 Large | Gemini 1.5 Flash 001 | Llama 3 70B | Reka Core | Mistral Small 2024-09 | Yi-Large | |
2 | Version (date, alias, etc) | 4/16/2025 | 8/7/2025 | 6/5/2025 | 3/25/2025 | 6/10/2025 | 4/16/2025 | 5/6/2025 | 8/7/2025 | 4/16/2025 | 5/22/2025 | 5/22/2025 | 8/7/2025 | 4/9/2025 | 4/17/2025 | 2/13/2025 | 5/22/2025 | 4/16/2025 | 12/17/2024 | 2/27/2025 | 2/24/2025 | 1/20/2025 | 6/17/2025 | 5/22/2025 | 9/30/2025 | 4/28/2025 | 2025-03-027 | 1/31/2025 | 4/17/2025 | 3/24/2025 | 4/14/2025 | 6/17/2025 | preview-2024-09-12 | 3/5/2025 | 2/5/2025 | 2/24/2025 | 4/14/2025 | 12/19/2024 | 4/28/2025 | 5/7/2025 | preview-2024-09-12 | 3/17/2025 | 12/6/2024 | 10/22/2024 | 3/10/2025 | 4/5/2025 | 4/28/2025 | 11/20/2024 | 6/17/2025 | 1/1/2025 | 2/5/2025 | 1.5-pro-002 | 3/7/2025 | 1/29/2025 | 5/1/2025 | 8/6/2024 | 6/20/2024 | 2/5/2025 | 3/13/2025 | 1/29/2025 | 2025-03–12 | 4/5/2025 | 0125-preview | 1/25/2025 | 7/23/2024 | 12/12/2024 | 12/6/20214 | 11/18/2024 | 8/13/2024 | 2025-03–12 | 7/24/2024 | 4/14/2025 | 12/3/2024 | 2/10/2025 | 3/17/2025 | 1.5-flash-002 | 10/22/2024 | 1/1/2025 | 2/10/2025 | 12/3/2024 | 3/13/2025 | 7/22/2024 | 9/1/2024 | ||||||||||||||||||
3 | Provider (for Throughput and Cost metrics) | OpenAI | OpenAI | Google Vertex | OpenAI | OpenAI | Google Vertex | OpenAI | OpenAI | Google Vertex | Google Vertex | OpenAI | xAI | OpenAI | Google Vertex | OpenAI | OpenAI | OpenAI | Anthropic | DeepSeek | Google AI Studio | Google Vertex | z.AI | DeepInfra | OpenAI | OpenAI | Hyperbolic | OpenAI | Google AI Studio | OpenAI | Fireworks | Anthropic | OpenAI | Parasail | Mistral | OpenAI | xAI | Fireworks AI | Anthropic | Reka | Fireworks | DeepInfra | OpenAI | Google AI Studio | Together | Perplexity | Alibaba | DeepInfra | OpenAI | Anthropic | Cohere | DeepInfra | Alibaba | Parasail | Parasail | OpenAI | Perplexity | Together | xAI | Friendli | Mistral | xAI | DeepInfra | DeepInfra | Mistral | Anthropic | OpenAI | AWS | OpenAI | Mistral | Anthropic | OpenAI | Fireworks AI | Fireworks AI | Mistral | Together.ai | AWS | Amazon Bedrock | Together.ai | Amazon Bedrock | Fireworks AI | Reka | Mistral | Fireworks AI | |||||||||||||||||
4 | Year Released | 2025 | 2025 | 2025 | 2025 | 2025 | 2025 | 2025 | 2025 | 2025 | 2025 | 2025 | 2025 | 2025 | 2025 | 2025 | 2025 | 2025 | 2024 | 2025 | 2025 | 2025 | 2025 | 2025 | 2025 | 2025 | 2025 | 2025 | 2025 | 2025 | 2025 | 2025 | 2024 | 2025 | 2025 | 2025 | 2025 | 2024 | 2025 | 2025 | 2024 | 2025 | 2024 | 2024 | 2024 | 2025 | 2025 | 2024 | 2025 | 2025 | 2025 | 2024 | 2025 | 2025 | 2025 | 2024 | 2024 | 2025 | 2025 | 2025 | 2024 | 2025 | 2025 | 2023 | 2025 | 2024 | 2024 | 2024 | 2024 | 2024 | 2025 | 2024 | 2024 | 2024 | 2024 | 2025 | 2024 | 2024 | 2025 | 2024 | 2025 | 2024 | 2024 | 2023 | 2024 | 2024 | 2025 | 2025 | 2024 | 2024 | 2024 | 2024 | 2024 | 2024 | 2025 | 2024 | 2024 | 2024 | 2024 | 2024 | |
5 | COMPOSITE CAPABILITY 100 represents the most capable model at this moment in time. See Composite Capability sheet for complete calculation. Calulation method updated 2025-03 | 97 | 100 | 96 | 95 | 95 | 95 | 94 | 94 | 94 | 93 | 90 | 90 | 88 | 87 | 87 | 87 | 86 | 85 | 85 | 84 | 83 | 83 | 82 | 89 | 81 | 81 | 81 | 80 | 79 | 79 | 77 | 77 | 77 | 76 | 75 | 75 | 74 | 73 | 72 | 72 | 71 | 69 | 68 | 69 | 68 | 68 | 67 | 67 | 67 | 67 | 67 | 66 | 66 | 66 | 66 | 66 | 65 | 65 | 64 | 64 | 63 | 63 | 61 | 59 | 58 | 58 | 57 | 56 | 56 | 56 | 55 | 54 | 54 | 53 | 52 | 52 | 52 | 52 | 51 | 51 | 51 | 51 | 50 | 50 | 50 | 49 | 49 | 49 | 47 | 45 | 44 | 44 | 43 | 43 | 43 | 42 | 40 | 40 | 40 | |
6 | Composite Capability Consistency 100 represents consistent performance across benchmarks relative to other models. Lower score indicates higher variance. | 56 | 47 | 17 | 67 | 28 | 92 | 44 | 57 | 81 | 98 | 86 | 83 | 22 | n/a | 40 | 69 | 93 | 69 | 28 | 86 | 75 | 74 | 70 | n/a | 27 | 50 | 44 | 71 | 56 | n/a | 70 | 55 | 63 | 71 | 67 | 68 | 43 | 61 | 69 | 74 | 75 | 68 | n/a | 71 | 55 | 55 | n/a | 72 | 73 | 77 | 69 | 72 | n/a | 60 | 60 | 66 | 45 | 1 | 70 | 34 | n/a | 76 | 64 | 83 | 57 | 72 | 81 | 95 | 22 | 62 | 71 | 85 | 83 | 57 | 59 | 75 | n/a | n/a | 99 | 56 | 94 | n/a | 80 | 89 | 54 | n/a | n/a | 17 | 56 | 89 | 46 | 93 | n/a | n/a | 87 | 74 | 94 | 88 | ||
7 | Artificial Analysis Intelligence Index V2 Weighted average of benchmarks for General Reasoning & Knowledge, Math, and Code Gen. | 70 | 69 | 70 | 68 | 71 | 66 | 68 | 68 | 63 | 64 | 61 | 63 | 67 | 60 | 66 | 58 | 62 | 57 | 60 | 55 | 53 | 56 | 56 | 50 | 63 | 48 | 52 | 53 | 53 | 58 | 49 | 48 | 53 | 52 | 44 | 49 | 54 | 51 | 46 | 44 | 47 | 50 | 43 | 41 | 46 | 45 | 44 | 45 | 43 | 45 | 41 | 41 | 40 | 52 | 38 | 43 | 43 | 41 | 41 | 37 | 38 | 34 | 40 | 37 | 35 | 41 | 37 | 36 | 35 | 28 | 35 | 35 | 33 | 35 | 33 | 29 | 29 | 27 | 28 | |||||||||||||||||||||
8 | Artificial Analysis Index V1 Average result across MMLU, GPQA, Math, and HumanEval. V1 Index deprecated 2025-02 | 90 | 89 | 89 | 86 | 85 | 84 | 79 | 80 | 85 | 83 | 80 | 79 | 78 | 76 | 79 | 77 | 75 | 74 | 74 | 74 | 72 | 76 | 74 | 72 | 70 | 75 | 73 | 72 | 68 | 68 | 68 | 72 | 66 | 72 | 70 | 64 | 67 | 61 | 62 | 57 | 60 | 58 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
9 | LM Arena ELO (style control enabled) Crowdsourced comparative evaluation of "vibes". Can be misleading - "answers people like" is not the same as "right answers", ie confidently wrong often beats nuanced & considerate. | 1481 | 1470 | 1437 | 1446 | 1434 | 1346 | 1414 | 1382 | 1380 | 1417 | 1371 | 1376 | 1387 | 1390 | 1429 | 1422 | 1329 | 1395 | 1382 | 1391 | 1363 | 1346 | 1381 | 1358 | 1355 | 1311 | 1328 | 1355 | 1316 | 1332 | 1337 | 1345 | 1309 | 1293 | 1339 | 1286 | 1328 | 1284 | 1341 | 1320 | 1331 | 1324 | 1316 | 1342 | 1302 | 1254 | 1249 | 1235 | 1323 | 1237 | 1232 | 1240 | 1239 | 1240 | 1209 | 1234 | 1242 | 1241 | 1230 | 1232 | 1235 | 1214 | 1186 | 1229 | 1181 | 1181 | 1205 | 1182 | 1207 | 1204 | 1195 | 1207 | ||||||||||||||||||||||||||||
10 | LiveBench Global Average Questions refresh monthly, potential for stale values in spreadsheet. Categories: Reasoning, Coding, Math, Data Analysis, Language, and Instruction Following. | 78.72 | 78.59 | 69.39 | 76.69 | 74.42 | 80.71 | 71.99 | 76.45 | 79.25 | 79.53 | 79.09 | 75.34 | 68.33 | 75.88 | 71.52 | 74.40 | 75.67 | 65.93 | 74.50 | 72.49 | 69.65 | 64.75 | 70.01 | 69.93 | 66.86 | 62.99 | 60.03 | 65.13 | 65.56 | 59.05 | 66.92 | 71.03 | 56.59 | 57.76 | 56.95 | 60.45 | 57.94 | 54.38 | 65.32 | 57.79 | 54.50 | 61.47 | 54.33 | 60.27 | 62.29 | 56.64 | 55.33 | 53.24 | 45.55 | 51.44 | 49.99 | 50.40 | 46.88 | 52.36 | 54.30 | 50.16 | 49.18 | 41.25 | 41.61 | 45.98 | 49.12 | 39.72 | 43.55 | 41.26 | 43.96 | 48.59 | 43.45 | 44.89 | 42.55 | 36.35 | 38.19 | 33.39 | ||||||||||||||||||||||||||||
11 | Primary Dimensions: 1. Capability: output quality 2. Throughput: sec to first 1k tokens (latency + tokens/sec) 3. Cost: $ per token Unique to this sheet: - Composite Capability: normalized average of key Capability indexes and ELOs, providing a single score for "which model is most capable". - Composite Capability Consistency: standard deviation of normalized capability scores. High value indicates more consistent eval performance. - Efficiency Index: weighted, normalized score that reflects overall performance across Capability, Throughput, and Cost, answering "which model has the optimal balance of primary dimensions". 2025 emerging trends: - Reasoning models breaking through Capability plateau, and reset price war with differentiated tier. Caveats: - Reasoning models generate significantly more tokens, greatly increasing both total cost and time to full response beyond what $/token and tokens/sec alone indicate. - Continued improvement in frontend non-reasoning LLMs has closed the gap with first-wave reasoning model capability. - Open models narrowed capability gap with Closed models, exerting more downward pressure on cost. - 2025 "smart" models are competitive in speed and cost with 2024's "fast & cheap" models. - Benchmarks lose validity and utility over time (saturation, contamination, test-specific optimization, etc). Purpose-specific and controlled evaluation required. 2024 at a glance: - Cost decreased significantly (15x price drop for "GPT-4" level capability in 1yr). - Capability incrementally improved. - Small models aka SLMs (low parameter count) increasingly capable, appropriate for tightly-focused use cases and potentially on-device. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
12 | EFFICIENCY INDEX (weighted, normalized) 100 represents the optimal balance of primary dimensions: - Capability: 2x weight - Throughput: sqrt - Cost: sqrt | 90 | n/a | 91 | 88 | 64 | 86 | 90 | 86 | 85 | 81 | 85 | n/a | 88 | 90 | 86 | 78 | 85 | 73 | 64 | 83 | 79 | 90 | 82 | n/a | 85 | 85 | 83 | 87 | 65 | 81 | 84 | 76 | 82 | n/a | 78 | 82 | n/a | 78 | 79 | 79 | 74 | 75 | 75 | 78 | 79 | 78 | 76 | 82 | 77 | 80 | 77 | 72 | 72 | n/a | 74 | 74 | 80 | 78 | 75 | 71 | 71 | 76 | 66 | 73 | 71 | 71 | 74 | 69 | 69 | 70 | 69 | 68 | 56 | 59 | 74 | 70 | 70 | n/a | 68 | 72 | 73 | 68 | 56 | 70 | 68 | 71 | n/a | 61 | 67 | 68 | 62 | 58 | 65 | 64 | 69 | 67 | 52 | 65 | 62 | |
13 | THROUGHPUT | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
14 | Throughput (median tokens/sec) Highly variable based on sample date. | 83.5 | 93.0 | 124.4 | 13.3 | 47.3 | 93.3 | 45.0 | 47.3 | 48.8 | 55.4 | 130.8 | 341.5 | 118.3 | 48.8 | 76.1 | 29.0 | 16.1 | 62.4 | 36.2 | 292.6 | 65.7 | 170.0 | 193.0 | 93.5 | 166.1 | 7.1 | 52.1 | 115.4 | 87.9 | 71.0 | 118.4 | 62.4 | 79.0 | 213.3 | 70.4 | 56.0 | 131.1 | 34.7 | 33.2 | 57.7 | 56.4 | 90.2 | 170.0 | 68.9 | 292.6 | 94.1 | 107.4 | 80.2 | 47.6 | 35.5 | 33.9 | 56.6 | 64.5 | 184.4 | 184.1 | 47.6 | 27.3 | 26.8 | 82.5 | 31.0 | 83.3 | 64.5 | 59.5 | 101.2 | 52.2 | 59.2 | 38.7 | 36.4 | 53.6 | 8.9 | 29.5 | 184.2 | 73.9 | 60.9 | 75.9 | 128.8 | 181.4 | 68.4 | 28.3 | 102.4 | 66.6 | 128.0 | 16.8 | 64.4 | 76.2 | 44.0 | 16.9 | 62.2 | 68.1 | 153.3 | 129.9 | 14.7 | 79.5 | 63.2 | ||||||
15 | Latency seconds (First Chunk) | 8.87 | 2.08 | 15.3 | 89.72 | 9.62 | 2.26 | 6.63 | 9.62 | 1.98 | 1.91 | 9.09 | 3.5 | 9.64 | 1.98 | 5.06 | 15.85 | 1.05 | 1.17 | 19.65 | 0.13 | 1.56 | 8.42 | 0.37 | 5.57 | 0.64 | 0.94 | 0.63 | 0.44 | 1.1 | 1.35 | 12.32 | 1.17 | 0.63 | 4.33 | 13.2 | 0.77 | 0.99 | 0.74 | 1.08 | 1.37 | 0.94 | 0.4 | 8.42 | 0.74 | 0.13 | 2.57 | 0.54 | 1.02 | 3.06 | 0.86 | 3.58 | 0.56 | 1.16 | 0.47 | 0.36 | 0.49 | 1.65 | 0.88 | 0.84 | 0.73 | 2.35 | 1.09 | 0.27 | 0.29 | 0.74 | 0.25 | 0.66 | 0.16 | 0.51 | 1.81 | 2.24 | 0.43 | 0.37 | 0.45 | 1.19 | 0.38 | 1.24 | 0.51 | 0.48 | 0.26 | 0.13 | 0.24 | 1.04 | 0.75 | 0.32 | 0.52 | ||||||||||||||
16 | Seconds per first 1K tokens output Is it fast? | 20.85 | 12.83 | 23.34 | 164.91 | 30.78 | 12.97 | 28.86 | 30.78 | 22.46 | 19.96 | 16.74 | 6.43 | 18.09 | 22.46 | 18.19 | 50.38 | 63.32 | 17.20 | 47.24 | 3.55 | 16.77 | 14.30 | 5.55 | 16.26 | 6.66 | 142.18 | 19.82 | 9.11 | 12.48 | 15.43 | 20.77 | 17.20 | 13.29 | 9.02 | 27.41 | 18.62 | 8.62 | 29.57 | 31.16 | 18.70 | 18.67 | 11.49 | 14.30 | 15.25 | 3.55 | 13.20 | 9.85 | 13.49 | 24.09 | 29.04 | 33.08 | 18.24 | 16.65 | 5.89 | 5.79 | 21.51 | 38.25 | 38.14 | 12.97 | 32.98 | 14.35 | 16.60 | 17.09 | 10.17 | 19.90 | 17.15 | 26.51 | 27.61 | 19.18 | 113.92 | 36.16 | 5.86 | 13.91 | 16.88 | 14.36 | 8.14 | 5.51 | 15.86 | 35.34 | 10.28 | 15.50 | 8.07 | 59.52 | 15.53 | 13.26 | 22.97 | 59.17 | 17.13 | 15.45 | 6.84 | 8.22 | 68.03 | 12.58 | 15.82 | ||||||
17 | Seconds per first 4k tokens output | 56.77 | 45.07 | 47.45 | 390.47 | 94.26 | 45.11 | 95.54 | 94.26 | 83.91 | 74.11 | 39.67 | 15.21 | 43.45 | 83.91 | 57.59 | 153.97 | 250.12 | 65.29 | 130.03 | 13.80 | 62.42 | 31.95 | 21.10 | 48.34 | 24.72 | 565.91 | 77.39 | 35.10 | 46.62 | 57.69 | 46.10 | 65.29 | 51.27 | 23.08 | 70.05 | 72.17 | 31.50 | 116.05 | 121.42 | 70.71 | 71.86 | 44.77 | 31.95 | 58.80 | 13.80 | 45.07 | 37.78 | 50.91 | 87.18 | 113.60 | 121.57 | 71.29 | 63.14 | 22.16 | 22.09 | 84.58 | 148.06 | 149.91 | 49.35 | 129.72 | 50.37 | 63.13 | 67.53 | 39.82 | 77.37 | 67.85 | 104.07 | 109.96 | 75.19 | 450.24 | 137.93 | 22.15 | 54.53 | 66.15 | 53.87 | 31.44 | 22.05 | 59.71 | 141.34 | 39.57 | 60.58 | 31.51 | 238.10 | 62.11 | 52.65 | 91.15 | 236.69 | 65.39 | 59.53 | 26.41 | 31.31 | 272.11 | 50.31 | 63.29 | ||||||
18 | COST | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
19 | Cost Variants (Cached 1M input, queries…) | $7.500 | $37.500 | $0.550 | $7.500 | $1.500 | $3.75write / $0.3read | $1.250 | $0.025 | $0.313 | $1.250 | $3.75write / $0.3read | $18.75write/$1.50read | $0.075 | $0.019 | $0.30write/$0.03read | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
20 | Cost Uncached Input (1M tokens) | $1.100 | $1.250 | $1.250 | $1.250 | $20.000 | $2.000 | $1.250 | $1.250 | $2.000 | $15.000 | $3.000 | $1.250 | $0.300 | $0.150 | $1.100 | $15.000 | $1.100 | $15.000 | $75.000 | $3.000 | $0.550 | $0.100 | $3.000 | $0.300 | $5.000 | $1.100 | $0.150 | $4.000 | $2.000 | $0.300 | $15.000 | $0.900 | $3.000 | $0.400 | $0.100 | $0.400 | $3.000 | $3.000 | $0.900 | $3.000 | $0.200 | $0.220 | $0.300 | $2.500 | $0.100 | $2.000 | $0.100 | $1.250 | $3.000 | $1.600 | $2.500 | $3.000 | $0.075 | $2.500 | $0.120 | $0.380 | $0.300 | $0.140 | $10.000 | $1.000 | $3.500 | $2.000 | $0.600 | $2.000 | $5.000 | $0.050 | $0.170 | $2.000 | $1.070 | $15.000 | $0.100 | $0.800 | $0.150 | $3.500 | $0.100 | $0.075 | $0.800 | $30.000 | $0.900 | $1.200 | $0.100 | $0.140 | $0.630 | $0.060 | $2.000 | $0.140 | $0.800 | $2.000 | $0.075 | $0.900 | $3.000 | $0.200 | $3.000 | |||||||
21 | Cost Output (1M tokens) | $4.400 | $10.000 | $10.000 | $10.000 | $80.000 | $8.000 | $10.000 | $10.000 | $8.000 | $75.000 | $15.000 | $10.000 | $0.500 | $3.500 | $4.400 | $75.000 | $4.400 | $60.000 | $150.000 | $15.000 | $2.190 | $0.400 | $15.000 | $0.100 | $15.000 | $4.400 | $0.600 | $4.000 | $8.000 | $2.500 | $60.000 | $0.900 | $15.000 | $1.600 | $0.500 | $2.000 | $12.000 | $15.000 | $0.900 | $15.000 | $0.800 | $0.880 | $0.100 | $10.000 | $0.400 | $2.000 | $0.400 | $2.500 | $15.000 | $6.400 | $10.000 | $15.000 | $0.300 | $10.000 | $0.180 | $0.400 | $0.500 | $0.580 | $30.000 | $1.000 | $3.500 | $10.000 | $0.600 | $6.000 | $15.000 | $0.100 | $0.680 | $6.000 | $1.140 | $75.000 | $0.400 | $3.200 | $0.600 | $10.500 | $0.300 | $0.300 | $4.000 | $60.000 | $0.900 | $1.200 | $0.300 | $0.280 | $0.650 | $0.240 | $8.000 | $0.280 | $0.800 | $8.000 | $0.300 | $0.900 | $15.000 | $0.600 | $3.000 | |||||||
22 | Cost 1M tokens (3:1 input:output) Is it cheap? | $1.925 | $3.438 | $3.438 | $3.438 | $35.000 | $3.500 | $3.438 | $3.438 | $3.500 | $30.000 | $6.000 | $3.438 | $0.350 | $0.988 | $1.925 | $30.000 | $1.925 | $26.250 | $93.750 | $6.000 | $0.960 | $0.175 | $6.000 | $0.250 | $7.500 | $1.925 | $0.263 | $4.000 | $3.500 | $0.850 | $26.250 | $0.900 | $6.000 | $0.700 | $0.200 | $0.800 | $5.250 | $6.000 | $0.900 | $6.000 | $0.350 | $0.385 | $0.250 | $4.375 | $0.175 | $2.000 | $0.175 | $1.563 | $6.000 | $2.800 | $4.375 | $6.000 | $0.131 | $4.375 | $0.135 | $0.385 | $0.350 | $0.250 | $15.000 | $1.000 | $3.500 | $4.000 | $0.600 | $3.000 | $7.500 | $0.063 | $0.298 | $3.000 | $1.088 | $30.000 | $0.175 | $1.400 | $0.263 | $5.250 | $0.150 | $0.131 | $1.600 | $37.500 | $0.900 | $1.200 | $0.150 | $0.175 | $0.635 | $0.105 | $3.500 | $0.175 | $0.800 | $3.500 | $0.131 | $0.900 | $6.000 | $0.300 | $3.000 | |||||||
23 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
24 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
25 | COST VS (CAPABILITY, THROUGHPUT) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
26 | Cost 1M (3:1 IO) tokens per Composite Capability point Is capability in line with cost? | $0.020 | $0.034 | $0.036 | $0.036 | $0.369 | $0.037 | $0.036 | $0.037 | $0.037 | $0.324 | $0.066 | $0.038 | $0.004 | $0.011 | $0.022 | $0.346 | $0.022 | $0.308 | $1.103 | $0.071 | $0.012 | $0.002 | $0.073 | $0.003 | $0.093 | $0.024 | $0.003 | $0.050 | $0.044 | $0.011 | $0.341 | $0.012 | $0.080 | $0.009 | $0.003 | $0.011 | $0.073 | $0.084 | $0.013 | $0.088 | $0.005 | $0.006 | $0.004 | $0.065 | $0.003 | $0.030 | $0.003 | $0.023 | $0.090 | $0.042 | $0.067 | $0.091 | $0.002 | $0.067 | $0.002 | $0.006 | $0.006 | $0.004 | $0.244 | $0.017 | $0.060 | $0.069 | $0.011 | $0.053 | $0.133 | $0.001 | $0.005 | $0.056 | $0.020 | $0.563 | $0.003 | $0.027 | $0.005 | $0.102 | $0.003 | $0.003 | $0.032 | $0.751 | $0.018 | $0.024 | $0.003 | $0.004 | $0.014 | $0.002 | $0.080 | $0.004 | $0.018 | $0.082 | $0.003 | $0.021 | $0.149 | $0.008 | $0.075 | |||||||
27 | Cost 1M (3:1 IO) tokens per AA Intelligence Index v2 point | $0.028 | $0.050 | $0.049 | $0.051 | $0.493 | $0.053 | $0.051 | $0.051 | $0.056 | $0.469 | $0.098 | $0.055 | $0.005 | $0.016 | $0.029 | $0.517 | $0.423 | $0.105 | $0.016 | $0.003 | $0.113 | $0.004 | $0.150 | $0.031 | $0.005 | $0.077 | $0.066 | $0.016 | $0.016 | $0.125 | $0.013 | $0.005 | $0.016 | $0.097 | $0.118 | $0.020 | $0.136 | $0.007 | $0.008 | $0.006 | $0.107 | $0.004 | $0.044 | $0.004 | $0.035 | $0.140 | $0.062 | $0.107 | $0.003 | $0.109 | $0.003 | $0.009 | $0.006 | $0.023 | $0.085 | $0.015 | $0.081 | $0.197 | $0.002 | $0.007 | $0.081 | $0.857 | $0.004 | $0.038 | $0.007 | $0.004 | $0.005 | $0.046 | $0.026 | $0.036 | $0.004 | $0.003 | $0.121 | $0.121 | $0.011 | $0.107 | ||||||||||||||||||||||||
28 | Cost 1M (3:1 IO) tokens per AA Index v1 point | $0.292 | $0.011 | $0.022 | $0.305 | $0.063 | $0.011 | $0.075 | $0.024 | $0.002 | $0.020 | $0.035 | $0.056 | $0.079 | $0.002 | $0.005 | $0.200 | $0.047 | $0.008 | $0.041 | $0.104 | $0.004 | $0.041 | $0.015 | $0.429 | $0.019 | $0.004 | $0.002 | $0.024 | $0.013 | $0.018 | $0.002 | $0.003 | $0.009 | $0.002 | $0.055 | $0.003 | $0.013 | $0.015 | $0.105 | $0.005 | $0.052 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
29 | Cost 1M (3:1 IO) tokens per Chatbot Arena ELO point | $0.002 | $0.002 | $0.002 | $0.002 | $0.002 | $0.001 | $0.021 | $0.001 | $0.019 | $0.066 | $0.004 | $0.001 | $0.000 | $0.004 | $0.005 | $0.001 | $0.000 | $0.003 | $0.003 | $0.019 | $0.001 | $0.004 | $0.001 | $0.000 | $0.001 | $0.004 | $0.005 | $0.001 | $0.004 | $0.000 | $0.000 | $0.003 | $0.000 | $0.001 | $0.002 | $0.003 | $0.005 | $0.000 | $0.003 | $0.000 | $0.000 | $0.012 | $0.003 | $0.003 | $0.000 | $0.000 | $0.000 | $0.002 | $0.001 | $0.024 | $0.000 | $0.001 | $0.000 | $0.004 | $0.000 | $0.001 | $0.030 | $0.001 | $0.000 | $0.001 | $0.000 | $0.003 | $0.000 | $0.001 | $0.000 | $0.001 | $0.005 | |||||||||||||||||||||||||||||||||
30 | Cost 1M (3:1 IO) tokens per LiveBench point | $0.024 | $0.044 | $0.050 | $0.045 | $0.470 | $0.043 | $0.048 | $0.045 | $0.044 | $0.377 | $0.076 | $0.046 | $0.005 | $0.025 | $0.419 | $0.026 | $0.347 | $1.422 | $0.081 | $0.013 | $0.086 | $0.116 | $0.027 | $0.004 | $0.060 | $0.056 | $0.015 | $0.092 | $0.012 | $0.003 | $0.014 | $0.091 | $0.105 | $0.015 | $0.104 | $0.007 | $0.004 | $0.076 | $0.037 | $0.003 | $0.029 | $0.100 | $0.045 | $0.079 | $0.002 | $0.003 | $0.007 | $0.007 | $0.298 | $0.021 | $0.067 | $0.074 | $0.012 | $0.153 | $0.002 | $0.007 | $0.024 | $0.611 | $0.004 | $0.032 | $0.006 | $0.003 | $0.003 | $0.037 | $0.020 | $0.004 | $0.003 | $0.021 | $0.009 | |||||||||||||||||||||||||||||||
31 | Cost 1M (3:1 IO) tokens per Throughput (first 1k tokens output) Is throughput in line with cost? | $0.092 | $0.268 | $0.147 | $0.212 | $0.114 | $0.265 | $0.119 | $0.114 | $1.336 | $0.301 | $0.021 | $0.154 | $0.106 | $1.336 | $0.106 | $0.521 | $1.481 | $0.349 | $0.020 | $0.049 | $0.358 | $0.017 | $1.351 | $0.118 | $0.039 | $0.028 | $0.177 | $0.093 | $2.104 | $0.058 | $0.349 | $0.053 | $0.007 | $0.043 | $0.609 | $0.203 | $0.029 | $0.321 | $0.019 | $0.034 | $0.017 | $0.287 | $0.049 | $0.152 | $0.018 | $0.116 | $0.249 | $0.096 | $0.240 | $0.360 | $0.022 | $0.755 | $0.006 | $0.010 | $0.009 | $0.019 | $0.455 | $0.070 | $0.211 | $0.234 | $0.059 | $0.151 | $0.437 | $0.002 | $0.011 | $0.156 | $0.010 | $0.830 | $0.030 | $0.101 | $0.016 | $0.366 | $0.018 | $0.024 | $0.101 | $1.061 | $0.088 | $0.077 | $0.019 | $0.003 | $0.041 | $0.008 | $0.152 | $0.003 | $0.047 | $0.227 | $0.019 | $0.110 | $0.088 | $0.024 | $0.190 | |||||||||
32 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
33 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
34 | BENCHMARKS | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
35 | Scores are opportunistically sourced from HuggingFace leaderboards, model cards, docs & press releases, other 3rd parties. Many reproducibility, precision, and accuracy issues due to selection bias & varied methodologies (n-shot, turns, CoT, benchmark variant…), susceptible to prompt sensitivity, construct validity, and contamination. Models used by HuggingFace Leaderboard 2 (2024 June 26) are highlighted in GREEN. Models used by HuggingFace Leaderboard 1 are highlighted in ORANGE. These benchmarks are increasingly outdated for various reasons, eg benchmark inaccuracies or they are effectively "solved" by frontier models (due to training to beat the benchmark, leaked criteria, or outdated level of difficulty). | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
36 | General | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
37 | MMLU PRO HuggingFace Leaderboard 2 Scores per category (Biology, Business, Engineering, …) | 84.0 | 84.0 | 84.0 | 79.0 | 77.0 | 76.0 | 81.0 | 80.0 | 76.2 | 74.0 | 76.0 | 78.0 | 67.0 | 80.0 | 77.6 | 75.8 | 76.0 | 72.6 | 76.1 | 71.6 | 71.0 | 86.0 | 67.0 | 63.7 | 73.3 | 68.9 | 70.0 | 85.0 | 45.3 | 70.4 | 68.0 | 68.5 | 69.0 | 63.1 | 69.0 | 66.8 | 67.3 | 65.0 | 64.8 | 66.4 | 67.0 | 66.3 | 63.6 | 52.6 | 53.5 | 53.6 | 59.1 | 56.2 | 53.0 | 58.1 | ||||||||||||||||||||||||||||||||||||||||||||||||||
38 | MMLU HuggingFace Leaderboard 1 Most widely reported, but many known issues. | 86.9 | 92.0 | 91.0 | 85.9 | 90.8 | 85.2 | 87.0 | 89.0 | 88.0 | 86.0 | 87.0 | 88.7 | 88.7 | 85.0 | 78.6 | 86.4 | 88.6 | 86.0 | 74.5 | 84.8 | 84.0 | 86.8 | 85.9 | 82.0 | 81.9 | 80.6 | 75.0 | 81.0 | 86.4 | 86.0 | 86.0 | 82.0 | 78.5 | 83.8 | 80.5 | 81.2 | 75.2 | 78.9 | 82.0 | 83.2 | 83.8 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
39 | Humanity's Last Exam Technical knowledge and reasoning via structured academic problems | 18.8 | 12.1 | 8.35 | 10 | 8.57 | 11.1 | 5 | 4 | 6.67 | 5 | 7.2 | 4.05 | 4.85 | 5 | 4.89 | 5.15 | 2.62 | 4.43 | 5 | 5 | 3.9 | 3.97 | 5 | 0.84 | 5.53 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
40 | HELM | 90.8 | 93.8 | 88.5 | 90.8 | 85.8 | 80.3 | 72.2 | 88.5 | 70.1 | 79.3 | 91.5 | 83.8 | 82.7 | 70.8 | 74.2 | 73.3 | 53.0 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
41 | Kagi LLM Benchmarking Omitted - scores highly variable over time | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
42 | SimpleBench | 38.7 | 51.6 | 53.1 | 22.8 | 40.1 | 34.5 | 46.4 | 30.9 | 41.7 | 31.3 | 44.9 | 30.7 | 18.1 | 18.9 | 41.4 | 27.7 | 18.9 | 27.1 | 17.8 | 27.5 | 25.1 | 19.9 | 22.7 | 19.9 | 22.5 | 23.5 | 10.7 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
43 | TriviaQA | 85.5 | 78.2 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
44 | Arena Hard | 79.2 | 79.3 | 78.0 | 84.0 | 75.4 | 73.2 | 60.4 | 89.7 | 72.0 | 37.9 | 87.3 | 62.3 | 46.9 | 57.5 | 49.6 | 46.6 | 63.7 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
45 | SimpleQA | 52.9 | 47.0 | 63.0 | 44.3 | 29.9 | 38.0 | 21.7 | 15.0 | 10.4 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
46 | Instruction Following | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
47 | IFEval HuggingFace Leaderboard 2 | 83.9 | 90.8 | 85.6 | 88.0 | 88.6 | 92.1 | 63.0 | 92.1 | 85.0 | 84.3 | 87.5 | 82.9 | 79.9 | 89.7 | 80.7 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
48 | Math | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
49 | MATH Multiple difficulty levels, inconsistently reported. HuggingFace Leaderboard 2 uses Level 5 (most difficult) | 97.9 | 95.0 | 95.0 | 97.0 | 97.3 | 94.0 | 94.8 | 96.0 | 91.8 | 82.2 | 90.0 | 90.0 | 85.0 | 78.3 | 89.0 | 92.0 | 90.9 | 86.5 | 83.0 | 75.9 | 71.1 | 86.8 | 82.0 | 85.0 | 88.0 | 73.4 | 73.8 | 77.0 | 73.0 | 43.3 | 80.4 | 60.1 | 76.6 | 70.2 | 67.7 | 71.0 | 77.9 | 69.2 | 64.5 | 68.0 | 68.0 | 70.6 | 35.1 | 73.3 | 53.5 | 54.9 | 50.4 | 62.4 | ||||||||||||||||||||||||||||||||||||||||||||||||||||
50 | GSM8K HuggingFace Leaderboard 1 | 96.1 | 96.4 | 82.6 | 96.8 | 71.0 | 93.0 | 95.0 | 94.8 | 90.8 | 94.2 | 95.1 | 85.3 | 94.5 | 87.0 | 85.4 | 92.6 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
51 | HiddenMath | 87.3 | 83.3 | 79.6 | 56.7 | 65.2 | 63.6 | 63.5 | 52.0 | 55.3 | 28.0 | 47.2 | 20.3 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
52 | AIME 2025 | 86.7 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
53 | AIME 2024 | 36.7 | 49.0 | 52.0 | 79.5 | 23.3 | 92.0 | 16.0 | 51.0 | 9.3 | 9.6 | 10.0 | 25.0 | 9.0 | 5.3 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
54 | MathVista | 73.2 | 68.1 | 63.8 | 69.4 | 56.7 | 63.9 | 68.9 | 65.8 | 57.3 | 58.4 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
55 | Reasoning | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
56 | GPQA HuggingFace Leaderboard 2 | 84.0 | 79.7 | 77.0 | 71.4 | 77.0 | 70.0 | 75.0 | 62.0 | 78.0 | 59.0 | 64.7 | 68.0 | 60.0 | 56.0 | 65.0 | 53.0 | 61.0 | 60.1 | 59.1 | 53.0 | 53.1 | 59.4 | 60.1 | 53.0 | 49.0 | 43.0 | 49.1 | 51.1 | 50.5 | 51.0 | 43.0 | 25.4 | 56.1 | 47.0 | 50.4 | 46.9 | 40.2 | 46.0 | 44.4 | 51.0 | 41.6 | 41.4 | 46.7 | 46.7 | 45.3 | 16.3 | 42.0 | 36.9 | 34.4 | 41.4 | 39.5 | 43.5 | ||||||||||||||||||||||||||||||||||||||||||||||||
57 | BIG-BENCH-HARD HuggingFace Leaderboard 2 | 93.1 | 77.7 | 72.6 | 86.8 | 86.9 | 84.0 | 83.1 | 57.5 | 82.4 | 85.5 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
58 | MuSR HuggingFace Leaderboard 2 | 17.3 | 19.7 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
59 | ARC Challenge HuggingFace Leaderboard 1 | 96.7 | 96.7 | 70.6 | 96.9 | 68.9 | 94.8 | 96.4 | 94.8 | 68.8 | 92.4 | 93.0 | 71.4 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
60 | HellaSwag HuggingFace Leaderboard 1 | 85.6 | 84.2 | 95.4 | 95.3 | 87.3 | 85.7 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
61 | TruthfulQA (MC2) HuggingFace Leaderboard 1 | 54.7 | 61.8 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
62 | WinoGrande HuggingFace Leaderboard 1 | 78.8 | 74.3 | 84.5 | 82.9 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
63 | ARC Easy | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
64 | DROP | 79.8 | 87.1 | 77.2 | 85.4 | 72.2 | 83.1 | 79.7 | 80.9 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
65 | Social IQA | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
66 | MedQA | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
67 | Code | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
68 | Aider polyglot | 72.0 | 72.9 | 79.6 | 49.3 | 60.4 | 61.7 | 44.9 | 64.9 | 56.9 | 45.3 | 53.8 | 47.1 | 55.1 | 52.4 | 20.9 | 38.2 | 60.4 | 32.4 | 18.2 | 32.9 | 48.4 | 51.6 | 15.6 | 18.2 | 22.1 | 21.8 | 27.1 | 12.0 | 17.8 | 8.9 | 3.6 | 28.0 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
69 | HumanEval | 97.0 | 98.0 | 98.0 | 97.0 | 92.0 | 92.4 | 98.0 | 92.0 | 92.4 | 92.0 | 93.7 | 95.0 | 97.0 | 90.0 | 56.5 | 93.0 | 90.2 | 92.0 | 90.0 | 82.0 | 88.0 | 89.0 | 87.6 | 89.0 | 88.4 | 87.0 | 45.7 | 82.6 | 92.0 | 84.9 | 88.0 | 87.2 | 71.9 | 88.4 | 88.1 | 86.6 | 80.5 | 84.8 | 64.6 | 84.0 | 73.2 | 81.7 | 76.8 | 82.3 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
70 | HumanEval Plus | 82.8 | 87.0 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
71 | LiveCodeBench v5 | 70.4 | 36.0 | 34.5 | 28.9 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
72 | MBPP Base | 65.6 | 60.4 | 80.0 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
73 | MBPP EvalPlus | 90.5 | 88.6 | 87.7 | 69.0 | 83.6 | 86.0 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
74 | Natural2Code | 92.9 | 85.4 | 82.6 | 79.8 | 77.2 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
75 | Tool use | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
76 | BFCL v3 | 59.3 | 57.5 | 56.5 | 62.2 | 57.4 | 61.9 | 56.8 | 52.1 | 60.5 | 55.1 | 57.9 | 53.7 | 53.7 | 55.9 | 51.5 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
77 | BFCL v2 | 77.2 | 81.1 | 77.3 | 77.5 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
78 | BFCL v1 (Berkeley Function Calling) | 66.4 | 80.5 | 80.2 | 88.5 | 86.3 | 88.3 | 84.8 | 85.5 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
79 | Nexus | 56.1 | 45.7 | 58.7 | 50.3 | 56.7 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
80 | Agentic | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
81 | SWE-bench Verified | 49.0 | 33.4 | 33.4 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
82 | Conversational | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
83 | MT Bench | 93.2 | 74.0 | 86.3 | 83.3 | 83.5 | 78.6 | 92.6 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
84 | Multimodal (Image) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
85 | MMMU | 81.7 | 74.4 | 78.2 | 72.7 | 71.8 | 75.4 | 70.4 | 72.7 | 65.9 | 68.3 | 69.1 | 68.0 | 56.1 | 64.0 | 50.3 | 59.4 | 59.4 | 62.2 | 62.8 | 62.3 | 56.1 | 56.3 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
86 | Multilingual | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
87 | MGSM | 90.5 | 91.6 | 74.3 | 88.6 | 91.6 | 91.1 | 64.3 | 80.6 | 90.7 | 87.0 | 87.5 | 86.5 | 86.9 | 86.9 | 75.5 | 82.6 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
88 | Long Context | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
89 | ZeroSCROLLS/QuALITY | 90.5 | 90.5 | 95.2 | 95.2 | 90.5 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
90 | InfiniteBench/En.MC | 82.5 | 83.4 | 72.1 | 78.2 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
91 | NIH/Multi-needle | 100.0 | 90.8 | 98.1 | 100.0 | 97.5 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
92 | MRCR (1M) | 74.7 | 70.5 | 58 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
93 | MISC | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
94 | Reasoning Tokens | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
95 | License | proprietary | | proprietary | proprietary | proprietary | proprietary | proprietary | | proprietary | proprietary | proprietary | | proprietary | proprietary | proprietary | proprietary | proprietary | proprietary | proprietary | proprietary | open | proprietary | proprietary | | open | proprietary | proprietary | proprietary | open | proprietary | proprietary | proprietary | open | proprietary | proprietary | proprietary | proprietary | open | proprietary | proprietary | proprietary | open | proprietary | open | open | open | proprietary | proprietary | open | proprietary | proprietary | proprietary | proprietary | open | proprietary | proprietary | proprietary | proprietary | open | open | open | open | proprietary | proprietary | open | proprietary | open | open | proprietary | open | open | open | open | proprietary | proprietary | proprietary | proprietary | proprietary | proprietary | open | proprietary | proprietary | proprietary | open | open | open | proprietary | open | open | proprietary | open | open | open | open | proprietary | open | proprietary | open | proprietary | |
96 | Organization | OpenAI | OpenAI | OpenAI | OpenAI | Anthropic | Anthropic | xAi | OpenAI | Anthropic | OpenAI | OpenAI | OpenAI | Anthropic | DeepSeek | Anthropic | Alibaba | OpenAI | OpenAI | DeepSeek | OpenAI | OpenAI | Alibaba | Anthropic | OpenAI | Alibaba | Mistral | OpenAI | xAI | DeepSeek | Anthropic | Reka | Meta | Alibaba | OpenAI | DeepSeek | Perplexity | Alibaba | Microsoft | OpenAI | Anthropic | Cohere | DeepSeek | Alibaba | Meta | OpenAI | Perplexity | Meta | xAI | Meta | Mistral | xAI | Microsoft | Mistral | DeepSeek | Anthropic | OpenAI | Amazon | OpenAI | Tencent | Mistral | Anthropic | OpenAI | Meta | Meta | Mistral | Tencent | DeepSeek | Alibaba | Amazon | AI21 | DeepSeek | AI21 | Meta | Reka | Mistral | 01.AI | ||||||||||||||||||||||||
97 | Search | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
98 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
99 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
100 |