ABCDEFGHIJKLMNOPQRSTUVWXYZAAABACADAEAFAGAHAIAJAKALAMANAOAPAQARASATAUAVAWAXAYAZBABBBCBDBEBFBGBHBIBJBKBLBMBNBOBPBQBRBSBTBUBVBWBXBYBZCACBCCCDCECFCGCHCICJCKCLCMCNCOCPCQCRCSCTCUCV
1
MODELo4-mini (high)gpt-5 (high)Gemini 2.5 Pro Preview 06-05Gemini 2.5 Pro Experimentalo3 proo3 (high)Gemini 2.5 Pro Preview 05-06gpt-5 (medium)o3 (medium)Claude 4 Opus ThinkingClaude 4 Sonnet Thinkinggpt-5 (low)Grok 3 Mini Reasoning (high)Gemini 2.5 Flash (Thinking) Previewo3-mini (high)Claude 4 Opuso4-mini (medium)o1GPT-4.5 PreviewClaude 3.7 Sonnet ThinkingDeepSeek R1Gemini 2.5 Flash Lite (Reasoning) Preview 06-17Claude 4 SonnetGLM-4.6Qwen 3 30B A3B (Reasoning)GPT-4o 2025-03-27o3-mini (medium)Gemini 2.5 Flash PreviewDeepSeek V3 0324GPT-4.1Gemini 2.5 Flasho1-previewQwen 2.5 MaxGemini 2.0 Pro ExperimentalClaude 3.7 SonnetGPT-4.1 miniGemini 2.0 Flash Thinking ExperimentalQwen 3 32BMistral Medium 3o1-miniGrok 3DeepSeek V3Claude 3.5 Sonnet New 2024-10Reka Flash 3 PreviewLlama 4 MaverickQwen 3 30B A3BGPT-4o 2024-11-20Gemini 2.5 Flash Lite Preview 06-17DeepSeek R1 Distill Llama 70BGemini 2.0 FlashGemini 1.5 Pro 002Sonar ProQwen 2.5 MaxPhi-4 Reasoning PlusGPT-4o 2024-08-06Claude 3.5 Sonnet 2024-06Gemini 2.0 Flash-LiteCommand ADeepSeek R1 Distill Qwen 32BQwen 2.5 (72B)Gemma 3 27BLlama 4 ScoutGPT-4 TurboSonarLlama 3.1 405BGrok-2 1212Llama 3.3 70BPixtral LargeGrok BetaGemma 3 12BPhi-4Mistral Large 2 2024-07-24DeepSeek-V2.5Claude 3 OpusGPT-4.1 nanoNova ProGPT-4o miniHunyuan-Large 2025-02Gemini 1.5 Pro 001Mistral Small 3 2025-03Gemini 1.5 Flash 002Claude 3.5 HaikuGPT-4Llama 3.1 70BLlama 3.2 90BMistral Small 3 2025-01Hunyuan-Standard 2025-02DeepSeek-V2Qwen 2 (72B)Nova LiteJamba 1.5 LargeDeepSeek-Coder-V2Gemma 2 27BJamba 1.6 LargeGemini 1.5 Flash 001Llama 3 70BReka CoreMistral Small 2024-09Yi-Large
2
Version (date, alias, etc)4/16/20258/7/20256/5/20253/25/20256/10/20254/16/20255/6/20258/7/20254/16/20255/22/20255/22/20258/7/20254/9/20254/17/20252/13/20255/22/20254/16/202512/17/20242/27/20252/24/20251/20/20256/17/20255/22/20259/30/20254/28/20252025-03-0271/31/20254/17/20253/24/20254/14/20256/17/2025
preview-2024-09-12
3/5/20252/5/20252/24/20254/14/202512/19/20244/28/20255/7/2025
preview-2024-09-12
3/17/202512/6/202410/22/20243/10/20254/5/20254/28/202511/20/20246/17/20251/1/20252/5/20251.5-pro-0023/7/20251/29/20255/1/20258/6/20246/20/20242/5/20253/13/20251/29/20252025-03–124/5/20250125-preview1/25/20257/23/202412/12/202412/6/2021411/18/20248/13/20242025-03–127/24/20244/14/202512/3/20242/10/20253/17/20251.5-flash-00210/22/20241/1/20252/10/202512/3/20243/13/20257/22/20249/1/2024
3
Provider (for Throughput and Cost metrics)OpenAIOpenAIGoogle VertexGoogleOpenAIOpenAIGoogle VertexOpenAIOpenAIGoogle VertexGoogle VertexOpenAIxAIGoogleOpenAIGoogle VertexOpenAIOpenAIOpenAIAnthropicDeepSeekGoogle AI StudioGoogle Vertexz.AIDeepInfraOpenAIOpenAIGoogleHyperbolicOpenAIGoogle AI StudioOpenAIFireworksGoogleAnthropicOpenAIGoogleParasailMistralOpenAIxAIFireworks AIAnthropicRekaFireworksDeepInfraOpenAIGoogle AI StudioTogetherGoogleGooglePerplexityAlibabaDeepInfraOpenAIAnthropicGoogleCohereDeepInfraAlibabaParasailParasailOpenAIPerplexityTogetherxAIFriendliMistralxAIDeepInfraDeepInfraMistralAnthropicOpenAIAWSOpenAIGoogleMistralGoogleAnthropicOpenAIFireworks AIFireworks AIMistralTogether.aiAWSAmazon BedrockTogether.aiAmazon BedrockGoogleFireworks AIRekaMistralFireworks AI
4
Year Released202520252025202520252025202520252025202520252025202520252025202520252024202520252025202520252025202520252025202520252025202520242025202520252025202420252025202420252024202420242025202520242025202520252024202520252025202420242025202520252024202520252023202520242024202420242024202520242024202420242025202420242025202420252024202420232024202420252025202420242024202420242024202520242024202420242024
5
COMPOSITE CAPABILITY
100 represents the most capable model at this moment in time.

See Composite Capability sheet for complete calculation.

Calulation method updated 2025-03
9710096959595949494939090888787878685858483838289818181807979777777767575747372727169686968686767676767666666666665656464636361595858575656565554545352525252515151515050504949494745444443434342404040
6
Composite Capability Consistency
100 represents consistent performance across benchmarks relative to other models. Lower score indicates higher variance.
56471767289244578198868322n/a406993692886757470n/a2750447156n/a705563716768436169747568n/a715555n/a7273776972n/a6060664517034n/a766483577281952262718583575975n/an/a995694n/a808954n/an/a1756894693n/an/a87749488
7
Artificial Analysis Intelligence Index V2
Weighted average of benchmarks for General Reasoning & Knowledge, Math, and Code Gen.
70697068716668686364616367606658625760555356565063485253535849485352444954514644475043414645444543454141405238434341413738344037354137363528353533353329292728
8
Artificial Analysis Index V1
Average result across MMLU, GPQA, Math, and HumanEval.

V1 Index deprecated 2025-02
908989868584798085838079787679777574747472767472707573726868687266727064676162576058
9
LM Arena ELO (style control enabled)
Crowdsourced comparative evaluation of "vibes".
Can be misleading - "answers people like" is not the same as "right answers", ie
confidently wrong often beats nuanced & considerate.
148114701437144614341346141413821380141713711376138713901429142213291395138213911363134613811358135513111328135513161332133713451309129313391286132812841341132013311324131613421302125412491235132312371232124012391240120912341242124112301232123512141186122911811181120511821207120411951207
10
LiveBench Global Average
Questions refresh monthly, potential for stale values in spreadsheet.

Categories: Reasoning, Coding, Math, Data Analysis, Language, and Instruction Following.
78.7278.5969.3976.6974.4280.7171.9976.4579.2579.5379.0975.3468.3375.8871.5274.4075.6765.9374.5072.4969.6564.7570.0169.9366.8662.9960.0365.1365.5659.0566.9271.0356.5957.7656.9560.4557.9454.3865.3257.7954.5061.4754.3360.2762.2956.6455.3353.2445.5551.4449.9950.4046.8852.3654.3050.1649.1841.2541.6145.9849.1239.7243.5541.2643.9648.5943.4544.8942.5536.3538.1933.39
11
Primary Dimensions:

1. Capability: output quality
2.
Throughput: sec to first 1k tokens (latency + tokens/sec)
3.
Cost: $ per token

Unique to this sheet:

-
Composite Capability: normalized average of key Capability indexes and ELOs, providing a single score for "which model is most capable".
-
Composite Capability Consistency: standard deviation of normalized capability scores. High value indicates more consistent eval performance.
-
Efficiency Index: weighted, normalized score that reflects overall performance across Capability, Throughput, and Cost, answering "which model has the optimal balance of primary dimensions".

2025 emerging trends:

-
Reasoning models breaking through Capability plateau, and reset price war with differentiated tier. Caveats:
- Reasoning models generate significantly more tokens, greatly increasing both total cost and time to full response beyond what $/token and tokens/sec alone indicate.
- Continued improvement in frontend non-reasoning LLMs has closed the gap with first-wave reasoning model capability.
-
Open models narrowed capability gap with Closed models, exerting more downward pressure on cost.
- 2025 "smart" models are competitive in speed and cost with 2024's "fast & cheap" models.
- Benchmarks lose validity and utility over time (saturation, contamination, test-specific optimization, etc). Purpose-specific and controlled evaluation required.

2024 at a glance:

- Cost decreased significantly (15x price drop for "GPT-4" level capability in 1yr).
- Capability incrementally improved.
-
Small models aka SLMs (low parameter count) increasingly capable, appropriate for tightly-focused use cases and potentially on-device.
12
EFFICIENCY INDEX (weighted, normalized)
100 represents the optimal balance of primary dimensions:

- Capability: 2x weight
- Throughput: sqrt
- Cost: sqrt
90n/a918864869086858185n/a8890867885736483799082n/a858583876581847682n/a7882n/a78797974757578797876827780777272n/a7474807875717176667371717469697069685659747070n/a6872736856706871n/a616768625865646967526562
13
THROUGHPUT
14
Throughput (median tokens/sec)
Highly variable based on sample date.
83.593.0124.413.347.393.345.047.348.855.4130.8341.5118.348.876.129.016.162.436.2292.665.7170.0193.093.5166.17.152.1115.487.971.0118.462.479.0213.370.456.0131.134.733.257.756.490.2170.068.9292.694.1107.480.247.635.533.956.664.5184.4184.147.627.326.882.531.083.364.559.5101.252.259.238.736.453.68.929.5184.273.960.975.9128.8181.468.428.3102.466.6128.016.864.476.244.016.962.268.1153.3129.914.779.563.2
15
Latency seconds (First Chunk)8.872.0815.389.729.622.266.639.621.981.919.093.59.641.985.0615.851.051.1719.650.131.568.420.375.570.640.940.630.441.11.3512.321.170.634.3313.20.770.990.741.081.370.940.48.420.740.132.570.541.023.060.863.580.561.160.470.360.491.650.880.840.732.351.090.270.290.740.250.660.160.511.812.240.430.370.451.190.381.240.510.480.260.130.241.040.750.320.52
16
Seconds per first 1K tokens output
Is it fast?
20.8512.8323.34164.9130.7812.9728.8630.7822.4619.9616.746.4318.0922.4618.1950.3863.3217.2047.243.5516.7714.305.5516.266.66142.1819.829.1112.4815.4320.7717.2013.299.0227.4118.628.6229.5731.1618.7018.6711.4914.3015.253.5513.209.8513.4924.0929.0433.0818.2416.655.895.7921.5138.2538.1412.9732.9814.3516.6017.0910.1719.9017.1526.5127.6119.18113.9236.165.8613.9116.8814.368.145.5115.8635.3410.2815.508.0759.5215.5313.2622.9759.1717.1315.456.848.2268.0312.5815.82
17
Seconds per first 4k tokens output56.7745.0747.45390.4794.2645.1195.5494.2683.9174.1139.6715.2143.4583.9157.59153.97250.1265.29130.0313.8062.4231.9521.1048.3424.72565.9177.3935.1046.6257.6946.1065.2951.2723.0870.0572.1731.50116.05121.4270.7171.8644.7731.9558.8013.8045.0737.7850.9187.18113.60121.5771.2963.1422.1622.0984.58148.06149.9149.35129.7250.3763.1367.5339.8277.3767.85104.07109.9675.19450.24137.9322.1554.5366.1553.8731.4422.0559.71141.3439.5760.5831.51238.1062.1152.6591.15236.6965.3959.5326.4131.31272.1150.3163.29
18
COST
19
Cost Variants (Cached 1M input, queries…)$7.500$37.500$0.550$7.500$1.500$3.75write / $0.3read$1.250$0.025$0.313$1.250$3.75write / $0.3read
$18.75write/$1.50read
$0.075$0.019$0.30write/$0.03read
20
Cost Uncached Input (1M tokens)$1.100$1.250$1.250$1.250$20.000$2.000$1.250$1.250$2.000$15.000$3.000$1.250$0.300$0.150$1.100$15.000$1.100$15.000$75.000$3.000$0.550$0.100$3.000$0.300$5.000$1.100$0.150$4.000$2.000$0.300$15.000$0.900$3.000$0.400$0.100$0.400$3.000$3.000$0.900$3.000$0.200$0.220$0.300$2.500$0.100$2.000$0.100$1.250$3.000$1.600$2.500$3.000$0.075$2.500$0.120$0.380$0.300$0.140$10.000$1.000$3.500$2.000$0.600$2.000$5.000$0.050$0.170$2.000$1.070$15.000$0.100$0.800$0.150$3.500$0.100$0.075$0.800$30.000$0.900$1.200$0.100$0.140$0.630$0.060$2.000$0.140$0.800$2.000$0.075$0.900$3.000$0.200$3.000
21
Cost Output (1M tokens)$4.400$10.000$10.000$10.000$80.000$8.000$10.000$10.000$8.000$75.000$15.000$10.000$0.500$3.500$4.400$75.000$4.400$60.000$150.000$15.000$2.190$0.400$15.000$0.100$15.000$4.400$0.600$4.000$8.000$2.500$60.000$0.900$15.000$1.600$0.500$2.000$12.000$15.000$0.900$15.000$0.800$0.880$0.100$10.000$0.400$2.000$0.400$2.500$15.000$6.400$10.000$15.000$0.300$10.000$0.180$0.400$0.500$0.580$30.000$1.000$3.500$10.000$0.600$6.000$15.000$0.100$0.680$6.000$1.140$75.000$0.400$3.200$0.600$10.500$0.300$0.300$4.000$60.000$0.900$1.200$0.300$0.280$0.650$0.240$8.000$0.280$0.800$8.000$0.300$0.900$15.000$0.600$3.000
22
Cost 1M tokens (3:1 input:output)
Is it cheap?
$1.925$3.438$3.438$3.438$35.000$3.500$3.438$3.438$3.500$30.000$6.000$3.438$0.350$0.988$1.925$30.000$1.925$26.250$93.750$6.000$0.960$0.175$6.000$0.250$7.500$1.925$0.263$4.000$3.500$0.850$26.250$0.900$6.000$0.700$0.200$0.800$5.250$6.000$0.900$6.000$0.350$0.385$0.250$4.375$0.175$2.000$0.175$1.563$6.000$2.800$4.375$6.000$0.131$4.375$0.135$0.385$0.350$0.250$15.000$1.000$3.500$4.000$0.600$3.000$7.500$0.063$0.298$3.000$1.088$30.000$0.175$1.400$0.263$5.250$0.150$0.131$1.600$37.500$0.900$1.200$0.150$0.175$0.635$0.105$3.500$0.175$0.800$3.500$0.131$0.900$6.000$0.300$3.000
23
24
25
COST VS (CAPABILITY, THROUGHPUT)
26
Cost 1M (3:1 IO) tokens per Composite Capability point
Is capability in line with cost?
$0.020$0.034$0.036$0.036$0.369$0.037$0.036$0.037$0.037$0.324$0.066$0.038$0.004$0.011$0.022$0.346$0.022$0.308$1.103$0.071$0.012$0.002$0.073$0.003$0.093$0.024$0.003$0.050$0.044$0.011$0.341$0.012$0.080$0.009$0.003$0.011$0.073$0.084$0.013$0.088$0.005$0.006$0.004$0.065$0.003$0.030$0.003$0.023$0.090$0.042$0.067$0.091$0.002$0.067$0.002$0.006$0.006$0.004$0.244$0.017$0.060$0.069$0.011$0.053$0.133$0.001$0.005$0.056$0.020$0.563$0.003$0.027$0.005$0.102$0.003$0.003$0.032$0.751$0.018$0.024$0.003$0.004$0.014$0.002$0.080$0.004$0.018$0.082$0.003$0.021$0.149$0.008$0.075
27
Cost 1M (3:1 IO) tokens per AA Intelligence Index v2 point$0.028$0.050$0.049$0.051$0.493$0.053$0.051$0.051$0.056$0.469$0.098$0.055$0.005$0.016$0.029$0.517$0.423$0.105$0.016$0.003$0.113$0.004$0.150$0.031$0.005$0.077$0.066$0.016$0.016$0.125$0.013$0.005$0.016$0.097$0.118$0.020$0.136$0.007$0.008$0.006$0.107$0.004$0.044$0.004$0.035$0.140$0.062$0.107$0.003$0.109$0.003$0.009$0.006$0.023$0.085$0.015$0.081$0.197$0.002$0.007$0.081$0.857$0.004$0.038$0.007$0.004$0.005$0.046$0.026$0.036$0.004$0.003$0.121$0.121$0.011$0.107
28
Cost 1M (3:1 IO) tokens per AA Index v1 point$0.292$0.011$0.022$0.305$0.063$0.011$0.075$0.024$0.002$0.020$0.035$0.056$0.079$0.002$0.005$0.200$0.047$0.008$0.041$0.104$0.004$0.041$0.015$0.429$0.019$0.004$0.002$0.024$0.013$0.018$0.002$0.003$0.009$0.002$0.055$0.003$0.013$0.015$0.105$0.005$0.052
29
Cost 1M (3:1 IO) tokens per Chatbot Arena ELO point$0.002$0.002$0.002$0.002$0.002$0.001$0.021$0.001$0.019$0.066$0.004$0.001$0.000$0.004$0.005$0.001$0.000$0.003$0.003$0.019$0.001$0.004$0.001$0.000$0.001$0.004$0.005$0.001$0.004$0.000$0.000$0.003$0.000$0.001$0.002$0.003$0.005$0.000$0.003$0.000$0.000$0.012$0.003$0.003$0.000$0.000$0.000$0.002$0.001$0.024$0.000$0.001$0.000$0.004$0.000$0.001$0.030$0.001$0.000$0.001$0.000$0.003$0.000$0.001$0.000$0.001$0.005
30
Cost 1M (3:1 IO) tokens per LiveBench point$0.024$0.044$0.050$0.045$0.470$0.043$0.048$0.045$0.044$0.377$0.076$0.046$0.005$0.025$0.419$0.026$0.347$1.422$0.081$0.013$0.086$0.116$0.027$0.004$0.060$0.056$0.015$0.092$0.012$0.003$0.014$0.091$0.105$0.015$0.104$0.007$0.004$0.076$0.037$0.003$0.029$0.100$0.045$0.079$0.002$0.003$0.007$0.007$0.298$0.021$0.067$0.074$0.012$0.153$0.002$0.007$0.024$0.611$0.004$0.032$0.006$0.003$0.003$0.037$0.020$0.004$0.003$0.021$0.009
31
Cost 1M (3:1 IO) tokens per Throughput (first 1k tokens output)
Is throughput in line with cost?
$0.092$0.268$0.147$0.212$0.114$0.265$0.119$0.114$1.336$0.301$0.021$0.154$0.106$1.336$0.106$0.521$1.481$0.349$0.020$0.049$0.358$0.017$1.351$0.118$0.039$0.028$0.177$0.093$2.104$0.058$0.349$0.053$0.007$0.043$0.609$0.203$0.029$0.321$0.019$0.034$0.017$0.287$0.049$0.152$0.018$0.116$0.249$0.096$0.240$0.360$0.022$0.755$0.006$0.010$0.009$0.019$0.455$0.070$0.211$0.234$0.059$0.151$0.437$0.002$0.011$0.156$0.010$0.830$0.030$0.101$0.016$0.366$0.018$0.024$0.101$1.061$0.088$0.077$0.019$0.003$0.041$0.008$0.152$0.003$0.047$0.227$0.019$0.110$0.088$0.024$0.190
32
33
34
BENCHMARKS
35
Scores are opportunistically sourced from HuggingFace leaderboards, model cards, docs & press releases, other 3rd parties.

Many reproducibility, precision, and accuracy issues due to selection bias & varied methodologies (n-shot, turns, CoT, benchmark variant…), susceptible to prompt sensitivity, construct validity, and contamination.

Models used by HuggingFace Leaderboard 2 (2024 June 26) are highlighted in GREEN.

Models used by HuggingFace Leaderboard 1 are highlighted in ORANGE. These benchmarks are increasingly outdated for various reasons, eg benchmark inaccuracies or they are effectively "solved" by frontier models (due to training to beat the benchmark, leaked criteria, or outdated level of difficulty).
36
General
37
MMLU PRO
HuggingFace Leaderboard 2
Scores per category (Biology, Business, Engineering, …)
84.084.084.079.077.076.081.080.076.274.076.078.067.080.077.675.876.072.676.171.671.086.067.063.773.368.970.085.045.370.468.068.569.063.169.066.867.365.064.866.467.066.363.652.653.553.659.156.253.058.1
38
MMLU
HuggingFace Leaderboard 1
Most widely reported, but many known issues.
86.992.091.085.990.885.287.089.088.086.087.088.788.785.078.686.488.686.074.584.884.086.885.982.081.980.675.081.086.486.086.082.078.583.880.581.275.278.982.083.283.8
39
Humanity's Last Exam
Technical knowledge and reasoning via structured academic problems
18.812.18.35108.5711.1546.6757.24.054.8554.895.152.624.43553.93.9750.845.53
40
HELM90.893.888.590.885.880.372.288.570.179.391.583.882.770.874.273.353.0
41
Kagi LLM Benchmarking
Omitted - scores highly variable over time
42
SimpleBench38.751.653.122.840.134.546.430.941.731.344.930.718.118.941.427.718.927.117.827.525.119.922.719.922.523.510.7
43
TriviaQA85.578.2
44
Arena Hard79.279.378.084.075.473.260.489.772.037.987.362.346.957.549.646.663.7
45
SimpleQA52.947.063.044.329.938.021.715.010.4
46
Instruction Following
47
IFEval
HuggingFace Leaderboard 2
83.990.885.688.088.692.163.092.185.084.387.582.979.989.780.7
48
Math
49
MATH
Multiple difficulty levels, inconsistently reported.
HuggingFace Leaderboard 2 uses Level 5 (most difficult)
97.995.095.097.097.394.094.896.091.882.290.090.085.078.389.092.090.986.583.075.971.186.882.085.088.073.473.877.073.043.380.460.176.670.267.771.077.969.264.568.068.070.635.173.353.554.950.462.4
50
GSM8K
HuggingFace Leaderboard 1
96.196.482.696.871.093.095.094.890.894.295.185.394.587.085.492.6
51
HiddenMath87.383.379.656.765.263.663.552.055.328.047.220.3
52
AIME 202586.7
53
AIME 202436.749.052.079.523.392.016.051.09.39.610.025.09.05.3
54
MathVista73.268.163.869.456.763.968.965.857.358.4
55
Reasoning
56
GPQA
HuggingFace Leaderboard 2
84.079.777.071.477.070.075.062.078.059.064.768.060.056.065.053.061.060.159.153.053.159.460.153.049.043.049.151.150.551.043.025.456.147.050.446.940.246.044.451.041.641.446.746.745.316.342.036.934.441.439.543.5
57
BIG-BENCH-HARD
HuggingFace Leaderboard 2
93.177.772.686.886.984.083.157.582.485.5
58
MuSR
HuggingFace Leaderboard 2
17.319.7
59
ARC Challenge
HuggingFace Leaderboard 1
96.796.770.696.968.994.896.494.868.892.493.071.4
60
HellaSwag
HuggingFace Leaderboard 1
85.684.295.495.387.385.7
61
TruthfulQA (MC2)
HuggingFace Leaderboard 1
54.761.8
62
WinoGrande
HuggingFace Leaderboard 1
78.874.384.582.9
63
ARC Easy
64
DROP79.887.177.285.472.283.179.780.9
65
Social IQA
66
MedQA
67
Code
68
Aider polyglot72.072.979.649.360.461.744.964.956.945.353.847.155.152.420.938.260.432.418.232.948.451.615.618.222.121.827.112.017.88.93.628.0
69
HumanEval97.098.098.097.092.092.498.092.092.492.093.795.097.090.056.593.090.292.090.082.088.089.087.689.088.487.045.782.692.084.988.087.271.988.488.186.680.584.864.684.073.281.776.882.3
70
HumanEval Plus82.887.0
71
LiveCodeBench v570.436.034.528.9
72
MBPP Base65.660.480.0
73
MBPP EvalPlus90.588.687.769.083.686.0
74
Natural2Code92.985.482.679.877.2
75
Tool use
76
BFCL v359.357.556.562.257.461.956.852.160.555.157.953.753.755.951.5
77
BFCL v277.281.177.377.5
78
BFCL v1 (Berkeley Function Calling)66.480.580.288.586.388.384.885.5
79
Nexus56.145.758.750.356.7
80
Agentic
81
SWE-bench Verified49.033.433.4
82
Conversational
83
MT Bench93.274.086.383.383.578.692.6
84
Multimodal (Image)
85
MMMU81.774.478.272.771.875.470.472.765.968.369.168.056.164.050.359.459.462.262.862.356.156.3
86
Multilingual
87
MGSM90.591.674.388.691.691.164.380.690.787.087.586.586.986.975.582.6
88
Long Context
89
ZeroSCROLLS/QuALITY90.590.595.295.290.5
90
InfiniteBench/En.MC82.583.472.178.2
91
NIH/Multi-needle100.090.898.1100.097.5
92
MRCR (1M)74.770.558
93
MISC
94
Reasoning Tokens
95
Licenseproprietaryproprietaryproprietaryproprietaryproprietaryproprietaryproprietaryproprietaryproprietaryproprietaryproprietaryproprietaryproprietaryproprietaryproprietaryproprietaryproprietaryopenproprietaryproprietaryopenproprietaryproprietaryproprietaryopenproprietaryproprietaryproprietaryopenproprietaryproprietaryproprietaryproprietaryopenproprietaryproprietaryproprietaryopenproprietaryopenopenopenproprietaryproprietaryopenproprietaryproprietaryproprietaryproprietaryopenproprietaryproprietaryproprietaryproprietaryopenopenopenopenproprietaryproprietaryopenproprietaryopenopenproprietaryopenopenopenopenproprietaryproprietaryproprietaryproprietaryproprietaryproprietaryopenproprietaryproprietaryproprietaryopenopenopenproprietaryopenopenproprietaryopenopenopenopenproprietaryopenproprietaryopenproprietary
96
OrganizationOpenAIGoogleGoogleOpenAIOpenAIGoogleOpenAIAnthropicAnthropicxAiGoogleOpenAIAnthropicOpenAIOpenAIOpenAIAnthropicDeepSeekGoogleAnthropicAlibabaOpenAIOpenAIGoogleDeepSeekOpenAIGoogleOpenAIAlibabaGoogleAnthropicOpenAIGoogleAlibabaMistralOpenAIxAIDeepSeekAnthropicRekaMetaAlibabaOpenAIGoogleDeepSeekGoogleGooglePerplexityAlibabaMicrosoftOpenAIAnthropicGoogleCohereDeepSeekAlibabaGoogleMetaOpenAIPerplexityMetaxAIMetaMistralxAIGoogleMicrosoftMistralDeepSeekAnthropicOpenAIAmazonOpenAITencentGoogleMistralGoogleAnthropicOpenAIMetaMetaMistralTencentDeepSeekAlibabaAmazonAI21DeepSeekGoogleAI21GoogleMetaRekaMistral01.AI
97
Search
98
99
100