A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | (670) Permalink: | https://lifearchitect.ai/models-table | Timeline view: | https://lifearchitect.ai/timeline | The Memo: | https://lifearchitect.ai/memo | ||||||||||||||
2 | Model | Lab | Playground | Parameters (B) | Tokens trained (B) | Ratio Tokens:Params (Chinchilla scaling≥20:1) | ALScore "ALScore" is a quick and dirty rating of the model's power. The formula is: Sqr Root of (Parameters x Tokens) ÷ 300. Any ALScore ≥ 1.0 is a powerful model in mid-2023. | MMLU | MMLU -Pro | GPQA | HLE | Training dataset | Announced ▼ | Public? | Paper / Repo | Arch | Tags | Notes | Count (rough) | |
3 | AuroraGPT (ScienceGPT) | Argonne National Laboratory | https://lifearchitect.ai/auroragpt/ | 2000 | 30000 | 15:1 | TBA | 🔴 | | Three models targeted in Jul/2024: AuroraGPT-7B-P (Ponte Vecchio GPU testing) AuroraGPT-7B-A (Aurora) AuroraGPT-7B-A-S (Aurora + Science). | l | |||||||||
4 | DeepSeek-R2 | DeepSeek-AI | https://www.reuters.com/technology/artificial-intelligence/deepseek-rushes-launch-new-ai-model-china-goes-all-2025-02-25/ | 1200 | 130000 | 109:1 | 41.6 | TBA | 🟢 | https://docs.google.com/document/d/e/2PACX-1vTmx-A5sBe_3RsURGM7VvLWsAgUXbcIb2pFaW7f1FTPgK7mGvYENXGQPoF2u4onFndJ_5tzZ02su-vg/pub | MoE | Reasoning, SOTA | Due April 2025. Hybrid MoE, 1.2TA78B. 5.2PB corpus=1.3Qa tokens. 1.3 quadrillion tokens = 1,300T tokens = 1,300,000B tokens "Constructed a 5.2 PB high-quality corpus covering vertical domains such as finance, law, and patents." http://jiuyangongshe.com/a/1h4gq724su0 and translated at: https://docs.google.com/document/d/e/2PACX-1vTmx-A5sBe_3RsURGM7VvLWsAgUXbcIb2pFaW7f1FTPgK7mGvYENXGQPoF2u4onFndJ_5tzZ02su-vg/pub | k | ||||||
5 | ERNIE 5 | Baidu | https://lifearchitect.ai/ernie/ | TBA | | j | ||||||||||||||
6 | Gemini Ultra | Google DeepMind | https://www.reddit.com/r/singularity/comments/1kbpdvp/a_string_referencing_gemini_ultra_has_been_added/ | 2000 | 30000 | 15:1 | 25.8 | TBA | | Due May/2025. | i | |||||||||
7 | GPT-6 | OpenAI | https://lifearchitect.ai/gpt-6/ | TBA | SOTA | Due 2025. | f | |||||||||||||
8 | Llama 4 Reasoning | Meta AI | https://ai.meta.com/blog/llama-4-multimodal-intelligence/ | TBA | 🟢 | https://ai.meta.com/blog/llama-4-multimodal-intelligence/ | MoE | Reasoning, SOTA | Announced, coming soon. | d | ||||||||||
9 | o4 | OpenAI | https://lifearchitect.ai/o4/ | TBA | Reasoning, SOTA | Due 2025. | b | |||||||||||||
10 | o5 | OpenAI | https://lifearchitect.ai/o5/ | TBA | Reasoning, SOTA | Due 2025. Proto-ASI. | b | |||||||||||||
11 | GLM-4.6 | Z.AI | https://huggingface.co/zai-org/GLM-4.6 | 355 | 22000 | 62:1 | 9.3 | 82.9 | 30.4 | synthetic, web-scale | Sep/2025 | 🟢 | https://z.ai/blog/glm-4.6 | MoE | Reasoning | 355B-A32B. "context window has been expanded from 128K to 200K tokens" | 670 | |||
12 | Ring-1T-preview | InclusionAI | https://huggingface.co/inclusionAI/Ring-1T-preview | 1000 | 20000 | 20:1 | 14.9 | synthetic, web-scale | Sep/2025 | 🟢 | https://huggingface.co/inclusionAI/Ring-1T-preview | MoE | Reasoning | 1T-A48.5B. | 669 | |||||
13 | Claude Sonnet 4.5 | Anthropic | https://claude.ai/ | 400 | 80000 | 200:1 | 18.9 | 83.4 | synthetic, web-scale | Sep/2025 | 🟢 | https://www.anthropic.com/news/claude-sonnet-4-5 | MoE | Reasoning, SOTA | The Claude Sonnet 4.5 "system card" is an absolute farce, won't be linked here. | 668 | ||||
14 | Gemini Robotics 1.5 | Google DeepMind | 200 | 20000 | 100:1 | 6.7 | 59.6 | synthetic, web-scale | Sep/2025 | 🟢 | https://storage.googleapis.com/deepmind-media/gemini-robotics/Gemini-Robotics-1-5-Tech-Report.pdf | MoE | Reasoning | 2. "vision-language-action (VLA) model turns visual information and instructions into motor commands for a robot to perform a task." Available to select partners. | 667 | |||||
15 | Gemini Robotics-ER 1.5 | Google DeepMind | 30 | 30000 | 1,000:1 | 3.2 | 83.3 | synthetic, web-scale | Sep/2025 | 🟢 | https://storage.googleapis.com/deepmind-media/gemini-robotics/Gemini-Robotics-1-5-Tech-Report.pdf | MoE | Reasoning | 1. "vision-language model (VLM) reasons about the physical world, natively calls digital tools and creates detailed, multi-step plans to complete a mission." Available to all devs. | 666 | |||||
16 | TimesFM-ICF | 0.2 | 100 | 500:1 | 0.0 | special | Sep/2025 | 🔴 | https://research.google/blog/time-series-foundation-models-can-be-few-shot-learners/ | Dense | | TimesFM-ICF is 6.8% more accurate than TimesFM (Base). Time-series forecasting only. 'a large pretraining corpus of 100B real world time-points' may be more than 100B tokens. | 665 | |||||||
17 | Qwen3-Max | Alibaba | https://chat.qwen.ai/ | 1000 | 36000 | 36:1 | 20.0 | 85.4 | synthetic, web-scale | Sep/2025 | 🟢 | https://qwen.ai/blog?id=241398b9cd6353de490b0f82806c7848c5d2777d&from=research.latest-advancements-list | MoE | Reasoning | "Qwen3-Max-Thinking — still under active training — is already demonstrating remarkable potential. When augmented with tool usage and scaled test-time compute, the Thinking variant has achieved 100% on challenging reasoning benchmarks such as AIME 25 and HMMT. " | 664 | ||||
18 | Qwen3-Omni | Alibaba | https://github.com/QwenLM/Qwen3-Omni?tab=readme-ov-file | 30 | 17000 | 567:1 | 2.4 | 88.8 | 73.1 | synthetic, web-scale | Sep/2025 | 🟢 | https://github.com/QwenLM/Qwen3-Omni/blob/main/assets/Qwen3_Omni.pdf | MoE | Reasoning | "Qwen3-Omni is a unified end-to-end model capable of processing multiple modalities, such as text, audio, image and video, and generating real-time text or speech response."... "pretraining utilizes a large-scale dataset containing approximately 2 trillion tokens, with the following distribution across modalities: text (0.57 trillion), audio (0.77 trillion), image (0.82 trillion), video (0.05 trillion), and video-audio (0.05 trillion)." | 663 | |||
19 | DeepSeek-V3.1-Terminus | DeepSeek-AI | https://huggingface.co/deepseek-ai/DeepSeek-V3.1-Terminus | 685 | 15640 | 22:1 | 10.6 | 85 | 80.7 | 21.7 | synthetic, web-scale | Sep/2025 | 🟢 | https://api-docs.deepseek.com/news/news250922 | MoE | SOTA, Reasoning | Hybrid reasoning. Dataset tokens: https://x.com/deepseek_ai/status/1958417072536608952 HLE: https://x.com/deepseek_ai/status/1958417068568481854/photo/2 | 662 | ||
20 | Isaac 0.1 | Perceptron | https://huggingface.co/PerceptronAI/Isaac-0.1 | 2 | 2000 | 1,000:1 | 0.2 | synthetic, web-scale | Sep/2025 | 🟢 | https://www.perceptron.inc/blog/introducing-isaac-0-1 | Dense | | "perceptive-language model...delivering capabilities that meet or exceed those of models over 50 times its size. Founded by the team behind Meta's Chameleon multimodal models, Perceptron is tackling a fundamental challenge: bringing the power of physical AI to the dynamic, multimodal, and real-time environments we live and work in." | 661 | |||||
21 | Grok 4 Fast | xAI | https://grok.com/ | 200 | 20000 | 100:1 | 6.7 | 85.7 | 20 | synthetic, web-scale | Sep/2025 | 🟢 | https://x.ai/news/grok-4-fast | MoE | Reasoning, SOTA | "2M token context window, and a unified architecture that blends reasoning and non-reasoning modes in one model." | 660 | |||
22 | VaultGemma | Google DeepMind | https://huggingface.co/google/vaultgemma-1b | 1 | 13000 | 13,000:1 | 0.4 | web-scale | Sep/2025 | 🟢 | https://services.google.com/fh/files/blogs/vaultgemma_tech_report.pdf | Dense | | "Differential Privacy (DP) has emerged as the gold standard, providing a rigorous, mathematical framework to limit the influence of any single example in the training data on the resulting model. A model trained with DP provably bounds the reconstruction or leakage of information tied to individual data points." Announce: https://research.google/blog/vaultgemma-the-worlds-most-capable-differentially-private-llm/ | 659 | |||||
23 | Qwen3-Next-80B-A3B | Alibaba | https://huggingface.co/collections/Qwen/qwen3-next-68c25fd6838e585db8eeea9d | 80 | 15000 | 188:1 | 3.7 | 84.72 | 66.05 | 43.43 | synthetic, web-scale | Sep/2025 | 🟢 | https://qwen.ai/blog?id=4074cca80393150c248e508aa62983f9cb7d27cd&from=research.latest-advancements-list | MoE | Reasoning | "Qwen3-Next introduces several key improvements: a hybrid attention mechanism, a highly sparse Mixture-of-Experts (MoE) structure, training-stability-friendly optimizations, and a multi-token prediction mechanism for faster inference." | 658 | ||
24 | K2-Think | MBZUAI | https://www.k2think.ai/ | 32 | 18000 | 563:1 | 2.5 | 71.08 | 9.95 | synthetic, web-scale | Sep/2025 | 🟢 | https://arxiv.org/abs/2509.07604 | Dense | Reasoning | "Built on the Qwen2.5 base model, our system shows that smaller models can compete at the highest levels by combining advanced post-training and test-time computation techniques. The approach is based on six key technical pillars: Long Chain-of-thought Supervised Finetuning, Reinforcement Learning with Verifiable Rewards (RLVR), Agentic planning prior to reasoning, Test-time Scaling, Speculative Decoding, and Inference-optimized Hardware, all using publicly available open-source datasets." | 657 | |||
25 | mmBERT | JHU | https://huggingface.co/collections/jhu-clsp/mmbert-a-modern-multilingual-encoder-68b725831d7c6e3acc435ed4 | 0.307 | 3000 | 9,772:1 | 0.1 | synthetic, web-scale | Sep/2025 | 🟢 | https://arxiv.org/abs/2509.06888 | Dense | | "a modern multilingual encoder trained on 3T tokens and 1833 languages. We introduce several novel elements in training: an inverse masking schedule and a cascading annealed language learning schedule for multilingual data" Announce: https://huggingface.co/blog/mmbert | 656 | |||||
26 | ERNIE X1.1 | Baidu | https://ernie.baidu.com/ | synthetic, web-scale | Sep/2025 | 🟢 | https://www.prnewswire.com/news-releases/baidu-unveils-reasoning-model-ernie-x1-1-with-upgrades-in-key-capabilities-302551170.html | MoE | Reasoning | 655 | ||||||||||
27 | ERNIE-4.5-21B-A3B-Thinking | Baidu | https://huggingface.co/baidu/ERNIE-4.5-21B-A3B-Thinking | 21 | 15000 | 715:1 | 1.9 | synthetic, web-scale | Sep/2025 | 🟢 | https://huggingface.co/baidu/ERNIE-4.5-21B-A3B-Thinking | MoE | Reasoning | 654 | ||||||
28 | Klear-46B-A2.5B | Kuaishou | https://huggingface.co/Kwai-Klear/Klear-46B-A2.5B-Instruct | 46 | 22000 | 479:1 | 3.4 | 80.5 | 57.6 | 35.3 | synthetic, web-scale | Sep/2025 | 🟢 | https://huggingface.co/Kwai-Klear/Klear-46B-A2.5B-Instruct | MoE | | 46B-A2.5B. | 653 | ||
29 | TildeOpen-30b | Tilde AI | https://huggingface.co/TildeAI/TildeOpen-30b | 30 | 2000 | 67:1 | 0.8 | synthetic, web-scale | Sep/2025 | 🟢 | https://tilde.ai/lv/tildeopen-llm/ | Dense | | "language data from across Europe" | 652 | |||||
30 | Qwen3-Max-Preview | Alibaba | https://chat.qwen.ai/ | 1000 | 36000 | 36:1 | 20.0 | 64.6 | synthetic, web-scale | Sep/2025 | 🟢 | https://modelstudio.console.alibabacloud.com/?tab=doc#/doc/?type=model&url=2840914_2&modelId=qwen3-max-preview | MoE | | GPQA score is SuperGPQA. "our biggest model yet, with over 1 trillion parameters" | 651 | ||||
31 | Kimi K2-Instruct-0905 | Moonshot AI | https://huggingface.co/moonshotai/Kimi-K2-Instruct | 1000 | 15500 | 16:1 | 13.1 | synthetic, web-scale | Sep/2025 | 🟢 | https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905 | MoE | Reasoning, SOTA | 1TA32B. 1T parameters and 384 experts. Open source SOTA. | 650 | |||||
32 | Apertus | ETH Zürich | https://huggingface.co/swiss-ai/Apertus-70B-Instruct-2509 | 70 | 15000 | 215:1 | 3.4 | 65.2 | 30.6 | synthetic, web-scale | Sep/2025 | 🟢 | https://github.com/swiss-ai/apertus-tech-report/blob/main/Apertus_Tech_Report.pdf | Dense | | "Apertus – Latin for “open”" 1,811 languages. Announce: https://ethz.ch/en/news-and-events/eth-news/news/2025/09/press-release-apertus-a-fully-open-transparent-multilingual-language-model.html | 649 | |||
33 | MAI-1-preview | Microsoft | https://microsoft.ai/news/two-new-in-house-models/ | 500 | 10000 | 20:1 | 7.5 | synthetic, web-scale | Aug/2025 | 🟢 | https://microsoft.ai/news/two-new-in-house-models/ | MoE | | MAI=Microsoft artificial intelligence. "MAI’s first foundation model trained end-to-end... MAI-1-preview is an in-house mixture-of-experts model, pre-trained and post-trained on ~15,000 NVIDIA H100 GPUs. This model is designed to provide powerful capabilities to consumers seeking to benefit from models that specialize in following instructions and providing helpful responses to everyday queries. We will be rolling MAI-1-preview out for certain text use cases within Copilot" | 648 | |||||
34 | grok-code-fast-1 | xAI | https://github.com/features/copilot | 100 | 10000 | 100:1 | 3.3 | synthetic, web-scale | Aug/2025 | 🟢 | https://data.x.ai/2025-08-26-grok-code-fast-1-model-card.pdf | MoE | | "We built grok-code-fast-1 from scratch, starting with a brand-new model architecture. To lay a robust foundation, we carefully assembled a pre-training corpus rich with programming-related content. For post-training, we curated high-quality datasets that reflect real-world pull requests and coding tasks." Announce: https://x.ai/news/grok-code-fast-1 | 647 | |||||
35 | Hermes 4 | Nous Research | https://huggingface.co/NousResearch/Hermes-4-405B-FP8 | 405 | 15656 | 39:1 | 8.4 | 87.2 | 80.5 | 70.5 | synthetic, web-scale | Aug/2025 | 🟢 | https://arxiv.org/abs/2508.18255 | Dense | Reasoning | Based on Llama 3. Announce: https://hermes4.nousresearch.com/ | 646 | ||
36 | Jet-Nemotron-4B | NVIDIA | https://github.com/NVlabs/Jet-Nemotron | 4 | 400 | 100:1 | 0.1 | 65.2 | 44.2 | synthetic, web-scale | Aug/2025 | 🟢 | https://arxiv.org/abs/2508.15884v1 | Dense | Reasoning | "pre-training corpus and train Jet-Nemotron models for 50B tokens. This is also the setting in Section 2 where we perform PostNAS. At the second stage, we include more high-quality data from math [65] and coding [66, 67] domains into our data mixture. The models are then trained on 350B tokens." | 645 | |||
37 | DeepSeek-V3.1-Base | DeepSeek-AI | https://huggingface.co/deepseek-ai/DeepSeek-V3.1-Base | 685 | 15640 | 22:1 | 10.6 | 93.7 | 84.8 | 80.1 | 29.8 | synthetic, web-scale | Aug/2025 | 🟢 | https://huggingface.co/deepseek-ai/DeepSeek-V3.1-Base | MoE | SOTA, Reasoning | Hybrid reasoning. Dataset tokens: https://x.com/deepseek_ai/status/1958417072536608952 HLE: https://x.com/deepseek_ai/status/1958417068568481854/photo/2 | 644 | |
38 | Nemotron Nano 2 | NVIDIA | https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2-Base | 12.31 | 20000 | 1,625:1 | 1.7 | 78.24 | 63.98 | 64.48 | synthetic, web-scale | Aug/2025 | 🟢 | https://research.nvidia.com/labs/adlr/files/NVIDIA-Nemotron-Nano-2-Technical-Report.pdf | Dense | Reasoning | Announce: https://research.nvidia.com/labs/adlr/NVIDIA-Nemotron-Nano-2/ | 643 | ||
39 | Gemma 3 270M | Google DeepMind | https://huggingface.co/google/gemma-3-270m-it | 0.27 | 6000 | 22,223:1 | 0.1 | web-scale | Aug/2025 | 🟢 | https://developers.googleblog.com/en/introducing-gemma-3-270m/ | Dense | | This is a record tokens-to-params ratio (for text models) of 22,223:1. | 642 | |||||
40 | GPT-5 | OpenAI | https://poe.com/GPT-5 | 300 | 114000 | 380:1 | 19.5 | 91 | 89.4 | 42 | synthetic, web-scale | Aug/2025 | 🟢 | https://openai.com/index/gpt-5-system-card/ | MoE | SOTA, Reasoning | Announce: https://openai.com/index/introducing-gpt-5/. MMLU is based on ES and PT translated from EN. | 641 | ||
41 | gpt-oss-120b | OpenAI | https://huggingface.co/openai/gpt-oss-120b | 120 | 30000 | 250:1 | 6.3 | 90 | 80.1 | 19 | synthetic, web-scale | Aug/2025 | 🟢 | https://cdn.openai.com/pdf/419b6906-9da6-406c-a19d-1bb078ac7637/oai_gpt-oss_model_card.pdf | MoE | Reasoning, SOTA | 116.8B total parameters and 5.1B “active” parameters per token per forward pass. https://openai.com/index/introducing-gpt-oss/ | 640 | ||
42 | gpt-oss-20b | OpenAI | https://huggingface.co/openai/gpt-oss-20b | 20 | 13000 | 650:1 | 1.7 | 85.3 | 71.5 | 17.3 | synthetic, web-scale | Aug/2025 | 🟢 | https://cdn.openai.com/pdf/419b6906-9da6-406c-a19d-1bb078ac7637/oai_gpt-oss_model_card.pdf | MoE | Reasoning, SOTA | 20.9B total and 3.6B active parameters. https://openai.com/index/introducing-gpt-oss/ | 639 | ||
43 | Claude Opus 4.1 | Anthropic | https://claude.ai/ | 2000 | 100000 | 50:1 | 47.1 | 80.9 | synthetic, web-scale | Aug/2025 | 🟢 | https://www.anthropic.com/news/claude-opus-4-1 | MoE | Reasoning, SOTA | 638 | |||||
44 | GLM-4.5 | Z.AI | https://huggingface.co/zai-org/GLM-4.5 | 355 | 22000 | 62:1 | 9.3 | 84.6 | 79.1 | 14.4 | synthetic, web-scale | Jul/2025 | 🟢 | https://z.ai/blog/glm-4.5 | MoE | Reasoning | 355B-A32B. | 637 | ||
45 | T1 | China Telecom Artificial Intelligence Research Institute | https://github.com/Tele-AI/T1 | 115 | 10000 | 87:1 | 3.6 | web-scale | Jul/2025 | 🟢 | https://arxiv.org/abs/2507.18013 | Dense | Reasoning | 636 | ||||||
46 | Intern-S1 | Shanghai AI Laboratory/SenseTime | https://huggingface.co/internlm/Intern-S1 | 235 | 41000 | 175:1 | 10.3 | 83.5 | 77.3 | synthetic, web-scale | Jul/2025 | 🟢 | https://huggingface.co/internlm/Intern-S1 | MoE | Reasoning, SOTA | 41T tokens assumes base model of Qwen3. "Built upon a 235B MoE language model and a 6B Vision encoder, Intern-S1 has been further pretrained on 5 trillion tokens of multimodal data" | 635 | |||
47 | Step 3 | StepFun | https://www.stepfun.com/ | 321 | 18000 | 57:1 | 8.0 | 72.9 | web-scale | Jul/2025 | 🟢 | https://github.com/stepfun-ai/Step3/blob/main/Step3-Sys-Tech-Report.pdf | MoE | | 321B-A38B. https://x.com/CyouSakura/status/1948767450751009227 | 634 | ||||
48 | Qwen3-235B-A22B-Thinking-2507 | Alibaba | https://huggingface.co/Qwen/Qwen3-235B-A22B-Thinking-2507 | 235 | 36000 | 154:1 | 9.7 | 93.8 | 84.4 | 81.1 | synthetic, web-scale | Jul/2025 | 🟢 | https://huggingface.co/Qwen/Qwen3-235B-A22B-Thinking-2507 | MoE | Reasoning | 235B-A22B. "Qwen3 is pre-trained on 36 trillion tokens across 119 languages" MMLU score is MMLU-Redux. | 633 | ||
49 | KAT-V1-200B | Kuaishou | 200 | 10000 | 50:1 | 4.7 | 82.3 | 78.2 | synthetic, web-scale | Jul/2025 | 🔴 | https://arxiv.org/abs/2507.08297 | MoE | Reasoning | 200BA40B. In training as of Jul/2025. "to address the overthinking problem in reasoning-intensive tasks" | 632 | ||||
50 | KAT-V1-40B | Kuaishou | https://huggingface.co/Kwaipilot/KAT-V1-40B | 40 | 10000 | 250:1 | 2.1 | 77.8 | 75.1 | synthetic, web-scale | Jul/2025 | 🟢 | https://arxiv.org/abs/2507.08297 | Dense | Reasoning | "to address the overthinking problem in reasoning-intensive tasks" | 631 | |||
51 | Qwen3-Coder-480B-A35B-Instruct | Alibaba | https://huggingface.co/Qwen/Qwen3-Coder-480B-A35B-Instruct | 480 | 36000 | 75:1 | 13.9 | synthetic, web-scale | Jul/2025 | 🟢 | https://qwenlm.github.io/blog/qwen3-coder/ | MoE | | 480B-A35B. | 630 | |||||
52 | Qwen3-235B-A22B-Instruct-2507 | Alibaba | https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507 | 235 | 36000 | 154:1 | 9.7 | 93.1 | 83 | 77.5 | synthetic, web-scale | Jul/2025 | 🟢 | https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507 | MoE | SOTA | 235B-A22B. "Qwen3 is pre-trained on 36 trillion tokens across 119 languages" MMLU score is MMLU-Redux. | 629 | ||
53 | FlexOlmo | Allen AI | https://huggingface.co/allenai/FlexOlmo-7x7B-1T | 37 | 4150 | 113:1 | 1.3 | 60.4 | 30.9 | synthetic, web-scale | Jul/2025 | 🟢 | https://arxiv.org/abs/2507.07024v1 | MoE | | 37B-A20B. "We adopt the OLMo-2 7B setup, starting from a a checkpoint pre-trained on 4T tokens and annealed for 50B tokens to produce a public expert. We then train two additional experts on math and code, each for 50B tokens, and combine them with the public expert to form a three-expert version of FLEXOLMO." | 628 | |||
54 | EXAONE 4.0 | LG | https://huggingface.co/LGAI-EXAONE/EXAONE-4.0-32B | 32 | 14000 | 438:1 | 2.2 | 92.3 | 81.8 | 75.4 | web-scale | Jul/2025 | 🟢 | https://www.lgresearch.ai/data/cdn/upload/EXAONE_4_0.pdf | Dense | Reasoning | “EXAONE”=“EXpert AI for EveryONE”. Training tokens/ratio: EXAONE-3 7.8B=8T tokens (Aug/2024) -> EXAONE-3.5 7.8B=9T -> EXAONE-3.5 32B=6.5T tokens -> EXAONE 4.0 32B=14T tokens. MMLU score is MMLU-Redux. Interesting: "To focus [RL] training on more informative data samples, we perform accuracy-based filtering by generating eight responses from the SFT model and excluding samples where all eight responses are correct, a pre-filtering step that removes problems that are easy for the model to avoid inefficient training." | 627 | ||
55 | Kimi K2 | Moonshot AI | https://huggingface.co/moonshotai/Kimi-K2-Instruct | 1000 | 15500 | 16:1 | 13.1 | 89.5 | 81.1 | 75.1 | 4.7 | synthetic, web-scale | Jul/2025 | 🟢 | https://moonshotai.github.io/Kimi-K2/ | MoE | Reasoning, SOTA | 1TA32B. 1T parameters and 384 experts. Open source SOTA. | 626 | |
56 | Reka Flash 3.1 | Reka AI | https://huggingface.co/RekaAI/reka-flash-3.1 | 21 | 5000 | 239:1 | 1.1 | web-scale | Jul/2025 | 🟢 | https://www.reka.ai/news/introducing-reka-flash | Dense | Reasoning | 625 | ||||||
57 | Devstral Medium | Mistral | https://chat.mistral.ai/chat | 50 | 12000 | 240:1 | 2.6 | synthetic, web-scale | Jul/2025 | 🟢 | https://mistral.ai/news/devstral-2507 | Dense | | Non-reasoning. | 624 | |||||
58 | Grok 4 | xAI | https://grok.com/ | 600 | 80000 | 134:1 | 23.1 | 88.9 | 44.4 | synthetic, web-scale | Jul/2025 | 🟢 | https://lifearchitect.ai/grok/ | MoE | Reasoning, SOTA | "The smartest AI in the world, 100% on SAT, etc, questions that it's never seen before." | 623 | |||
59 | Phi-4-mini-flash-reasoning | Microsoft | https://huggingface.co/microsoft/Phi-4-mini-flash-reasoning | 3.8 | 5150 | 1,356:1 | 0.5 | synthetic, web-scale | Jul/2025 | 🟢 | https://azure.microsoft.com/en-us/blog/reasoning-reimagined-introducing-phi-4-mini-flash-reasoning/ | Dense | | "Pre-training: 5T tokens; Reasoning training: 150B tokens" "At the core of Phi-4-mini-flash-reasoning is the newly introduced decoder-hybrid-decoder architecture, SambaY, whose central innovation is the Gated Memory Unit (GMU), a simple yet effective mechanism for sharing representations between layers. The architecture includes a self-decoder that combines Mamba (a State Space Model) and Sliding Window Attention (SWA), along with a single layer of full attention. The architecture also involves a cross-decoder that interleaves expensive cross-attention layers with the new, efficient GMUs. This new architecture with GMU modules drastically improves decoding efficiency, boosts long-context retrieval performance and enables the architecture to deliver exceptional performance across a wide range of tasks. " | 622 | |||||
60 | T5Gemma | Google DeepMind | https://huggingface.co/google/t5gemma-9b-9b-ul2-it | 9 | 10000 | 1,112:1 | 1.0 | 76.7 | 55.7 | 40.4 | web-scale | Jul/2025 | 🟢 | https://developers.googleblog.com/en/t5gemma/ | Dense | | Related paper: https://arxiv.org/abs/2504.06225. Dataset was Gemma 2 9B on 8T tokens + 2T tokens adapted. | 621 | ||
61 | MedGemma | Google DeepMind | https://huggingface.co/google/medgemma-27b-it | 27 | 14000 | 519:1 | 2.0 | 87 | web-scale | Jul/2025 | 🟢 | https://arxiv.org/abs/2507.05201 | Dense | | Multimodal model. Text MMLU score for med only=87.0. | 620 | ||||
62 | R1T2 Chimera | TNG | https://huggingface.co/tngtech/DeepSeek-TNG-R1T2-Chimera | 685 | 14800 | 22:1 | 10.6 | synthetic, web-scale | Jul/2025 | 🟢 | https://arxiv.org/abs/2506.14794 | MoE | | Assembly of Experts-method of V3-0324, R1, R1-0528. Announce: https://x.com/tngtech/status/1940531045432283412?s=46 | 619 | |||||
63 | Spectra 1.1 | Consortium | 3.6 | 1200 | 334:1 | 0.2 | 36.12 | synthetic, web-scale | Jun/2025 | 🟢 | https://arxiv.org/abs/2506.23025 | Dense | | "Spectra-1.1, an open suite of TriLMs trained on up to 1.2 trillion tokens, demonstrating sustained performance gains at scale. Furthermore, to improve inference efficiency, we propose novel 2-bit and 1.6-bit packing schemes for ternary weights" | 618 | |||||
64 | DiffuCoder | Apple | https://github.com/apple/ml-diffucoder | 7 | 5630 | 805:1 | 0.7 | code, The Stack | Jun/2025 | 🟢 | https://arxiv.org/abs/2506.20639 | Dense | Diffusion | "We adapt our model from Qwen2.5-Coder (Hui et al., 2024) as the base model to perform continual pre-training using the adaptation approach from Gong et al. (2025). During this pre-training, we use a 400B-token code pre-training corpus from RefineCode (Huang et al., 2024) and Stackv2 (Lozhkov et al., 2024)." | 617 | |||||
65 | Hunyuan-A13B | Tencent | https://huggingface.co/tencent/Hunyuan-A13B-Instruct | 80 | 7000 | 88:1 | 2.5 | 88.17 | 67.23 | 71.2 | synthetic, web-scale | Jun/2025 | 🟢 | https://huggingface.co/tencent/Hunyuan-A13B-Instruct | MoE | | 80B-A13B. 'We have open-sourced Hunyuan-A13B-Pretrain , Hunyuan-A13B-Instruct , Hunyuan-A13B-Instruct-FP8 , Hunyuan-A13B-Instruct-GPTQ-Int4 on Hugging Face.' | 616 | ||
66 | Mercury | Inception | https://chat.inceptionlabs.ai/ | 90 | 8000 | 89:1 | 2.8 | 69 | 51 | 3.4 | synthetic, web-scale | Jun/2025 | 🟢 | https://www.inceptionlabs.ai/introducing-mercury-our-general-chat-model | Dense | Diffusion | Diffusion large language model (dLLM). | 615 | ||
67 | Mu | Microsoft | https://blogs.windows.com/windows-insider/2025/06/13/announcing-windows-11-insider-preview-build-26200-5651-dev-channel/ | 0.5 | 500 | 1,000:1 | 0.1 | synthetic, web-scale | Jun/2025 | 🟢 | https://blogs.windows.com/windowsexperience/2025/06/23/introducing-mu-language-model-and-how-it-enabled-the-agent-in-windows-settings/ | Dense | | "distillation from Microsoft’s Phi models...Mu is an efficient 330M encoder–decoder language model optimized for small-scale deployment, particularly on the NPUs on Copilot+ PCs. It follows a transformer encoder–decoder architecture" | 614 | |||||
68 | Gemini Robotics On-Device | Google DeepMind | https://docs.google.com/forms/u/0/d/1sM5GqcVMWv-KmKY3TOMpVtQ-lDFeAftQ-d9xQn92jCE/viewform?ts=67cef986&edit_requested=true | 20 | 10000 | 500:1 | 1.5 | synthetic, web-scale | Jun/2025 | 🟢 | https://deepmind.google/discover/blog/gemini-robotics-on-device-brings-ai-to-local-robotic-devices/ | MoE | | See Mar/2025 Gemini Robotics-ER model for comparison. Announce: https://deepmind.google/discover/blog/gemini-robotics-on-device-brings-ai-to-local-robotic-devices/ | 613 | |||||
69 | ICONN-1 | ICONNAI | https://huggingface.co/ICONNAI/ICONN-1 | 88 | 10000 | 114:1 | 3.1 | synthetic, web-scale | Jun/2025 | 🟢 | MoE | | "ICONN-1 (this version) is optimized for natural, emotionally resonant, and conversational interactions. ICONN-e1 is a specialized variant of the model fine-tuned for advanced reasoning, critical analysis, and complex problem-solving." | 612 | ||||||
70 | MiniMax-M1 | MiniMax | https://huggingface.co/MiniMaxAI/MiniMax-M1-80k | 456 | 7200 | 16:1 | 6.0 | 81.1 | 70 | 8.4 | web-scale | Jun/2025 | 🟢 | https://arxiv.org/abs/2506.13585 | MoE | Reasoning | 456B-A45.9B. Announce: https://www.minimax.io/news/minimaxm1 | 611 | ||
71 | Magistral Medium | Mistral | https://chat.mistral.ai/chat | 50 | 12000 | 240:1 | 2.6 | 70.8 | synthetic, web-scale | Jun/2025 | 🟢 | https://mistral.ai/static/research/magistral.pdf | Dense | Reasoning | Magistral Small=24B. Announce: https://mistral.ai/news/magistral | 610 | ||||
72 | Comma v0.1-2T | EleutherAI | https://huggingface.co/common-pile/comma-v0.1-2t | 7 | 2000 | 286:1 | 0.4 | 49.8 | web-scale | Jun/2025 | 🟢 | https://arxiv.org/abs/2506.05209 | Dense | | "Comma v0.1-2T is a decoder-only transformer that uses the same architecture as Llama 3. Training was done in two stages: first on 1.93 trillion tokens with a cosine learning rate schedule, and second a "cool-down" training phase on 75.5 billion tokens from high-quality sources. The final model is the average of 10 checkpoints during this cool-down phase. Both training phases use a batch size of 8.3 million tokens per step. Training was performed using lingua on 512 AMD MI300A GPUs." | 609 | ||||
73 | dots.llm1 | Xiaohongshu/RedNote | https://huggingface.co/rednote-hilab/dots.llm1.base | 142 | 11200 | 79:1 | 4.2 | 83.2 | 61.9 | 52.6 | web-scale | Jun/2025 | 🟢 | https://github.com/rednote-hilab/dots.llm1/blob/main/dots1_tech_report.pdf | MoE | | 142B-A14B. "dots.llm1, a large-scale MoE model that activates 14 billion parameters out of a total of 142 billion parameters, delivering performance on par with state-of-the-art models while reducing training and inference costs. Leveraging our meticulously crafted and efficient data processing pipeline, dots.llm1 achieves performance comparable to Qwen2.5-72B after pretraining on 11.2T high-quality tokens and post-training to fully unlock its capabilities. Notably, no synthetic data is used during pretraining. To foster further research, we open-source intermediate training checkpoints at every one trillion tokens, providing valuable insights into the learning dynamics of large language models." | 608 | ||
74 | Gemini 2.5 Pro 06-05 | Google DeepMind | https://deepmind.google/models/gemini-diffusion/ | 400 | 80000 | 200:1 | 18.9 | 86.4 | 21.6 | synthetic, web-scale | Jun/2025 | 🟢 | https://storage.googleapis.com/deepmind-media/gemini/gemini_v2_5_report.pdf | Dense | Reasoning, SOTA | "an upgraded preview of Gemini 2.5 Pro, our most intelligent model yet. Building on the version we released in May and showed at I/O, this model will be the generally available, stable version starting in a couple of weeks, ready for enterprise-scale applications." | 607 | |||
75 | MiMo-7B-RL-0530 | Xiaomi | https://huggingface.co/XiaomiMiMo/MiMo-7B-RL-0530 | 7 | 25000 | 3,572:1 | 1.4 | 58.6 | 60.6 | synthetic, web-scale | May/2025 | 🟢 | https://arxiv.org/abs/2505.07608 | Dense | Reasoning | "[2025.05.30] During the RL training, by continuously expanding the training window size (from 32K to 48K), the performance of MiMo-7B-RL-0530 on AIME24 can be continuously improved and eventually surpass that of DeepSeek R1... MiMo-7B-Base is pre-trained on approximately 25 trillion tokens." | 606 | |||
76 | DeepTransformers | Google DeepMind | 1.3 | 100 | 77:1 | 0.0 | synthetic, web-scale | May/2025 | 🔴 | https://arxiv.org/abs/2505.23735 | Dense | | "Atlas, a long-term memory module with high capacity that learns to memorize the context by optimizing the memory based on the current and past tokens, overcoming the online nature of long-term memory models. Building on this insight, we present a new family of Transformer-like architectures, called DeepTransformers, that are strict generalizations of the original Transformer architecture." | 605 | ||||||
77 | Atlas | Google DeepMind | 1.3 | 100 | 77:1 | 0.0 | synthetic, web-scale | May/2025 | 🔴 | https://arxiv.org/abs/2505.23735 | Dense | | "Atlas, a long-term memory module with high capacity that learns to memorize the context by optimizing the memory based on the current and past tokens, overcoming the online nature of long-term memory models. Building on this insight, we present a new family of Transformer-like architectures, called DeepTransformers, that are strict generalizations of the original Transformer architecture." | 604 | ||||||
78 | DeepSeek-R1-0528 | DeepSeek-AI | https://chat.deepseek.com/ | 685 | 14800 | 22:1 | 10.6 | 93.4 | 85 | 81 | 17.7 | synthetic, web-scale | May/2025 | 🟢 | https://huggingface.co/deepseek-ai/DeepSeek-R1-0528 | MoE | Reasoning, SOTA | Censorship increased significantly. "overall performance is now approaching that of leading models, such as o3 and Gemini 2.5 Pro." MMLU shows MMLU-Redux score with lower error rate. | 603 | |
79 | Fathom-R1-14B | Fractal Analytics | https://huggingface.co/FractalAIResearch/Fathom-R1-14B | 14 | 18000 | 1,286:1 | 1.7 | 66.16 | synthetic, web-scale | May/2025 | 🟢 | https://huggingface.co/FractalAIResearch/Fathom-R1-14B | Dense | Reasoning | Base R1-distilled-14B model, based on Qwen 14B. Media release. | 602 | ||||
80 | QwenLong-L1-32B | Alibaba | https://huggingface.co/Tongyi-Zhiwen/QwenLong-L1-32B | 32 | 18000 | 563:1 | 2.5 | synthetic, web-scale | May/2025 | 🟢 | https://arxiv.org/abs/2505.17667 | Dense | Reasoning | "the first long-context LRM trained with reinforcement learniing for long-context reasoning." | 601 | |||||
81 | Claude Opus 4 | Anthropic | https://claude.ai/ | 6000 | 100000 | 17:1 | 81.6 | 83.3 | synthetic, web-scale | May/2025 | 🟢 | https://www-cdn.anthropic.com/6be99a52cb68eb70eb9572b4cafad13df32ed995.pdf | Dense | Reasoning, SOTA | "Claude Opus 4 is our most intelligent model to date, pushing the frontier in coding, agentic search, and creative writing. With advanced reasoning and powerful collaboration capabilities…Both models can also alternate between reasoning and tool use—like web search—to improve responses…Claude Opus 4 can work continuously for hours on complex, long-running tasks" | 600 | ||||
82 | Falcon-H1 | TII | https://huggingface.co/tiiuae/Falcon-H1-34B-Instruct-GGUF | 34 | 18000 | 530:1 | 2.6 | 84.05 | 58.73 | 49.66 | synthetic, web-scale | May/2025 | 🟢 | https://huggingface.co/papers/2507.22448 | Dense | | "hybrid architecture that combines the strengths of the classical Transformer-based attention mechanism with the State Space Model (SSM), known for its superior long-context memory and computational efficiency." | 599 | ||
83 | Gemini Diffusion | Google DeepMind | https://deepmind.google/models/gemini-diffusion/ | 40 | 16000 | 400:1 | 2.7 | 40.4 | synthetic, web-scale | May/2025 | 🟢 | https://deepmind.google/models/gemini-diffusion/ | Dense | Diffusion | "Gemini Diffusion’s external benchmark performance is comparable to much larger models [like Gemini-2.0-Flash-Lite], whilst also being faster." | 598 | ||||
84 | Gemma 3n | Google DeepMind | https://ai.google.dev/gemma/docs/gemma-3n | 4 | 8000 | 2,000:1 | 0.6 | 62.1 | synthetic, web-scale | May/2025 | 🟢 | https://developers.googleblog.com/en/introducing-gemma-3n/ | MatFormer | | Matryoshka Transformer or MatFormer model architecture. 850M (696M / 620M / 582M). | 597 | ||||
85 | ParScale | Alibaba | https://huggingface.co/ParScale/ParScale-4.7B-P8-Python | 4.7 | 1000 | 213:1 | 0.2 | 35.1 | synthetic, web-scale | May/2025 | 🟢 | https://arxiv.org/abs/2505.10475 | Dense | | "We introduce the third scaling paradigm for scaling LLMs: leverages parallel computation during both training and inference time (Parallel Scaling, or ParScale)... ParScale can use up to 22× less memory increase and 6× less latency increase compared to parameter scaling that achieves the same performance improvement. It can also recycle an off-the-shelf pre-trained model into a parallelly scaled one by post-training on a small amount of tokens, further reducing the training budget." MMLU shows for 1.8B models, not the 4.7B models. | 596 | ||||
86 | codex-1 | OpenAI | https://chatgpt.com/codex | 600 | 100000 | 167:1 | 25.8 | synthetic, web-scale | May/2025 | 🟢 | https://openai.com/index/introducing-codex/ | MoE | Reasoning, SOTA | o3 base. "codex-1, a version of OpenAI o3 optimized for software engineering. It was trained using reinforcement learning on real-world coding tasks in a variety of environments to generate code that closely mirrors human style and PR preferences, adheres precisely to instructions, and can iteratively run tests until it receives a passing result." | 595 | |||||
87 | Falcon-Edge | TII | https://huggingface.co/tiiuae/Falcon-E-3B-Instruct | 3 | 1500 | 500:1 | 0.2 | 55.7 | 27.16 | 23.59 | synthetic, web-scale | May/2025 | 🟢 | https://huggingface.co/blog/tiiuae/falcon-edge | Dense | | "Falcon-Edge series - a collection of powerful, universal, and fine-tunable language models available in ternary format, based on the BitNet architecture." | 594 | ||
88 | SWE-1 | Windsurf | https://windsurf.com/blog/windsurf-wave-9-swe-1 | 50 | 8000 | 160:1 | 2.1 | synthetic, web-scale | May/2025 | 🟢 | https://windsurf.com/blog/windsurf-wave-9-swe-1 | Dense | | "SWE-1, optimized for the entire software engineering process, not just the task of coding." | 593 | |||||
89 | INTELLECT-2 | Prime Intellect | https://chat.primeintellect.ai/ | 32 | 18000 | 563:1 | 2.5 | 66.8 | web-scale | May/2025 | 🟢 | https://storage.googleapis.com/public-technical-paper/INTELLECT_2_Technical_Report.pdf | Dense | Reasoning | QwQ-32B base. Announce: https://www.primeintellect.ai/blog/intellect-2-release Finished training 30/Apr/2025: https://app.primeintellect.ai/intelligence/intellect-2 | 592 | ||||
90 | Pangu Ultra MoE | Huawei | https://github.com/pangu-tech/pangu-ultra | 718 | 13000 | 19:1 | 10.2 | 91.5 | 83.5 | 75.3 | synthetic, web-scale | May/2025 | 🔴 | https://arxiv.org/abs/2505.04519 | MoE | Reasoning | 718B-A39B. Trained on 6,000 Ascend NPUs (Kunpeng 920 processors in Huawei Atlas 800T A2 servers). | 591 | ||
91 | Mistral Medium 3 | Mistral | https://chat.mistral.ai/chat | 50 | 12000 | 240:1 | 2.6 | 77.2 | 57.1 | synthetic, web-scale | May/2025 | 🟢 | https://mistral.ai/news/mistral-medium-3 | Dense | | Multimodal. 50B param estimate based on "Mistral Medium 3 can also be deployed on any cloud, including self-hosted environments of four GPUs and above.". Note: "With the launches of Mistral Small in March and Mistral Medium today, it’s no secret that we’re working on something ‘large’ over the next few weeks. With even our medium-sized model being resoundingly better than flagship open source models such as Llama 4 Maverick, we’re excited to ‘open’ up what’s to come :) " | 590 | |||
92 | Granite-4.0-Tiny-Preview | IBM | https://huggingface.co/ibm-granite/granite-4.0-tiny-preview | 7 | 2500 | 358:1 | 0.4 | 60.4 | synthetic, web-scale | May/2025 | 🟢 | https://www.ibm.com/new/announcements/ibm-granite-4-0-tiny-preview-sneak-peek | MoE | Reasoning | "the model is only partially trained—it has only seen 2.5T of a planned 15T or more training tokens...Granite 4.0 Tiny-Preview, specifically, is a fine-grained hybrid mixture of experts (MoE) model, with 7B total parameters and only 1B active parameters at inference time... Like its predecessors in Granite 3.2 and Granite 3.3, Granite 4.0 Tiny Preview offers toggleable thinking on and thinking off functionality (though its reasoning-focused post-training is very much incomplete)." | 589 | ||||
93 | Phi-4-reasoning-plus | Microsoft | https://huggingface.co/microsoft/Phi-4-reasoning-plus | 14 | 10016 | 716:1 | 1.2 | 76 | 69.3 | synthetic, web-scale | Apr/2025 | 🟢 | https://arxiv.org/abs/2504.21318 | Dense | | "Phi-4-reasoning-plus is a state-of-the-art open-weight reasoning model finetuned from Phi-4 using supervised fine-tuning on a dataset of chain-of-thought traces and reinforcement learning." | 588 | |||
94 | Bamba-9B-v2 | IBM | https://huggingface.co/ibm-ai-platform/Bamba-9B-v2 | 9 | 3000 | 334:1 | 0.5 | 67.92 | 25.41 | 5.93 | synthetic, web-scale | Apr/2025 | 🟢 | https://huggingface.co/blog/ibm-ai-platform/bamba-9b-v2 | Dense | | "During Christmas of 2024, IBM, Princeton, CMU, and UIUC released, Bamba v1, a performant Mamba2 based pretrained model with full data lineage trained to 2T tokens. Since then, we have been busy cooking an update with new datasets. Today, we are excited to release Bamba v2, trained for an additional 1T tokens that significantly improves on Bamba v1. The L1 and L2 leaderboard scores outperform Llama 3.1 8B, which was trained with nearly 5x the amount of data. All of this with the inference speedup that we get from Mamba2 based architecture, which with the latest vLLM is 2-2.5x faster than similar sized transformer models." | 587 | ||
95 | Qwen3-235B-A22B | Alibaba | https://huggingface.co/Qwen/Qwen3-235B-A22B | 235 | 36000 | 154:1 | 9.7 | 87.81 | 68.18 | 47.47 | synthetic, web-scale | Apr/2025 | 🟢 | https://github.com/QwenLM/Qwen3/blob/main/Qwen3_Technical_Report.pdf | MoE | Reasoning | Qwen3-235B-A22B. Qwen3-30B-A3B. "Qwen3 is pre-trained on 36 trillion tokens across 119 languages" | 586 | ||
96 | ERNIE X1 Turbo | Baidu | https://huggingface.co/spaces/PaddlePaddle/ernie_x1_turbo_demo | 69 | synthetic, web-scale | Apr/2025 | 🟢 | https://www.prnewswire.com/news-releases/baidu-launches-ernie-4-5-turbo-ernie-x1-turbo-and-new-suite-of-ai-tools-to-empower-developers-and-supercharge-ai-innovation-302438584.html | MoE | Reasoning | Announce: https://x.com/Baidu_Inc/status/1915603080336597310 | 585 | ||||||||
97 | ERNIE 4.5 Turbo | Baidu | https://huggingface.co/spaces/PaddlePaddle/ernie_4.5_turbo_demo | 90 | synthetic, web-scale | Apr/2025 | 🟢 | https://www.prnewswire.com/news-releases/baidu-launches-ernie-4-5-turbo-ernie-x1-turbo-and-new-suite-of-ai-tools-to-empower-developers-and-supercharge-ai-innovation-302438584.html | MoE | | Announce: https://x.com/Baidu_Inc/status/1915603080336597310 | 584 | ||||||||
98 | MAI-DS-R1 | Microsoft | https://huggingface.co/microsoft/MAI-DS-R1 | 685 | 14800 | 22:1 | 10.6 | 86.8 | synthetic, web-scale | Apr/2025 | 🟢 | https://techcommunity.microsoft.com/blog/machinelearningblog/introducing-mai-ds-r1/4405076 | MoE | Reasoning | DeepSeek-R1 base. "MAI-DS-R1, a new open weights DeepSeek R1 model variant... post-trained by the Microsoft AI team to improve its responsiveness on blocked topics and its risk profile, while maintaining its reasoning capabilities and competitive performance." | 583 | ||||
99 | Gemini 2.5 Flash Preview | Google DeepMind | https://aistudio.google.com/prompts/new_chat?model=gemini-2.5-flash-preview-04-17 | 80 | 20000 | 250:1 | 4.2 | 78.3 | 12.1 | synthetic, web-scale | Apr/2025 | 🟢 | https://deepmind.google/technologies/gemini/flash/ | MoE | Reasoning | Context in=1M, out=64k. Knowledge cutoff Jan/2025. Codename 'nebula'. Note: Gemini outputs are watermarked. I do not use GDM models. https://lifearchitect.ai/watermarking/ | 582 | |||
100 | o4-mini | OpenAI | https://chatgpt.com/?model=o4-mini-high | 200 | 40000 | 200:1 | 9.4 | 88 | 81.4 | 14.28 | synthetic, web-scale | Apr/2025 | 🟢 | https://cdn.openai.com/pdf/2221c875-02dc-4789-800b-e7758f3722c1/o3-and-o4-mini-system-card.pdf | MoE | Reasoning, SOTA | https://openai.com/index/introducing-o3-and-o4-mini/ MMLU shows a translated LOTE. | 581 |