ABCDEFGHIJKLMNOPQRS
1
(683) Permalink:
https://lifearchitect.ai/models-table
Timeline view:
https://lifearchitect.ai/timeline
The Memo:
https://lifearchitect.ai/memo
2
ModelLab
Playground
Parameters
(B)
Tokens
trained (B)
Ratio Tokens:Params
(Chinchilla scaling≥20:1)
ALScore
"ALScore" is a quick and dirty rating of the model's power. The formula is:
Sqr Root of (Parameters x Tokens) ÷ 300.
Any ALScore ≥ 1.0 is a powerful model in mid-2023.
MMLUMMLU
-Pro
GPQAHLE
Training dataset
Announced
Public?Paper / RepoArchTagsNotesCount (rough)
3
AuroraGPT (ScienceGPT)
Argonne National Laboratory
https://lifearchitect.ai/auroragpt/20003000015:1TBA🔴
Three models targeted in Jul/2024: AuroraGPT-7B-P (Ponte Vecchio GPU testing) AuroraGPT-7B-A (Aurora) AuroraGPT-7B-A-S (Aurora + Science).
l
4
DeepSeek-R2DeepSeek-AIhttps://www.reuters.com/technology/artificial-intelligence/deepseek-rushes-launch-new-ai-model-china-goes-all-2025-02-25/1200130000109:141.6TBA🟢https://docs.google.com/document/d/e/2PACX-1vTmx-A5sBe_3RsURGM7VvLWsAgUXbcIb2pFaW7f1FTPgK7mGvYENXGQPoF2u4onFndJ_5tzZ02su-vg/pubMoEReasoning, SOTA
Due April 2025. Hybrid MoE, 1.2TA78B. 5.2PB corpus=1.3Qa tokens. 1.3 quadrillion tokens = 1,300T tokens = 1,300,000B tokens "Constructed a 5.2 PB high-quality corpus covering vertical domains such as finance, law, and patents." http://jiuyangongshe.com/a/1h4gq724su0 and translated at: https://docs.google.com/document/d/e/2PACX-1vTmx-A5sBe_3RsURGM7VvLWsAgUXbcIb2pFaW7f1FTPgK7mGvYENXGQPoF2u4onFndJ_5tzZ02su-vg/pub
k
5
ERNIE 5Baiduhttps://lifearchitect.ai/ernie/TBAj
6
Gemini Ultra
Google DeepMind
https://www.reddit.com/r/singularity/comments/1kbpdvp/a_string_referencing_gemini_ultra_has_been_added/20003000015:125.8TBADue May/2025.i
7
GPT-6OpenAIhttps://lifearchitect.ai/gpt-6/TBASOTADue 2025.f
8
Llama 4 ReasoningMeta AIhttps://ai.meta.com/blog/llama-4-multimodal-intelligence/TBA🟢https://ai.meta.com/blog/llama-4-multimodal-intelligence/MoEReasoning, SOTA
Announced, coming soon. Llama5 cancelled: https://x.com/stablequan/status/1977235217879552103
d
9
o4OpenAIhttps://lifearchitect.ai/o4/TBAReasoning, SOTADue 2025.b
10
Hope
Google DeepMind
1.310077:10.0synthetic, web-scaleNov/2025🟡
https://abehrouz.github.io/files/NL.pdf
Dense
"Combining our self-modifying sequence model with the continuum memory system, we present a learning module, called HOPE, showing promising results in language modeling, continual learning, and long-context reasoning tasks." Announce: https://research.google/blog/introducing-nested-learning-a-new-ml-paradigm-for-continual-learning/ May be released after paper is public.
683
11
Kimi K2 ThinkingMoonshot AI
https://kimi.com/
10001550016:113.194.484.684.544synthetic, web-scaleNov/2025🟢https://moonshotai.github.io/Kimi-K2/thinking.htmlMoEReasoning, SOTA
1TA32B. 1T parameters and 384 experts. Open source SOTA. HLE=51.0 on text-only subset, compare to Grok-4 HLE=50.7 also on text-only, but Grok-4 HLE=44.4 on HLE full, ∴ Kimi K2 Thinking HLE≈44 full (estimated).
682
12
GEN-0Generalist
https://generalistai.com/blog/nov-04-2025-GEN-0
10100001,000:11.1web-scaleNov/2025🟡https://generalistai.com/blog/nov-04-2025-GEN-0
Dense
SOTA
"GEN-0, a new class of embodied foundation models built for multimodal training directly on high-fidelity raw physical interaction. Its architecture builds on the strengths of vision and language models while also going beyond them—natively designed to capture human-level reflexes and physical commonsense. One core feature is Harmonic Reasoning, in which the models are trained to simultaneously think and act seamlessly... GEN-0 is pretrained on our in-house robotics dataset, which includes over 270,000 hours of real-world diverse manipulation data, growing at a rate of 10,000 hours a week and accelerating."
681
13
CALMWechat
https://github.com/shaochenze/calm
1.82230127:10.1web-scaleOct/2025🟢https://arxiv.org/abs/2510.27688
Dense
"Continuous Autoregressive Language Models (CALM), a paradigm shift from discrete next-token prediction to continuous next-vector prediction. CALM uses a high-fidelity autoencoder to compress a chunk of K tokens into a single continuous vector, from which the original tokens can be reconstructed with over 99.9% accuracy... We train our models on the Pile uncopyrighted dataset (Gao et al., 2020). The raw text is processed with the Llama 3 tokenizer (Grattafiori et al., 2024), resulting in a training set of ∼230B tokens."
680
14
Kimi-LinearMoonshot AI
https://huggingface.co/moonshotai/Kimi-Linear-48B-A3B-Instruct
485700119:11.751synthetic, web-scaleOct/2025🟢https://github.com/MoonshotAI/Kimi-Linear?tab=readme-ov-fileMoE
48B-A3B. "Kimi Linear is a hybrid linear attention architecture that outperforms traditional full attention methods across various contexts, including short, long, and reinforcement learning (RL) scaling regimes. At its core is Kimi Delta Attention (KDA)—a refined version of Gated DeltaNet that introduces a more efficient gating mechanism to optimize the use of finite-state RNN memory."
679
15
MiniMax-M2MiniMax
https://huggingface.co/MiniMaxAI/MiniMax-M2
230720032:14.3827831.8web-scaleOct/2025🟢https://platform.minimax.io/docs/guides/text-generationMoEReasoning230B-A10B.678
16
DeepSeek-OCRDeepSeek-AIhttps://github.com/deepseek-ai/DeepSeek-OCR/blob/main/DeepSeek_OCR_paper.pdf360002,000:10.4specialOct/2025🟢https://github.com/deepseek-ai/DeepSeek-OCR/blob/main/DeepSeek_OCR_paper.pdfMoE2D vision tokens for 1D text achieves huge compression. Encoder/Decoder: DeepEncoder 380M (80M SAM-base + 300M CLIP-large), DeepSeek-3B-MoE (A570M). 677
17
UserLM-8bMicrosoft
https://huggingface.co/microsoft/UserLM-8b
81000125:10.3WildChatOct/2025🟢https://huggingface.co/microsoft/UserLM-8b
Dense
"we trained UserLM-8b to simulate the “user” role in conversation (by training it to predict user turns in a large corpus of conversations called WildChat)."
676
18
CoDASalesforce
https://huggingface.co/Salesforce/CoDA-v0-Instruct
1.7180106:10.1synthetic, web-scaleOct/2025🟢https://github.com/SalesforceAIResearch/CoDA/blob/main/technical_report.pdf
Dense
Diffusion"diffusion coder trained on TPU [Google TPU v4-1024 VM]"675
19
TRMSamsunghttps://github.com/SamsungSAILMontreal/TinyRecursiveModels0.0070.115:10.0Mazes (ARC-AGI)Oct/2025🟢https://arxiv.org/abs/2510.04871v1
Dense
"Tiny Recursive Model (TRM), a much simpler recursive reasoning approach that achieves significantly higher generalization than HRM, while using a single tiny network with only 2 layers"
674
20
Granite-4.0 SmallIBM
https://huggingface.co/ibm-granite/granite-4.0-h-small
3215000469:12.378.3355.4740.63synthetic, web-scaleOct/2025🟢
https://www.ibm.com/granite/docs/models/granite
MoEReasoning
32B-A9B. Announce: https://www.ibm.com/new/announcements/ibm-granite-4-0-hyper-efficient-high-performance-hybrid-models
673
21
GLM-4.6Z.AI
https://huggingface.co/zai-org/GLM-4.6
3552200062:19.382.930.4synthetic, web-scaleSep/2025🟢
https://z.ai/blog/glm-4.6
MoEReasoning355B-A32B. "context window has been expanded from 128K to 200K tokens"672
22
Ring-1T-previewInclusionAI
https://huggingface.co/inclusionAI/Ring-1T-preview
10002000020:114.9synthetic, web-scaleSep/2025🟢
https://huggingface.co/inclusionAI/Ring-1T-preview
MoEReasoning1T-A48.5B.671
23
Claude Sonnet 4.5Anthropic
https://claude.ai/
40080000200:118.983.4synthetic, web-scaleSep/2025🟢
https://assets.anthropic.com/m/12f214efcc2f457a/original/Claude-Sonnet-4-5-System-Card.pdf
MoEReasoning, SOTAThe Claude Sonnet 4.5 "system card" is an absolute farce. Announce: https://www.anthropic.com/news/claude-sonnet-4-5670
24
Gemini Robotics 1.5
Google DeepMind
20020000100:16.759.6synthetic, web-scaleSep/2025🟢
https://storage.googleapis.com/deepmind-media/gemini-robotics/Gemini-Robotics-1-5-Tech-Report.pdf
MoEReasoning
2. "vision-language-action (VLA) model turns visual information and instructions into motor commands for a robot to perform a task." Available to select partners.
669
25
Gemini Robotics-ER 1.5
Google DeepMind
https://aistudio.google.com/?model=gemini-robotics-er-1.5-preview
30300001,000:13.283.3synthetic, web-scaleSep/2025🟢
https://storage.googleapis.com/deepmind-media/gemini-robotics/Gemini-Robotics-1-5-Tech-Report.pdf
MoEReasoning
1. "vision-language model (VLM) reasons about the physical world, natively calls digital tools and creates detailed, multi-step plans to complete a mission." Available to all devs.
668
26
TimesFM-ICFGoogle
https://huggingface.co/collections/google/timesfm-release-66e4be5fdb56e960c1e482a6
0.2100500:10.0specialSep/2025🔴
https://research.google/blog/time-series-foundation-models-can-be-few-shot-learners/
Dense
TimesFM-ICF is 6.8% more accurate than TimesFM (Base). Time-series forecasting only. 'a large pretraining corpus of 100B real world time-points' may be more than 100B tokens.
667
27
Qwen3-MaxAlibaba
https://chat.qwen.ai/
10003600036:120.085.4synthetic, web-scaleSep/2025🟢https://qwen.ai/blog?id=241398b9cd6353de490b0f82806c7848c5d2777d&from=research.latest-advancements-listMoEReasoning
"Qwen3-Max-Thinking — still under active training — is already demonstrating remarkable potential. When augmented with tool usage and scaled test-time compute, the Thinking variant has achieved 100% on challenging reasoning benchmarks such as AIME 25 and HMMT. "
666
28
Qwen3-OmniAlibaba
https://github.com/QwenLM/Qwen3-Omni?tab=readme-ov-file
3017000567:12.488.873.1synthetic, web-scaleSep/2025🟢https://github.com/QwenLM/Qwen3-Omni/blob/main/assets/Qwen3_Omni.pdfMoEReasoning
"Qwen3-Omni is a unified end-to-end model capable of processing multiple modalities, such as text, audio, image and video, and generating real-time text or speech response."... "pretraining utilizes a large-scale dataset containing approximately 2 trillion tokens, with the following distribution across modalities: text (0.57 trillion), audio (0.77 trillion), image (0.82 trillion), video (0.05 trillion), and video-audio (0.05 trillion)."
665
29
DeepSeek-V3.1-Terminus
DeepSeek-AIhttps://huggingface.co/deepseek-ai/DeepSeek-V3.1-Terminus6851564022:110.68580.721.7synthetic, web-scaleSep/2025🟢https://api-docs.deepseek.com/news/news250922MoESOTA, ReasoningHybrid reasoning. Dataset tokens: https://x.com/deepseek_ai/status/1958417072536608952 HLE: https://x.com/deepseek_ai/status/1958417068568481854/photo/2664
30
Isaac 0.1Perceptronhttps://huggingface.co/PerceptronAI/Isaac-0.1220001,000:10.2synthetic, web-scaleSep/2025🟢https://www.perceptron.inc/blog/introducing-isaac-0-1
Dense
"perceptive-language model...delivering capabilities that meet or exceed those of models over 50 times its size. Founded by the team behind Meta's Chameleon multimodal models, Perceptron is tackling a fundamental challenge: bringing the power of physical AI to the dynamic, multimodal, and real-time environments we live and work in."
663
31
Grok 4 FastxAI
https://grok.com/
20020000100:16.785.720synthetic, web-scaleSep/2025🟢https://x.ai/news/grok-4-fastMoEReasoning, SOTA
"2M token context window, and a unified architecture that blends reasoning and non-reasoning modes in one model."
662
32
VaultGemma
Google DeepMind
https://huggingface.co/google/vaultgemma-1b
11300013,000:10.4web-scaleSep/2025🟢https://services.google.com/fh/files/blogs/vaultgemma_tech_report.pdf
Dense
"Differential Privacy (DP) has emerged as the gold standard, providing a rigorous, mathematical framework to limit the influence of any single example in the training data on the resulting model. A model trained with DP provably bounds the reconstruction or leakage of information tied to individual data points." Announce: https://research.google/blog/vaultgemma-the-worlds-most-capable-differentially-private-llm/661
33
Qwen3-Next-80B-A3B
Alibaba
https://huggingface.co/collections/Qwen/qwen3-next-68c25fd6838e585db8eeea9d
8015000188:13.784.7266.0543.43synthetic, web-scaleSep/2025🟢https://qwen.ai/blog?id=4074cca80393150c248e508aa62983f9cb7d27cd&from=research.latest-advancements-listMoEReasoning
"Qwen3-Next introduces several key improvements: a hybrid attention mechanism, a highly sparse Mixture-of-Experts (MoE) structure, training-stability-friendly optimizations, and a multi-token prediction mechanism for faster inference."
660
34
K2-ThinkMBZUAIhttps://www.k2think.ai/3218000563:12.571.089.95synthetic, web-scaleSep/2025🟢https://arxiv.org/abs/2509.07604
Dense
Reasoning
"Built on the Qwen2.5 base model, our system shows that smaller models can compete at the highest levels by combining advanced post-training and test-time computation techniques. The approach is based on six key technical pillars: Long Chain-of-thought Supervised Finetuning, Reinforcement Learning with Verifiable Rewards (RLVR), Agentic planning prior to reasoning, Test-time Scaling, Speculative Decoding, and Inference-optimized Hardware, all using publicly available open-source datasets."
659
35
mmBERTJHUhttps://huggingface.co/collections/jhu-clsp/mmbert-a-modern-multilingual-encoder-68b725831d7c6e3acc435ed40.30730009,772:10.1synthetic, web-scaleSep/2025🟢https://arxiv.org/abs/2509.06888
Dense
"a modern multilingual encoder trained on 3T tokens and 1833 languages. We introduce several novel elements in training: an inverse masking schedule and a cascading annealed language learning schedule for multilingual data" Announce: https://huggingface.co/blog/mmbert
658
36
ERNIE X1.1Baiduhttps://ernie.baidu.com/synthetic, web-scaleSep/2025🟢https://www.prnewswire.com/news-releases/baidu-unveils-reasoning-model-ernie-x1-1-with-upgrades-in-key-capabilities-302551170.htmlMoEReasoning657
37
ERNIE-4.5-21B-A3B-Thinking
Baidu
https://huggingface.co/baidu/ERNIE-4.5-21B-A3B-Thinking
2115000715:11.9synthetic, web-scaleSep/2025🟢https://huggingface.co/baidu/ERNIE-4.5-21B-A3B-ThinkingMoEReasoning656
38
Klear-46B-A2.5BKuaishou
https://huggingface.co/Kwai-Klear/Klear-46B-A2.5B-Instruct
4622000479:13.480.557.635.3synthetic, web-scaleSep/2025🟢https://huggingface.co/Kwai-Klear/Klear-46B-A2.5B-InstructMoE46B-A2.5B.655
39
TildeOpen-30bTilde AI
https://huggingface.co/TildeAI/TildeOpen-30b
30200067:10.8synthetic, web-scaleSep/2025🟢https://tilde.ai/lv/tildeopen-llm/
Dense
"language data from across Europe"654
40
Qwen3-Max-Preview
Alibaba
https://chat.qwen.ai/
10003600036:120.064.6synthetic, web-scaleSep/2025🟢https://modelstudio.console.alibabacloud.com/?tab=doc#/doc/?type=model&url=2840914_2&modelId=qwen3-max-previewMoE
GPQA score is SuperGPQA. "our biggest model yet, with over 1 trillion parameters"
653
41
Kimi K2-Instruct-0905
Moonshot AI
https://huggingface.co/moonshotai/Kimi-K2-Instruct
10001550016:113.1synthetic, web-scaleSep/2025🟢https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905MoEReasoning, SOTA1TA32B. 1T parameters and 384 experts. Open source SOTA.652
42
ApertusETH Zürich
https://huggingface.co/swiss-ai/Apertus-70B-Instruct-2509
7015000215:13.465.230.6synthetic, web-scaleSep/2025🟢https://github.com/swiss-ai/apertus-tech-report/blob/main/Apertus_Tech_Report.pdf
Dense
"Apertus – Latin for “open”" 1,811 languages. Announce: https://ethz.ch/en/news-and-events/eth-news/news/2025/09/press-release-apertus-a-fully-open-transparent-multilingual-language-model.html
651
43
LongCat-FlashMeituan
https://longcat.ai/
5602000036:111.289.7182.6873.23synthetic, web-scaleSep/2025🟢https://github.com/meituan-longcat/LongCat-Flash-Chat/blob/main/tech_report.pdfMoEReasoning, SOTA
560B-A18.6B–31.3B (27B on average). Announce: https://lmsys.org/blog/2025-09-01-sglang-longcat-flash/
650
44
MAI-1-previewMicrosoft
https://microsoft.ai/news/two-new-in-house-models/
5001000020:17.5synthetic, web-scaleAug/2025🟢https://microsoft.ai/news/two-new-in-house-models/MoE
MAI=Microsoft artificial intelligence. "MAI’s first foundation model trained end-to-end... MAI-1-preview is an in-house mixture-of-experts model, pre-trained and post-trained on ~15,000 NVIDIA H100 GPUs. This model is designed to provide powerful capabilities to consumers seeking to benefit from models that specialize in following instructions and providing helpful responses to everyday queries. We will be rolling MAI-1-preview out for certain text use cases within Copilot"
649
45
grok-code-fast-1xAI
https://github.com/features/copilot
10010000100:13.3synthetic, web-scaleAug/2025🟢
https://data.x.ai/2025-08-26-grok-code-fast-1-model-card.pdf
MoE
"We built grok-code-fast-1 from scratch, starting with a brand-new model architecture. To lay a robust foundation, we carefully assembled a pre-training corpus rich with programming-related content. For post-training, we curated high-quality datasets that reflect real-world pull requests and coding tasks." Announce: https://x.ai/news/grok-code-fast-1
648
46
Hermes 4Nous Research
https://huggingface.co/NousResearch/Hermes-4-405B-FP8
4051565639:18.487.280.570.5synthetic, web-scaleAug/2025🟢
https://arxiv.org/abs/2508.18255
Dense
ReasoningBased on Llama 3. Announce: https://hermes4.nousresearch.com/647
47
Jet-Nemotron-4BNVIDIA
https://github.com/NVlabs/Jet-Nemotron
4400100:10.165.244.2synthetic, web-scaleAug/2025🟢https://arxiv.org/abs/2508.15884v1
Dense
Reasoning"pre-training corpus and train Jet-Nemotron models for 50B tokens. This is also the setting in Section 2 where we perform PostNAS. At the second stage, we include more high-quality data from math [65] and coding [66, 67] domains into our data mixture. The models are then trained on 350B tokens."646
48
DeepSeek-V3.1-Base
DeepSeek-AI
https://huggingface.co/deepseek-ai/DeepSeek-V3.1-Base
6851564022:110.693.784.880.129.8synthetic, web-scaleAug/2025🟢https://huggingface.co/deepseek-ai/DeepSeek-V3.1-BaseMoESOTA, ReasoningHybrid reasoning. Dataset tokens: https://x.com/deepseek_ai/status/1958417072536608952 HLE: https://x.com/deepseek_ai/status/1958417068568481854/photo/2645
49
Nemotron Nano 2NVIDIA
https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2-Base
12.31200001,625:11.778.2463.9864.48synthetic, web-scaleAug/2025🟢https://research.nvidia.com/labs/adlr/files/NVIDIA-Nemotron-Nano-2-Technical-Report.pdf
Dense
ReasoningAnnounce: https://research.nvidia.com/labs/adlr/NVIDIA-Nemotron-Nano-2/644
50
Gemma 3 270M
Google DeepMind
https://huggingface.co/google/gemma-3-270m-it
0.27600022,223:10.1web-scaleAug/2025🟢https://developers.googleblog.com/en/introducing-gemma-3-270m/
Dense
This is a record tokens-to-params ratio (for text models) of 22,223:1.643
51
GPT-5OpenAI
https://poe.com/GPT-5
300114000380:119.59189.442synthetic, web-scaleAug/2025🟢
https://openai.com/index/gpt-5-system-card/
MoESOTA, Reasoning
Announce: https://openai.com/index/introducing-gpt-5/. MMLU is based on ES and PT translated from EN.
642
52
gpt-oss-120bOpenAI
https://huggingface.co/openai/gpt-oss-120b
12030000250:16.39080.119synthetic, web-scaleAug/2025🟢
https://cdn.openai.com/pdf/419b6906-9da6-406c-a19d-1bb078ac7637/oai_gpt-oss_model_card.pdf
MoEReasoning, SOTA116.8B total parameters and 5.1B “active” parameters per token per forward pass. https://openai.com/index/introducing-gpt-oss/641
53
gpt-oss-20bOpenAI
https://huggingface.co/openai/gpt-oss-20b
2013000650:11.785.371.517.3synthetic, web-scaleAug/2025🟢
https://cdn.openai.com/pdf/419b6906-9da6-406c-a19d-1bb078ac7637/oai_gpt-oss_model_card.pdf
MoEReasoning, SOTA20.9B total and 3.6B active parameters. https://openai.com/index/introducing-gpt-oss/640
54
Claude Opus 4.1Anthropic
https://claude.ai/
200010000050:147.180.9synthetic, web-scaleAug/2025🟢
https://www.anthropic.com/news/claude-opus-4-1
MoEReasoning, SOTA639
55
GLM-4.5Z.AI
https://huggingface.co/zai-org/GLM-4.5
3552200062:19.384.679.114.4synthetic, web-scaleJul/2025🟢
https://z.ai/blog/glm-4.5
MoEReasoning355B-A32B.638
56
T1
China Telecom Artificial Intelligence Research Institute
https://github.com/Tele-AI/T1
1151000087:13.6web-scaleJul/2025🟢https://arxiv.org/abs/2507.18013
Dense
Reasoning637
57
Intern-S1
Shanghai AI Laboratory/SenseTime
https://huggingface.co/internlm/Intern-S1
23541000175:110.383.577.3synthetic, web-scaleJul/2025🟢
https://huggingface.co/internlm/Intern-S1
MoEReasoning, SOTA
41T tokens assumes base model of Qwen3. "Built upon a 235B MoE language model and a 6B Vision encoder, Intern-S1 has been further pretrained on 5 trillion tokens of multimodal data"
636
58
Step 3StepFun
https://www.stepfun.com/
3211800057:18.072.9web-scaleJul/2025🟢
https://github.com/stepfun-ai/Step3/blob/main/Step3-Sys-Tech-Report.pdf
MoE321B-A38B. https://x.com/CyouSakura/status/1948767450751009227635
59
Qwen3-235B-A22B-Thinking-2507
Alibaba
https://huggingface.co/Qwen/Qwen3-235B-A22B-Thinking-2507
23536000154:19.793.884.481.1synthetic, web-scaleJul/2025🟢
https://huggingface.co/Qwen/Qwen3-235B-A22B-Thinking-2507
MoEReasoning
235B-A22B. "Qwen3 is pre-trained on 36 trillion tokens across 119 languages" MMLU score is MMLU-Redux.
634
60
KAT-V1-200BKuaishou2001000050:14.782.378.2synthetic, web-scaleJul/2025🔴https://arxiv.org/abs/2507.08297MoEReasoning
200BA40B. In training as of Jul/2025. "to address the overthinking problem in reasoning-intensive tasks"
633
61
KAT-V1-40BKuaishou
https://huggingface.co/Kwaipilot/KAT-V1-40B
4010000250:12.177.875.1synthetic, web-scaleJul/2025🟢https://arxiv.org/abs/2507.08297
Dense
Reasoning"to address the overthinking problem in reasoning-intensive tasks"632
62
Qwen3-Coder-480B-A35B-Instruct
Alibaba
https://huggingface.co/Qwen/Qwen3-Coder-480B-A35B-Instruct
4803600075:113.9synthetic, web-scaleJul/2025🟢https://qwenlm.github.io/blog/qwen3-coder/MoE480B-A35B.631
63
Qwen3-235B-A22B-Instruct-2507
Alibaba
https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507
23536000154:19.793.18377.5synthetic, web-scaleJul/2025🟢https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507MoESOTA
235B-A22B. "Qwen3 is pre-trained on 36 trillion tokens across 119 languages" MMLU score is MMLU-Redux.
630
64
FlexOlmoAllen AI
https://huggingface.co/allenai/FlexOlmo-7x7B-1T
374150113:11.360.430.9synthetic, web-scaleJul/2025🟢
https://arxiv.org/abs/2507.07024v1
MoE
37B-A20B. "We adopt the OLMo-2 7B setup, starting from a a checkpoint pre-trained on 4T tokens and annealed for 50B tokens to produce a public expert. We then train two additional experts on math and code, each for 50B tokens, and combine them with the public expert to form a three-expert version of FLEXOLMO."
629
65
EXAONE 4.0LG
https://huggingface.co/LGAI-EXAONE/EXAONE-4.0-32B
3214000438:12.292.381.875.4web-scaleJul/2025🟢
https://www.lgresearch.ai/data/cdn/upload/EXAONE_4_0.pdf
Dense
Reasoning
“EXAONE”=“EXpert AI for EveryONE”. Training tokens/ratio: EXAONE-3 7.8B=8T tokens (Aug/2024) -> EXAONE-3.5 7.8B=9T -> EXAONE-3.5 32B=6.5T tokens -> EXAONE 4.0 32B=14T tokens. MMLU score is MMLU-Redux. Interesting: "To focus [RL] training on more informative data samples, we perform accuracy-based filtering by generating eight responses from the SFT model and excluding samples where all eight responses are correct, a pre-filtering step that removes problems that are easy for the model to avoid inefficient training."
628
66
Kimi K2Moonshot AI
https://huggingface.co/moonshotai/Kimi-K2-Instruct
10001550016:113.189.581.175.14.7synthetic, web-scaleJul/2025🟢https://moonshotai.github.io/Kimi-K2/MoEReasoning, SOTA1TA32B. 1T parameters and 384 experts. Open source SOTA.627
67
Reka Flash 3.1Reka AI
https://huggingface.co/RekaAI/reka-flash-3.1
215000239:11.1web-scaleJul/2025🟢
https://www.reka.ai/news/introducing-reka-flash
Dense
Reasoning626
68
Devstral MediumMistral
https://chat.mistral.ai/chat
5012000240:12.6synthetic, web-scaleJul/2025🟢
https://mistral.ai/news/devstral-2507
Dense
Non-reasoning.625
69
Grok 4xAI
https://grok.com/
60080000134:123.188.944.4synthetic, web-scaleJul/2025🟢https://lifearchitect.ai/grok/MoEReasoning, SOTA
"The smartest AI in the world, 100% on SAT, etc, questions that it's never seen before."
624
70
Phi-4-mini-flash-reasoning
Microsoft
https://huggingface.co/microsoft/Phi-4-mini-flash-reasoning
3.851501,356:10.5synthetic, web-scaleJul/2025🟢https://azure.microsoft.com/en-us/blog/reasoning-reimagined-introducing-phi-4-mini-flash-reasoning/
Dense
"Pre-training: 5T tokens; Reasoning training: 150B tokens" "At the core of Phi-4-mini-flash-reasoning is the newly introduced decoder-hybrid-decoder architecture, SambaY, whose central innovation is the Gated Memory Unit (GMU), a simple yet effective mechanism for sharing representations between layers. The architecture includes a self-decoder that combines Mamba (a State Space Model) and Sliding Window Attention (SWA), along with a single layer of full attention. The architecture also involves a cross-decoder that interleaves expensive cross-attention layers with the new, efficient GMUs. This new architecture with GMU modules drastically improves decoding efficiency, boosts long-context retrieval performance and enables the architecture to deliver exceptional performance across a wide range of tasks. "
623
71
T5Gemma
Google DeepMind
https://huggingface.co/google/t5gemma-9b-9b-ul2-it
9100001,112:11.076.755.740.4web-scaleJul/2025🟢https://developers.googleblog.com/en/t5gemma/
Dense
Related paper: https://arxiv.org/abs/2504.06225. Dataset was Gemma 2 9B on 8T tokens + 2T tokens adapted.622
72
MedGemma
Google DeepMind
https://huggingface.co/google/medgemma-27b-it
2714000519:12.087web-scaleJul/2025🟢
https://arxiv.org/abs/2507.05201
Dense
Multimodal model. Text MMLU score for med only=87.0. 621
73
R1T2 ChimeraTNG
https://huggingface.co/tngtech/DeepSeek-TNG-R1T2-Chimera
6851480022:110.6synthetic, web-scaleJul/2025🟢https://arxiv.org/abs/2506.14794MoE
Assembly of Experts-method of V3-0324, R1, R1-0528. Announce: https://x.com/tngtech/status/1940531045432283412?s=46
620
74
Spectra 1.1Consortium3.61200334:10.236.12synthetic, web-scaleJun/2025🟢https://arxiv.org/abs/2506.23025
Dense
"Spectra-1.1, an open suite of TriLMs trained on up to 1.2 trillion tokens, demonstrating sustained performance gains at scale. Furthermore, to improve inference efficiency, we propose novel 2-bit and 1.6-bit packing schemes for ternary weights"
619
75
DiffuCoderApple
https://github.com/apple/ml-diffucoder
75630805:10.7code, The StackJun/2025🟢https://arxiv.org/abs/2506.20639
Dense
Diffusion
"We adapt our model from Qwen2.5-Coder (Hui et al., 2024) as the base model to perform continual pre-training using the adaptation approach from Gong et al. (2025). During this pre-training, we use a 400B-token code pre-training corpus from RefineCode (Huang et al., 2024) and Stackv2 (Lozhkov et al., 2024)."
618
76
Hunyuan-A13BTencent
https://huggingface.co/tencent/Hunyuan-A13B-Instruct
80700088:12.588.1767.2371.2synthetic, web-scaleJun/2025🟢
https://huggingface.co/tencent/Hunyuan-A13B-Instruct
MoE
80B-A13B. 'We have open-sourced Hunyuan-A13B-Pretrain , Hunyuan-A13B-Instruct , Hunyuan-A13B-Instruct-FP8 , Hunyuan-A13B-Instruct-GPTQ-Int4 on Hugging Face.'
617
77
MercuryInception
https://chat.inceptionlabs.ai/
90800089:12.869513.4synthetic, web-scaleJun/2025🟢https://www.inceptionlabs.ai/introducing-mercury-our-general-chat-modelDenseDiffusionDiffusion large language model (dLLM).616
78
MuMicrosoft
https://blogs.windows.com/windows-insider/2025/06/13/announcing-windows-11-insider-preview-build-26200-5651-dev-channel/
0.55001,000:10.1synthetic, web-scaleJun/2025🟢https://blogs.windows.com/windowsexperience/2025/06/23/introducing-mu-language-model-and-how-it-enabled-the-agent-in-windows-settings/
Dense
"distillation from Microsoft’s Phi models...Mu is an efficient 330M encoder–decoder language model optimized for small-scale deployment, particularly on the NPUs on Copilot+ PCs. It follows a transformer encoder–decoder architecture"
615
79
Gemini Robotics On-Device
Google DeepMind
https://docs.google.com/forms/u/0/d/1sM5GqcVMWv-KmKY3TOMpVtQ-lDFeAftQ-d9xQn92jCE/viewform?ts=67cef986&edit_requested=true
2010000500:11.5synthetic, web-scaleJun/2025🟢https://deepmind.google/discover/blog/gemini-robotics-on-device-brings-ai-to-local-robotic-devices/MoE
See Mar/2025 Gemini Robotics-ER model for comparison. Announce: https://deepmind.google/discover/blog/gemini-robotics-on-device-brings-ai-to-local-robotic-devices/
614
80
ICONN-1ICONNAI
https://huggingface.co/ICONNAI/ICONN-1
8810000114:13.1synthetic, web-scaleJun/2025🟢MoE
"ICONN-1 (this version) is optimized for natural, emotionally resonant, and conversational interactions. ICONN-e1 is a specialized variant of the model fine-tuned for advanced reasoning, critical analysis, and complex problem-solving."
613
81
MiniMax-M1MiniMax
https://huggingface.co/MiniMaxAI/MiniMax-M1-80k
456720016:16.081.1708.4web-scaleJun/2025🟢https://arxiv.org/abs/2506.13585MoEReasoning456B-A45.9B. Announce: https://www.minimax.io/news/minimaxm1612
82
Magistral MediumMistral
https://chat.mistral.ai/chat
5012000240:12.670.8synthetic, web-scaleJun/2025🟢
https://mistral.ai/static/research/magistral.pdf
Dense
ReasoningMagistral Small=24B. Announce: https://mistral.ai/news/magistral611
83
Comma v0.1-2TEleutherAI
https://huggingface.co/common-pile/comma-v0.1-2t
72000286:10.449.8web-scaleJun/2025🟢https://arxiv.org/abs/2506.05209
Dense
"Comma v0.1-2T is a decoder-only transformer that uses the same architecture as Llama 3. Training was done in two stages: first on 1.93 trillion tokens with a cosine learning rate schedule, and second a "cool-down" training phase on 75.5 billion tokens from high-quality sources. The final model is the average of 10 checkpoints during this cool-down phase. Both training phases use a batch size of 8.3 million tokens per step. Training was performed using lingua on 512 AMD MI300A GPUs."
610
84
dots.llm1
Xiaohongshu/RedNote
https://huggingface.co/rednote-hilab/dots.llm1.base
1421120079:14.283.261.952.6web-scaleJun/2025🟢
https://github.com/rednote-hilab/dots.llm1/blob/main/dots1_tech_report.pdf
MoE
142B-A14B. "dots.llm1, a large-scale MoE model that activates 14 billion parameters out of a total of 142 billion parameters, delivering performance on par with state-of-the-art models while reducing training and inference costs. Leveraging our meticulously crafted and efficient data processing pipeline, dots.llm1 achieves performance comparable to Qwen2.5-72B after pretraining on 11.2T high-quality tokens and post-training to fully unlock its capabilities. Notably, no synthetic data is used during pretraining. To foster further research, we open-source intermediate training checkpoints at every one trillion tokens, providing valuable insights into the learning dynamics of large language models."
609
85
Gemini 2.5 Pro 06-05
Google DeepMind
https://deepmind.google/models/gemini-diffusion/
40080000200:118.986.421.6synthetic, web-scaleJun/2025🟢
https://storage.googleapis.com/deepmind-media/gemini/gemini_v2_5_report.pdf
Dense
Reasoning, SOTA
"an upgraded preview of Gemini 2.5 Pro, our most intelligent model yet. Building on the version we released in May and showed at I/O, this model will be the generally available, stable version starting in a couple of weeks, ready for enterprise-scale applications."
608
86
MiMo-7B-RL-0530Xiaomi
https://huggingface.co/XiaomiMiMo/MiMo-7B-RL-0530
7250003,572:11.458.660.6synthetic, web-scaleMay/2025🟢https://arxiv.org/abs/2505.07608
Dense
Reasoning
"[2025.05.30] During the RL training, by continuously expanding the training window size (from 32K to 48K), the performance of MiMo-7B-RL-0530 on AIME24 can be continuously improved and eventually surpass that of DeepSeek R1... MiMo-7B-Base is pre-trained on approximately 25 trillion tokens."
607
87
DeepTransformers
Google DeepMind
1.310077:10.0synthetic, web-scaleMay/2025🔴
https://arxiv.org/abs/2505.23735
Dense
"Atlas, a long-term memory module with high capacity that learns to memorize the context by optimizing the memory based on the current and past tokens, overcoming the online nature of long-term memory models. Building on this insight, we present a new family of Transformer-like architectures, called DeepTransformers, that are strict generalizations of the original Transformer architecture."
606
88
Atlas
Google DeepMind
1.310077:10.0synthetic, web-scaleMay/2025🔴
https://arxiv.org/abs/2505.23735
Dense
"Atlas, a long-term memory module with high capacity that learns to memorize the context by optimizing the memory based on the current and past tokens, overcoming the online nature of long-term memory models. Building on this insight, we present a new family of Transformer-like architectures, called DeepTransformers, that are strict generalizations of the original Transformer architecture."
605
89
DeepSeek-R1-0528
DeepSeek-AI
https://chat.deepseek.com/
6851480022:110.693.4858117.7synthetic, web-scaleMay/2025🟢
https://huggingface.co/deepseek-ai/DeepSeek-R1-0528
MoEReasoning, SOTA
Censorship increased significantly. "overall performance is now approaching that of leading models, such as o3 and Gemini 2.5 Pro." MMLU shows MMLU-Redux score with lower error rate.
604
90
Fathom-R1-14BFractal Analytics
https://huggingface.co/FractalAIResearch/Fathom-R1-14B
14180001,286:11.766.16synthetic, web-scaleMay/2025🟢https://huggingface.co/FractalAIResearch/Fathom-R1-14B
Dense
ReasoningBase R1-distilled-14B model, based on Qwen 14B. Media release.603
91
QwenLong-L1-32BAlibaba
https://huggingface.co/Tongyi-Zhiwen/QwenLong-L1-32B
3218000563:12.5synthetic, web-scaleMay/2025🟢https://arxiv.org/abs/2505.17667
Dense
Reasoning
"the first long-context LRM trained with reinforcement learniing for long-context reasoning."
602
92
Claude Opus 4Anthropic
https://claude.ai/
600010000017:181.683.3synthetic, web-scaleMay/2025🟢https://www-cdn.anthropic.com/6be99a52cb68eb70eb9572b4cafad13df32ed995.pdf
Dense
Reasoning, SOTA
"Claude Opus 4 is our most intelligent model to date, pushing the frontier in coding, agentic search, and creative writing. With advanced reasoning and powerful collaboration capabilities…Both models can also alternate between reasoning and tool use—like web search—to improve responses…Claude Opus 4 can work continuously for hours on complex, long-running tasks"
601
93
Falcon-H1TII
https://huggingface.co/tiiuae/Falcon-H1-34B-Instruct-GGUF
3418000530:12.684.0558.7349.66synthetic, web-scaleMay/2025🟢
https://huggingface.co/papers/2507.22448
Dense
"hybrid architecture that combines the strengths of the classical Transformer-based attention mechanism with the State Space Model (SSM), known for its superior long-context memory and computational efficiency."600
94
Gemini Diffusion
Google DeepMind
https://deepmind.google/models/gemini-diffusion/
4016000400:12.740.4synthetic, web-scaleMay/2025🟢
https://deepmind.google/models/gemini-diffusion/
Dense
Diffusion
"Gemini Diffusion’s external benchmark performance is comparable to much larger models [like Gemini-2.0-Flash-Lite], whilst also being faster."
599
95
Gemma 3n
Google DeepMind
https://ai.google.dev/gemma/docs/gemma-3n
480002,000:10.662.1synthetic, web-scaleMay/2025🟢
https://developers.googleblog.com/en/introducing-gemma-3n/
MatFormer
Matryoshka Transformer or MatFormer model architecture. 850M (696M / 620M / 582M).
598
96
ParScaleAlibaba
https://huggingface.co/ParScale/ParScale-4.7B-P8-Python
4.71000213:10.235.1synthetic, web-scaleMay/2025🟢https://arxiv.org/abs/2505.10475
Dense
"We introduce the third scaling paradigm for scaling LLMs: leverages parallel computation during both training and inference time (Parallel Scaling, or ParScale)... ParScale can use up to 22× less memory increase and 6× less latency increase compared to parameter scaling that achieves the same performance improvement. It can also recycle an off-the-shelf pre-trained model into a parallelly scaled one by post-training on a small amount of tokens, further reducing the training budget." MMLU shows for 1.8B models, not the 4.7B models.
597
97
codex-1OpenAIhttps://chatgpt.com/codex600100000167:125.8synthetic, web-scaleMay/2025🟢https://openai.com/index/introducing-codex/MoEReasoning, SOTA
o3 base. "codex-1, a version of OpenAI o3 optimized for software engineering. It was trained using reinforcement learning on real-world coding tasks in a variety of environments to generate code that closely mirrors human style and PR preferences, adheres precisely to instructions, and can iteratively run tests until it receives a passing result."
596
98
Falcon-EdgeTII
https://huggingface.co/tiiuae/Falcon-E-3B-Instruct
31500500:10.255.727.1623.59synthetic, web-scaleMay/2025🟢
https://huggingface.co/blog/tiiuae/falcon-edge
Dense
"Falcon-Edge series - a collection of powerful, universal, and fine-tunable language models available in ternary format, based on the BitNet architecture."595
99
SWE-1Windsurf
https://windsurf.com/blog/windsurf-wave-9-swe-1
508000160:12.1synthetic, web-scaleMay/2025🟢
https://windsurf.com/blog/windsurf-wave-9-swe-1
Dense"SWE-1, optimized for the entire software engineering process, not just the task of coding."594
100
INTELLECT-2Prime Intellect
https://chat.primeintellect.ai/
3218000563:12.566.8web-scaleMay/2025🟢
https://storage.googleapis.com/public-technical-paper/INTELLECT_2_Technical_Report.pdf
Dense
ReasoningQwQ-32B base. Announce: https://www.primeintellect.ai/blog/intellect-2-release Finished training 30/Apr/2025: https://app.primeintellect.ai/intelligence/intellect-2593