ABCDEFGHIJKLMNOPQRS
1
(703) Permalink:
https://lifearchitect.ai/models-table
Timeline view:
https://lifearchitect.ai/timeline
The Memo:
https://lifearchitect.ai/memo
2
ModelLab
Playground
Parameters
(B)
Tokens
trained (B)
Ratio Tokens:Params
(Chinchilla scaling≥20:1)
ALScore
"ALScore" is a quick and dirty rating of the model's power. The formula is:
Sqr Root of (Parameters x Tokens) ÷ 300.
Any ALScore ≥ 1.0 is a powerful model in mid-2023.
MMLUMMLU
-Pro
GPQAHLE
Training dataset
Announced
Public?Paper / RepoArchTagsNotesCount (rough)
3
AuroraGPT (ScienceGPT)
Argonne National Laboratory
https://lifearchitect.ai/auroragpt/20003000015:125.8TBA🔴
Three models targeted in Jul/2024: AuroraGPT-7B-P (Ponte Vecchio GPU testing) AuroraGPT-7B-A (Aurora) AuroraGPT-7B-A-S (Aurora + Science).
l
4
DeepSeek-R2DeepSeek-AIhttps://www.reuters.com/technology/artificial-intelligence/deepseek-rushes-launch-new-ai-model-china-goes-all-2025-02-25/1200130000109:141.6TBA🟢https://docs.google.com/document/d/e/2PACX-1vTmx-A5sBe_3RsURGM7VvLWsAgUXbcIb2pFaW7f1FTPgK7mGvYENXGQPoF2u4onFndJ_5tzZ02su-vg/pubMoEReasoning, SOTA
Hybrid MoE, 1.2TA78B. 5.2PB corpus=1.3Qa tokens. 1.3 quadrillion tokens = 1,300T tokens = 1,300,000B tokens "Constructed a 5.2 PB high-quality corpus covering vertical domains such as finance, law, and patents." http://jiuyangongshe.com/a/1h4gq724su0 and translated at: https://docs.google.com/document/d/e/2PACX-1vTmx-A5sBe_3RsURGM7VvLWsAgUXbcIb2pFaW7f1FTPgK7mGvYENXGQPoF2u4onFndJ_5tzZ02su-vg/pub
k
5
GPT-6OpenAIhttps://lifearchitect.ai/gpt-6/TBASOTADue 2026.f
6
Grok-5xAIhttps://lifearchitect.ai/whats-in-grok/600010000017:181.6TBAMoEDue 2026. Quote 3T/6T: https://youtu.be/q_mMV5OpRd4?t=1387 n
7
Trinity-LargeArcee AI4202000048:19.7TBA🟢
https://www.arcee.ai/blog/the-trinity-manifesto
MoEReasoning
420BA13B. "we worked closely with Prime Intellect. They not only served the H100 clusters Datology used to generate synthetic data, they have been deeply involved in helping scale our training setup to the GPU footprint required for a fully frontier sized model, including the current 2048 B300 GPU configuration for Trinity Large."
p
8
HY 2.0Tencent
https://hunyuan.tencent.com/
4064000099:113.418.8synthetic, web-scaleDec/2025🟢
https://x.com/TencentHunyuan/status/1996948083377332614
MoE406BA32B.703
9
Trinity-MiniArcee AI
https://huggingface.co/arcee-ai/Trinity-Mini
2620000770:12.484.9558.55synthetic, web-scaleDec/2025🟢
https://www.arcee.ai/blog/the-trinity-manifesto
MoEReasoning
26BA3B. "we worked closely with Prime Intellect. They not only served the H100 clusters Datology used to generate synthetic data, they have been deeply involved in helping scale our training setup to the GPU footprint required for a fully frontier sized model, including the current 2048 B300 GPU configuration for Trinity Large."
702
10
Nova 2 ProAmazon
https://nova.amazon.com/chat
20020000100:16.781.681.4synthetic, web-scaleDec/2025🟢
https://www.aboutamazon.com/news/aws/aws-agentic-ai-amazon-bedrock-nova-models
Dense
Reasoning
"Nova 2 Pro is Amazon's most intelligent reasoning model that can process text, images, video, and speech to generate text."
701
11
Mistral Large 3Mistral
https://huggingface.co/collections/mistralai/mistral-large-3
6752000030:112.243.9synthetic, web-scaleDec/2025🟢
https://mistral.ai/news/mistral-3
MoEReasoning675BA41B. "Mistral Large 3 joins the ranks of frontier instruction-fine-tuned open-source models." EU tech doc: https://legal.cms.mistral.ai/assets/1e37fffd-7ea5-469b-822f-05dcfbb43623700
12
DeepSeek-V3.2-Speciale
DeepSeek-AIhttps://huggingface.co/deepseek-ai/DeepSeek-V3.2-Speciale6851564022:110.685.730.6synthetic, web-scaleDec/2025🟢https://huggingface.co/deepseek-ai/DeepSeek-V3.2/blob/main/assets/paper.pdfMoESOTA, ReasoningThe word 'Speciale' may be a reference to Ferrari. "It shows gold-medal performance in the IOI 2025, ICPC World Final 2025, IMO 2025, and CMO 2025." API: https://api-docs.deepseek.com/news/news251201699
13
DeepSeek-Math-V2
DeepSeek-AIhttps://huggingface.co/deepseek-ai/DeepSeek-Math-V26851564022:110.6synthetic, web-scaleNov/2025🟢https://github.com/deepseek-ai/DeepSeek-Math-V2/blob/main/DeepSeekMath_V2.pdfMoESOTA, Reasoning"DeepSeekMath-V2, demonstrates strong theorem-proving capabilities, achieving gold-level scores on IMO 2025 and CMO 2024 and a near-perfect 118/120 on Putnam 2024 with scaled testtime compute. "698
14
Orchestrator-8BNVIDIA
https://huggingface.co/nvidia/Orchestrator-8B
8360004,500:11.837.1synthetic, web-scaleNov/2025🟢https://arxiv.org/abs/2511.21689
Dense
ReasoningBase Model: Qwen3-8B697
15
INTELLECT-3Prime Intellect
https://chat.primeintellect.ai/
10622000208:15.181.974.414.6synthetic, web-scaleNov/2025🟢https://storage.googleapis.com/intellect-3-paper/INTELLECT_3_Technical_Report.pdf
Dense
Reasoning
GLM-4.5-Air-Base model. Announce: https://www.primeintellect.ai/blog/intellect-3
696
16
Fara-7BMicrosoft
https://huggingface.co/microsoft/Fara-7B
7180002,572:11.2synthetic, web-scaleNov/2025🟢https://www.microsoft.com/en-us/research/wp-content/uploads/2025/11/Fara-7B-An-Efficient-Agentic-Model-for-Computer-Use.pdf
Dense
"Fara-7B is Microsoft's first agentic small language model (SLM) designed specifically for computer use. With only 7 billion parameters, Fara-7B is an ultra-compact Computer Use Agent (CUA)...Current production baselines leverage Qwen 2.5-VL (7B)."
695
17
Claude Opus 4.5Anthropic
https://claude.ai/
500010000020:174.586.9543.2synthetic, web-scaleNov/2025🟢
https://assets.anthropic.com/m/64823ba7485345a7/Claude-Opus-4-5-System-Card.pdf
MoEReasoning, SOTA"the best model in the world for coding, agents, and computer use." Announce: https://www.anthropic.com/news/claude-opus-4-5694
18
Nemotron ElasticNVIDIA
https://huggingface.co/nvidia/Nemotron-Elastic-12B
1211010:10.176.263.25synthetic, web-scaleNov/2025🟢https://arxiv.org/abs/2511.16664v1
Dense
Reasoning"Nemotron Elastic, a framework for building reasoning-oriented LLMs, including hybrid Mamba-Attention architectures, that embed multiple nested submodels within a single parent model, each optimized for different deployment configurations and budgets. Each of these submodels shares weights with the parent model and can be extracted zero-shot during deployment without additional training or fine-tuning...We apply Nemotron Elastic to the Nemotron Nano V2 12B model, simultaneously producing a 9B and a 6B model using only 110B training tokens"693
19
GeoVistaTencent
https://github.com/ekonwang/GeoVista
7180002,572:11.2synthetic, web-scaleNov/2025🟢
https://arxiv.org/abs/2511.15705
Dense
Base model: Qwen2.5-VL-7B-Instruct. "GeoVista, an agentic model that seamlessly integrates tool invocation within the reasoning loop, including an image-zoom-in tool to magnify regions of interest and a web-search tool to retrieve related web information. " Project page: https://ekonwang.github.io/geo-vista/
692
20
OLMo 3Allen AI
https://huggingface.co/collections/allenai/olmo-3
326000188:11.585.458.1synthetic, web-scaleNov/2025🟢
https://www.datocms-assets.com/64837/1763662397-1763646865-olmo_3_technical_report-1.pdf
Dense
ReasoningAnnounce: https://allenai.org/blog/olmo3691
21
Gemini 3
Google DeepMind
https://gemini.google.com/300010000034:157.790.193.845.8synthetic, web-scaleNov/2025🟢https://storage.googleapis.com/deepmind-media/Model-Cards/Gemini-3-Pro-Model-Card.pdfMoEReasoning, SOTA"The knowledge cutoff date for Gemini 3 Pro was January 2025."690
22
Grok 4.1xAI
https://grok.com/
30008000027:151.6synthetic, web-scaleNov/2025🟢https://x.ai/news/grok-4-1MoEReasoning689
23
BaguettotronPleIAs
https://huggingface.co/PleIAs/Baguettotron
0.321200624:10.040synthetic, web-scaleNov/2025🟢
https://huggingface.co/PleIAs/Baguettotron
Dense
Reasoning
"The name is both a nod to French origins and to the unusual shape of the model: with 80 layers, Baguettotron is currently the deepest SLM in its size range."
688
24
ERNIE-5.0-Preview-1022
Baiduhttps://ernie.baidu.com/240010000042:151.6synthetic, web-scaleNov/2025🟢https://ernie.baidu.com/blog/posts/ernie-5.0-preview-1022-release-on-lmarena/MoEReasoning
Very low performance on ALPrompt. 2.4T params confirmed: https://global.chinadaily.com.cn/a/202511/13/WS691571bda310d6866eb29500.html
687
25
GPT-5.1OpenAI
https://chatgpt.com/
300114000380:119.588.1synthetic, web-scaleNov/2025🟢
https://openai.com/index/gpt-5-1/
MoEReasoning, SOTA
Personality change via fine-tuning. GPQA (no tools) increased from GPT-5=85.7 to GPT-5.1=88.1.
686
26
JustRL-Nemotron-1.5B
Tsinghua
https://huggingface.co/hbx/JustRL-Nemotron-1.5B
1.590006,000:10.4synthetic, web-scaleNov/2025🟢
https://relieved-cafe-fe1.notion.site/JustRL-Scaling-a-1-5B-LLM-with-a-Simple-RL-Recipe-24f6198b0b6b80e48e74f519bfdaf0a8
Dense
Reasoning
"JustRL, a simple recipe with fixed hyperparameters, achieves state-of-the-art performance on two different 1.5B base models (54.5% and 64.3% across 9 math benchmarks) while using 2× less compute than sophisticated approaches. The same hyperparameters transfer across both models without tuning, and training remains stable over thousands of steps without intervention. This suggests the field may be adding complexity to solve problems that disappear with a stable, scaled-up baseline."
685
27
ERNIE-4.5-VL-28B-A3B-Thinking
Baiduhttps://huggingface.co/baidu/ERNIE-4.5-VL-28B-A3B-Thinking2815000536:12.278.966synthetic, web-scaleNov/2025🟢https://github.com/PaddlePaddle/ERNIEMoEReasoning28B-A3B. Open-sourced 12/Nov/2025 from Jun/2025 release.684
28
HOPE
Google DeepMind
1.310077:10.0synthetic, web-scaleNov/2025🟡
https://abehrouz.github.io/files/NL.pdf
Dense
"Combining our self-modifying sequence model with the continuum memory system, we present a learning module, called HOPE, showing promising results in language modeling, continual learning, and long-context reasoning tasks." Announce: https://research.google/blog/introducing-nested-learning-a-new-ml-paradigm-for-continual-learning/ May be released after paper is public.
683
29
Kimi K2 ThinkingMoonshot AI
https://kimi.com/
10001550016:113.194.484.684.544synthetic, web-scaleNov/2025🟢https://moonshotai.github.io/Kimi-K2/thinking.htmlMoEReasoning, SOTA
1TA32B. 1T parameters and 384 experts. Open source SOTA. HLE=51.0 on text-only subset, compare to Grok-4 HLE=50.7 also on text-only, but Grok-4 HLE=44.4 on HLE full, ∴ Kimi K2 Thinking HLE≈44 full (estimated).
682
30
GEN-0Generalist
https://generalistai.com/blog/nov-04-2025-GEN-0
10100001,000:11.1web-scaleNov/2025🟡https://generalistai.com/blog/nov-04-2025-GEN-0
Dense
SOTA
"GEN-0, a new class of embodied foundation models built for multimodal training directly on high-fidelity raw physical interaction. Its architecture builds on the strengths of vision and language models while also going beyond them—natively designed to capture human-level reflexes and physical commonsense. One core feature is Harmonic Reasoning, in which the models are trained to simultaneously think and act seamlessly... GEN-0 is pretrained on our in-house robotics dataset, which includes over 270,000 hours of real-world diverse manipulation data, growing at a rate of 10,000 hours a week and accelerating."
681
31
CALMWechat
https://github.com/shaochenze/calm
1.82230127:10.1web-scaleOct/2025🟢https://arxiv.org/abs/2510.27688
Dense
"Continuous Autoregressive Language Models (CALM), a paradigm shift from discrete next-token prediction to continuous next-vector prediction. CALM uses a high-fidelity autoencoder to compress a chunk of K tokens into a single continuous vector, from which the original tokens can be reconstructed with over 99.9% accuracy... We train our models on the Pile uncopyrighted dataset (Gao et al., 2020). The raw text is processed with the Llama 3 tokenizer (Grattafiori et al., 2024), resulting in a training set of ∼230B tokens."
680
32
Kimi-LinearMoonshot AI
https://huggingface.co/moonshotai/Kimi-Linear-48B-A3B-Instruct
485700119:11.751synthetic, web-scaleOct/2025🟢https://github.com/MoonshotAI/Kimi-Linear?tab=readme-ov-fileMoE
48B-A3B. "Kimi Linear is a hybrid linear attention architecture that outperforms traditional full attention methods across various contexts, including short, long, and reinforcement learning (RL) scaling regimes. At its core is Kimi Delta Attention (KDA)—a refined version of Gated DeltaNet that introduces a more efficient gating mechanism to optimize the use of finite-state RNN memory."
679
33
MiniMax-M2MiniMax
https://huggingface.co/MiniMaxAI/MiniMax-M2
230720032:14.3827831.8web-scaleOct/2025🟢https://platform.minimax.io/docs/guides/text-generationMoEReasoning230B-A10B.678
34
DeepSeek-OCRDeepSeek-AIhttps://github.com/deepseek-ai/DeepSeek-OCR/blob/main/DeepSeek_OCR_paper.pdf360002,000:10.4specialOct/2025🟢https://github.com/deepseek-ai/DeepSeek-OCR/blob/main/DeepSeek_OCR_paper.pdfMoE2D vision tokens for 1D text achieves huge compression. Encoder/Decoder: DeepEncoder 380M (80M SAM-base + 300M CLIP-large), DeepSeek-3B-MoE (A570M). 677
35
UserLM-8bMicrosoft
https://huggingface.co/microsoft/UserLM-8b
81000125:10.3WildChatOct/2025🟢https://huggingface.co/microsoft/UserLM-8b
Dense
"we trained UserLM-8b to simulate the “user” role in conversation (by training it to predict user turns in a large corpus of conversations called WildChat)."
676
36
CoDASalesforce
https://huggingface.co/Salesforce/CoDA-v0-Instruct
1.7180106:10.1synthetic, web-scaleOct/2025🟢https://github.com/SalesforceAIResearch/CoDA/blob/main/technical_report.pdf
Dense
Diffusion"diffusion coder trained on TPU [Google TPU v4-1024 VM]"675
37
TRMSamsunghttps://github.com/SamsungSAILMontreal/TinyRecursiveModels0.0070.115:10.0Mazes (ARC-AGI)Oct/2025🟢https://arxiv.org/abs/2510.04871v1
Dense
"Tiny Recursive Model (TRM), a much simpler recursive reasoning approach that achieves significantly higher generalization than HRM, while using a single tiny network with only 2 layers"
674
38
Granite-4.0 SmallIBM
https://huggingface.co/ibm-granite/granite-4.0-h-small
3215000469:12.378.3355.4740.63synthetic, web-scaleOct/2025🟢
https://www.ibm.com/granite/docs/models/granite
MoEReasoning
32B-A9B. Announce: https://www.ibm.com/new/announcements/ibm-granite-4-0-hyper-efficient-high-performance-hybrid-models
673
39
GLM-4.6Z.AI
https://huggingface.co/zai-org/GLM-4.6
3552200062:19.382.930.4synthetic, web-scaleSep/2025🟢
https://z.ai/blog/glm-4.6
MoEReasoning355B-A32B. "context window has been expanded from 128K to 200K tokens"672
40
Ring-1T-previewInclusionAI
https://huggingface.co/inclusionAI/Ring-1T-preview
10002000020:114.9synthetic, web-scaleSep/2025🟢
https://huggingface.co/inclusionAI/Ring-1T-preview
MoEReasoning1T-A48.5B.671
41
Claude Sonnet 4.5Anthropic
https://claude.ai/
40080000200:118.983.4synthetic, web-scaleSep/2025🟢
https://assets.anthropic.com/m/12f214efcc2f457a/original/Claude-Sonnet-4-5-System-Card.pdf
MoEReasoning, SOTAThe Claude Sonnet 4.5 "system card" is an absolute farce. Announce: https://www.anthropic.com/news/claude-sonnet-4-5670
42
Gemini Robotics 1.5
Google DeepMind
20020000100:16.759.6synthetic, web-scaleSep/2025🟢
https://storage.googleapis.com/deepmind-media/gemini-robotics/Gemini-Robotics-1-5-Tech-Report.pdf
MoEReasoning
2. "vision-language-action (VLA) model turns visual information and instructions into motor commands for a robot to perform a task." Available to select partners.
669
43
Gemini Robotics-ER 1.5
Google DeepMind
https://aistudio.google.com/?model=gemini-robotics-er-1.5-preview
30300001,000:13.283.3synthetic, web-scaleSep/2025🟢
https://storage.googleapis.com/deepmind-media/gemini-robotics/Gemini-Robotics-1-5-Tech-Report.pdf
MoEReasoning
1. "vision-language model (VLM) reasons about the physical world, natively calls digital tools and creates detailed, multi-step plans to complete a mission." Available to all devs.
668
44
TimesFM-ICFGoogle
https://huggingface.co/collections/google/timesfm-release-66e4be5fdb56e960c1e482a6
0.2100500:10.0specialSep/2025🔴
https://research.google/blog/time-series-foundation-models-can-be-few-shot-learners/
Dense
TimesFM-ICF is 6.8% more accurate than TimesFM (Base). Time-series forecasting only. 'a large pretraining corpus of 100B real world time-points' may be more than 100B tokens.
667
45
Qwen3-MaxAlibaba
https://chat.qwen.ai/
10003600036:120.085.4synthetic, web-scaleSep/2025🟢https://qwen.ai/blog?id=241398b9cd6353de490b0f82806c7848c5d2777d&from=research.latest-advancements-listMoEReasoning
"Qwen3-Max-Thinking — still under active training — is already demonstrating remarkable potential. When augmented with tool usage and scaled test-time compute, the Thinking variant has achieved 100% on challenging reasoning benchmarks such as AIME 25 and HMMT. "
666
46
Qwen3-OmniAlibaba
https://github.com/QwenLM/Qwen3-Omni?tab=readme-ov-file
3017000567:12.488.873.1synthetic, web-scaleSep/2025🟢https://github.com/QwenLM/Qwen3-Omni/blob/main/assets/Qwen3_Omni.pdfMoEReasoning
"Qwen3-Omni is a unified end-to-end model capable of processing multiple modalities, such as text, audio, image and video, and generating real-time text or speech response."... "pretraining utilizes a large-scale dataset containing approximately 2 trillion tokens, with the following distribution across modalities: text (0.57 trillion), audio (0.77 trillion), image (0.82 trillion), video (0.05 trillion), and video-audio (0.05 trillion)."
665
47
DeepSeek-V3.1-Terminus
DeepSeek-AIhttps://huggingface.co/deepseek-ai/DeepSeek-V3.1-Terminus6851564022:110.68580.721.7synthetic, web-scaleSep/2025🟢https://api-docs.deepseek.com/news/news250922MoESOTA, ReasoningHybrid reasoning. Dataset tokens: https://x.com/deepseek_ai/status/1958417072536608952 HLE: https://x.com/deepseek_ai/status/1958417068568481854/photo/2664
48
Isaac 0.1Perceptronhttps://huggingface.co/PerceptronAI/Isaac-0.1220001,000:10.2synthetic, web-scaleSep/2025🟢https://www.perceptron.inc/blog/introducing-isaac-0-1
Dense
"perceptive-language model...delivering capabilities that meet or exceed those of models over 50 times its size. Founded by the team behind Meta's Chameleon multimodal models, Perceptron is tackling a fundamental challenge: bringing the power of physical AI to the dynamic, multimodal, and real-time environments we live and work in."
663
49
Grok 4 FastxAI
https://grok.com/
3000200007:125.885.720synthetic, web-scaleSep/2025🟢https://x.ai/news/grok-4-fastMoEReasoning, SOTA
"2M token context window, and a unified architecture that blends reasoning and non-reasoning modes in one model."
662
50
VaultGemma
Google DeepMind
https://huggingface.co/google/vaultgemma-1b
11300013,000:10.4web-scaleSep/2025🟢https://services.google.com/fh/files/blogs/vaultgemma_tech_report.pdf
Dense
"Differential Privacy (DP) has emerged as the gold standard, providing a rigorous, mathematical framework to limit the influence of any single example in the training data on the resulting model. A model trained with DP provably bounds the reconstruction or leakage of information tied to individual data points." Announce: https://research.google/blog/vaultgemma-the-worlds-most-capable-differentially-private-llm/661
51
Qwen3-Next-80B-A3B
Alibaba
https://huggingface.co/collections/Qwen/qwen3-next-68c25fd6838e585db8eeea9d
8015000188:13.784.7266.0543.43synthetic, web-scaleSep/2025🟢https://qwen.ai/blog?id=4074cca80393150c248e508aa62983f9cb7d27cd&from=research.latest-advancements-listMoEReasoning
"Qwen3-Next introduces several key improvements: a hybrid attention mechanism, a highly sparse Mixture-of-Experts (MoE) structure, training-stability-friendly optimizations, and a multi-token prediction mechanism for faster inference."
660
52
K2-ThinkMBZUAIhttps://www.k2think.ai/3218000563:12.571.089.95synthetic, web-scaleSep/2025🟢https://arxiv.org/abs/2509.07604
Dense
Reasoning
"Built on the Qwen2.5 base model, our system shows that smaller models can compete at the highest levels by combining advanced post-training and test-time computation techniques. The approach is based on six key technical pillars: Long Chain-of-thought Supervised Finetuning, Reinforcement Learning with Verifiable Rewards (RLVR), Agentic planning prior to reasoning, Test-time Scaling, Speculative Decoding, and Inference-optimized Hardware, all using publicly available open-source datasets."
659
53
mmBERTJHUhttps://huggingface.co/collections/jhu-clsp/mmbert-a-modern-multilingual-encoder-68b725831d7c6e3acc435ed40.30730009,772:10.1synthetic, web-scaleSep/2025🟢https://arxiv.org/abs/2509.06888
Dense
"a modern multilingual encoder trained on 3T tokens and 1833 languages. We introduce several novel elements in training: an inverse masking schedule and a cascading annealed language learning schedule for multilingual data" Announce: https://huggingface.co/blog/mmbert
658
54
ERNIE X1.1Baiduhttps://ernie.baidu.com/synthetic, web-scaleSep/2025🟢https://www.prnewswire.com/news-releases/baidu-unveils-reasoning-model-ernie-x1-1-with-upgrades-in-key-capabilities-302551170.htmlMoEReasoning657
55
ERNIE-4.5-21B-A3B-Thinking
Baidu
https://huggingface.co/baidu/ERNIE-4.5-21B-A3B-Thinking
2115000715:11.9synthetic, web-scaleSep/2025🟢https://huggingface.co/baidu/ERNIE-4.5-21B-A3B-ThinkingMoEReasoning656
56
Klear-46B-A2.5BKuaishou
https://huggingface.co/Kwai-Klear/Klear-46B-A2.5B-Instruct
4622000479:13.480.557.635.3synthetic, web-scaleSep/2025🟢https://huggingface.co/Kwai-Klear/Klear-46B-A2.5B-InstructMoE46B-A2.5B.655
57
TildeOpen-30bTilde AI
https://huggingface.co/TildeAI/TildeOpen-30b
30200067:10.8synthetic, web-scaleSep/2025🟢https://tilde.ai/lv/tildeopen-llm/
Dense
"language data from across Europe"654
58
Qwen3-Max-Preview
Alibaba
https://chat.qwen.ai/
10003600036:120.064.6synthetic, web-scaleSep/2025🟢https://modelstudio.console.alibabacloud.com/?tab=doc#/doc/?type=model&url=2840914_2&modelId=qwen3-max-previewMoE
GPQA score is SuperGPQA. "our biggest model yet, with over 1 trillion parameters"
653
59
Kimi K2-Instruct-0905
Moonshot AI
https://huggingface.co/moonshotai/Kimi-K2-Instruct
10001550016:113.1synthetic, web-scaleSep/2025🟢https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905MoEReasoning, SOTA1TA32B. 1T parameters and 384 experts. Open source SOTA.652
60
ApertusETH Zürich
https://huggingface.co/swiss-ai/Apertus-70B-Instruct-2509
7015000215:13.465.230.6synthetic, web-scaleSep/2025🟢https://github.com/swiss-ai/apertus-tech-report/blob/main/Apertus_Tech_Report.pdf
Dense
"Apertus – Latin for “open”" 1,811 languages. Announce: https://ethz.ch/en/news-and-events/eth-news/news/2025/09/press-release-apertus-a-fully-open-transparent-multilingual-language-model.html
651
61
LongCat-FlashMeituan
https://longcat.ai/
5602000036:111.289.7182.6873.23synthetic, web-scaleSep/2025🟢https://github.com/meituan-longcat/LongCat-Flash-Chat/blob/main/tech_report.pdfMoEReasoning, SOTA
560B-A18.6B–31.3B (27B on average). Announce: https://lmsys.org/blog/2025-09-01-sglang-longcat-flash/
650
62
MAI-1-previewMicrosoft
https://microsoft.ai/news/two-new-in-house-models/
5001000020:17.5synthetic, web-scaleAug/2025🟢https://microsoft.ai/news/two-new-in-house-models/MoE
MAI=Microsoft artificial intelligence. "MAI’s first foundation model trained end-to-end... MAI-1-preview is an in-house mixture-of-experts model, pre-trained and post-trained on ~15,000 NVIDIA H100 GPUs. This model is designed to provide powerful capabilities to consumers seeking to benefit from models that specialize in following instructions and providing helpful responses to everyday queries. We will be rolling MAI-1-preview out for certain text use cases within Copilot"
649
63
grok-code-fast-1xAI
https://github.com/features/copilot
8001000013:19.4synthetic, web-scaleAug/2025🟢
https://data.x.ai/2025-08-26-grok-code-fast-1-model-card.pdf
MoE
"We built grok-code-fast-1 from scratch, starting with a brand-new model architecture. To lay a robust foundation, we carefully assembled a pre-training corpus rich with programming-related content. For post-training, we curated high-quality datasets that reflect real-world pull requests and coding tasks." Announce: https://x.ai/news/grok-code-fast-1
648
64
Hermes 4Nous Research
https://huggingface.co/NousResearch/Hermes-4-405B-FP8
4051565639:18.487.280.570.5synthetic, web-scaleAug/2025🟢
https://arxiv.org/abs/2508.18255
Dense
ReasoningBased on Llama 3. Announce: https://hermes4.nousresearch.com/647
65
Jet-Nemotron-4BNVIDIA
https://github.com/NVlabs/Jet-Nemotron
4400100:10.165.244.2synthetic, web-scaleAug/2025🟢https://arxiv.org/abs/2508.15884v1
Dense
Reasoning"pre-training corpus and train Jet-Nemotron models for 50B tokens. This is also the setting in Section 2 where we perform PostNAS. At the second stage, we include more high-quality data from math [65] and coding [66, 67] domains into our data mixture. The models are then trained on 350B tokens."646
66
DeepSeek-V3.1-Base
DeepSeek-AI
https://huggingface.co/deepseek-ai/DeepSeek-V3.1-Base
6851564022:110.693.784.880.129.8synthetic, web-scaleAug/2025🟢https://huggingface.co/deepseek-ai/DeepSeek-V3.1-BaseMoESOTA, ReasoningHybrid reasoning. Dataset tokens: https://x.com/deepseek_ai/status/1958417072536608952 HLE: https://x.com/deepseek_ai/status/1958417068568481854/photo/2645
67
Nemotron Nano 2NVIDIA
https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2-Base
12.31200001,625:11.778.2463.9864.48synthetic, web-scaleAug/2025🟢https://research.nvidia.com/labs/adlr/files/NVIDIA-Nemotron-Nano-2-Technical-Report.pdf
Dense
ReasoningAnnounce: https://research.nvidia.com/labs/adlr/NVIDIA-Nemotron-Nano-2/644
68
Gemma 3 270M
Google DeepMind
https://huggingface.co/google/gemma-3-270m-it
0.27600022,223:10.1web-scaleAug/2025🟢https://developers.googleblog.com/en/introducing-gemma-3-270m/
Dense
This is a record tokens-to-params ratio (for text models) of 22,223:1.643
69
GPT-5OpenAI
https://poe.com/GPT-5
300114000380:119.59189.442synthetic, web-scaleAug/2025🟢
https://openai.com/index/gpt-5-system-card/
MoESOTA, Reasoning
Announce: https://openai.com/index/introducing-gpt-5/. MMLU is based on ES and PT translated from EN.
642
70
gpt-oss-120bOpenAI
https://huggingface.co/openai/gpt-oss-120b
12030000250:16.39080.119synthetic, web-scaleAug/2025🟢
https://cdn.openai.com/pdf/419b6906-9da6-406c-a19d-1bb078ac7637/oai_gpt-oss_model_card.pdf
MoEReasoning, SOTA116.8B total parameters and 5.1B “active” parameters per token per forward pass. https://openai.com/index/introducing-gpt-oss/641
71
gpt-oss-20bOpenAI
https://huggingface.co/openai/gpt-oss-20b
2013000650:11.785.371.517.3synthetic, web-scaleAug/2025🟢
https://cdn.openai.com/pdf/419b6906-9da6-406c-a19d-1bb078ac7637/oai_gpt-oss_model_card.pdf
MoEReasoning, SOTA20.9B total and 3.6B active parameters. https://openai.com/index/introducing-gpt-oss/640
72
Claude Opus 4.1Anthropic
https://claude.ai/
200010000050:147.180.9synthetic, web-scaleAug/2025🟢
https://www.anthropic.com/news/claude-opus-4-1
MoEReasoning, SOTA639
73
GLM-4.5Z.AI
https://huggingface.co/zai-org/GLM-4.5
3552200062:19.384.679.114.4synthetic, web-scaleJul/2025🟢
https://z.ai/blog/glm-4.5
MoEReasoning355B-A32B.638
74
T1
China Telecom Artificial Intelligence Research Institute
https://github.com/Tele-AI/T1
1151000087:13.6web-scaleJul/2025🟢https://arxiv.org/abs/2507.18013
Dense
Reasoning637
75
Intern-S1
Shanghai AI Laboratory/SenseTime
https://huggingface.co/internlm/Intern-S1
23541000175:110.383.577.3synthetic, web-scaleJul/2025🟢
https://huggingface.co/internlm/Intern-S1
MoEReasoning, SOTA
41T tokens assumes base model of Qwen3. "Built upon a 235B MoE language model and a 6B Vision encoder, Intern-S1 has been further pretrained on 5 trillion tokens of multimodal data"
636
76
Step 3StepFun
https://www.stepfun.com/
3211800057:18.072.9web-scaleJul/2025🟢
https://github.com/stepfun-ai/Step3/blob/main/Step3-Sys-Tech-Report.pdf
MoE321B-A38B. https://x.com/CyouSakura/status/1948767450751009227635
77
Qwen3-235B-A22B-Thinking-2507
Alibaba
https://huggingface.co/Qwen/Qwen3-235B-A22B-Thinking-2507
23536000154:19.793.884.481.1synthetic, web-scaleJul/2025🟢
https://huggingface.co/Qwen/Qwen3-235B-A22B-Thinking-2507
MoEReasoning
235B-A22B. "Qwen3 is pre-trained on 36 trillion tokens across 119 languages" MMLU score is MMLU-Redux.
634
78
KAT-V1-200BKuaishou2001000050:14.782.378.2synthetic, web-scaleJul/2025🔴https://arxiv.org/abs/2507.08297MoEReasoning
200BA40B. In training as of Jul/2025. "to address the overthinking problem in reasoning-intensive tasks"
633
79
KAT-V1-40BKuaishou
https://huggingface.co/Kwaipilot/KAT-V1-40B
4010000250:12.177.875.1synthetic, web-scaleJul/2025🟢https://arxiv.org/abs/2507.08297
Dense
Reasoning"to address the overthinking problem in reasoning-intensive tasks"632
80
Qwen3-Coder-480B-A35B-Instruct
Alibaba
https://huggingface.co/Qwen/Qwen3-Coder-480B-A35B-Instruct
4803600075:113.9synthetic, web-scaleJul/2025🟢https://qwenlm.github.io/blog/qwen3-coder/MoE480B-A35B.631
81
Qwen3-235B-A22B-Instruct-2507
Alibaba
https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507
23536000154:19.793.18377.5synthetic, web-scaleJul/2025🟢https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507MoESOTA
235B-A22B. "Qwen3 is pre-trained on 36 trillion tokens across 119 languages" MMLU score is MMLU-Redux.
630
82
FlexOlmoAllen AI
https://huggingface.co/allenai/FlexOlmo-7x7B-1T
374150113:11.360.430.9synthetic, web-scaleJul/2025🟢
https://arxiv.org/abs/2507.07024v1
MoE
37B-A20B. "We adopt the OLMo-2 7B setup, starting from a a checkpoint pre-trained on 4T tokens and annealed for 50B tokens to produce a public expert. We then train two additional experts on math and code, each for 50B tokens, and combine them with the public expert to form a three-expert version of FLEXOLMO."
629
83
EXAONE 4.0LG
https://huggingface.co/LGAI-EXAONE/EXAONE-4.0-32B
3214000438:12.292.381.875.4web-scaleJul/2025🟢
https://www.lgresearch.ai/data/cdn/upload/EXAONE_4_0.pdf
Dense
Reasoning
“EXAONE”=“EXpert AI for EveryONE”. Training tokens/ratio: EXAONE-3 7.8B=8T tokens (Aug/2024) -> EXAONE-3.5 7.8B=9T -> EXAONE-3.5 32B=6.5T tokens -> EXAONE 4.0 32B=14T tokens. MMLU score is MMLU-Redux. Interesting: "To focus [RL] training on more informative data samples, we perform accuracy-based filtering by generating eight responses from the SFT model and excluding samples where all eight responses are correct, a pre-filtering step that removes problems that are easy for the model to avoid inefficient training."
628
84
Kimi K2Moonshot AI
https://huggingface.co/moonshotai/Kimi-K2-Instruct
10001550016:113.189.581.175.14.7synthetic, web-scaleJul/2025🟢https://moonshotai.github.io/Kimi-K2/MoEReasoning, SOTA1TA32B. 1T parameters and 384 experts. Open source SOTA.627
85
Reka Flash 3.1Reka AI
https://huggingface.co/RekaAI/reka-flash-3.1
215000239:11.1web-scaleJul/2025🟢
https://www.reka.ai/news/introducing-reka-flash
Dense
Reasoning626
86
Devstral MediumMistral
https://chat.mistral.ai/chat
5012000240:12.6synthetic, web-scaleJul/2025🟢
https://mistral.ai/news/devstral-2507
Dense
Non-reasoning.625
87
Grok 4xAI
https://grok.com/
30008000027:151.688.944.4synthetic, web-scaleJul/2025🟢https://lifearchitect.ai/grok/MoEReasoning, SOTA
2.4T? https://x.com/kalomaze/status/1942996555088134592 "The smartest AI in the world, 100% on SAT, etc, questions that it's never seen before."
624
88
Phi-4-mini-flash-reasoning
Microsoft
https://huggingface.co/microsoft/Phi-4-mini-flash-reasoning
3.851501,356:10.5synthetic, web-scaleJul/2025🟢https://azure.microsoft.com/en-us/blog/reasoning-reimagined-introducing-phi-4-mini-flash-reasoning/
Dense
"Pre-training: 5T tokens; Reasoning training: 150B tokens" "At the core of Phi-4-mini-flash-reasoning is the newly introduced decoder-hybrid-decoder architecture, SambaY, whose central innovation is the Gated Memory Unit (GMU), a simple yet effective mechanism for sharing representations between layers. The architecture includes a self-decoder that combines Mamba (a State Space Model) and Sliding Window Attention (SWA), along with a single layer of full attention. The architecture also involves a cross-decoder that interleaves expensive cross-attention layers with the new, efficient GMUs. This new architecture with GMU modules drastically improves decoding efficiency, boosts long-context retrieval performance and enables the architecture to deliver exceptional performance across a wide range of tasks. "
623
89
T5Gemma
Google DeepMind
https://huggingface.co/google/t5gemma-9b-9b-ul2-it
9100001,112:11.076.755.740.4web-scaleJul/2025🟢https://developers.googleblog.com/en/t5gemma/
Dense
Related paper: https://arxiv.org/abs/2504.06225. Dataset was Gemma 2 9B on 8T tokens + 2T tokens adapted.622
90
MedGemma
Google DeepMind
https://huggingface.co/google/medgemma-27b-it
2714000519:12.087web-scaleJul/2025🟢
https://arxiv.org/abs/2507.05201
Dense
Multimodal model. Text MMLU score for med only=87.0. 621
91
R1T2 ChimeraTNG
https://huggingface.co/tngtech/DeepSeek-TNG-R1T2-Chimera
6851480022:110.6synthetic, web-scaleJul/2025🟢https://arxiv.org/abs/2506.14794MoE
Assembly of Experts-method of V3-0324, R1, R1-0528. Announce: https://x.com/tngtech/status/1940531045432283412?s=46
620
92
Spectra 1.1Consortium3.61200334:10.236.12synthetic, web-scaleJun/2025🟢https://arxiv.org/abs/2506.23025
Dense
"Spectra-1.1, an open suite of TriLMs trained on up to 1.2 trillion tokens, demonstrating sustained performance gains at scale. Furthermore, to improve inference efficiency, we propose novel 2-bit and 1.6-bit packing schemes for ternary weights"
619
93
DiffuCoderApple
https://github.com/apple/ml-diffucoder
75630805:10.7code, The StackJun/2025🟢https://arxiv.org/abs/2506.20639
Dense
Diffusion
"We adapt our model from Qwen2.5-Coder (Hui et al., 2024) as the base model to perform continual pre-training using the adaptation approach from Gong et al. (2025). During this pre-training, we use a 400B-token code pre-training corpus from RefineCode (Huang et al., 2024) and Stackv2 (Lozhkov et al., 2024)."
618
94
Hunyuan-A13BTencent
https://huggingface.co/tencent/Hunyuan-A13B-Instruct
80700088:12.588.1767.2371.2synthetic, web-scaleJun/2025🟢
https://huggingface.co/tencent/Hunyuan-A13B-Instruct
MoE
80B-A13B. 'We have open-sourced Hunyuan-A13B-Pretrain , Hunyuan-A13B-Instruct , Hunyuan-A13B-Instruct-FP8 , Hunyuan-A13B-Instruct-GPTQ-Int4 on Hugging Face.'
617
95
MercuryInception
https://chat.inceptionlabs.ai/
90800089:12.869513.4synthetic, web-scaleJun/2025🟢https://www.inceptionlabs.ai/introducing-mercury-our-general-chat-modelDenseDiffusionDiffusion large language model (dLLM).616
96
MuMicrosoft
https://blogs.windows.com/windows-insider/2025/06/13/announcing-windows-11-insider-preview-build-26200-5651-dev-channel/
0.55001,000:10.1synthetic, web-scaleJun/2025🟢https://blogs.windows.com/windowsexperience/2025/06/23/introducing-mu-language-model-and-how-it-enabled-the-agent-in-windows-settings/
Dense
"distillation from Microsoft’s Phi models...Mu is an efficient 330M encoder–decoder language model optimized for small-scale deployment, particularly on the NPUs on Copilot+ PCs. It follows a transformer encoder–decoder architecture"
615
97
Gemini Robotics On-Device
Google DeepMind
https://docs.google.com/forms/u/0/d/1sM5GqcVMWv-KmKY3TOMpVtQ-lDFeAftQ-d9xQn92jCE/viewform?ts=67cef986&edit_requested=true
2010000500:11.5synthetic, web-scaleJun/2025🟢https://deepmind.google/discover/blog/gemini-robotics-on-device-brings-ai-to-local-robotic-devices/MoE
See Mar/2025 Gemini Robotics-ER model for comparison. Announce: https://deepmind.google/discover/blog/gemini-robotics-on-device-brings-ai-to-local-robotic-devices/
614
98
ICONN-1ICONNAI
https://huggingface.co/ICONNAI/ICONN-1
8810000114:13.1synthetic, web-scaleJun/2025🟢MoE
"ICONN-1 (this version) is optimized for natural, emotionally resonant, and conversational interactions. ICONN-e1 is a specialized variant of the model fine-tuned for advanced reasoning, critical analysis, and complex problem-solving."
613
99
MiniMax-M1MiniMax
https://huggingface.co/MiniMaxAI/MiniMax-M1-80k
456720016:16.081.1708.4web-scaleJun/2025🟢https://arxiv.org/abs/2506.13585MoEReasoning456B-A45.9B. Announce: https://www.minimax.io/news/minimaxm1612
100
Magistral MediumMistral
https://chat.mistral.ai/chat
5012000240:12.670.8synthetic, web-scaleJun/2025🟢
https://mistral.ai/static/research/magistral.pdf
Dense
ReasoningMagistral Small=24B. Announce: https://mistral.ai/news/magistral611