ABCDEFGHIJKLMNOPQRS
1
(670) Permalink:
https://lifearchitect.ai/models-table
Timeline view:
https://lifearchitect.ai/timeline
The Memo:
https://lifearchitect.ai/memo
2
ModelLab
Playground
Parameters
(B)
Tokens
trained (B)
Ratio Tokens:Params
(Chinchilla scaling≥20:1)
ALScore
"ALScore" is a quick and dirty rating of the model's power. The formula is:
Sqr Root of (Parameters x Tokens) ÷ 300.
Any ALScore ≥ 1.0 is a powerful model in mid-2023.
MMLUMMLU
-Pro
GPQAHLE
Training dataset
Announced
Public?Paper / RepoArchTagsNotesCount (rough)
3
AuroraGPT (ScienceGPT)
Argonne National Laboratory
https://lifearchitect.ai/auroragpt/20003000015:1TBA🔴
Three models targeted in Jul/2024: AuroraGPT-7B-P (Ponte Vecchio GPU testing) AuroraGPT-7B-A (Aurora) AuroraGPT-7B-A-S (Aurora + Science).
l
4
DeepSeek-R2DeepSeek-AIhttps://www.reuters.com/technology/artificial-intelligence/deepseek-rushes-launch-new-ai-model-china-goes-all-2025-02-25/1200130000109:141.6TBA🟢https://docs.google.com/document/d/e/2PACX-1vTmx-A5sBe_3RsURGM7VvLWsAgUXbcIb2pFaW7f1FTPgK7mGvYENXGQPoF2u4onFndJ_5tzZ02su-vg/pubMoEReasoning, SOTA
Due April 2025. Hybrid MoE, 1.2TA78B. 5.2PB corpus=1.3Qa tokens. 1.3 quadrillion tokens = 1,300T tokens = 1,300,000B tokens "Constructed a 5.2 PB high-quality corpus covering vertical domains such as finance, law, and patents." http://jiuyangongshe.com/a/1h4gq724su0 and translated at: https://docs.google.com/document/d/e/2PACX-1vTmx-A5sBe_3RsURGM7VvLWsAgUXbcIb2pFaW7f1FTPgK7mGvYENXGQPoF2u4onFndJ_5tzZ02su-vg/pub
k
5
ERNIE 5Baiduhttps://lifearchitect.ai/ernie/TBAj
6
Gemini Ultra
Google DeepMind
https://www.reddit.com/r/singularity/comments/1kbpdvp/a_string_referencing_gemini_ultra_has_been_added/20003000015:125.8TBADue May/2025.i
7
GPT-6OpenAIhttps://lifearchitect.ai/gpt-6/TBASOTADue 2025.f
8
Llama 4 ReasoningMeta AIhttps://ai.meta.com/blog/llama-4-multimodal-intelligence/TBA🟢https://ai.meta.com/blog/llama-4-multimodal-intelligence/MoEReasoning, SOTAAnnounced, coming soon.d
9
o4OpenAIhttps://lifearchitect.ai/o4/TBAReasoning, SOTADue 2025.b
10
o5OpenAIhttps://lifearchitect.ai/o5/TBAReasoning, SOTADue 2025. Proto-ASI.b
11
GLM-4.6Z.AI
https://huggingface.co/zai-org/GLM-4.6
3552200062:19.382.930.4synthetic, web-scaleSep/2025🟢
https://z.ai/blog/glm-4.6
MoEReasoning355B-A32B. "context window has been expanded from 128K to 200K tokens"670
12
Ring-1T-previewInclusionAI
https://huggingface.co/inclusionAI/Ring-1T-preview
10002000020:114.9synthetic, web-scaleSep/2025🟢
https://huggingface.co/inclusionAI/Ring-1T-preview
MoEReasoning1T-A48.5B.669
13
Claude Sonnet 4.5Anthropic
https://claude.ai/
40080000200:118.983.4synthetic, web-scaleSep/2025🟢
https://www.anthropic.com/news/claude-sonnet-4-5
MoEReasoning, SOTAThe Claude Sonnet 4.5 "system card" is an absolute farce, won't be linked here.668
14
Gemini Robotics 1.5
Google DeepMind
20020000100:16.759.6synthetic, web-scaleSep/2025🟢
https://storage.googleapis.com/deepmind-media/gemini-robotics/Gemini-Robotics-1-5-Tech-Report.pdf
MoEReasoning
2. "vision-language-action (VLA) model turns visual information and instructions into motor commands for a robot to perform a task." Available to select partners.
667
15
Gemini Robotics-ER 1.5
Google DeepMind
30300001,000:13.283.3synthetic, web-scaleSep/2025🟢
https://storage.googleapis.com/deepmind-media/gemini-robotics/Gemini-Robotics-1-5-Tech-Report.pdf
MoEReasoning
1. "vision-language model (VLM) reasons about the physical world, natively calls digital tools and creates detailed, multi-step plans to complete a mission." Available to all devs.
666
16
TimesFM-ICFGoogle0.2100500:10.0specialSep/2025🔴
https://research.google/blog/time-series-foundation-models-can-be-few-shot-learners/
Dense
TimesFM-ICF is 6.8% more accurate than TimesFM (Base). Time-series forecasting only. 'a large pretraining corpus of 100B real world time-points' may be more than 100B tokens.
665
17
Qwen3-MaxAlibaba
https://chat.qwen.ai/
10003600036:120.085.4synthetic, web-scaleSep/2025🟢https://qwen.ai/blog?id=241398b9cd6353de490b0f82806c7848c5d2777d&from=research.latest-advancements-listMoEReasoning
"Qwen3-Max-Thinking — still under active training — is already demonstrating remarkable potential. When augmented with tool usage and scaled test-time compute, the Thinking variant has achieved 100% on challenging reasoning benchmarks such as AIME 25 and HMMT. "
664
18
Qwen3-OmniAlibaba
https://github.com/QwenLM/Qwen3-Omni?tab=readme-ov-file
3017000567:12.488.873.1synthetic, web-scaleSep/2025🟢https://github.com/QwenLM/Qwen3-Omni/blob/main/assets/Qwen3_Omni.pdfMoEReasoning
"Qwen3-Omni is a unified end-to-end model capable of processing multiple modalities, such as text, audio, image and video, and generating real-time text or speech response."... "pretraining utilizes a large-scale dataset containing approximately 2 trillion tokens, with the following distribution across modalities: text (0.57 trillion), audio (0.77 trillion), image (0.82 trillion), video (0.05 trillion), and video-audio (0.05 trillion)."
663
19
DeepSeek-V3.1-Terminus
DeepSeek-AIhttps://huggingface.co/deepseek-ai/DeepSeek-V3.1-Terminus6851564022:110.68580.721.7synthetic, web-scaleSep/2025🟢https://api-docs.deepseek.com/news/news250922MoESOTA, ReasoningHybrid reasoning. Dataset tokens: https://x.com/deepseek_ai/status/1958417072536608952 HLE: https://x.com/deepseek_ai/status/1958417068568481854/photo/2662
20
Isaac 0.1Perceptronhttps://huggingface.co/PerceptronAI/Isaac-0.1220001,000:10.2synthetic, web-scaleSep/2025🟢https://www.perceptron.inc/blog/introducing-isaac-0-1
Dense
"perceptive-language model...delivering capabilities that meet or exceed those of models over 50 times its size. Founded by the team behind Meta's Chameleon multimodal models, Perceptron is tackling a fundamental challenge: bringing the power of physical AI to the dynamic, multimodal, and real-time environments we live and work in."
661
21
Grok 4 FastxAI
https://grok.com/
20020000100:16.785.720synthetic, web-scaleSep/2025🟢https://x.ai/news/grok-4-fastMoEReasoning, SOTA
"2M token context window, and a unified architecture that blends reasoning and non-reasoning modes in one model."
660
22
VaultGemma
Google DeepMind
https://huggingface.co/google/vaultgemma-1b
11300013,000:10.4web-scaleSep/2025🟢https://services.google.com/fh/files/blogs/vaultgemma_tech_report.pdf
Dense
"Differential Privacy (DP) has emerged as the gold standard, providing a rigorous, mathematical framework to limit the influence of any single example in the training data on the resulting model. A model trained with DP provably bounds the reconstruction or leakage of information tied to individual data points." Announce: https://research.google/blog/vaultgemma-the-worlds-most-capable-differentially-private-llm/659
23
Qwen3-Next-80B-A3B
Alibaba
https://huggingface.co/collections/Qwen/qwen3-next-68c25fd6838e585db8eeea9d
8015000188:13.784.7266.0543.43synthetic, web-scaleSep/2025🟢https://qwen.ai/blog?id=4074cca80393150c248e508aa62983f9cb7d27cd&from=research.latest-advancements-listMoEReasoning
"Qwen3-Next introduces several key improvements: a hybrid attention mechanism, a highly sparse Mixture-of-Experts (MoE) structure, training-stability-friendly optimizations, and a multi-token prediction mechanism for faster inference."
658
24
K2-ThinkMBZUAIhttps://www.k2think.ai/3218000563:12.571.089.95synthetic, web-scaleSep/2025🟢https://arxiv.org/abs/2509.07604
Dense
Reasoning
"Built on the Qwen2.5 base model, our system shows that smaller models can compete at the highest levels by combining advanced post-training and test-time computation techniques. The approach is based on six key technical pillars: Long Chain-of-thought Supervised Finetuning, Reinforcement Learning with Verifiable Rewards (RLVR), Agentic planning prior to reasoning, Test-time Scaling, Speculative Decoding, and Inference-optimized Hardware, all using publicly available open-source datasets."
657
25
mmBERTJHUhttps://huggingface.co/collections/jhu-clsp/mmbert-a-modern-multilingual-encoder-68b725831d7c6e3acc435ed40.30730009,772:10.1synthetic, web-scaleSep/2025🟢https://arxiv.org/abs/2509.06888
Dense
"a modern multilingual encoder trained on 3T tokens and 1833 languages. We introduce several novel elements in training: an inverse masking schedule and a cascading annealed language learning schedule for multilingual data" Announce: https://huggingface.co/blog/mmbert
656
26
ERNIE X1.1Baiduhttps://ernie.baidu.com/synthetic, web-scaleSep/2025🟢https://www.prnewswire.com/news-releases/baidu-unveils-reasoning-model-ernie-x1-1-with-upgrades-in-key-capabilities-302551170.htmlMoEReasoning655
27
ERNIE-4.5-21B-A3B-Thinking
Baidu
https://huggingface.co/baidu/ERNIE-4.5-21B-A3B-Thinking
2115000715:11.9synthetic, web-scaleSep/2025🟢https://huggingface.co/baidu/ERNIE-4.5-21B-A3B-ThinkingMoEReasoning654
28
Klear-46B-A2.5BKuaishou
https://huggingface.co/Kwai-Klear/Klear-46B-A2.5B-Instruct
4622000479:13.480.557.635.3synthetic, web-scaleSep/2025🟢https://huggingface.co/Kwai-Klear/Klear-46B-A2.5B-InstructMoE46B-A2.5B.653
29
TildeOpen-30bTilde AI
https://huggingface.co/TildeAI/TildeOpen-30b
30200067:10.8synthetic, web-scaleSep/2025🟢https://tilde.ai/lv/tildeopen-llm/
Dense
"language data from across Europe"652
30
Qwen3-Max-Preview
Alibaba
https://chat.qwen.ai/
10003600036:120.064.6synthetic, web-scaleSep/2025🟢https://modelstudio.console.alibabacloud.com/?tab=doc#/doc/?type=model&url=2840914_2&modelId=qwen3-max-previewMoE
GPQA score is SuperGPQA. "our biggest model yet, with over 1 trillion parameters"
651
31
Kimi K2-Instruct-0905
Moonshot AI
https://huggingface.co/moonshotai/Kimi-K2-Instruct
10001550016:113.1synthetic, web-scaleSep/2025🟢https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905MoEReasoning, SOTA1TA32B. 1T parameters and 384 experts. Open source SOTA.650
32
ApertusETH Zürich
https://huggingface.co/swiss-ai/Apertus-70B-Instruct-2509
7015000215:13.465.230.6synthetic, web-scaleSep/2025🟢https://github.com/swiss-ai/apertus-tech-report/blob/main/Apertus_Tech_Report.pdf
Dense
"Apertus – Latin for “open”" 1,811 languages. Announce: https://ethz.ch/en/news-and-events/eth-news/news/2025/09/press-release-apertus-a-fully-open-transparent-multilingual-language-model.html
649
33
MAI-1-previewMicrosoft
https://microsoft.ai/news/two-new-in-house-models/
5001000020:17.5synthetic, web-scaleAug/2025🟢https://microsoft.ai/news/two-new-in-house-models/MoE
MAI=Microsoft artificial intelligence. "MAI’s first foundation model trained end-to-end... MAI-1-preview is an in-house mixture-of-experts model, pre-trained and post-trained on ~15,000 NVIDIA H100 GPUs. This model is designed to provide powerful capabilities to consumers seeking to benefit from models that specialize in following instructions and providing helpful responses to everyday queries. We will be rolling MAI-1-preview out for certain text use cases within Copilot"
648
34
grok-code-fast-1xAI
https://github.com/features/copilot
10010000100:13.3synthetic, web-scaleAug/2025🟢
https://data.x.ai/2025-08-26-grok-code-fast-1-model-card.pdf
MoE
"We built grok-code-fast-1 from scratch, starting with a brand-new model architecture. To lay a robust foundation, we carefully assembled a pre-training corpus rich with programming-related content. For post-training, we curated high-quality datasets that reflect real-world pull requests and coding tasks." Announce: https://x.ai/news/grok-code-fast-1
647
35
Hermes 4Nous Research
https://huggingface.co/NousResearch/Hermes-4-405B-FP8
4051565639:18.487.280.570.5synthetic, web-scaleAug/2025🟢
https://arxiv.org/abs/2508.18255
Dense
ReasoningBased on Llama 3. Announce: https://hermes4.nousresearch.com/646
36
Jet-Nemotron-4BNVIDIA
https://github.com/NVlabs/Jet-Nemotron
4400100:10.165.244.2synthetic, web-scaleAug/2025🟢https://arxiv.org/abs/2508.15884v1
Dense
Reasoning"pre-training corpus and train Jet-Nemotron models for 50B tokens. This is also the setting in Section 2 where we perform PostNAS. At the second stage, we include more high-quality data from math [65] and coding [66, 67] domains into our data mixture. The models are then trained on 350B tokens."645
37
DeepSeek-V3.1-Base
DeepSeek-AI
https://huggingface.co/deepseek-ai/DeepSeek-V3.1-Base
6851564022:110.693.784.880.129.8synthetic, web-scaleAug/2025🟢https://huggingface.co/deepseek-ai/DeepSeek-V3.1-BaseMoESOTA, ReasoningHybrid reasoning. Dataset tokens: https://x.com/deepseek_ai/status/1958417072536608952 HLE: https://x.com/deepseek_ai/status/1958417068568481854/photo/2644
38
Nemotron Nano 2NVIDIA
https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2-Base
12.31200001,625:11.778.2463.9864.48synthetic, web-scaleAug/2025🟢https://research.nvidia.com/labs/adlr/files/NVIDIA-Nemotron-Nano-2-Technical-Report.pdf
Dense
ReasoningAnnounce: https://research.nvidia.com/labs/adlr/NVIDIA-Nemotron-Nano-2/643
39
Gemma 3 270M
Google DeepMind
https://huggingface.co/google/gemma-3-270m-it
0.27600022,223:10.1web-scaleAug/2025🟢https://developers.googleblog.com/en/introducing-gemma-3-270m/
Dense
This is a record tokens-to-params ratio (for text models) of 22,223:1.642
40
GPT-5OpenAI
https://poe.com/GPT-5
300114000380:119.59189.442synthetic, web-scaleAug/2025🟢
https://openai.com/index/gpt-5-system-card/
MoESOTA, Reasoning
Announce: https://openai.com/index/introducing-gpt-5/. MMLU is based on ES and PT translated from EN.
641
41
gpt-oss-120bOpenAI
https://huggingface.co/openai/gpt-oss-120b
12030000250:16.39080.119synthetic, web-scaleAug/2025🟢
https://cdn.openai.com/pdf/419b6906-9da6-406c-a19d-1bb078ac7637/oai_gpt-oss_model_card.pdf
MoEReasoning, SOTA116.8B total parameters and 5.1B “active” parameters per token per forward pass. https://openai.com/index/introducing-gpt-oss/640
42
gpt-oss-20bOpenAI
https://huggingface.co/openai/gpt-oss-20b
2013000650:11.785.371.517.3synthetic, web-scaleAug/2025🟢
https://cdn.openai.com/pdf/419b6906-9da6-406c-a19d-1bb078ac7637/oai_gpt-oss_model_card.pdf
MoEReasoning, SOTA20.9B total and 3.6B active parameters. https://openai.com/index/introducing-gpt-oss/639
43
Claude Opus 4.1Anthropic
https://claude.ai/
200010000050:147.180.9synthetic, web-scaleAug/2025🟢
https://www.anthropic.com/news/claude-opus-4-1
MoEReasoning, SOTA638
44
GLM-4.5Z.AI
https://huggingface.co/zai-org/GLM-4.5
3552200062:19.384.679.114.4synthetic, web-scaleJul/2025🟢
https://z.ai/blog/glm-4.5
MoEReasoning355B-A32B.637
45
T1
China Telecom Artificial Intelligence Research Institute
https://github.com/Tele-AI/T1
1151000087:13.6web-scaleJul/2025🟢https://arxiv.org/abs/2507.18013
Dense
Reasoning636
46
Intern-S1
Shanghai AI Laboratory/SenseTime
https://huggingface.co/internlm/Intern-S1
23541000175:110.383.577.3synthetic, web-scaleJul/2025🟢
https://huggingface.co/internlm/Intern-S1
MoEReasoning, SOTA
41T tokens assumes base model of Qwen3. "Built upon a 235B MoE language model and a 6B Vision encoder, Intern-S1 has been further pretrained on 5 trillion tokens of multimodal data"
635
47
Step 3StepFun
https://www.stepfun.com/
3211800057:18.072.9web-scaleJul/2025🟢
https://github.com/stepfun-ai/Step3/blob/main/Step3-Sys-Tech-Report.pdf
MoE321B-A38B. https://x.com/CyouSakura/status/1948767450751009227634
48
Qwen3-235B-A22B-Thinking-2507
Alibaba
https://huggingface.co/Qwen/Qwen3-235B-A22B-Thinking-2507
23536000154:19.793.884.481.1synthetic, web-scaleJul/2025🟢
https://huggingface.co/Qwen/Qwen3-235B-A22B-Thinking-2507
MoEReasoning
235B-A22B. "Qwen3 is pre-trained on 36 trillion tokens across 119 languages" MMLU score is MMLU-Redux.
633
49
KAT-V1-200BKuaishou2001000050:14.782.378.2synthetic, web-scaleJul/2025🔴https://arxiv.org/abs/2507.08297MoEReasoning
200BA40B. In training as of Jul/2025. "to address the overthinking problem in reasoning-intensive tasks"
632
50
KAT-V1-40BKuaishou
https://huggingface.co/Kwaipilot/KAT-V1-40B
4010000250:12.177.875.1synthetic, web-scaleJul/2025🟢https://arxiv.org/abs/2507.08297
Dense
Reasoning"to address the overthinking problem in reasoning-intensive tasks"631
51
Qwen3-Coder-480B-A35B-Instruct
Alibaba
https://huggingface.co/Qwen/Qwen3-Coder-480B-A35B-Instruct
4803600075:113.9synthetic, web-scaleJul/2025🟢https://qwenlm.github.io/blog/qwen3-coder/MoE480B-A35B.630
52
Qwen3-235B-A22B-Instruct-2507
Alibaba
https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507
23536000154:19.793.18377.5synthetic, web-scaleJul/2025🟢https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507MoESOTA
235B-A22B. "Qwen3 is pre-trained on 36 trillion tokens across 119 languages" MMLU score is MMLU-Redux.
629
53
FlexOlmoAllen AI
https://huggingface.co/allenai/FlexOlmo-7x7B-1T
374150113:11.360.430.9synthetic, web-scaleJul/2025🟢
https://arxiv.org/abs/2507.07024v1
MoE
37B-A20B. "We adopt the OLMo-2 7B setup, starting from a a checkpoint pre-trained on 4T tokens and annealed for 50B tokens to produce a public expert. We then train two additional experts on math and code, each for 50B tokens, and combine them with the public expert to form a three-expert version of FLEXOLMO."
628
54
EXAONE 4.0LG
https://huggingface.co/LGAI-EXAONE/EXAONE-4.0-32B
3214000438:12.292.381.875.4web-scaleJul/2025🟢
https://www.lgresearch.ai/data/cdn/upload/EXAONE_4_0.pdf
Dense
Reasoning
“EXAONE”=“EXpert AI for EveryONE”. Training tokens/ratio: EXAONE-3 7.8B=8T tokens (Aug/2024) -> EXAONE-3.5 7.8B=9T -> EXAONE-3.5 32B=6.5T tokens -> EXAONE 4.0 32B=14T tokens. MMLU score is MMLU-Redux. Interesting: "To focus [RL] training on more informative data samples, we perform accuracy-based filtering by generating eight responses from the SFT model and excluding samples where all eight responses are correct, a pre-filtering step that removes problems that are easy for the model to avoid inefficient training."
627
55
Kimi K2Moonshot AI
https://huggingface.co/moonshotai/Kimi-K2-Instruct
10001550016:113.189.581.175.14.7synthetic, web-scaleJul/2025🟢https://moonshotai.github.io/Kimi-K2/MoEReasoning, SOTA1TA32B. 1T parameters and 384 experts. Open source SOTA.626
56
Reka Flash 3.1Reka AI
https://huggingface.co/RekaAI/reka-flash-3.1
215000239:11.1web-scaleJul/2025🟢
https://www.reka.ai/news/introducing-reka-flash
Dense
Reasoning625
57
Devstral MediumMistral
https://chat.mistral.ai/chat
5012000240:12.6synthetic, web-scaleJul/2025🟢
https://mistral.ai/news/devstral-2507
Dense
Non-reasoning.624
58
Grok 4xAI
https://grok.com/
60080000134:123.188.944.4synthetic, web-scaleJul/2025🟢https://lifearchitect.ai/grok/MoEReasoning, SOTA
"The smartest AI in the world, 100% on SAT, etc, questions that it's never seen before."
623
59
Phi-4-mini-flash-reasoning
Microsoft
https://huggingface.co/microsoft/Phi-4-mini-flash-reasoning
3.851501,356:10.5synthetic, web-scaleJul/2025🟢https://azure.microsoft.com/en-us/blog/reasoning-reimagined-introducing-phi-4-mini-flash-reasoning/
Dense
"Pre-training: 5T tokens; Reasoning training: 150B tokens" "At the core of Phi-4-mini-flash-reasoning is the newly introduced decoder-hybrid-decoder architecture, SambaY, whose central innovation is the Gated Memory Unit (GMU), a simple yet effective mechanism for sharing representations between layers. The architecture includes a self-decoder that combines Mamba (a State Space Model) and Sliding Window Attention (SWA), along with a single layer of full attention. The architecture also involves a cross-decoder that interleaves expensive cross-attention layers with the new, efficient GMUs. This new architecture with GMU modules drastically improves decoding efficiency, boosts long-context retrieval performance and enables the architecture to deliver exceptional performance across a wide range of tasks. "
622
60
T5Gemma
Google DeepMind
https://huggingface.co/google/t5gemma-9b-9b-ul2-it
9100001,112:11.076.755.740.4web-scaleJul/2025🟢https://developers.googleblog.com/en/t5gemma/
Dense
Related paper: https://arxiv.org/abs/2504.06225. Dataset was Gemma 2 9B on 8T tokens + 2T tokens adapted.621
61
MedGemma
Google DeepMind
https://huggingface.co/google/medgemma-27b-it
2714000519:12.087web-scaleJul/2025🟢
https://arxiv.org/abs/2507.05201
Dense
Multimodal model. Text MMLU score for med only=87.0. 620
62
R1T2 ChimeraTNG
https://huggingface.co/tngtech/DeepSeek-TNG-R1T2-Chimera
6851480022:110.6synthetic, web-scaleJul/2025🟢https://arxiv.org/abs/2506.14794MoE
Assembly of Experts-method of V3-0324, R1, R1-0528. Announce: https://x.com/tngtech/status/1940531045432283412?s=46
619
63
Spectra 1.1Consortium3.61200334:10.236.12synthetic, web-scaleJun/2025🟢https://arxiv.org/abs/2506.23025
Dense
"Spectra-1.1, an open suite of TriLMs trained on up to 1.2 trillion tokens, demonstrating sustained performance gains at scale. Furthermore, to improve inference efficiency, we propose novel 2-bit and 1.6-bit packing schemes for ternary weights"
618
64
DiffuCoderApple
https://github.com/apple/ml-diffucoder
75630805:10.7code, The StackJun/2025🟢https://arxiv.org/abs/2506.20639
Dense
Diffusion
"We adapt our model from Qwen2.5-Coder (Hui et al., 2024) as the base model to perform continual pre-training using the adaptation approach from Gong et al. (2025). During this pre-training, we use a 400B-token code pre-training corpus from RefineCode (Huang et al., 2024) and Stackv2 (Lozhkov et al., 2024)."
617
65
Hunyuan-A13BTencent
https://huggingface.co/tencent/Hunyuan-A13B-Instruct
80700088:12.588.1767.2371.2synthetic, web-scaleJun/2025🟢
https://huggingface.co/tencent/Hunyuan-A13B-Instruct
MoE
80B-A13B. 'We have open-sourced Hunyuan-A13B-Pretrain , Hunyuan-A13B-Instruct , Hunyuan-A13B-Instruct-FP8 , Hunyuan-A13B-Instruct-GPTQ-Int4 on Hugging Face.'
616
66
MercuryInception
https://chat.inceptionlabs.ai/
90800089:12.869513.4synthetic, web-scaleJun/2025🟢https://www.inceptionlabs.ai/introducing-mercury-our-general-chat-modelDenseDiffusionDiffusion large language model (dLLM).615
67
MuMicrosoft
https://blogs.windows.com/windows-insider/2025/06/13/announcing-windows-11-insider-preview-build-26200-5651-dev-channel/
0.55001,000:10.1synthetic, web-scaleJun/2025🟢https://blogs.windows.com/windowsexperience/2025/06/23/introducing-mu-language-model-and-how-it-enabled-the-agent-in-windows-settings/
Dense
"distillation from Microsoft’s Phi models...Mu is an efficient 330M encoder–decoder language model optimized for small-scale deployment, particularly on the NPUs on Copilot+ PCs. It follows a transformer encoder–decoder architecture"
614
68
Gemini Robotics On-Device
Google DeepMind
https://docs.google.com/forms/u/0/d/1sM5GqcVMWv-KmKY3TOMpVtQ-lDFeAftQ-d9xQn92jCE/viewform?ts=67cef986&edit_requested=true
2010000500:11.5synthetic, web-scaleJun/2025🟢https://deepmind.google/discover/blog/gemini-robotics-on-device-brings-ai-to-local-robotic-devices/MoE
See Mar/2025 Gemini Robotics-ER model for comparison. Announce: https://deepmind.google/discover/blog/gemini-robotics-on-device-brings-ai-to-local-robotic-devices/
613
69
ICONN-1ICONNAI
https://huggingface.co/ICONNAI/ICONN-1
8810000114:13.1synthetic, web-scaleJun/2025🟢MoE
"ICONN-1 (this version) is optimized for natural, emotionally resonant, and conversational interactions. ICONN-e1 is a specialized variant of the model fine-tuned for advanced reasoning, critical analysis, and complex problem-solving."
612
70
MiniMax-M1MiniMax
https://huggingface.co/MiniMaxAI/MiniMax-M1-80k
456720016:16.081.1708.4web-scaleJun/2025🟢https://arxiv.org/abs/2506.13585MoEReasoning456B-A45.9B. Announce: https://www.minimax.io/news/minimaxm1611
71
Magistral MediumMistral
https://chat.mistral.ai/chat
5012000240:12.670.8synthetic, web-scaleJun/2025🟢
https://mistral.ai/static/research/magistral.pdf
Dense
ReasoningMagistral Small=24B. Announce: https://mistral.ai/news/magistral610
72
Comma v0.1-2TEleutherAI
https://huggingface.co/common-pile/comma-v0.1-2t
72000286:10.449.8web-scaleJun/2025🟢https://arxiv.org/abs/2506.05209
Dense
"Comma v0.1-2T is a decoder-only transformer that uses the same architecture as Llama 3. Training was done in two stages: first on 1.93 trillion tokens with a cosine learning rate schedule, and second a "cool-down" training phase on 75.5 billion tokens from high-quality sources. The final model is the average of 10 checkpoints during this cool-down phase. Both training phases use a batch size of 8.3 million tokens per step. Training was performed using lingua on 512 AMD MI300A GPUs."
609
73
dots.llm1
Xiaohongshu/RedNote
https://huggingface.co/rednote-hilab/dots.llm1.base
1421120079:14.283.261.952.6web-scaleJun/2025🟢
https://github.com/rednote-hilab/dots.llm1/blob/main/dots1_tech_report.pdf
MoE
142B-A14B. "dots.llm1, a large-scale MoE model that activates 14 billion parameters out of a total of 142 billion parameters, delivering performance on par with state-of-the-art models while reducing training and inference costs. Leveraging our meticulously crafted and efficient data processing pipeline, dots.llm1 achieves performance comparable to Qwen2.5-72B after pretraining on 11.2T high-quality tokens and post-training to fully unlock its capabilities. Notably, no synthetic data is used during pretraining. To foster further research, we open-source intermediate training checkpoints at every one trillion tokens, providing valuable insights into the learning dynamics of large language models."
608
74
Gemini 2.5 Pro 06-05
Google DeepMind
https://deepmind.google/models/gemini-diffusion/
40080000200:118.986.421.6synthetic, web-scaleJun/2025🟢
https://storage.googleapis.com/deepmind-media/gemini/gemini_v2_5_report.pdf
Dense
Reasoning, SOTA
"an upgraded preview of Gemini 2.5 Pro, our most intelligent model yet. Building on the version we released in May and showed at I/O, this model will be the generally available, stable version starting in a couple of weeks, ready for enterprise-scale applications."
607
75
MiMo-7B-RL-0530Xiaomi
https://huggingface.co/XiaomiMiMo/MiMo-7B-RL-0530
7250003,572:11.458.660.6synthetic, web-scaleMay/2025🟢https://arxiv.org/abs/2505.07608
Dense
Reasoning
"[2025.05.30] During the RL training, by continuously expanding the training window size (from 32K to 48K), the performance of MiMo-7B-RL-0530 on AIME24 can be continuously improved and eventually surpass that of DeepSeek R1... MiMo-7B-Base is pre-trained on approximately 25 trillion tokens."
606
76
DeepTransformers
Google DeepMind
1.310077:10.0synthetic, web-scaleMay/2025🔴
https://arxiv.org/abs/2505.23735
Dense
"Atlas, a long-term memory module with high capacity that learns to memorize the context by optimizing the memory based on the current and past tokens, overcoming the online nature of long-term memory models. Building on this insight, we present a new family of Transformer-like architectures, called DeepTransformers, that are strict generalizations of the original Transformer architecture."
605
77
Atlas
Google DeepMind
1.310077:10.0synthetic, web-scaleMay/2025🔴
https://arxiv.org/abs/2505.23735
Dense
"Atlas, a long-term memory module with high capacity that learns to memorize the context by optimizing the memory based on the current and past tokens, overcoming the online nature of long-term memory models. Building on this insight, we present a new family of Transformer-like architectures, called DeepTransformers, that are strict generalizations of the original Transformer architecture."
604
78
DeepSeek-R1-0528
DeepSeek-AI
https://chat.deepseek.com/
6851480022:110.693.4858117.7synthetic, web-scaleMay/2025🟢
https://huggingface.co/deepseek-ai/DeepSeek-R1-0528
MoEReasoning, SOTA
Censorship increased significantly. "overall performance is now approaching that of leading models, such as o3 and Gemini 2.5 Pro." MMLU shows MMLU-Redux score with lower error rate.
603
79
Fathom-R1-14BFractal Analytics
https://huggingface.co/FractalAIResearch/Fathom-R1-14B
14180001,286:11.766.16synthetic, web-scaleMay/2025🟢https://huggingface.co/FractalAIResearch/Fathom-R1-14B
Dense
ReasoningBase R1-distilled-14B model, based on Qwen 14B. Media release.602
80
QwenLong-L1-32BAlibaba
https://huggingface.co/Tongyi-Zhiwen/QwenLong-L1-32B
3218000563:12.5synthetic, web-scaleMay/2025🟢https://arxiv.org/abs/2505.17667
Dense
Reasoning
"the first long-context LRM trained with reinforcement learniing for long-context reasoning."
601
81
Claude Opus 4Anthropic
https://claude.ai/
600010000017:181.683.3synthetic, web-scaleMay/2025🟢https://www-cdn.anthropic.com/6be99a52cb68eb70eb9572b4cafad13df32ed995.pdf
Dense
Reasoning, SOTA
"Claude Opus 4 is our most intelligent model to date, pushing the frontier in coding, agentic search, and creative writing. With advanced reasoning and powerful collaboration capabilities…Both models can also alternate between reasoning and tool use—like web search—to improve responses…Claude Opus 4 can work continuously for hours on complex, long-running tasks"
600
82
Falcon-H1TII
https://huggingface.co/tiiuae/Falcon-H1-34B-Instruct-GGUF
3418000530:12.684.0558.7349.66synthetic, web-scaleMay/2025🟢
https://huggingface.co/papers/2507.22448
Dense
"hybrid architecture that combines the strengths of the classical Transformer-based attention mechanism with the State Space Model (SSM), known for its superior long-context memory and computational efficiency."599
83
Gemini Diffusion
Google DeepMind
https://deepmind.google/models/gemini-diffusion/
4016000400:12.740.4synthetic, web-scaleMay/2025🟢
https://deepmind.google/models/gemini-diffusion/
Dense
Diffusion
"Gemini Diffusion’s external benchmark performance is comparable to much larger models [like Gemini-2.0-Flash-Lite], whilst also being faster."
598
84
Gemma 3n
Google DeepMind
https://ai.google.dev/gemma/docs/gemma-3n
480002,000:10.662.1synthetic, web-scaleMay/2025🟢
https://developers.googleblog.com/en/introducing-gemma-3n/
MatFormer
Matryoshka Transformer or MatFormer model architecture. 850M (696M / 620M / 582M).
597
85
ParScaleAlibaba
https://huggingface.co/ParScale/ParScale-4.7B-P8-Python
4.71000213:10.235.1synthetic, web-scaleMay/2025🟢https://arxiv.org/abs/2505.10475
Dense
"We introduce the third scaling paradigm for scaling LLMs: leverages parallel computation during both training and inference time (Parallel Scaling, or ParScale)... ParScale can use up to 22× less memory increase and 6× less latency increase compared to parameter scaling that achieves the same performance improvement. It can also recycle an off-the-shelf pre-trained model into a parallelly scaled one by post-training on a small amount of tokens, further reducing the training budget." MMLU shows for 1.8B models, not the 4.7B models.
596
86
codex-1OpenAIhttps://chatgpt.com/codex600100000167:125.8synthetic, web-scaleMay/2025🟢https://openai.com/index/introducing-codex/MoEReasoning, SOTA
o3 base. "codex-1, a version of OpenAI o3 optimized for software engineering. It was trained using reinforcement learning on real-world coding tasks in a variety of environments to generate code that closely mirrors human style and PR preferences, adheres precisely to instructions, and can iteratively run tests until it receives a passing result."
595
87
Falcon-EdgeTII
https://huggingface.co/tiiuae/Falcon-E-3B-Instruct
31500500:10.255.727.1623.59synthetic, web-scaleMay/2025🟢
https://huggingface.co/blog/tiiuae/falcon-edge
Dense
"Falcon-Edge series - a collection of powerful, universal, and fine-tunable language models available in ternary format, based on the BitNet architecture."594
88
SWE-1Windsurf
https://windsurf.com/blog/windsurf-wave-9-swe-1
508000160:12.1synthetic, web-scaleMay/2025🟢
https://windsurf.com/blog/windsurf-wave-9-swe-1
Dense"SWE-1, optimized for the entire software engineering process, not just the task of coding."593
89
INTELLECT-2Prime Intellect
https://chat.primeintellect.ai/
3218000563:12.566.8web-scaleMay/2025🟢
https://storage.googleapis.com/public-technical-paper/INTELLECT_2_Technical_Report.pdf
Dense
ReasoningQwQ-32B base. Announce: https://www.primeintellect.ai/blog/intellect-2-release Finished training 30/Apr/2025: https://app.primeintellect.ai/intelligence/intellect-2592
90
Pangu Ultra MoEHuawei
https://github.com/pangu-tech/pangu-ultra
7181300019:110.291.583.575.3synthetic, web-scaleMay/2025🔴
https://arxiv.org/abs/2505.04519
MoEReasoning
718B-A39B. Trained on 6,000 Ascend NPUs (Kunpeng 920 processors in Huawei Atlas 800T A2 servers).
591
91
Mistral Medium 3Mistral
https://chat.mistral.ai/chat
5012000240:12.677.257.1synthetic, web-scaleMay/2025🟢
https://mistral.ai/news/mistral-medium-3
Dense
Multimodal. 50B param estimate based on "Mistral Medium 3 can also be deployed on any cloud, including self-hosted environments of four GPUs and above.". Note: "With the launches of Mistral Small in March and Mistral Medium today, it’s no secret that we’re working on something ‘large’ over the next few weeks. With even our medium-sized model being resoundingly better than flagship open source models such as Llama 4 Maverick, we’re excited to ‘open’ up what’s to come :) "590
92
Granite-4.0-Tiny-Preview
IBM
https://huggingface.co/ibm-granite/granite-4.0-tiny-preview
72500358:10.460.4synthetic, web-scaleMay/2025🟢
https://www.ibm.com/new/announcements/ibm-granite-4-0-tiny-preview-sneak-peek
MoEReasoning
"the model is only partially trained—it has only seen 2.5T of a planned 15T or more training tokens...Granite 4.0 Tiny-Preview, specifically, is a fine-grained hybrid mixture of experts (MoE) model, with 7B total parameters and only 1B active parameters at inference time... Like its predecessors in Granite 3.2 and Granite 3.3, Granite 4.0 Tiny Preview offers toggleable thinking on and thinking off functionality (though its reasoning-focused post-training is very much incomplete)."
589
93
Phi-4-reasoning-plus
Microsoft
https://huggingface.co/microsoft/Phi-4-reasoning-plus
1410016716:11.27669.3synthetic, web-scaleApr/2025🟢https://arxiv.org/abs/2504.21318
Dense
"Phi-4-reasoning-plus is a state-of-the-art open-weight reasoning model finetuned from Phi-4 using supervised fine-tuning on a dataset of chain-of-thought traces and reinforcement learning."
588
94
Bamba-9B-v2IBM
https://huggingface.co/ibm-ai-platform/Bamba-9B-v2
93000334:10.567.9225.415.93synthetic, web-scaleApr/2025🟢
https://huggingface.co/blog/ibm-ai-platform/bamba-9b-v2
Dense
"During Christmas of 2024, IBM, Princeton, CMU, and UIUC released, Bamba v1, a performant Mamba2 based pretrained model with full data lineage trained to 2T tokens. Since then, we have been busy cooking an update with new datasets. Today, we are excited to release Bamba v2, trained for an additional 1T tokens that significantly improves on Bamba v1. The L1 and L2 leaderboard scores outperform Llama 3.1 8B, which was trained with nearly 5x the amount of data. All of this with the inference speedup that we get from Mamba2 based architecture, which with the latest vLLM is 2-2.5x faster than similar sized transformer models."
587
95
Qwen3-235B-A22BAlibaba
https://huggingface.co/Qwen/Qwen3-235B-A22B
23536000154:19.787.8168.1847.47synthetic, web-scaleApr/2025🟢https://github.com/QwenLM/Qwen3/blob/main/Qwen3_Technical_Report.pdfMoEReasoning
Qwen3-235B-A22B. Qwen3-30B-A3B. "Qwen3 is pre-trained on 36 trillion tokens across 119 languages"
586
96
ERNIE X1 TurboBaiduhttps://huggingface.co/spaces/PaddlePaddle/ernie_x1_turbo_demo69synthetic, web-scaleApr/2025🟢https://www.prnewswire.com/news-releases/baidu-launches-ernie-4-5-turbo-ernie-x1-turbo-and-new-suite-of-ai-tools-to-empower-developers-and-supercharge-ai-innovation-302438584.htmlMoEReasoningAnnounce: https://x.com/Baidu_Inc/status/1915603080336597310585
97
ERNIE 4.5 TurboBaiduhttps://huggingface.co/spaces/PaddlePaddle/ernie_4.5_turbo_demo90synthetic, web-scaleApr/2025🟢https://www.prnewswire.com/news-releases/baidu-launches-ernie-4-5-turbo-ernie-x1-turbo-and-new-suite-of-ai-tools-to-empower-developers-and-supercharge-ai-innovation-302438584.htmlMoEAnnounce: https://x.com/Baidu_Inc/status/1915603080336597310584
98
MAI-DS-R1Microsoft
https://huggingface.co/microsoft/MAI-DS-R1
6851480022:110.686.8synthetic, web-scaleApr/2025🟢
https://techcommunity.microsoft.com/blog/machinelearningblog/introducing-mai-ds-r1/4405076
MoEReasoning
DeepSeek-R1 base. "MAI-DS-R1, a new open weights DeepSeek R1 model variant... post-trained by the Microsoft AI team to improve its responsiveness on blocked topics and its risk profile, while maintaining its reasoning capabilities and competitive performance."
583
99
Gemini 2.5 Flash Preview
Google DeepMind
https://aistudio.google.com/prompts/new_chat?model=gemini-2.5-flash-preview-04-17
8020000250:14.278.312.1synthetic, web-scaleApr/2025🟢
https://deepmind.google/technologies/gemini/flash/
MoEReasoning
Context in=1M, out=64k. Knowledge cutoff Jan/2025. Codename 'nebula'. Note: Gemini outputs are watermarked. I do not use GDM models. https://lifearchitect.ai/watermarking/
582
100
o4-miniOpenAIhttps://chatgpt.com/?model=o4-mini-high20040000200:19.48881.414.28synthetic, web-scaleApr/2025🟢https://cdn.openai.com/pdf/2221c875-02dc-4789-800b-e7758f3722c1/o3-and-o4-mini-system-card.pdfMoEReasoning, SOTA
https://openai.com/index/introducing-o3-and-o4-mini/ MMLU shows a translated LOTE.
581