2025 LifeArchitect.ai data (shared) - NEW

	A	B	C	D	E	F	G	H	I	J	K	L	M	N	O	P	Q	R	S
1	(703) Permalink:	https://lifearchitect.ai/models-table			Timeline view:	https://lifearchitect.ai/timeline				The Memo:	https://lifearchitect.ai/memo
2	Model	Lab	Playground	Parameters (B)	Tokens trained (B)	Ratio Tokens:Params (Chinchilla scaling≥20:1)	ALScore "ALScore" is a quick and dirty rating of the model's power. The formula is: Sqr Root of (Parameters x Tokens) ÷ 300. Any ALScore ≥ 1.0 is a powerful model in mid-2023.	MMLU	MMLU -Pro	GPQA	HLE	Training dataset	Announced ▼	Public?	Paper / Repo	Arch	Tags	Notes	Count (rough)

3	AuroraGPT (ScienceGPT)	Argonne National Laboratory	https://lifearchitect.ai/auroragpt/	2000	30000	15:1	25.8						TBA	🔴				Three models targeted in Jul/2024: AuroraGPT-7B-P (Ponte Vecchio GPU testing) AuroraGPT-7B-A (Aurora) AuroraGPT-7B-A-S (Aurora + Science).	l
4	DeepSeek-R2	DeepSeek-AI	https://www.reuters.com/technology/artificial-intelligence/deepseek-rushes-launch-new-ai-model-china-goes-all-2025-02-25/	1200	130000	109:1	41.6						TBA	🟢	https://docs.google.com/document/d/e/2PACX-1vTmx-A5sBe_3RsURGM7VvLWsAgUXbcIb2pFaW7f1FTPgK7mGvYENXGQPoF2u4onFndJ_5tzZ02su-vg/pub	MoE	Reasoning, SOTA	Hybrid MoE, 1.2TA78B. 5.2PB corpus=1.3Qa tokens. 1.3 quadrillion tokens = 1,300T tokens = 1,300,000B tokens "Constructed a 5.2 PB high-quality corpus covering vertical domains such as finance, law, and patents." http://jiuyangongshe.com/a/1h4gq724su0 and translated at: https://docs.google.com/document/d/e/2PACX-1vTmx-A5sBe_3RsURGM7VvLWsAgUXbcIb2pFaW7f1FTPgK7mGvYENXGQPoF2u4onFndJ_5tzZ02su-vg/pub	k
5	GPT-6	OpenAI	https://lifearchitect.ai/gpt-6/										TBA				SOTA	Due 2026.	f
6	Grok-5	xAI	https://lifearchitect.ai/whats-in-grok/	6000	100000	17:1	81.6						TBA			MoE		Due 2026. Quote 3T/6T: https://youtu.be/q_mMV5OpRd4?t=1387	n
7	Trinity-Large	Arcee AI		420	20000	48:1	9.7						TBA	🟢	https://www.arcee.ai/blog/the-trinity-manifesto	MoE	Reasoning	420BA13B. "we worked closely with Prime Intellect. They not only served the H100 clusters Datology used to generate synthetic data, they have been deeply involved in helping scale our training setup to the GPU footprint required for a fully frontier sized model, including the current 2048 B300 GPU configuration for Trinity Large."	p
8	HY 2.0	Tencent	https://hunyuan.tencent.com/	406	40000	99:1	13.4				18.8	synthetic, web-scale	Dec/2025	🟢	https://x.com/TencentHunyuan/status/1996948083377332614	MoE		406BA32B.	703
9	Trinity-Mini	Arcee AI	https://huggingface.co/arcee-ai/Trinity-Mini	26	20000	770:1	2.4	84.95		58.55		synthetic, web-scale	Dec/2025	🟢	https://www.arcee.ai/blog/the-trinity-manifesto	MoE	Reasoning	26BA3B. "we worked closely with Prime Intellect. They not only served the H100 clusters Datology used to generate synthetic data, they have been deeply involved in helping scale our training setup to the GPU footprint required for a fully frontier sized model, including the current 2048 B300 GPU configuration for Trinity Large."	702
10	Nova 2 Pro	Amazon	https://nova.amazon.com/chat	200	20000	100:1	6.7		81.6	81.4		synthetic, web-scale	Dec/2025	🟢	https://www.aboutamazon.com/news/aws/aws-agentic-ai-amazon-bedrock-nova-models	Dense	Reasoning	"Nova 2 Pro is Amazon's most intelligent reasoning model that can process text, images, video, and speech to generate text."	701
11	Mistral Large 3	Mistral	https://huggingface.co/collections/mistralai/mistral-large-3	675	20000	30:1	12.2			43.9		synthetic, web-scale	Dec/2025	🟢	https://mistral.ai/news/mistral-3	MoE	Reasoning	675BA41B. "Mistral Large 3 joins the ranks of frontier instruction-fine-tuned open-source models." EU tech doc: https://legal.cms.mistral.ai/assets/1e37fffd-7ea5-469b-822f-05dcfbb43623	700
12	DeepSeek-V3.2-Speciale	DeepSeek-AI	https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Speciale	685	15640	22:1	10.6			85.7	30.6	synthetic, web-scale	Dec/2025	🟢	https://huggingface.co/deepseek-ai/DeepSeek-V3.2/blob/main/assets/paper.pdf	MoE	SOTA, Reasoning	The word 'Speciale' may be a reference to Ferrari. "It shows gold-medal performance in the IOI 2025, ICPC World Final 2025, IMO 2025, and CMO 2025." API: https://api-docs.deepseek.com/news/news251201	699
13	DeepSeek-Math-V2	DeepSeek-AI	https://huggingface.co/deepseek-ai/DeepSeek-Math-V2	685	15640	22:1	10.6					synthetic, web-scale	Nov/2025	🟢	https://github.com/deepseek-ai/DeepSeek-Math-V2/blob/main/DeepSeekMath_V2.pdf	MoE	SOTA, Reasoning	"DeepSeekMath-V2, demonstrates strong theorem-proving capabilities, achieving gold-level scores on IMO 2025 and CMO 2024 and a near-perfect 118/120 on Putnam 2024 with scaled testtime compute. "	698
14	Orchestrator-8B	NVIDIA	https://huggingface.co/nvidia/Orchestrator-8B	8	36000	4,500:1	1.8				37.1	synthetic, web-scale	Nov/2025	🟢	https://arxiv.org/abs/2511.21689	Dense	Reasoning	Base Model: Qwen3-8B	697
15	INTELLECT-3	Prime Intellect	https://chat.primeintellect.ai/	106	22000	208:1	5.1		81.9	74.4	14.6	synthetic, web-scale	Nov/2025	🟢	https://storage.googleapis.com/intellect-3-paper/INTELLECT_3_Technical_Report.pdf	Dense	Reasoning	GLM-4.5-Air-Base model. Announce: https://www.primeintellect.ai/blog/intellect-3	696
16	Fara-7B	Microsoft	https://huggingface.co/microsoft/Fara-7B	7	18000	2,572:1	1.2					synthetic, web-scale	Nov/2025	🟢	https://www.microsoft.com/en-us/research/wp-content/uploads/2025/11/Fara-7B-An-Efficient-Agentic-Model-for-Computer-Use.pdf	Dense		"Fara-7B is Microsoft's first agentic small language model (SLM) designed specifically for computer use. With only 7 billion parameters, Fara-7B is an ultra-compact Computer Use Agent (CUA)...Current production baselines leverage Qwen 2.5-VL (7B)."	695
17	Claude Opus 4.5	Anthropic	https://claude.ai/	5000	100000	20:1	74.5			86.95	43.2	synthetic, web-scale	Nov/2025	🟢	https://assets.anthropic.com/m/64823ba7485345a7/Claude-Opus-4-5-System-Card.pdf	MoE	Reasoning, SOTA	"the best model in the world for coding, agents, and computer use." Announce: https://www.anthropic.com/news/claude-opus-4-5	694
18	Nemotron Elastic	NVIDIA	https://huggingface.co/nvidia/Nemotron-Elastic-12B	12	110	10:1	0.1		76.2	63.25		synthetic, web-scale	Nov/2025	🟢	https://arxiv.org/abs/2511.16664v1	Dense	Reasoning	"Nemotron Elastic, a framework for building reasoning-oriented LLMs, including hybrid Mamba-Attention architectures, that embed multiple nested submodels within a single parent model, each optimized for different deployment configurations and budgets. Each of these submodels shares weights with the parent model and can be extracted zero-shot during deployment without additional training or fine-tuning...We apply Nemotron Elastic to the Nemotron Nano V2 12B model, simultaneously producing a 9B and a 6B model using only 110B training tokens"	693
19	GeoVista	Tencent	https://github.com/ekonwang/GeoVista	7	18000	2,572:1	1.2					synthetic, web-scale	Nov/2025	🟢	https://arxiv.org/abs/2511.15705	Dense		Base model: Qwen2.5-VL-7B-Instruct. "GeoVista, an agentic model that seamlessly integrates tool invocation within the reasoning loop, including an image-zoom-in tool to magnify regions of interest and a web-search tool to retrieve related web information. " Project page: https://ekonwang.github.io/geo-vista/	692
20	OLMo 3	Allen AI	https://huggingface.co/collections/allenai/olmo-3	32	6000	188:1	1.5	85.4		58.1		synthetic, web-scale	Nov/2025	🟢	https://www.datocms-assets.com/64837/1763662397-1763646865-olmo_3_technical_report-1.pdf	Dense	Reasoning	Announce: https://allenai.org/blog/olmo3	691
21	Gemini 3	Google DeepMind	https://gemini.google.com/	3000	100000	34:1	57.7		90.1	93.8	45.8	synthetic, web-scale	Nov/2025	🟢	https://storage.googleapis.com/deepmind-media/Model-Cards/Gemini-3-Pro-Model-Card.pdf	MoE	Reasoning, SOTA	"The knowledge cutoff date for Gemini 3 Pro was January 2025."	690
22	Grok 4.1	xAI	https://grok.com/	3000	80000	27:1	51.6					synthetic, web-scale	Nov/2025	🟢	https://x.ai/news/grok-4-1	MoE	Reasoning		689
23	Baguettotron	PleIAs	https://huggingface.co/PleIAs/Baguettotron	0.321	200	624:1	0.0	40				synthetic, web-scale	Nov/2025	🟢	https://huggingface.co/PleIAs/Baguettotron	Dense	Reasoning	"The name is both a nod to French origins and to the unusual shape of the model: with 80 layers, Baguettotron is currently the deepest SLM in its size range."	688
24	ERNIE-5.0-Preview-1022	Baidu	https://ernie.baidu.com/	2400	100000	42:1	51.6					synthetic, web-scale	Nov/2025	🟢	https://ernie.baidu.com/blog/posts/ernie-5.0-preview-1022-release-on-lmarena/	MoE	Reasoning	Very low performance on ALPrompt. 2.4T params confirmed: https://global.chinadaily.com.cn/a/202511/13/WS691571bda310d6866eb29500.html	687
25	GPT-5.1	OpenAI	https://chatgpt.com/	300	114000	380:1	19.5			88.1		synthetic, web-scale	Nov/2025	🟢	https://openai.com/index/gpt-5-1/	MoE	Reasoning, SOTA	Personality change via fine-tuning. GPQA (no tools) increased from GPT-5=85.7 to GPT-5.1=88.1.	686
26	JustRL-Nemotron-1.5B	Tsinghua	https://huggingface.co/hbx/JustRL-Nemotron-1.5B	1.5	9000	6,000:1	0.4					synthetic, web-scale	Nov/2025	🟢	https://relieved-cafe-fe1.notion.site/JustRL-Scaling-a-1-5B-LLM-with-a-Simple-RL-Recipe-24f6198b0b6b80e48e74f519bfdaf0a8	Dense	Reasoning	"JustRL, a simple recipe with fixed hyperparameters, achieves state-of-the-art performance on two different 1.5B base models (54.5% and 64.3% across 9 math benchmarks) while using 2× less compute than sophisticated approaches. The same hyperparameters transfer across both models without tuning, and training remains stable over thousands of steps without intervention. This suggests the field may be adding complexity to solve problems that disappear with a stable, scaled-up baseline."	685
27	ERNIE-4.5-VL-28B-A3B-Thinking	Baidu	https://huggingface.co/baidu/ERNIE-4.5-VL-28B-A3B-Thinking	28	15000	536:1	2.2	78.9	66			synthetic, web-scale	Nov/2025	🟢	https://github.com/PaddlePaddle/ERNIE	MoE	Reasoning	28B-A3B. Open-sourced 12/Nov/2025 from Jun/2025 release.	684
28	HOPE	Google DeepMind		1.3	100	77:1	0.0					synthetic, web-scale	Nov/2025	🟡	https://abehrouz.github.io/files/NL.pdf	Dense		"Combining our self-modifying sequence model with the continuum memory system, we present a learning module, called HOPE, showing promising results in language modeling, continual learning, and long-context reasoning tasks." Announce: https://research.google/blog/introducing-nested-learning-a-new-ml-paradigm-for-continual-learning/ May be released after paper is public.	683
29	Kimi K2 Thinking	Moonshot AI	https://kimi.com/	1000	15500	16:1	13.1	94.4	84.6	84.5	44	synthetic, web-scale	Nov/2025	🟢	https://moonshotai.github.io/Kimi-K2/thinking.html	MoE	Reasoning, SOTA	1TA32B. 1T parameters and 384 experts. Open source SOTA. HLE=51.0 on text-only subset, compare to Grok-4 HLE=50.7 also on text-only, but Grok-4 HLE=44.4 on HLE full, ∴ Kimi K2 Thinking HLE≈44 full (estimated).	682
30	GEN-0	Generalist	https://generalistai.com/blog/nov-04-2025-GEN-0	10	10000	1,000:1	1.1					web-scale	Nov/2025	🟡	https://generalistai.com/blog/nov-04-2025-GEN-0	Dense	SOTA	"GEN-0, a new class of embodied foundation models built for multimodal training directly on high-fidelity raw physical interaction. Its architecture builds on the strengths of vision and language models while also going beyond them—natively designed to capture human-level reflexes and physical commonsense. One core feature is Harmonic Reasoning, in which the models are trained to simultaneously think and act seamlessly... GEN-0 is pretrained on our in-house robotics dataset, which includes over 270,000 hours of real-world diverse manipulation data, growing at a rate of 10,000 hours a week and accelerating."	681
31	CALM	Wechat	https://github.com/shaochenze/calm	1.82	230	127:1	0.1					web-scale	Oct/2025	🟢	https://arxiv.org/abs/2510.27688	Dense		"Continuous Autoregressive Language Models (CALM), a paradigm shift from discrete next-token prediction to continuous next-vector prediction. CALM uses a high-fidelity autoencoder to compress a chunk of K tokens into a single continuous vector, from which the original tokens can be reconstructed with over 99.9% accuracy... We train our models on the Pile uncopyrighted dataset (Gao et al., 2020). The raw text is processed with the Llama 3 tokenizer (Grattafiori et al., 2024), resulting in a training set of ∼230B tokens."	680
32	Kimi-Linear	Moonshot AI	https://huggingface.co/moonshotai/Kimi-Linear-48B-A3B-Instruct	48	5700	119:1	1.7		51			synthetic, web-scale	Oct/2025	🟢	https://github.com/MoonshotAI/Kimi-Linear?tab=readme-ov-file	MoE		48B-A3B. "Kimi Linear is a hybrid linear attention architecture that outperforms traditional full attention methods across various contexts, including short, long, and reinforcement learning (RL) scaling regimes. At its core is Kimi Delta Attention (KDA)—a refined version of Gated DeltaNet that introduces a more efficient gating mechanism to optimize the use of finite-state RNN memory."	679
33	MiniMax-M2	MiniMax	https://huggingface.co/MiniMaxAI/MiniMax-M2	230	7200	32:1	4.3		82	78	31.8	web-scale	Oct/2025	🟢	https://platform.minimax.io/docs/guides/text-generation	MoE	Reasoning	230B-A10B.	678
34	DeepSeek-OCR	DeepSeek-AI	https://github.com/deepseek-ai/DeepSeek-OCR/blob/main/DeepSeek_OCR_paper.pdf	3	6000	2,000:1	0.4					special	Oct/2025	🟢	https://github.com/deepseek-ai/DeepSeek-OCR/blob/main/DeepSeek_OCR_paper.pdf	MoE		2D vision tokens for 1D text achieves huge compression. Encoder/Decoder: DeepEncoder 380M (80M SAM-base + 300M CLIP-large), DeepSeek-3B-MoE (A570M).	677
35	UserLM-8b	Microsoft	https://huggingface.co/microsoft/UserLM-8b	8	1000	125:1	0.3					WildChat	Oct/2025	🟢	https://huggingface.co/microsoft/UserLM-8b	Dense		"we trained UserLM-8b to simulate the “user” role in conversation (by training it to predict user turns in a large corpus of conversations called WildChat)."	676
36	CoDA	Salesforce	https://huggingface.co/Salesforce/CoDA-v0-Instruct	1.7	180	106:1	0.1					synthetic, web-scale	Oct/2025	🟢	https://github.com/SalesforceAIResearch/CoDA/blob/main/technical_report.pdf	Dense	Diffusion	"diffusion coder trained on TPU [Google TPU v4-1024 VM]"	675
37	TRM	Samsung	https://github.com/SamsungSAILMontreal/TinyRecursiveModels	0.007	0.1	15:1	0.0					Mazes (ARC-AGI)	Oct/2025	🟢	https://arxiv.org/abs/2510.04871v1	Dense		"Tiny Recursive Model (TRM), a much simpler recursive reasoning approach that achieves significantly higher generalization than HRM, while using a single tiny network with only 2 layers"	674
38	Granite-4.0 Small	IBM	https://huggingface.co/ibm-granite/granite-4.0-h-small	32	15000	469:1	2.3	78.33	55.47	40.63		synthetic, web-scale	Oct/2025	🟢	https://www.ibm.com/granite/docs/models/granite	MoE	Reasoning	32B-A9B. Announce: https://www.ibm.com/new/announcements/ibm-granite-4-0-hyper-efficient-high-performance-hybrid-models	673
39	GLM-4.6	Z.AI	https://huggingface.co/zai-org/GLM-4.6	355	22000	62:1	9.3			82.9	30.4	synthetic, web-scale	Sep/2025	🟢	https://z.ai/blog/glm-4.6	MoE	Reasoning	355B-A32B. "context window has been expanded from 128K to 200K tokens"	672
40	Ring-1T-preview	InclusionAI	https://huggingface.co/inclusionAI/Ring-1T-preview	1000	20000	20:1	14.9					synthetic, web-scale	Sep/2025	🟢	https://huggingface.co/inclusionAI/Ring-1T-preview	MoE	Reasoning	1T-A48.5B.	671
41	Claude Sonnet 4.5	Anthropic	https://claude.ai/	400	80000	200:1	18.9			83.4		synthetic, web-scale	Sep/2025	🟢	https://assets.anthropic.com/m/12f214efcc2f457a/original/Claude-Sonnet-4-5-System-Card.pdf	MoE	Reasoning, SOTA	The Claude Sonnet 4.5 "system card" is an absolute farce. Announce: https://www.anthropic.com/news/claude-sonnet-4-5	670
42	Gemini Robotics 1.5	Google DeepMind		200	20000	100:1	6.7			59.6		synthetic, web-scale	Sep/2025	🟢	https://storage.googleapis.com/deepmind-media/gemini-robotics/Gemini-Robotics-1-5-Tech-Report.pdf	MoE	Reasoning	2. "vision-language-action (VLA) model turns visual information and instructions into motor commands for a robot to perform a task." Available to select partners.	669
43	Gemini Robotics-ER 1.5	Google DeepMind	https://aistudio.google.com/?model=gemini-robotics-er-1.5-preview	30	30000	1,000:1	3.2			83.3		synthetic, web-scale	Sep/2025	🟢	https://storage.googleapis.com/deepmind-media/gemini-robotics/Gemini-Robotics-1-5-Tech-Report.pdf	MoE	Reasoning	1. "vision-language model (VLM) reasons about the physical world, natively calls digital tools and creates detailed, multi-step plans to complete a mission." Available to all devs.	668
44	TimesFM-ICF	Google	https://huggingface.co/collections/google/timesfm-release-66e4be5fdb56e960c1e482a6	0.2	100	500:1	0.0					special	Sep/2025	🔴	https://research.google/blog/time-series-foundation-models-can-be-few-shot-learners/	Dense		TimesFM-ICF is 6.8% more accurate than TimesFM (Base). Time-series forecasting only. 'a large pretraining corpus of 100B real world time-points' may be more than 100B tokens.	667
45	Qwen3-Max	Alibaba	https://chat.qwen.ai/	1000	36000	36:1	20.0			85.4		synthetic, web-scale	Sep/2025	🟢	https://qwen.ai/blog?id=241398b9cd6353de490b0f82806c7848c5d2777d&from=research.latest-advancements-list	MoE	Reasoning	"Qwen3-Max-Thinking — still under active training — is already demonstrating remarkable potential. When augmented with tool usage and scaled test-time compute, the Thinking variant has achieved 100% on challenging reasoning benchmarks such as AIME 25 and HMMT. "	666
46	Qwen3-Omni	Alibaba	https://github.com/QwenLM/Qwen3-Omni?tab=readme-ov-file	30	17000	567:1	2.4	88.8		73.1		synthetic, web-scale	Sep/2025	🟢	https://github.com/QwenLM/Qwen3-Omni/blob/main/assets/Qwen3_Omni.pdf	MoE	Reasoning	"Qwen3-Omni is a unified end-to-end model capable of processing multiple modalities, such as text, audio, image and video, and generating real-time text or speech response."... "pretraining utilizes a large-scale dataset containing approximately 2 trillion tokens, with the following distribution across modalities: text (0.57 trillion), audio (0.77 trillion), image (0.82 trillion), video (0.05 trillion), and video-audio (0.05 trillion)."	665
47	DeepSeek-V3.1-Terminus	DeepSeek-AI	https://huggingface.co/deepseek-ai/DeepSeek-V3.1-Terminus	685	15640	22:1	10.6		85	80.7	21.7	synthetic, web-scale	Sep/2025	🟢	https://api-docs.deepseek.com/news/news250922	MoE	SOTA, Reasoning	Hybrid reasoning. Dataset tokens: https://x.com/deepseek_ai/status/1958417072536608952 HLE: https://x.com/deepseek_ai/status/1958417068568481854/photo/2	664
48	Isaac 0.1	Perceptron	https://huggingface.co/PerceptronAI/Isaac-0.1	2	2000	1,000:1	0.2					synthetic, web-scale	Sep/2025	🟢	https://www.perceptron.inc/blog/introducing-isaac-0-1	Dense		"perceptive-language model...delivering capabilities that meet or exceed those of models over 50 times its size. Founded by the team behind Meta's Chameleon multimodal models, Perceptron is tackling a fundamental challenge: bringing the power of physical AI to the dynamic, multimodal, and real-time environments we live and work in."	663
49	Grok 4 Fast	xAI	https://grok.com/	3000	20000	7:1	25.8			85.7	20	synthetic, web-scale	Sep/2025	🟢	https://x.ai/news/grok-4-fast	MoE	Reasoning, SOTA	"2M token context window, and a unified architecture that blends reasoning and non-reasoning modes in one model."	662
50	VaultGemma	Google DeepMind	https://huggingface.co/google/vaultgemma-1b	1	13000	13,000:1	0.4					web-scale	Sep/2025	🟢	https://services.google.com/fh/files/blogs/vaultgemma_tech_report.pdf	Dense		"Differential Privacy (DP) has emerged as the gold standard, providing a rigorous, mathematical framework to limit the influence of any single example in the training data on the resulting model. A model trained with DP provably bounds the reconstruction or leakage of information tied to individual data points." Announce: https://research.google/blog/vaultgemma-the-worlds-most-capable-differentially-private-llm/	661
51	Qwen3-Next-80B-A3B	Alibaba	https://huggingface.co/collections/Qwen/qwen3-next-68c25fd6838e585db8eeea9d	80	15000	188:1	3.7	84.72	66.05	43.43		synthetic, web-scale	Sep/2025	🟢	https://qwen.ai/blog?id=4074cca80393150c248e508aa62983f9cb7d27cd&from=research.latest-advancements-list	MoE	Reasoning	"Qwen3-Next introduces several key improvements: a hybrid attention mechanism, a highly sparse Mixture-of-Experts (MoE) structure, training-stability-friendly optimizations, and a multi-token prediction mechanism for faster inference."	660
52	K2-Think	MBZUAI	https://www.k2think.ai/	32	18000	563:1	2.5			71.08	9.95	synthetic, web-scale	Sep/2025	🟢	https://arxiv.org/abs/2509.07604	Dense	Reasoning	"Built on the Qwen2.5 base model, our system shows that smaller models can compete at the highest levels by combining advanced post-training and test-time computation techniques. The approach is based on six key technical pillars: Long Chain-of-thought Supervised Finetuning, Reinforcement Learning with Verifiable Rewards (RLVR), Agentic planning prior to reasoning, Test-time Scaling, Speculative Decoding, and Inference-optimized Hardware, all using publicly available open-source datasets."	659
53	mmBERT	JHU	https://huggingface.co/collections/jhu-clsp/mmbert-a-modern-multilingual-encoder-68b725831d7c6e3acc435ed4	0.307	3000	9,772:1	0.1					synthetic, web-scale	Sep/2025	🟢	https://arxiv.org/abs/2509.06888	Dense		"a modern multilingual encoder trained on 3T tokens and 1833 languages. We introduce several novel elements in training: an inverse masking schedule and a cascading annealed language learning schedule for multilingual data" Announce: https://huggingface.co/blog/mmbert	658
54	ERNIE X1.1	Baidu	https://ernie.baidu.com/									synthetic, web-scale	Sep/2025	🟢	https://www.prnewswire.com/news-releases/baidu-unveils-reasoning-model-ernie-x1-1-with-upgrades-in-key-capabilities-302551170.html	MoE	Reasoning		657
55	ERNIE-4.5-21B-A3B-Thinking	Baidu	https://huggingface.co/baidu/ERNIE-4.5-21B-A3B-Thinking	21	15000	715:1	1.9					synthetic, web-scale	Sep/2025	🟢	https://huggingface.co/baidu/ERNIE-4.5-21B-A3B-Thinking	MoE	Reasoning		656
56	Klear-46B-A2.5B	Kuaishou	https://huggingface.co/Kwai-Klear/Klear-46B-A2.5B-Instruct	46	22000	479:1	3.4	80.5	57.6	35.3		synthetic, web-scale	Sep/2025	🟢	https://huggingface.co/Kwai-Klear/Klear-46B-A2.5B-Instruct	MoE		46B-A2.5B.	655
57	TildeOpen-30b	Tilde AI	https://huggingface.co/TildeAI/TildeOpen-30b	30	2000	67:1	0.8					synthetic, web-scale	Sep/2025	🟢	https://tilde.ai/lv/tildeopen-llm/	Dense		"language data from across Europe"	654
58	Qwen3-Max-Preview	Alibaba	https://chat.qwen.ai/	1000	36000	36:1	20.0			64.6		synthetic, web-scale	Sep/2025	🟢	https://modelstudio.console.alibabacloud.com/?tab=doc#/doc/?type=model&url=2840914_2&modelId=qwen3-max-preview	MoE		GPQA score is SuperGPQA. "our biggest model yet, with over 1 trillion parameters"	653
59	Kimi K2-Instruct-0905	Moonshot AI	https://huggingface.co/moonshotai/Kimi-K2-Instruct	1000	15500	16:1	13.1					synthetic, web-scale	Sep/2025	🟢	https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905	MoE	Reasoning, SOTA	1TA32B. 1T parameters and 384 experts. Open source SOTA.	652
60	Apertus	ETH Zürich	https://huggingface.co/swiss-ai/Apertus-70B-Instruct-2509	70	15000	215:1	3.4	65.2		30.6		synthetic, web-scale	Sep/2025	🟢	https://github.com/swiss-ai/apertus-tech-report/blob/main/Apertus_Tech_Report.pdf	Dense		"Apertus – Latin for “open”" 1,811 languages. Announce: https://ethz.ch/en/news-and-events/eth-news/news/2025/09/press-release-apertus-a-fully-open-transparent-multilingual-language-model.html	651
61	LongCat-Flash	Meituan	https://longcat.ai/	560	20000	36:1	11.2	89.71	82.68	73.23		synthetic, web-scale	Sep/2025	🟢	https://github.com/meituan-longcat/LongCat-Flash-Chat/blob/main/tech_report.pdf	MoE	Reasoning, SOTA	560B-A18.6B–31.3B (27B on average). Announce: https://lmsys.org/blog/2025-09-01-sglang-longcat-flash/	650
62	MAI-1-preview	Microsoft	https://microsoft.ai/news/two-new-in-house-models/	500	10000	20:1	7.5					synthetic, web-scale	Aug/2025	🟢	https://microsoft.ai/news/two-new-in-house-models/	MoE		MAI=Microsoft artificial intelligence. "MAI’s first foundation model trained end-to-end... MAI-1-preview is an in-house mixture-of-experts model, pre-trained and post-trained on ~15,000 NVIDIA H100 GPUs. This model is designed to provide powerful capabilities to consumers seeking to benefit from models that specialize in following instructions and providing helpful responses to everyday queries. We will be rolling MAI-1-preview out for certain text use cases within Copilot"	649
63	grok-code-fast-1	xAI	https://github.com/features/copilot	800	10000	13:1	9.4					synthetic, web-scale	Aug/2025	🟢	https://data.x.ai/2025-08-26-grok-code-fast-1-model-card.pdf	MoE		"We built grok-code-fast-1 from scratch, starting with a brand-new model architecture. To lay a robust foundation, we carefully assembled a pre-training corpus rich with programming-related content. For post-training, we curated high-quality datasets that reflect real-world pull requests and coding tasks." Announce: https://x.ai/news/grok-code-fast-1	648
64	Hermes 4	Nous Research	https://huggingface.co/NousResearch/Hermes-4-405B-FP8	405	15656	39:1	8.4	87.2	80.5	70.5		synthetic, web-scale	Aug/2025	🟢	https://arxiv.org/abs/2508.18255	Dense	Reasoning	Based on Llama 3. Announce: https://hermes4.nousresearch.com/	647
65	Jet-Nemotron-4B	NVIDIA	https://github.com/NVlabs/Jet-Nemotron	4	400	100:1	0.1	65.2	44.2			synthetic, web-scale	Aug/2025	🟢	https://arxiv.org/abs/2508.15884v1	Dense	Reasoning	"pre-training corpus and train Jet-Nemotron models for 50B tokens. This is also the setting in Section 2 where we perform PostNAS. At the second stage, we include more high-quality data from math [65] and coding [66, 67] domains into our data mixture. The models are then trained on 350B tokens."	646
66	DeepSeek-V3.1-Base	DeepSeek-AI	https://huggingface.co/deepseek-ai/DeepSeek-V3.1-Base	685	15640	22:1	10.6	93.7	84.8	80.1	29.8	synthetic, web-scale	Aug/2025	🟢	https://huggingface.co/deepseek-ai/DeepSeek-V3.1-Base	MoE	SOTA, Reasoning	Hybrid reasoning. Dataset tokens: https://x.com/deepseek_ai/status/1958417072536608952 HLE: https://x.com/deepseek_ai/status/1958417068568481854/photo/2	645
67	Nemotron Nano 2	NVIDIA	https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2-Base	12.31	20000	1,625:1	1.7	78.24	63.98	64.48		synthetic, web-scale	Aug/2025	🟢	https://research.nvidia.com/labs/adlr/files/NVIDIA-Nemotron-Nano-2-Technical-Report.pdf	Dense	Reasoning	Announce: https://research.nvidia.com/labs/adlr/NVIDIA-Nemotron-Nano-2/	644
68	Gemma 3 270M	Google DeepMind	https://huggingface.co/google/gemma-3-270m-it	0.27	6000	22,223:1	0.1					web-scale	Aug/2025	🟢	https://developers.googleblog.com/en/introducing-gemma-3-270m/	Dense		This is a record tokens-to-params ratio (for text models) of 22,223:1.	643
69	GPT-5	OpenAI	https://poe.com/GPT-5	300	114000	380:1	19.5	91		89.4	42	synthetic, web-scale	Aug/2025	🟢	https://openai.com/index/gpt-5-system-card/	MoE	SOTA, Reasoning	Announce: https://openai.com/index/introducing-gpt-5/. MMLU is based on ES and PT translated from EN.	642
70	gpt-oss-120b	OpenAI	https://huggingface.co/openai/gpt-oss-120b	120	30000	250:1	6.3	90		80.1	19	synthetic, web-scale	Aug/2025	🟢	https://cdn.openai.com/pdf/419b6906-9da6-406c-a19d-1bb078ac7637/oai_gpt-oss_model_card.pdf	MoE	Reasoning, SOTA	116.8B total parameters and 5.1B “active” parameters per token per forward pass. https://openai.com/index/introducing-gpt-oss/	641
71	gpt-oss-20b	OpenAI	https://huggingface.co/openai/gpt-oss-20b	20	13000	650:1	1.7	85.3		71.5	17.3	synthetic, web-scale	Aug/2025	🟢	https://cdn.openai.com/pdf/419b6906-9da6-406c-a19d-1bb078ac7637/oai_gpt-oss_model_card.pdf	MoE	Reasoning, SOTA	20.9B total and 3.6B active parameters. https://openai.com/index/introducing-gpt-oss/	640
72	Claude Opus 4.1	Anthropic	https://claude.ai/	2000	100000	50:1	47.1			80.9		synthetic, web-scale	Aug/2025	🟢	https://www.anthropic.com/news/claude-opus-4-1	MoE	Reasoning, SOTA		639
73	GLM-4.5	Z.AI	https://huggingface.co/zai-org/GLM-4.5	355	22000	62:1	9.3		84.6	79.1	14.4	synthetic, web-scale	Jul/2025	🟢	https://z.ai/blog/glm-4.5	MoE	Reasoning	355B-A32B.	638
74	T1	China Telecom Artificial Intelligence Research Institute	https://github.com/Tele-AI/T1	115	10000	87:1	3.6					web-scale	Jul/2025	🟢	https://arxiv.org/abs/2507.18013	Dense	Reasoning		637
75	Intern-S1	Shanghai AI Laboratory/SenseTime	https://huggingface.co/internlm/Intern-S1	235	41000	175:1	10.3		83.5	77.3		synthetic, web-scale	Jul/2025	🟢	https://huggingface.co/internlm/Intern-S1	MoE	Reasoning, SOTA	41T tokens assumes base model of Qwen3. "Built upon a 235B MoE language model and a 6B Vision encoder, Intern-S1 has been further pretrained on 5 trillion tokens of multimodal data"	636
76	Step 3	StepFun	https://www.stepfun.com/	321	18000	57:1	8.0			72.9		web-scale	Jul/2025	🟢	https://github.com/stepfun-ai/Step3/blob/main/Step3-Sys-Tech-Report.pdf	MoE		321B-A38B. https://x.com/CyouSakura/status/1948767450751009227	635
77	Qwen3-235B-A22B-Thinking-2507	Alibaba	https://huggingface.co/Qwen/Qwen3-235B-A22B-Thinking-2507	235	36000	154:1	9.7	93.8	84.4	81.1		synthetic, web-scale	Jul/2025	🟢	https://huggingface.co/Qwen/Qwen3-235B-A22B-Thinking-2507	MoE	Reasoning	235B-A22B. "Qwen3 is pre-trained on 36 trillion tokens across 119 languages" MMLU score is MMLU-Redux.	634
78	KAT-V1-200B	Kuaishou		200	10000	50:1	4.7		82.3	78.2		synthetic, web-scale	Jul/2025	🔴	https://arxiv.org/abs/2507.08297	MoE	Reasoning	200BA40B. In training as of Jul/2025. "to address the overthinking problem in reasoning-intensive tasks"	633
79	KAT-V1-40B	Kuaishou	https://huggingface.co/Kwaipilot/KAT-V1-40B	40	10000	250:1	2.1		77.8	75.1		synthetic, web-scale	Jul/2025	🟢	https://arxiv.org/abs/2507.08297	Dense	Reasoning	"to address the overthinking problem in reasoning-intensive tasks"	632
80	Qwen3-Coder-480B-A35B-Instruct	Alibaba	https://huggingface.co/Qwen/Qwen3-Coder-480B-A35B-Instruct	480	36000	75:1	13.9					synthetic, web-scale	Jul/2025	🟢	https://qwenlm.github.io/blog/qwen3-coder/	MoE		480B-A35B.	631
81	Qwen3-235B-A22B-Instruct-2507	Alibaba	https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507	235	36000	154:1	9.7	93.1	83	77.5		synthetic, web-scale	Jul/2025	🟢	https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507	MoE	SOTA	235B-A22B. "Qwen3 is pre-trained on 36 trillion tokens across 119 languages" MMLU score is MMLU-Redux.	630
82	FlexOlmo	Allen AI	https://huggingface.co/allenai/FlexOlmo-7x7B-1T	37	4150	113:1	1.3	60.4	30.9			synthetic, web-scale	Jul/2025	🟢	https://arxiv.org/abs/2507.07024v1	MoE		37B-A20B. "We adopt the OLMo-2 7B setup, starting from a a checkpoint pre-trained on 4T tokens and annealed for 50B tokens to produce a public expert. We then train two additional experts on math and code, each for 50B tokens, and combine them with the public expert to form a three-expert version of FLEXOLMO."	629
83	EXAONE 4.0	LG	https://huggingface.co/LGAI-EXAONE/EXAONE-4.0-32B	32	14000	438:1	2.2	92.3	81.8	75.4		web-scale	Jul/2025	🟢	https://www.lgresearch.ai/data/cdn/upload/EXAONE_4_0.pdf	Dense	Reasoning	“EXAONE”=“EXpert AI for EveryONE”. Training tokens/ratio: EXAONE-3 7.8B=8T tokens (Aug/2024) -> EXAONE-3.5 7.8B=9T -> EXAONE-3.5 32B=6.5T tokens -> EXAONE 4.0 32B=14T tokens. MMLU score is MMLU-Redux. Interesting: "To focus [RL] training on more informative data samples, we perform accuracy-based filtering by generating eight responses from the SFT model and excluding samples where all eight responses are correct, a pre-filtering step that removes problems that are easy for the model to avoid inefficient training."	628
84	Kimi K2	Moonshot AI	https://huggingface.co/moonshotai/Kimi-K2-Instruct	1000	15500	16:1	13.1	89.5	81.1	75.1	4.7	synthetic, web-scale	Jul/2025	🟢	https://moonshotai.github.io/Kimi-K2/	MoE	Reasoning, SOTA	1TA32B. 1T parameters and 384 experts. Open source SOTA.	627
85	Reka Flash 3.1	Reka AI	https://huggingface.co/RekaAI/reka-flash-3.1	21	5000	239:1	1.1					web-scale	Jul/2025	🟢	https://www.reka.ai/news/introducing-reka-flash	Dense	Reasoning		626
86	Devstral Medium	Mistral	https://chat.mistral.ai/chat	50	12000	240:1	2.6					synthetic, web-scale	Jul/2025	🟢	https://mistral.ai/news/devstral-2507	Dense		Non-reasoning.	625
87	Grok 4	xAI	https://grok.com/	3000	80000	27:1	51.6			88.9	44.4	synthetic, web-scale	Jul/2025	🟢	https://lifearchitect.ai/grok/	MoE	Reasoning, SOTA	2.4T? https://x.com/kalomaze/status/1942996555088134592 "The smartest AI in the world, 100% on SAT, etc, questions that it's never seen before."	624
88	Phi-4-mini-flash-reasoning	Microsoft	https://huggingface.co/microsoft/Phi-4-mini-flash-reasoning	3.8	5150	1,356:1	0.5					synthetic, web-scale	Jul/2025	🟢	https://azure.microsoft.com/en-us/blog/reasoning-reimagined-introducing-phi-4-mini-flash-reasoning/	Dense		"Pre-training: 5T tokens; Reasoning training: 150B tokens" "At the core of Phi-4-mini-flash-reasoning is the newly introduced decoder-hybrid-decoder architecture, SambaY, whose central innovation is the Gated Memory Unit (GMU), a simple yet effective mechanism for sharing representations between layers. The architecture includes a self-decoder that combines Mamba (a State Space Model) and Sliding Window Attention (SWA), along with a single layer of full attention. The architecture also involves a cross-decoder that interleaves expensive cross-attention layers with the new, efficient GMUs. This new architecture with GMU modules drastically improves decoding efficiency, boosts long-context retrieval performance and enables the architecture to deliver exceptional performance across a wide range of tasks. "	623
89	T5Gemma	Google DeepMind	https://huggingface.co/google/t5gemma-9b-9b-ul2-it	9	10000	1,112:1	1.0	76.7	55.7	40.4		web-scale	Jul/2025	🟢	https://developers.googleblog.com/en/t5gemma/	Dense		Related paper: https://arxiv.org/abs/2504.06225. Dataset was Gemma 2 9B on 8T tokens + 2T tokens adapted.	622
90	MedGemma	Google DeepMind	https://huggingface.co/google/medgemma-27b-it	27	14000	519:1	2.0	87				web-scale	Jul/2025	🟢	https://arxiv.org/abs/2507.05201	Dense		Multimodal model. Text MMLU score for med only=87.0.	621
91	R1T2 Chimera	TNG	https://huggingface.co/tngtech/DeepSeek-TNG-R1T2-Chimera	685	14800	22:1	10.6					synthetic, web-scale	Jul/2025	🟢	https://arxiv.org/abs/2506.14794	MoE		Assembly of Experts-method of V3-0324, R1, R1-0528. Announce: https://x.com/tngtech/status/1940531045432283412?s=46	620
92	Spectra 1.1	Consortium		3.6	1200	334:1	0.2	36.12				synthetic, web-scale	Jun/2025	🟢	https://arxiv.org/abs/2506.23025	Dense		"Spectra-1.1, an open suite of TriLMs trained on up to 1.2 trillion tokens, demonstrating sustained performance gains at scale. Furthermore, to improve inference efficiency, we propose novel 2-bit and 1.6-bit packing schemes for ternary weights"	619
93	DiffuCoder	Apple	https://github.com/apple/ml-diffucoder	7	5630	805:1	0.7					code, The Stack	Jun/2025	🟢	https://arxiv.org/abs/2506.20639	Dense	Diffusion	"We adapt our model from Qwen2.5-Coder (Hui et al., 2024) as the base model to perform continual pre-training using the adaptation approach from Gong et al. (2025). During this pre-training, we use a 400B-token code pre-training corpus from RefineCode (Huang et al., 2024) and Stackv2 (Lozhkov et al., 2024)."	618
94	Hunyuan-A13B	Tencent	https://huggingface.co/tencent/Hunyuan-A13B-Instruct	80	7000	88:1	2.5	88.17	67.23	71.2		synthetic, web-scale	Jun/2025	🟢	https://huggingface.co/tencent/Hunyuan-A13B-Instruct	MoE		80B-A13B. 'We have open-sourced Hunyuan-A13B-Pretrain , Hunyuan-A13B-Instruct , Hunyuan-A13B-Instruct-FP8 , Hunyuan-A13B-Instruct-GPTQ-Int4 on Hugging Face.'	617
95	Mercury	Inception	https://chat.inceptionlabs.ai/	90	8000	89:1	2.8		69	51	3.4	synthetic, web-scale	Jun/2025	🟢	https://www.inceptionlabs.ai/introducing-mercury-our-general-chat-model	Dense	Diffusion	Diffusion large language model (dLLM).	616
96	Mu	Microsoft	https://blogs.windows.com/windows-insider/2025/06/13/announcing-windows-11-insider-preview-build-26200-5651-dev-channel/	0.5	500	1,000:1	0.1					synthetic, web-scale	Jun/2025	🟢	https://blogs.windows.com/windowsexperience/2025/06/23/introducing-mu-language-model-and-how-it-enabled-the-agent-in-windows-settings/	Dense		"distillation from Microsoft’s Phi models...Mu is an efficient 330M encoder–decoder language model optimized for small-scale deployment, particularly on the NPUs on Copilot+ PCs. It follows a transformer encoder–decoder architecture"	615
97	Gemini Robotics On-Device	Google DeepMind	https://docs.google.com/forms/u/0/d/1sM5GqcVMWv-KmKY3TOMpVtQ-lDFeAftQ-d9xQn92jCE/viewform?ts=67cef986&edit_requested=true	20	10000	500:1	1.5					synthetic, web-scale	Jun/2025	🟢	https://deepmind.google/discover/blog/gemini-robotics-on-device-brings-ai-to-local-robotic-devices/	MoE		See Mar/2025 Gemini Robotics-ER model for comparison. Announce: https://deepmind.google/discover/blog/gemini-robotics-on-device-brings-ai-to-local-robotic-devices/	614
98	ICONN-1	ICONNAI	https://huggingface.co/ICONNAI/ICONN-1	88	10000	114:1	3.1					synthetic, web-scale	Jun/2025	🟢		MoE		"ICONN-1 (this version) is optimized for natural, emotionally resonant, and conversational interactions. ICONN-e1 is a specialized variant of the model fine-tuned for advanced reasoning, critical analysis, and complex problem-solving."	613
99	MiniMax-M1	MiniMax	https://huggingface.co/MiniMaxAI/MiniMax-M1-80k	456	7200	16:1	6.0		81.1	70	8.4	web-scale	Jun/2025	🟢	https://arxiv.org/abs/2506.13585	MoE	Reasoning	456B-A45.9B. Announce: https://www.minimax.io/news/minimaxm1	612
100	Magistral Medium	Mistral	https://chat.mistral.ai/chat	50	12000	240:1	2.6			70.8		synthetic, web-scale	Jun/2025	🟢	https://mistral.ai/static/research/magistral.pdf	Dense	Reasoning	Magistral Small=24B. Announce: https://mistral.ai/news/magistral	611