| A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | (703) Permalink: | https://lifearchitect.ai/models-table | Timeline view: | https://lifearchitect.ai/timeline | The Memo: | https://lifearchitect.ai/memo | ||||||||||||||
2 | Model | Lab | Playground | Parameters (B) | Tokens trained (B) | Ratio Tokens:Params (Chinchilla scaling≥20:1) | ALScore "ALScore" is a quick and dirty rating of the model's power. The formula is: Sqr Root of (Parameters x Tokens) ÷ 300. Any ALScore ≥ 1.0 is a powerful model in mid-2023. | MMLU | MMLU -Pro | GPQA | HLE | Training dataset | Announced ▼ | Public? | Paper / Repo | Arch | Tags | Notes | Count (rough) | |
3 | AuroraGPT (ScienceGPT) | Argonne National Laboratory | https://lifearchitect.ai/auroragpt/ | 2000 | 30000 | 15:1 | 25.8 | TBA | 🔴 | | Three models targeted in Jul/2024: AuroraGPT-7B-P (Ponte Vecchio GPU testing) AuroraGPT-7B-A (Aurora) AuroraGPT-7B-A-S (Aurora + Science). | l | ||||||||
4 | DeepSeek-R2 | DeepSeek-AI | https://www.reuters.com/technology/artificial-intelligence/deepseek-rushes-launch-new-ai-model-china-goes-all-2025-02-25/ | 1200 | 130000 | 109:1 | 41.6 | TBA | 🟢 | https://docs.google.com/document/d/e/2PACX-1vTmx-A5sBe_3RsURGM7VvLWsAgUXbcIb2pFaW7f1FTPgK7mGvYENXGQPoF2u4onFndJ_5tzZ02su-vg/pub | MoE | Reasoning, SOTA | Hybrid MoE, 1.2TA78B. 5.2PB corpus=1.3Qa tokens. 1.3 quadrillion tokens = 1,300T tokens = 1,300,000B tokens "Constructed a 5.2 PB high-quality corpus covering vertical domains such as finance, law, and patents." http://jiuyangongshe.com/a/1h4gq724su0 and translated at: https://docs.google.com/document/d/e/2PACX-1vTmx-A5sBe_3RsURGM7VvLWsAgUXbcIb2pFaW7f1FTPgK7mGvYENXGQPoF2u4onFndJ_5tzZ02su-vg/pub | k | ||||||
5 | GPT-6 | OpenAI | https://lifearchitect.ai/gpt-6/ | TBA | SOTA | Due 2026. | f | |||||||||||||
6 | Grok-5 | xAI | https://lifearchitect.ai/whats-in-grok/ | 6000 | 100000 | 17:1 | 81.6 | TBA | MoE | | Due 2026. Quote 3T/6T: https://youtu.be/q_mMV5OpRd4?t=1387 | n | ||||||||
7 | Trinity-Large | Arcee AI | 420 | 20000 | 48:1 | 9.7 | TBA | 🟢 | https://www.arcee.ai/blog/the-trinity-manifesto | MoE | Reasoning | 420BA13B. "we worked closely with Prime Intellect. They not only served the H100 clusters Datology used to generate synthetic data, they have been deeply involved in helping scale our training setup to the GPU footprint required for a fully frontier sized model, including the current 2048 B300 GPU configuration for Trinity Large." | p | |||||||
8 | HY 2.0 | Tencent | https://hunyuan.tencent.com/ | 406 | 40000 | 99:1 | 13.4 | 18.8 | synthetic, web-scale | Dec/2025 | 🟢 | https://x.com/TencentHunyuan/status/1996948083377332614 | MoE | | 406BA32B. | 703 | ||||
9 | Trinity-Mini | Arcee AI | https://huggingface.co/arcee-ai/Trinity-Mini | 26 | 20000 | 770:1 | 2.4 | 84.95 | 58.55 | synthetic, web-scale | Dec/2025 | 🟢 | https://www.arcee.ai/blog/the-trinity-manifesto | MoE | Reasoning | 26BA3B. "we worked closely with Prime Intellect. They not only served the H100 clusters Datology used to generate synthetic data, they have been deeply involved in helping scale our training setup to the GPU footprint required for a fully frontier sized model, including the current 2048 B300 GPU configuration for Trinity Large." | 702 | |||
10 | Nova 2 Pro | Amazon | https://nova.amazon.com/chat | 200 | 20000 | 100:1 | 6.7 | 81.6 | 81.4 | synthetic, web-scale | Dec/2025 | 🟢 | https://www.aboutamazon.com/news/aws/aws-agentic-ai-amazon-bedrock-nova-models | Dense | Reasoning | "Nova 2 Pro is Amazon's most intelligent reasoning model that can process text, images, video, and speech to generate text." | 701 | |||
11 | Mistral Large 3 | Mistral | https://huggingface.co/collections/mistralai/mistral-large-3 | 675 | 20000 | 30:1 | 12.2 | 43.9 | synthetic, web-scale | Dec/2025 | 🟢 | https://mistral.ai/news/mistral-3 | MoE | Reasoning | 675BA41B. "Mistral Large 3 joins the ranks of frontier instruction-fine-tuned open-source models." EU tech doc: https://legal.cms.mistral.ai/assets/1e37fffd-7ea5-469b-822f-05dcfbb43623 | 700 | ||||
12 | DeepSeek-V3.2-Speciale | DeepSeek-AI | https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Speciale | 685 | 15640 | 22:1 | 10.6 | 85.7 | 30.6 | synthetic, web-scale | Dec/2025 | 🟢 | https://huggingface.co/deepseek-ai/DeepSeek-V3.2/blob/main/assets/paper.pdf | MoE | SOTA, Reasoning | The word 'Speciale' may be a reference to Ferrari. "It shows gold-medal performance in the IOI 2025, ICPC World Final 2025, IMO 2025, and CMO 2025." API: https://api-docs.deepseek.com/news/news251201 | 699 | |||
13 | DeepSeek-Math-V2 | DeepSeek-AI | https://huggingface.co/deepseek-ai/DeepSeek-Math-V2 | 685 | 15640 | 22:1 | 10.6 | synthetic, web-scale | Nov/2025 | 🟢 | https://github.com/deepseek-ai/DeepSeek-Math-V2/blob/main/DeepSeekMath_V2.pdf | MoE | SOTA, Reasoning | "DeepSeekMath-V2, demonstrates strong theorem-proving capabilities, achieving gold-level scores on IMO 2025 and CMO 2024 and a near-perfect 118/120 on Putnam 2024 with scaled testtime compute. " | 698 | |||||
14 | Orchestrator-8B | NVIDIA | https://huggingface.co/nvidia/Orchestrator-8B | 8 | 36000 | 4,500:1 | 1.8 | 37.1 | synthetic, web-scale | Nov/2025 | 🟢 | https://arxiv.org/abs/2511.21689 | Dense | Reasoning | Base Model: Qwen3-8B | 697 | ||||
15 | INTELLECT-3 | Prime Intellect | https://chat.primeintellect.ai/ | 106 | 22000 | 208:1 | 5.1 | 81.9 | 74.4 | 14.6 | synthetic, web-scale | Nov/2025 | 🟢 | https://storage.googleapis.com/intellect-3-paper/INTELLECT_3_Technical_Report.pdf | Dense | Reasoning | GLM-4.5-Air-Base model. Announce: https://www.primeintellect.ai/blog/intellect-3 | 696 | ||
16 | Fara-7B | Microsoft | https://huggingface.co/microsoft/Fara-7B | 7 | 18000 | 2,572:1 | 1.2 | synthetic, web-scale | Nov/2025 | 🟢 | https://www.microsoft.com/en-us/research/wp-content/uploads/2025/11/Fara-7B-An-Efficient-Agentic-Model-for-Computer-Use.pdf | Dense | | "Fara-7B is Microsoft's first agentic small language model (SLM) designed specifically for computer use. With only 7 billion parameters, Fara-7B is an ultra-compact Computer Use Agent (CUA)...Current production baselines leverage Qwen 2.5-VL (7B)." | 695 | |||||
17 | Claude Opus 4.5 | Anthropic | https://claude.ai/ | 5000 | 100000 | 20:1 | 74.5 | 86.95 | 43.2 | synthetic, web-scale | Nov/2025 | 🟢 | https://assets.anthropic.com/m/64823ba7485345a7/Claude-Opus-4-5-System-Card.pdf | MoE | Reasoning, SOTA | "the best model in the world for coding, agents, and computer use." Announce: https://www.anthropic.com/news/claude-opus-4-5 | 694 | |||
18 | Nemotron Elastic | NVIDIA | https://huggingface.co/nvidia/Nemotron-Elastic-12B | 12 | 110 | 10:1 | 0.1 | 76.2 | 63.25 | synthetic, web-scale | Nov/2025 | 🟢 | https://arxiv.org/abs/2511.16664v1 | Dense | Reasoning | "Nemotron Elastic, a framework for building reasoning-oriented LLMs, including hybrid Mamba-Attention architectures, that embed multiple nested submodels within a single parent model, each optimized for different deployment configurations and budgets. Each of these submodels shares weights with the parent model and can be extracted zero-shot during deployment without additional training or fine-tuning...We apply Nemotron Elastic to the Nemotron Nano V2 12B model, simultaneously producing a 9B and a 6B model using only 110B training tokens" | 693 | |||
19 | GeoVista | Tencent | https://github.com/ekonwang/GeoVista | 7 | 18000 | 2,572:1 | 1.2 | synthetic, web-scale | Nov/2025 | 🟢 | https://arxiv.org/abs/2511.15705 | Dense | | Base model: Qwen2.5-VL-7B-Instruct. "GeoVista, an agentic model that seamlessly integrates tool invocation within the reasoning loop, including an image-zoom-in tool to magnify regions of interest and a web-search tool to retrieve related web information. " Project page: https://ekonwang.github.io/geo-vista/ | 692 | |||||
20 | OLMo 3 | Allen AI | https://huggingface.co/collections/allenai/olmo-3 | 32 | 6000 | 188:1 | 1.5 | 85.4 | 58.1 | synthetic, web-scale | Nov/2025 | 🟢 | https://www.datocms-assets.com/64837/1763662397-1763646865-olmo_3_technical_report-1.pdf | Dense | Reasoning | Announce: https://allenai.org/blog/olmo3 | 691 | |||
21 | Gemini 3 | Google DeepMind | https://gemini.google.com/ | 3000 | 100000 | 34:1 | 57.7 | 90.1 | 93.8 | 45.8 | synthetic, web-scale | Nov/2025 | 🟢 | https://storage.googleapis.com/deepmind-media/Model-Cards/Gemini-3-Pro-Model-Card.pdf | MoE | Reasoning, SOTA | "The knowledge cutoff date for Gemini 3 Pro was January 2025." | 690 | ||
22 | Grok 4.1 | xAI | https://grok.com/ | 3000 | 80000 | 27:1 | 51.6 | synthetic, web-scale | Nov/2025 | 🟢 | https://x.ai/news/grok-4-1 | MoE | Reasoning | 689 | ||||||
23 | Baguettotron | PleIAs | https://huggingface.co/PleIAs/Baguettotron | 0.321 | 200 | 624:1 | 0.0 | 40 | synthetic, web-scale | Nov/2025 | 🟢 | https://huggingface.co/PleIAs/Baguettotron | Dense | Reasoning | "The name is both a nod to French origins and to the unusual shape of the model: with 80 layers, Baguettotron is currently the deepest SLM in its size range." | 688 | ||||
24 | ERNIE-5.0-Preview-1022 | Baidu | https://ernie.baidu.com/ | 2400 | 100000 | 42:1 | 51.6 | synthetic, web-scale | Nov/2025 | 🟢 | https://ernie.baidu.com/blog/posts/ernie-5.0-preview-1022-release-on-lmarena/ | MoE | Reasoning | Very low performance on ALPrompt. 2.4T params confirmed: https://global.chinadaily.com.cn/a/202511/13/WS691571bda310d6866eb29500.html | 687 | |||||
25 | GPT-5.1 | OpenAI | https://chatgpt.com/ | 300 | 114000 | 380:1 | 19.5 | 88.1 | synthetic, web-scale | Nov/2025 | 🟢 | https://openai.com/index/gpt-5-1/ | MoE | Reasoning, SOTA | Personality change via fine-tuning. GPQA (no tools) increased from GPT-5=85.7 to GPT-5.1=88.1. | 686 | ||||
26 | JustRL-Nemotron-1.5B | Tsinghua | https://huggingface.co/hbx/JustRL-Nemotron-1.5B | 1.5 | 9000 | 6,000:1 | 0.4 | synthetic, web-scale | Nov/2025 | 🟢 | https://relieved-cafe-fe1.notion.site/JustRL-Scaling-a-1-5B-LLM-with-a-Simple-RL-Recipe-24f6198b0b6b80e48e74f519bfdaf0a8 | Dense | Reasoning | "JustRL, a simple recipe with fixed hyperparameters, achieves state-of-the-art performance on two different 1.5B base models (54.5% and 64.3% across 9 math benchmarks) while using 2× less compute than sophisticated approaches. The same hyperparameters transfer across both models without tuning, and training remains stable over thousands of steps without intervention. This suggests the field may be adding complexity to solve problems that disappear with a stable, scaled-up baseline." | 685 | |||||
27 | ERNIE-4.5-VL-28B-A3B-Thinking | Baidu | https://huggingface.co/baidu/ERNIE-4.5-VL-28B-A3B-Thinking | 28 | 15000 | 536:1 | 2.2 | 78.9 | 66 | synthetic, web-scale | Nov/2025 | 🟢 | https://github.com/PaddlePaddle/ERNIE | MoE | Reasoning | 28B-A3B. Open-sourced 12/Nov/2025 from Jun/2025 release. | 684 | |||
28 | HOPE | Google DeepMind | 1.3 | 100 | 77:1 | 0.0 | synthetic, web-scale | Nov/2025 | 🟡 | https://abehrouz.github.io/files/NL.pdf | Dense | | "Combining our self-modifying sequence model with the continuum memory system, we present a learning module, called HOPE, showing promising results in language modeling, continual learning, and long-context reasoning tasks." Announce: https://research.google/blog/introducing-nested-learning-a-new-ml-paradigm-for-continual-learning/ May be released after paper is public. | 683 | ||||||
29 | Kimi K2 Thinking | Moonshot AI | https://kimi.com/ | 1000 | 15500 | 16:1 | 13.1 | 94.4 | 84.6 | 84.5 | 44 | synthetic, web-scale | Nov/2025 | 🟢 | https://moonshotai.github.io/Kimi-K2/thinking.html | MoE | Reasoning, SOTA | 1TA32B. 1T parameters and 384 experts. Open source SOTA. HLE=51.0 on text-only subset, compare to Grok-4 HLE=50.7 also on text-only, but Grok-4 HLE=44.4 on HLE full, ∴ Kimi K2 Thinking HLE≈44 full (estimated). | 682 | |
30 | GEN-0 | Generalist | https://generalistai.com/blog/nov-04-2025-GEN-0 | 10 | 10000 | 1,000:1 | 1.1 | web-scale | Nov/2025 | 🟡 | https://generalistai.com/blog/nov-04-2025-GEN-0 | Dense | SOTA | "GEN-0, a new class of embodied foundation models built for multimodal training directly on high-fidelity raw physical interaction. Its architecture builds on the strengths of vision and language models while also going beyond them—natively designed to capture human-level reflexes and physical commonsense. One core feature is Harmonic Reasoning, in which the models are trained to simultaneously think and act seamlessly... GEN-0 is pretrained on our in-house robotics dataset, which includes over 270,000 hours of real-world diverse manipulation data, growing at a rate of 10,000 hours a week and accelerating." | 681 | |||||
31 | CALM | https://github.com/shaochenze/calm | 1.82 | 230 | 127:1 | 0.1 | web-scale | Oct/2025 | 🟢 | https://arxiv.org/abs/2510.27688 | Dense | | "Continuous Autoregressive Language Models (CALM), a paradigm shift from discrete next-token prediction to continuous next-vector prediction. CALM uses a high-fidelity autoencoder to compress a chunk of K tokens into a single continuous vector, from which the original tokens can be reconstructed with over 99.9% accuracy... We train our models on the Pile uncopyrighted dataset (Gao et al., 2020). The raw text is processed with the Llama 3 tokenizer (Grattafiori et al., 2024), resulting in a training set of ∼230B tokens." | 680 | ||||||
32 | Kimi-Linear | Moonshot AI | https://huggingface.co/moonshotai/Kimi-Linear-48B-A3B-Instruct | 48 | 5700 | 119:1 | 1.7 | 51 | synthetic, web-scale | Oct/2025 | 🟢 | https://github.com/MoonshotAI/Kimi-Linear?tab=readme-ov-file | MoE | | 48B-A3B. "Kimi Linear is a hybrid linear attention architecture that outperforms traditional full attention methods across various contexts, including short, long, and reinforcement learning (RL) scaling regimes. At its core is Kimi Delta Attention (KDA)—a refined version of Gated DeltaNet that introduces a more efficient gating mechanism to optimize the use of finite-state RNN memory." | 679 | ||||
33 | MiniMax-M2 | MiniMax | https://huggingface.co/MiniMaxAI/MiniMax-M2 | 230 | 7200 | 32:1 | 4.3 | 82 | 78 | 31.8 | web-scale | Oct/2025 | 🟢 | https://platform.minimax.io/docs/guides/text-generation | MoE | Reasoning | 230B-A10B. | 678 | ||
34 | DeepSeek-OCR | DeepSeek-AI | https://github.com/deepseek-ai/DeepSeek-OCR/blob/main/DeepSeek_OCR_paper.pdf | 3 | 6000 | 2,000:1 | 0.4 | special | Oct/2025 | 🟢 | https://github.com/deepseek-ai/DeepSeek-OCR/blob/main/DeepSeek_OCR_paper.pdf | MoE | | 2D vision tokens for 1D text achieves huge compression. Encoder/Decoder: DeepEncoder 380M (80M SAM-base + 300M CLIP-large), DeepSeek-3B-MoE (A570M). | 677 | |||||
35 | UserLM-8b | Microsoft | https://huggingface.co/microsoft/UserLM-8b | 8 | 1000 | 125:1 | 0.3 | WildChat | Oct/2025 | 🟢 | https://huggingface.co/microsoft/UserLM-8b | Dense | | "we trained UserLM-8b to simulate the “user” role in conversation (by training it to predict user turns in a large corpus of conversations called WildChat)." | 676 | |||||
36 | CoDA | Salesforce | https://huggingface.co/Salesforce/CoDA-v0-Instruct | 1.7 | 180 | 106:1 | 0.1 | synthetic, web-scale | Oct/2025 | 🟢 | https://github.com/SalesforceAIResearch/CoDA/blob/main/technical_report.pdf | Dense | Diffusion | "diffusion coder trained on TPU [Google TPU v4-1024 VM]" | 675 | |||||
37 | TRM | Samsung | https://github.com/SamsungSAILMontreal/TinyRecursiveModels | 0.007 | 0.1 | 15:1 | 0.0 | Mazes (ARC-AGI) | Oct/2025 | 🟢 | https://arxiv.org/abs/2510.04871v1 | Dense | | "Tiny Recursive Model (TRM), a much simpler recursive reasoning approach that achieves significantly higher generalization than HRM, while using a single tiny network with only 2 layers" | 674 | |||||
38 | Granite-4.0 Small | IBM | https://huggingface.co/ibm-granite/granite-4.0-h-small | 32 | 15000 | 469:1 | 2.3 | 78.33 | 55.47 | 40.63 | synthetic, web-scale | Oct/2025 | 🟢 | https://www.ibm.com/granite/docs/models/granite | MoE | Reasoning | 32B-A9B. Announce: https://www.ibm.com/new/announcements/ibm-granite-4-0-hyper-efficient-high-performance-hybrid-models | 673 | ||
39 | GLM-4.6 | Z.AI | https://huggingface.co/zai-org/GLM-4.6 | 355 | 22000 | 62:1 | 9.3 | 82.9 | 30.4 | synthetic, web-scale | Sep/2025 | 🟢 | https://z.ai/blog/glm-4.6 | MoE | Reasoning | 355B-A32B. "context window has been expanded from 128K to 200K tokens" | 672 | |||
40 | Ring-1T-preview | InclusionAI | https://huggingface.co/inclusionAI/Ring-1T-preview | 1000 | 20000 | 20:1 | 14.9 | synthetic, web-scale | Sep/2025 | 🟢 | https://huggingface.co/inclusionAI/Ring-1T-preview | MoE | Reasoning | 1T-A48.5B. | 671 | |||||
41 | Claude Sonnet 4.5 | Anthropic | https://claude.ai/ | 400 | 80000 | 200:1 | 18.9 | 83.4 | synthetic, web-scale | Sep/2025 | 🟢 | https://assets.anthropic.com/m/12f214efcc2f457a/original/Claude-Sonnet-4-5-System-Card.pdf | MoE | Reasoning, SOTA | The Claude Sonnet 4.5 "system card" is an absolute farce. Announce: https://www.anthropic.com/news/claude-sonnet-4-5 | 670 | ||||
42 | Gemini Robotics 1.5 | Google DeepMind | 200 | 20000 | 100:1 | 6.7 | 59.6 | synthetic, web-scale | Sep/2025 | 🟢 | https://storage.googleapis.com/deepmind-media/gemini-robotics/Gemini-Robotics-1-5-Tech-Report.pdf | MoE | Reasoning | 2. "vision-language-action (VLA) model turns visual information and instructions into motor commands for a robot to perform a task." Available to select partners. | 669 | |||||
43 | Gemini Robotics-ER 1.5 | Google DeepMind | https://aistudio.google.com/?model=gemini-robotics-er-1.5-preview | 30 | 30000 | 1,000:1 | 3.2 | 83.3 | synthetic, web-scale | Sep/2025 | 🟢 | https://storage.googleapis.com/deepmind-media/gemini-robotics/Gemini-Robotics-1-5-Tech-Report.pdf | MoE | Reasoning | 1. "vision-language model (VLM) reasons about the physical world, natively calls digital tools and creates detailed, multi-step plans to complete a mission." Available to all devs. | 668 | ||||
44 | TimesFM-ICF | https://huggingface.co/collections/google/timesfm-release-66e4be5fdb56e960c1e482a6 | 0.2 | 100 | 500:1 | 0.0 | special | Sep/2025 | 🔴 | https://research.google/blog/time-series-foundation-models-can-be-few-shot-learners/ | Dense | | TimesFM-ICF is 6.8% more accurate than TimesFM (Base). Time-series forecasting only. 'a large pretraining corpus of 100B real world time-points' may be more than 100B tokens. | 667 | ||||||
45 | Qwen3-Max | Alibaba | https://chat.qwen.ai/ | 1000 | 36000 | 36:1 | 20.0 | 85.4 | synthetic, web-scale | Sep/2025 | 🟢 | https://qwen.ai/blog?id=241398b9cd6353de490b0f82806c7848c5d2777d&from=research.latest-advancements-list | MoE | Reasoning | "Qwen3-Max-Thinking — still under active training — is already demonstrating remarkable potential. When augmented with tool usage and scaled test-time compute, the Thinking variant has achieved 100% on challenging reasoning benchmarks such as AIME 25 and HMMT. " | 666 | ||||
46 | Qwen3-Omni | Alibaba | https://github.com/QwenLM/Qwen3-Omni?tab=readme-ov-file | 30 | 17000 | 567:1 | 2.4 | 88.8 | 73.1 | synthetic, web-scale | Sep/2025 | 🟢 | https://github.com/QwenLM/Qwen3-Omni/blob/main/assets/Qwen3_Omni.pdf | MoE | Reasoning | "Qwen3-Omni is a unified end-to-end model capable of processing multiple modalities, such as text, audio, image and video, and generating real-time text or speech response."... "pretraining utilizes a large-scale dataset containing approximately 2 trillion tokens, with the following distribution across modalities: text (0.57 trillion), audio (0.77 trillion), image (0.82 trillion), video (0.05 trillion), and video-audio (0.05 trillion)." | 665 | |||
47 | DeepSeek-V3.1-Terminus | DeepSeek-AI | https://huggingface.co/deepseek-ai/DeepSeek-V3.1-Terminus | 685 | 15640 | 22:1 | 10.6 | 85 | 80.7 | 21.7 | synthetic, web-scale | Sep/2025 | 🟢 | https://api-docs.deepseek.com/news/news250922 | MoE | SOTA, Reasoning | Hybrid reasoning. Dataset tokens: https://x.com/deepseek_ai/status/1958417072536608952 HLE: https://x.com/deepseek_ai/status/1958417068568481854/photo/2 | 664 | ||
48 | Isaac 0.1 | Perceptron | https://huggingface.co/PerceptronAI/Isaac-0.1 | 2 | 2000 | 1,000:1 | 0.2 | synthetic, web-scale | Sep/2025 | 🟢 | https://www.perceptron.inc/blog/introducing-isaac-0-1 | Dense | | "perceptive-language model...delivering capabilities that meet or exceed those of models over 50 times its size. Founded by the team behind Meta's Chameleon multimodal models, Perceptron is tackling a fundamental challenge: bringing the power of physical AI to the dynamic, multimodal, and real-time environments we live and work in." | 663 | |||||
49 | Grok 4 Fast | xAI | https://grok.com/ | 3000 | 20000 | 7:1 | 25.8 | 85.7 | 20 | synthetic, web-scale | Sep/2025 | 🟢 | https://x.ai/news/grok-4-fast | MoE | Reasoning, SOTA | "2M token context window, and a unified architecture that blends reasoning and non-reasoning modes in one model." | 662 | |||
50 | VaultGemma | Google DeepMind | https://huggingface.co/google/vaultgemma-1b | 1 | 13000 | 13,000:1 | 0.4 | web-scale | Sep/2025 | 🟢 | https://services.google.com/fh/files/blogs/vaultgemma_tech_report.pdf | Dense | | "Differential Privacy (DP) has emerged as the gold standard, providing a rigorous, mathematical framework to limit the influence of any single example in the training data on the resulting model. A model trained with DP provably bounds the reconstruction or leakage of information tied to individual data points." Announce: https://research.google/blog/vaultgemma-the-worlds-most-capable-differentially-private-llm/ | 661 | |||||
51 | Qwen3-Next-80B-A3B | Alibaba | https://huggingface.co/collections/Qwen/qwen3-next-68c25fd6838e585db8eeea9d | 80 | 15000 | 188:1 | 3.7 | 84.72 | 66.05 | 43.43 | synthetic, web-scale | Sep/2025 | 🟢 | https://qwen.ai/blog?id=4074cca80393150c248e508aa62983f9cb7d27cd&from=research.latest-advancements-list | MoE | Reasoning | "Qwen3-Next introduces several key improvements: a hybrid attention mechanism, a highly sparse Mixture-of-Experts (MoE) structure, training-stability-friendly optimizations, and a multi-token prediction mechanism for faster inference." | 660 | ||
52 | K2-Think | MBZUAI | https://www.k2think.ai/ | 32 | 18000 | 563:1 | 2.5 | 71.08 | 9.95 | synthetic, web-scale | Sep/2025 | 🟢 | https://arxiv.org/abs/2509.07604 | Dense | Reasoning | "Built on the Qwen2.5 base model, our system shows that smaller models can compete at the highest levels by combining advanced post-training and test-time computation techniques. The approach is based on six key technical pillars: Long Chain-of-thought Supervised Finetuning, Reinforcement Learning with Verifiable Rewards (RLVR), Agentic planning prior to reasoning, Test-time Scaling, Speculative Decoding, and Inference-optimized Hardware, all using publicly available open-source datasets." | 659 | |||
53 | mmBERT | JHU | https://huggingface.co/collections/jhu-clsp/mmbert-a-modern-multilingual-encoder-68b725831d7c6e3acc435ed4 | 0.307 | 3000 | 9,772:1 | 0.1 | synthetic, web-scale | Sep/2025 | 🟢 | https://arxiv.org/abs/2509.06888 | Dense | | "a modern multilingual encoder trained on 3T tokens and 1833 languages. We introduce several novel elements in training: an inverse masking schedule and a cascading annealed language learning schedule for multilingual data" Announce: https://huggingface.co/blog/mmbert | 658 | |||||
54 | ERNIE X1.1 | Baidu | https://ernie.baidu.com/ | synthetic, web-scale | Sep/2025 | 🟢 | https://www.prnewswire.com/news-releases/baidu-unveils-reasoning-model-ernie-x1-1-with-upgrades-in-key-capabilities-302551170.html | MoE | Reasoning | 657 | ||||||||||
55 | ERNIE-4.5-21B-A3B-Thinking | Baidu | https://huggingface.co/baidu/ERNIE-4.5-21B-A3B-Thinking | 21 | 15000 | 715:1 | 1.9 | synthetic, web-scale | Sep/2025 | 🟢 | https://huggingface.co/baidu/ERNIE-4.5-21B-A3B-Thinking | MoE | Reasoning | 656 | ||||||
56 | Klear-46B-A2.5B | Kuaishou | https://huggingface.co/Kwai-Klear/Klear-46B-A2.5B-Instruct | 46 | 22000 | 479:1 | 3.4 | 80.5 | 57.6 | 35.3 | synthetic, web-scale | Sep/2025 | 🟢 | https://huggingface.co/Kwai-Klear/Klear-46B-A2.5B-Instruct | MoE | | 46B-A2.5B. | 655 | ||
57 | TildeOpen-30b | Tilde AI | https://huggingface.co/TildeAI/TildeOpen-30b | 30 | 2000 | 67:1 | 0.8 | synthetic, web-scale | Sep/2025 | 🟢 | https://tilde.ai/lv/tildeopen-llm/ | Dense | | "language data from across Europe" | 654 | |||||
58 | Qwen3-Max-Preview | Alibaba | https://chat.qwen.ai/ | 1000 | 36000 | 36:1 | 20.0 | 64.6 | synthetic, web-scale | Sep/2025 | 🟢 | https://modelstudio.console.alibabacloud.com/?tab=doc#/doc/?type=model&url=2840914_2&modelId=qwen3-max-preview | MoE | | GPQA score is SuperGPQA. "our biggest model yet, with over 1 trillion parameters" | 653 | ||||
59 | Kimi K2-Instruct-0905 | Moonshot AI | https://huggingface.co/moonshotai/Kimi-K2-Instruct | 1000 | 15500 | 16:1 | 13.1 | synthetic, web-scale | Sep/2025 | 🟢 | https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905 | MoE | Reasoning, SOTA | 1TA32B. 1T parameters and 384 experts. Open source SOTA. | 652 | |||||
60 | Apertus | ETH Zürich | https://huggingface.co/swiss-ai/Apertus-70B-Instruct-2509 | 70 | 15000 | 215:1 | 3.4 | 65.2 | 30.6 | synthetic, web-scale | Sep/2025 | 🟢 | https://github.com/swiss-ai/apertus-tech-report/blob/main/Apertus_Tech_Report.pdf | Dense | | "Apertus – Latin for “open”" 1,811 languages. Announce: https://ethz.ch/en/news-and-events/eth-news/news/2025/09/press-release-apertus-a-fully-open-transparent-multilingual-language-model.html | 651 | |||
61 | LongCat-Flash | Meituan | https://longcat.ai/ | 560 | 20000 | 36:1 | 11.2 | 89.71 | 82.68 | 73.23 | synthetic, web-scale | Sep/2025 | 🟢 | https://github.com/meituan-longcat/LongCat-Flash-Chat/blob/main/tech_report.pdf | MoE | Reasoning, SOTA | 560B-A18.6B–31.3B (27B on average). Announce: https://lmsys.org/blog/2025-09-01-sglang-longcat-flash/ | 650 | ||
62 | MAI-1-preview | Microsoft | https://microsoft.ai/news/two-new-in-house-models/ | 500 | 10000 | 20:1 | 7.5 | synthetic, web-scale | Aug/2025 | 🟢 | https://microsoft.ai/news/two-new-in-house-models/ | MoE | | MAI=Microsoft artificial intelligence. "MAI’s first foundation model trained end-to-end... MAI-1-preview is an in-house mixture-of-experts model, pre-trained and post-trained on ~15,000 NVIDIA H100 GPUs. This model is designed to provide powerful capabilities to consumers seeking to benefit from models that specialize in following instructions and providing helpful responses to everyday queries. We will be rolling MAI-1-preview out for certain text use cases within Copilot" | 649 | |||||
63 | grok-code-fast-1 | xAI | https://github.com/features/copilot | 800 | 10000 | 13:1 | 9.4 | synthetic, web-scale | Aug/2025 | 🟢 | https://data.x.ai/2025-08-26-grok-code-fast-1-model-card.pdf | MoE | | "We built grok-code-fast-1 from scratch, starting with a brand-new model architecture. To lay a robust foundation, we carefully assembled a pre-training corpus rich with programming-related content. For post-training, we curated high-quality datasets that reflect real-world pull requests and coding tasks." Announce: https://x.ai/news/grok-code-fast-1 | 648 | |||||
64 | Hermes 4 | Nous Research | https://huggingface.co/NousResearch/Hermes-4-405B-FP8 | 405 | 15656 | 39:1 | 8.4 | 87.2 | 80.5 | 70.5 | synthetic, web-scale | Aug/2025 | 🟢 | https://arxiv.org/abs/2508.18255 | Dense | Reasoning | Based on Llama 3. Announce: https://hermes4.nousresearch.com/ | 647 | ||
65 | Jet-Nemotron-4B | NVIDIA | https://github.com/NVlabs/Jet-Nemotron | 4 | 400 | 100:1 | 0.1 | 65.2 | 44.2 | synthetic, web-scale | Aug/2025 | 🟢 | https://arxiv.org/abs/2508.15884v1 | Dense | Reasoning | "pre-training corpus and train Jet-Nemotron models for 50B tokens. This is also the setting in Section 2 where we perform PostNAS. At the second stage, we include more high-quality data from math [65] and coding [66, 67] domains into our data mixture. The models are then trained on 350B tokens." | 646 | |||
66 | DeepSeek-V3.1-Base | DeepSeek-AI | https://huggingface.co/deepseek-ai/DeepSeek-V3.1-Base | 685 | 15640 | 22:1 | 10.6 | 93.7 | 84.8 | 80.1 | 29.8 | synthetic, web-scale | Aug/2025 | 🟢 | https://huggingface.co/deepseek-ai/DeepSeek-V3.1-Base | MoE | SOTA, Reasoning | Hybrid reasoning. Dataset tokens: https://x.com/deepseek_ai/status/1958417072536608952 HLE: https://x.com/deepseek_ai/status/1958417068568481854/photo/2 | 645 | |
67 | Nemotron Nano 2 | NVIDIA | https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2-Base | 12.31 | 20000 | 1,625:1 | 1.7 | 78.24 | 63.98 | 64.48 | synthetic, web-scale | Aug/2025 | 🟢 | https://research.nvidia.com/labs/adlr/files/NVIDIA-Nemotron-Nano-2-Technical-Report.pdf | Dense | Reasoning | Announce: https://research.nvidia.com/labs/adlr/NVIDIA-Nemotron-Nano-2/ | 644 | ||
68 | Gemma 3 270M | Google DeepMind | https://huggingface.co/google/gemma-3-270m-it | 0.27 | 6000 | 22,223:1 | 0.1 | web-scale | Aug/2025 | 🟢 | https://developers.googleblog.com/en/introducing-gemma-3-270m/ | Dense | | This is a record tokens-to-params ratio (for text models) of 22,223:1. | 643 | |||||
69 | GPT-5 | OpenAI | https://poe.com/GPT-5 | 300 | 114000 | 380:1 | 19.5 | 91 | 89.4 | 42 | synthetic, web-scale | Aug/2025 | 🟢 | https://openai.com/index/gpt-5-system-card/ | MoE | SOTA, Reasoning | Announce: https://openai.com/index/introducing-gpt-5/. MMLU is based on ES and PT translated from EN. | 642 | ||
70 | gpt-oss-120b | OpenAI | https://huggingface.co/openai/gpt-oss-120b | 120 | 30000 | 250:1 | 6.3 | 90 | 80.1 | 19 | synthetic, web-scale | Aug/2025 | 🟢 | https://cdn.openai.com/pdf/419b6906-9da6-406c-a19d-1bb078ac7637/oai_gpt-oss_model_card.pdf | MoE | Reasoning, SOTA | 116.8B total parameters and 5.1B “active” parameters per token per forward pass. https://openai.com/index/introducing-gpt-oss/ | 641 | ||
71 | gpt-oss-20b | OpenAI | https://huggingface.co/openai/gpt-oss-20b | 20 | 13000 | 650:1 | 1.7 | 85.3 | 71.5 | 17.3 | synthetic, web-scale | Aug/2025 | 🟢 | https://cdn.openai.com/pdf/419b6906-9da6-406c-a19d-1bb078ac7637/oai_gpt-oss_model_card.pdf | MoE | Reasoning, SOTA | 20.9B total and 3.6B active parameters. https://openai.com/index/introducing-gpt-oss/ | 640 | ||
72 | Claude Opus 4.1 | Anthropic | https://claude.ai/ | 2000 | 100000 | 50:1 | 47.1 | 80.9 | synthetic, web-scale | Aug/2025 | 🟢 | https://www.anthropic.com/news/claude-opus-4-1 | MoE | Reasoning, SOTA | 639 | |||||
73 | GLM-4.5 | Z.AI | https://huggingface.co/zai-org/GLM-4.5 | 355 | 22000 | 62:1 | 9.3 | 84.6 | 79.1 | 14.4 | synthetic, web-scale | Jul/2025 | 🟢 | https://z.ai/blog/glm-4.5 | MoE | Reasoning | 355B-A32B. | 638 | ||
74 | T1 | China Telecom Artificial Intelligence Research Institute | https://github.com/Tele-AI/T1 | 115 | 10000 | 87:1 | 3.6 | web-scale | Jul/2025 | 🟢 | https://arxiv.org/abs/2507.18013 | Dense | Reasoning | 637 | ||||||
75 | Intern-S1 | Shanghai AI Laboratory/SenseTime | https://huggingface.co/internlm/Intern-S1 | 235 | 41000 | 175:1 | 10.3 | 83.5 | 77.3 | synthetic, web-scale | Jul/2025 | 🟢 | https://huggingface.co/internlm/Intern-S1 | MoE | Reasoning, SOTA | 41T tokens assumes base model of Qwen3. "Built upon a 235B MoE language model and a 6B Vision encoder, Intern-S1 has been further pretrained on 5 trillion tokens of multimodal data" | 636 | |||
76 | Step 3 | StepFun | https://www.stepfun.com/ | 321 | 18000 | 57:1 | 8.0 | 72.9 | web-scale | Jul/2025 | 🟢 | https://github.com/stepfun-ai/Step3/blob/main/Step3-Sys-Tech-Report.pdf | MoE | | 321B-A38B. https://x.com/CyouSakura/status/1948767450751009227 | 635 | ||||
77 | Qwen3-235B-A22B-Thinking-2507 | Alibaba | https://huggingface.co/Qwen/Qwen3-235B-A22B-Thinking-2507 | 235 | 36000 | 154:1 | 9.7 | 93.8 | 84.4 | 81.1 | synthetic, web-scale | Jul/2025 | 🟢 | https://huggingface.co/Qwen/Qwen3-235B-A22B-Thinking-2507 | MoE | Reasoning | 235B-A22B. "Qwen3 is pre-trained on 36 trillion tokens across 119 languages" MMLU score is MMLU-Redux. | 634 | ||
78 | KAT-V1-200B | Kuaishou | 200 | 10000 | 50:1 | 4.7 | 82.3 | 78.2 | synthetic, web-scale | Jul/2025 | 🔴 | https://arxiv.org/abs/2507.08297 | MoE | Reasoning | 200BA40B. In training as of Jul/2025. "to address the overthinking problem in reasoning-intensive tasks" | 633 | ||||
79 | KAT-V1-40B | Kuaishou | https://huggingface.co/Kwaipilot/KAT-V1-40B | 40 | 10000 | 250:1 | 2.1 | 77.8 | 75.1 | synthetic, web-scale | Jul/2025 | 🟢 | https://arxiv.org/abs/2507.08297 | Dense | Reasoning | "to address the overthinking problem in reasoning-intensive tasks" | 632 | |||
80 | Qwen3-Coder-480B-A35B-Instruct | Alibaba | https://huggingface.co/Qwen/Qwen3-Coder-480B-A35B-Instruct | 480 | 36000 | 75:1 | 13.9 | synthetic, web-scale | Jul/2025 | 🟢 | https://qwenlm.github.io/blog/qwen3-coder/ | MoE | | 480B-A35B. | 631 | |||||
81 | Qwen3-235B-A22B-Instruct-2507 | Alibaba | https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507 | 235 | 36000 | 154:1 | 9.7 | 93.1 | 83 | 77.5 | synthetic, web-scale | Jul/2025 | 🟢 | https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507 | MoE | SOTA | 235B-A22B. "Qwen3 is pre-trained on 36 trillion tokens across 119 languages" MMLU score is MMLU-Redux. | 630 | ||
82 | FlexOlmo | Allen AI | https://huggingface.co/allenai/FlexOlmo-7x7B-1T | 37 | 4150 | 113:1 | 1.3 | 60.4 | 30.9 | synthetic, web-scale | Jul/2025 | 🟢 | https://arxiv.org/abs/2507.07024v1 | MoE | | 37B-A20B. "We adopt the OLMo-2 7B setup, starting from a a checkpoint pre-trained on 4T tokens and annealed for 50B tokens to produce a public expert. We then train two additional experts on math and code, each for 50B tokens, and combine them with the public expert to form a three-expert version of FLEXOLMO." | 629 | |||
83 | EXAONE 4.0 | LG | https://huggingface.co/LGAI-EXAONE/EXAONE-4.0-32B | 32 | 14000 | 438:1 | 2.2 | 92.3 | 81.8 | 75.4 | web-scale | Jul/2025 | 🟢 | https://www.lgresearch.ai/data/cdn/upload/EXAONE_4_0.pdf | Dense | Reasoning | “EXAONE”=“EXpert AI for EveryONE”. Training tokens/ratio: EXAONE-3 7.8B=8T tokens (Aug/2024) -> EXAONE-3.5 7.8B=9T -> EXAONE-3.5 32B=6.5T tokens -> EXAONE 4.0 32B=14T tokens. MMLU score is MMLU-Redux. Interesting: "To focus [RL] training on more informative data samples, we perform accuracy-based filtering by generating eight responses from the SFT model and excluding samples where all eight responses are correct, a pre-filtering step that removes problems that are easy for the model to avoid inefficient training." | 628 | ||
84 | Kimi K2 | Moonshot AI | https://huggingface.co/moonshotai/Kimi-K2-Instruct | 1000 | 15500 | 16:1 | 13.1 | 89.5 | 81.1 | 75.1 | 4.7 | synthetic, web-scale | Jul/2025 | 🟢 | https://moonshotai.github.io/Kimi-K2/ | MoE | Reasoning, SOTA | 1TA32B. 1T parameters and 384 experts. Open source SOTA. | 627 | |
85 | Reka Flash 3.1 | Reka AI | https://huggingface.co/RekaAI/reka-flash-3.1 | 21 | 5000 | 239:1 | 1.1 | web-scale | Jul/2025 | 🟢 | https://www.reka.ai/news/introducing-reka-flash | Dense | Reasoning | 626 | ||||||
86 | Devstral Medium | Mistral | https://chat.mistral.ai/chat | 50 | 12000 | 240:1 | 2.6 | synthetic, web-scale | Jul/2025 | 🟢 | https://mistral.ai/news/devstral-2507 | Dense | | Non-reasoning. | 625 | |||||
87 | Grok 4 | xAI | https://grok.com/ | 3000 | 80000 | 27:1 | 51.6 | 88.9 | 44.4 | synthetic, web-scale | Jul/2025 | 🟢 | https://lifearchitect.ai/grok/ | MoE | Reasoning, SOTA | 2.4T? https://x.com/kalomaze/status/1942996555088134592 "The smartest AI in the world, 100% on SAT, etc, questions that it's never seen before." | 624 | |||
88 | Phi-4-mini-flash-reasoning | Microsoft | https://huggingface.co/microsoft/Phi-4-mini-flash-reasoning | 3.8 | 5150 | 1,356:1 | 0.5 | synthetic, web-scale | Jul/2025 | 🟢 | https://azure.microsoft.com/en-us/blog/reasoning-reimagined-introducing-phi-4-mini-flash-reasoning/ | Dense | | "Pre-training: 5T tokens; Reasoning training: 150B tokens" "At the core of Phi-4-mini-flash-reasoning is the newly introduced decoder-hybrid-decoder architecture, SambaY, whose central innovation is the Gated Memory Unit (GMU), a simple yet effective mechanism for sharing representations between layers. The architecture includes a self-decoder that combines Mamba (a State Space Model) and Sliding Window Attention (SWA), along with a single layer of full attention. The architecture also involves a cross-decoder that interleaves expensive cross-attention layers with the new, efficient GMUs. This new architecture with GMU modules drastically improves decoding efficiency, boosts long-context retrieval performance and enables the architecture to deliver exceptional performance across a wide range of tasks. " | 623 | |||||
89 | T5Gemma | Google DeepMind | https://huggingface.co/google/t5gemma-9b-9b-ul2-it | 9 | 10000 | 1,112:1 | 1.0 | 76.7 | 55.7 | 40.4 | web-scale | Jul/2025 | 🟢 | https://developers.googleblog.com/en/t5gemma/ | Dense | | Related paper: https://arxiv.org/abs/2504.06225. Dataset was Gemma 2 9B on 8T tokens + 2T tokens adapted. | 622 | ||
90 | MedGemma | Google DeepMind | https://huggingface.co/google/medgemma-27b-it | 27 | 14000 | 519:1 | 2.0 | 87 | web-scale | Jul/2025 | 🟢 | https://arxiv.org/abs/2507.05201 | Dense | | Multimodal model. Text MMLU score for med only=87.0. | 621 | ||||
91 | R1T2 Chimera | TNG | https://huggingface.co/tngtech/DeepSeek-TNG-R1T2-Chimera | 685 | 14800 | 22:1 | 10.6 | synthetic, web-scale | Jul/2025 | 🟢 | https://arxiv.org/abs/2506.14794 | MoE | | Assembly of Experts-method of V3-0324, R1, R1-0528. Announce: https://x.com/tngtech/status/1940531045432283412?s=46 | 620 | |||||
92 | Spectra 1.1 | Consortium | 3.6 | 1200 | 334:1 | 0.2 | 36.12 | synthetic, web-scale | Jun/2025 | 🟢 | https://arxiv.org/abs/2506.23025 | Dense | | "Spectra-1.1, an open suite of TriLMs trained on up to 1.2 trillion tokens, demonstrating sustained performance gains at scale. Furthermore, to improve inference efficiency, we propose novel 2-bit and 1.6-bit packing schemes for ternary weights" | 619 | |||||
93 | DiffuCoder | Apple | https://github.com/apple/ml-diffucoder | 7 | 5630 | 805:1 | 0.7 | code, The Stack | Jun/2025 | 🟢 | https://arxiv.org/abs/2506.20639 | Dense | Diffusion | "We adapt our model from Qwen2.5-Coder (Hui et al., 2024) as the base model to perform continual pre-training using the adaptation approach from Gong et al. (2025). During this pre-training, we use a 400B-token code pre-training corpus from RefineCode (Huang et al., 2024) and Stackv2 (Lozhkov et al., 2024)." | 618 | |||||
94 | Hunyuan-A13B | Tencent | https://huggingface.co/tencent/Hunyuan-A13B-Instruct | 80 | 7000 | 88:1 | 2.5 | 88.17 | 67.23 | 71.2 | synthetic, web-scale | Jun/2025 | 🟢 | https://huggingface.co/tencent/Hunyuan-A13B-Instruct | MoE | | 80B-A13B. 'We have open-sourced Hunyuan-A13B-Pretrain , Hunyuan-A13B-Instruct , Hunyuan-A13B-Instruct-FP8 , Hunyuan-A13B-Instruct-GPTQ-Int4 on Hugging Face.' | 617 | ||
95 | Mercury | Inception | https://chat.inceptionlabs.ai/ | 90 | 8000 | 89:1 | 2.8 | 69 | 51 | 3.4 | synthetic, web-scale | Jun/2025 | 🟢 | https://www.inceptionlabs.ai/introducing-mercury-our-general-chat-model | Dense | Diffusion | Diffusion large language model (dLLM). | 616 | ||
96 | Mu | Microsoft | https://blogs.windows.com/windows-insider/2025/06/13/announcing-windows-11-insider-preview-build-26200-5651-dev-channel/ | 0.5 | 500 | 1,000:1 | 0.1 | synthetic, web-scale | Jun/2025 | 🟢 | https://blogs.windows.com/windowsexperience/2025/06/23/introducing-mu-language-model-and-how-it-enabled-the-agent-in-windows-settings/ | Dense | | "distillation from Microsoft’s Phi models...Mu is an efficient 330M encoder–decoder language model optimized for small-scale deployment, particularly on the NPUs on Copilot+ PCs. It follows a transformer encoder–decoder architecture" | 615 | |||||
97 | Gemini Robotics On-Device | Google DeepMind | https://docs.google.com/forms/u/0/d/1sM5GqcVMWv-KmKY3TOMpVtQ-lDFeAftQ-d9xQn92jCE/viewform?ts=67cef986&edit_requested=true | 20 | 10000 | 500:1 | 1.5 | synthetic, web-scale | Jun/2025 | 🟢 | https://deepmind.google/discover/blog/gemini-robotics-on-device-brings-ai-to-local-robotic-devices/ | MoE | | See Mar/2025 Gemini Robotics-ER model for comparison. Announce: https://deepmind.google/discover/blog/gemini-robotics-on-device-brings-ai-to-local-robotic-devices/ | 614 | |||||
98 | ICONN-1 | ICONNAI | https://huggingface.co/ICONNAI/ICONN-1 | 88 | 10000 | 114:1 | 3.1 | synthetic, web-scale | Jun/2025 | 🟢 | MoE | | "ICONN-1 (this version) is optimized for natural, emotionally resonant, and conversational interactions. ICONN-e1 is a specialized variant of the model fine-tuned for advanced reasoning, critical analysis, and complex problem-solving." | 613 | ||||||
99 | MiniMax-M1 | MiniMax | https://huggingface.co/MiniMaxAI/MiniMax-M1-80k | 456 | 7200 | 16:1 | 6.0 | 81.1 | 70 | 8.4 | web-scale | Jun/2025 | 🟢 | https://arxiv.org/abs/2506.13585 | MoE | Reasoning | 456B-A45.9B. Announce: https://www.minimax.io/news/minimaxm1 | 612 | ||
100 | Magistral Medium | Mistral | https://chat.mistral.ai/chat | 50 | 12000 | 240:1 | 2.6 | 70.8 | synthetic, web-scale | Jun/2025 | 🟢 | https://mistral.ai/static/research/magistral.pdf | Dense | Reasoning | Magistral Small=24B. Announce: https://mistral.ai/news/magistral | 611 |