A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | MOVED: | https://docs.google.com/spreadsheets/d/1kc262HZSMAWI6FVsh0zJwbB-ooYvzhCHaHcNUiA0_hY/edit?gid=1158069878#gid=1158069878 | |||||||||||||||
2 | Google corrupted this one... | ||||||||||||||||
3 | Model | Lab | Playground | Parameters (B) | Tokens trained (B) | Ratio Tokens:Params (Chinchilla scaling≥20:1) | ALScore "ALScore" is a quick and dirty rating of the model's power. The formula is: Sqr Root of (Parameters x Tokens) ÷ 300. Any ALScore ≥ 1.0 is a powerful model in mid-2023. | MMLU | MMLU -Pro | GPQA | Training dataset | Announced ▼ | Public? | Paper / Repo | Arch | Notes | |
4 | Olympus | Amazon | https://lifearchitect.ai/olympus/ | 2000 | 40000 | TBA | New related Titan details: '$65m training run. 200B dense model on 4T tokens of data across 13,760 NVIDIA A100 chips. 48 days to train. Training runs soon to cross $1B' https://importai.substack.com/p/import-ai-365-wmd-benchmark-amazon | ||||||||||
5 | GPT-5 | OpenAI | https://lifearchitect.ai/gpt-5/ | 52500 | TBA | Due 2024. | |||||||||||
6 | GPT-6 | OpenAI | https://lifearchitect.ai/gpt-6/ | TBA | Due 2025. | ||||||||||||
7 | AuroraGPT (ScienceGPT) | Argonne National Laboratory | https://www.hpcwire.com/2023/11/13/training-of-1-trillion-parameter-scientific-ai-begins/ | 1000 | TBA | 🔴 | https://tpc.dev/2023/11/10/tpc-announced-with-founding-partners/ powered by Intel Ponte Vecchio GPUs. | ||||||||||
8 | Grok-2 | xAI | https://twitter.com/elonmusk/status/1773655245769330757 | TBA | Due 2025. | ||||||||||||
9 | MAI-1 | Microsoft | https://arstechnica.com/information-technology/2024/05/microsoft-developing-mai-1-language-model-that-may-compete-with-openai-report/ | 500 | 10000 | 20:1 | 7.5 | TBA | https://www.reuters.com/technology/microsoft-readies-new-ai-model-compete-with-google-openai-information-reports-2024-05-06/ | Dense | Due 2024. MAI=Microsoft artificial intelligence. MSFT CTO statement: https://archive.md/XRSgS | ||||||
10 | GPT-4o mini | OpenAI | https://chatgpt.com/ | 8 | 6000 | 750:1 | 0.7 | 82 | 40.2 | 🆆 📚⬆ 🕸 🌋 | Jul/2024 | 🟢 | https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/ | MoE | Omnimodel. "OpenAI would not disclose exactly how large GPT-4o mini is, but said it’s roughly in the same tier as other small AI models, such as Llama 3 8b, Claude Haiku and Gemini 1.5 Flash." https://techcrunch.com/2024/07/18/openai-unveils-gpt-4o-mini-a-small-ai-model-powering-chatgpt/ "tested GPT-4o to identify potential risks, which we have addressed and plan to share the details of in the forthcoming GPT-4o system card and Preparedness scorecard." And related paper about instruction hierarchy: https://arxiv.org/abs/2404.13208 | ||
11 | NeMo | Mistral | https://huggingface.co/mistralai/Mistral-Nemo-Base-2407 | 12 | 2000 | 167:1 | 0.5 | 68 | 🆆 📚⬆ 🕸 🌋 | Jul/2024 | 🟢 | https://mistral.ai/news/mistral-nemo/ | Dense | With NVIDIA. "Drop-in replacement of Mistral 7B". "trained using Megatron-LM, part of NVIDIA NeMo, with 3,072 H100 80GB Tensor Core GPUs" https://blogs.nvidia.com/blog/mistral-nvidia-ai-model/ | |||
12 | Codestral Mamba | Mistral | https://huggingface.co/mistralai/mamba-codestral-7B-v0.1 | 7 | 2000 | 286:1 | 0.4 | 🆆 📚⬆ 🕸 🌋 | Jul/2024 | 🟢 | https://mistral.ai/news/codestral-mamba/ | Dense | "Unlike Transformer models, Mamba models offer the advantage of linear time inference and the theoretical ability to model sequences of infinite length." | ||||
13 | Mathstral | Mistral | https://huggingface.co/mistralai/mathstral-7B-v0.1 | 7 | 2000 | 286:1 | 0.4 | 63.47 | 🆆 📚⬆ 🕸 🌋 | Jul/2024 | 🟢 | https://mistral.ai/news/mathstral/ | Dense | "We’re contributing Mathstral to the science community to bolster efforts in advanced mathematical problems requiring complex, multi-step logical reasoning." | |||
14 | SpreadsheetLLM | Microsoft | 1760 | 13000 | 8:1 | 15.9 | 🆆 📚⬆ 🕸 🌋 | Jul/2024 | 🔴 | https://arxiv.org/abs/2407.09025v1 | Dense | Notable finetune of GPT4-0125-preview "outperforming the vanilla approach by 25.6% in GPT4’s in-context learning setting" | |||||
15 | next-gen | DeepL | https://www.deepl.com/en/translator | 🌋 | Jul/2024 | 🟢 | https://www.deepl.com/en/blog/next-gen-language-model | Dense | "Built using our own groundbreaking, specialized LLM technology and proprietary training data, designed specifically for translation" | ||||||||
16 | SmolLM | Hugging Face | https://huggingface.co/collections/HuggingFaceTB/smollm-6695016cad7167254ce15966 | 1.7 | 1000 | 589:1 | 0.1 | 39.97 | 🆆 📚⬆ 🕸 🌋 ⚛️ | Jul/2024 | 🟢 | https://huggingface.co/blog/smollm | Dense | Dataset includes new Cosmopedia v2 synthetic data. 135M and 360M models,each trained on 600B tokens from Smollm-Corpus. 1.7B model trained on 1T tokens from Smollm-Corpus. | |||
17 | Mockingbird | Vectara | https://vectara.com/platform/ | 9 | 1000 | 112:1 | 0.3 | 🆆 📚⬆ 🕸 🌋 ⚛️ | Jul/2024 | 🟢 | https://vectara.com/blog/mockingbird-a-rag-and-structured-output-focused-llm/ | Dense | "At <10B parameters it's an LLM trained to provide optimal results for RAG and structured outputs." | ||||
18 | FLAMe | Google DeepMind | 24 | 1000 | 42:1 | 0.5 | 👥 | Jul/2024 | 🔴 | https://arxiv.org/abs/2407.10817v1 | Dense | LLM-as-a-Judge autorater. Foundational Large Autorater Models (FLAMe). Uses an instruction-tuned PaLM-2-24B model. Unrelated to Microsoft FLAME Jan/2023. | |||||
19 | H2O-Danube3-4B | H2O.ai | https://h2o.ai/platform/danube/personal-gpt/ | 4 | 6000 | 1,500:1 | 0.5 | 55.18 | 🆆 📚⬆ 🕸 🌋 ⚛️ | Jul/2024 | 🟢 | https://arxiv.org/abs/2407.09276 | Dense | Runs natively and fully offline on mobile phone. "H2O-Danube3 is a family of decoder only LLM models that use the general Llama model architecture adopting core principles from Llama 2 and Mistral with custom parameters determining the shape of each layer and total parameter count. We use the Mistral tokenizer..." MMLU for chat=54.74, base=55.18 via https://huggingface.co/h2oai/h2o-danube3-4b-base | |||
20 | Causal Axioms | Microsoft | 0.067 | 1.2 | 18:1 | 0.0 | ⚛️ | Jul/2024 | 🔴 | https://arxiv.org/abs/2407.07612v1 | Dense | "the training dataset follows a specific structure, we develop a custom tokenizer. Alphanumeric node names are tokenized at a character level, while special terms such as ‘causes’, ‘Does’, ‘cause’, ‘Yes’, and ‘No’ are tokenized at the word level... Our training setup consists of around 175k instances of sequential chains with size of chains ranging from 3 to 6 nodes... All models are trained for 100 epochs. [LifeArchitect.ai estimate is 12 tokens per node x 6 nodes x 175,000 instances x 100 epochs = 1.26B tokens]" Based on GPT-2 arch. | |||||
21 | SenseNova 5.5 | SenseTime | https://platform.sensenova.cn/home#/home | 600 | 10000 | 17:1 | 8.2 | ⚛️ | Jul/2024 | 🟢 | https://www.sensetime.com/en/news-detail/51168278?categoryId=1072 | MoE | "The model training was based on over 10TB tokens [sic, taken as 10T tokens instead of 10TB=2T tokens] of high-quality training data, including a large amount of synthetically-generated reasoning chain data, which help to enhance its reasoning capabilities." & "The updates include SenseNova 5o, the first real-time multimodal model in China, which provides a new AI interaction model on par with GPT-4o’s streaming interaction capabilities" | ||||
22 | Helium 7B | Kyutai | https://moshi.chat/ | 7 | 1000 | 143:1 | 0.3 | ⚛️ | Jul/2024 | 🟢 | https://youtu.be/hm2IJSKcYvo | Dense | "1. The model is fine-tuned on 100K transcripts generated by Helium itself. 2. These transcripts are highly detailed, heavily annotated with emotion and style, and conversational. 3. Text to Speech Engine is further fine-tuned on 20 hours of audio recorded by Alice and licensed." | ||||
23 | InternLM2.5 | Shanghai AI Laboratory/SenseTime | https://huggingface.co/collections/internlm/internlm25-66853f32717072d17581bc13 | 7 | 2600 | 372:1 | 0.4 | 72.8 | 38.4 | 🆆 📚⬆ 🕸 🌋 | Jul/2024 | 🟢 | https://github.com/InternLM/InternLM/blob/main/model_cards/internlm2.5_7b.md | Dense | "The release of InternLM2.5 series contains 7B model size for now and we are going to release the 1.8B and 20B versions soon" | ||
24 | Llama 3 405B | Meta AI | https://wabetainfo.com/whatsapp-beta-for-android-2-24-14-7-whats-new/ | 405 | 15000 | 38:1 | 8.2 | 84.8 | 48 | 🆆 📚⬆ 🕸 🌋 | Jun/2024 | 🟡 | Dense | Waiting on release outside of WhatsApp Android as of 1/Jul/2024. | |||
25 | ERNIE 4.0 Turbo | Baidu | https://yiyan.baidu.com/ | 🆆 📚⬆ 🕸 🌋 | Jun/2024 | 🟢 | https://www.reuters.com/technology/artificial-intelligence/baidu-launches-upgraded-ai-model-says-user-base-hits-300-mln-2024-06-28/ | Dense | "Ernie Bot has reached 300 million users since its launch [on 16/Mar/2023, public Aug/2023]" Jun/2024 | ||||||||
26 | Gemma 2 | Google DeepMind | https://huggingface.co/google/gemma-2-27b-it | 27 | 13000 | 482:1 | 2.0 | 75.2 | 🆆 📚⬆ 🕸 🌋 | Jun/2024 | 🟢 | https://storage.googleapis.com/deepmind-media/gemma/gemma-2-report.pdf | Dense | Announce: https://blog.google/technology/developers/google-gemma-2/ | |||
27 | CriticGPT | OpenAI | 👥 | Jun/2024 | 🔴 | https://cdn.openai.com/llm-critics-help-catch-llm-bugs-paper.pdf | Dense | "LLM Critics Help Catch LLM Bugs" Announce: https://openai.com/index/finding-gpt4s-mistakes-with-gpt-4/ | |||||||||
28 | 4M-21 | Apple | https://github.com/apple/ml-4m/ | 3 | 🌋 | Jun/2024 | 🟢 | https://arxiv.org/abs/2406.09406 | Dense | Vision model based on T5-XXL. Modalities: RGB, Caption, Bounding boxes, Semantic segmentation, Depth, Human poses, Surface normals, CLIP, DINOv2, ImageBind, Metadata, Canny edges, SAM edges, SAM instances, Color palette. Project page: https://4m.epfl.ch/ | |||||||
29 | ESM3 | EvolutionaryScale | https://github.com/evolutionaryscale/esm | 98 | 771 | 8:1 | 0.9 | 🌋 | Jun/2024 | 🟡 | https://www.evolutionaryscale.ai/blog/esm3-release | Dense | Biology large language model: "sequence, structure, and function are all masked and predicted during training, ESM3 can generate in all three modalities." 1.4B only released. | ||||
30 | PanGu 5.0 Super | Huawei | https://www.huaweicloud.com/intl/en-us/product/modelarts.html | 1000 | 20000 | 20:1 | 14.9 | 🌋 | Jun/2024 | 🟡 | https://www.huaweicentral.com/huawei-cloud-unveils-pangu-large-model-5-0/ | MoE | https://x.com/faridofanani96/status/1804079517193113850/photo/1 | ||||
31 | Claude 3.5 Sonnet | Anthropic | https://poe.com/Claude-3.5-Sonnet | 90.4 | 72.83 | 67.2 | 🆆 📚⬆ 🕸 🌋 | Jun/2024 | 🟢 | https://www.anthropic.com/news/claude-3-5-sonnet | Dense | Model card: https://www-cdn.anthropic.com/fed9cc193a14b84131812372d8d5857f8f304c52/Model_Card_Claude_3_Addendum.pdf | |||||
32 | DeepSeek-Coder-V2 | DeepSeek-AI | https://chat.deepseek.com/coder | 236 | 10200 | 35:1 | 4.6 | 79.2 | 63.63 | 🆆 📚⬆ 🕸 🌋 | Jun/2024 | 🟢 | https://github.com/deepseek-ai/DeepSeek-Coder-V2/blob/main/paper.pdf | MoE | DeepSeek-V2 with additional 6 trillion tokens. | ||
33 | DCLM-Baseline 7B 2.6T | International | https://huggingface.co/apple/DCLM-Baseline-7B | 7 | 2600 | 372:1 | 0.4 | 63.7 | 🕸 🌋 | Jun/2024 | 🟡 | https://arxiv.org/abs/2406.11794 | Dense | New dataset: 240T tokens: 8× larger than previous SOTA dataset. DCLM-Pool is 240T, DCLM-Baseline is 3.8T: "we combine our 3.8T DCLM-BASELINE with the StarCoder and ProofPile2 data to arrive at a 4.1T token dataset. We train a 7B model for 2.5T tokens" and "We release the DCLM benchmark, framework, models, and datasets at https://datacomp.ai/dclm." | |||
34 | Nemotron-4-340B | NVIDIA | https://build.nvidia.com/nvidia/nemotron-4-340b-instruct | 340 | 9000 | 27:1 | 5.8 | 81.1 | 🆆 📚⬆ 🕸 🌋 | Jun/2024 | 🟢 | https://d1qx31qr3h6wln.cloudfront.net/publications/Nemotron_4_340B_8T.pdf | Dense | Open-source equiv of Mar/2023 GPT-4 (1760MoE≈340B, 13T), same param count but 2x the tokens of May/2023 PaLM 2 (340B, 3.6T), competitor to Nov/2023 Grok-1 (314B, 6T). Trained on 6,144 H100s. ~1.3TB for inference. 50+ natural and 40+ coding languages. Trained between December 2023 and May 2024. MMLU 0-shot for instruct=78.7, 5-shot for base=81.1. Permalink for paper: https://research.nvidia.com/publication/2024-06_nemotron-4-340b | |||
35 | Apple On-Device model Jun/2024 | Apple | https://github.com/apple/corenet/tree/main/projects/openelm | 3.04 | 1500 | 494:1 | 0.2 | 26.76 | 🆆 📚⬆ 🕸 🌋 | Jun/2024 | 🟢 | https://arxiv.org/abs/2404.14619 | Dense | https://lifearchitect.ai/apple/ Likely to be the Apple OpenELM model (Apr/2024). "two of these models — a ~3 billion parameter on-device language model, and a larger server-based language model available with Private Cloud Compute". https://machinelearning.apple.com/research/introducing-apple-foundation-models The server-based model is possibly Ferret, although it is more properly called a multimodal model (not just language). It could also be Apple GPT based on their Ajax framework: https://archive.md/f3C0r | |||
36 | MatMul-Free LM | UCSC | https://github.com/ridgerchu/matmulfreellm | 2.7 | 100 | 38:1 | 0.1 | 🆆 📚⬆ 🕸 🌋 | Jun/2024 | 🟢 | https://arxiv.org/abs/2406.02528 | Dense | "we explore alternative methods for mixing tokens without relying on matrix multiplications." Compared with Transformer++ based on Llama-2, not to be confused with the pre-GPT-3 American Express Transformer++ paper from 2/Mar/2020. Instead, Transformer++ is defined in the Mamba paper: 'Transformer++: A Transformer with an improved architecture, namely rotary positional encodings (Su et al. 2021) and SwiGLU MLP (Shazeer 2020)' | ||||
37 | Luna | Galileo | https://www.rungalileo.io/blog/introducing-galileo-luna-a-family-of-evaluation-foundation-models | 0.44 | 162 | 369:1 | 0.0 | 🆆 📚⬆ 🕸 🌋 | Jun/2024 | 🟢 | https://arxiv.org/abs/2406.00975 | Dense | Based on DeBERTA-large (440M). RoBERTa=162B token dataset. | ||||
38 | Qwen2 | Alibaba | https://huggingface.co/spaces/Qwen/Qwen2-72B-Instruct | 72 | 7000 | 98:1 | 2.4 | 84.2 | 55.6 | 37.9 | 🆆 📚⬆ 🕸 🌋 | Jun/2024 | 🟢 | https://arxiv.org/abs/2407.10671 | Dense | Instruct MMLU=82. Instruct GPQA=41.9. https://qwenlm.github.io/blog/qwen2/ | |
39 | Qwen2-57B-A14B | Alibaba | https://github.com/QwenLM/Qwen2?tab=readme-ov-file | 57 | 4500 | 79:1 | 1.7 | 76.5 | 43 | 34.3 | 🆆 📚⬆ 🕸 🌋 | Jun/2024 | 🟢 | https://arxiv.org/abs/2407.10671 | MoE | https://qwenlm.github.io/blog/qwen2/ | |
40 | Skywork MoE 16x13B | Kunlun Tech | https://huggingface.co/Skywork/Skywork-MoE-Base | 146 | 77.4 | 🆆 📚⬆ 🕸 🌋 | Jun/2024 | 🟢 | https://github.com/SkyworkAI/Skywork-MoE/blob/main/skywork-moe-tech-report.pdf | MoE | CN + EN. "(MoE) model with 146 billion parameters, 16 experts, and 22 billion activated parameters. This model is initialized from the pre-existing dense checkpoints of our Skywork-13B model." | ||||||
41 | Mamba-2 | CMU | https://github.com/state-spaces/mamba | 2.7 | 300 | 112:1 | 0.1 | 🆆 📚⬆ 🕸 🌋 | May/2024 | 🟢 | https://arxiv.org/abs/2405.21060 | Dense | Analysis: https://tridao.me/blog/2024/mamba2-part1-model/ | ||||
42 | MAP-Neo | International | https://map-neo.github.io/ | 7 | 4500 | 643:1 | 0.6 | 58.14 | 🆆 📚⬆ 🕸 🌋 | May/2024 | 🟢 | https://arxiv.org/abs/2405.19327 | Dense | "first fully open-sourced bilingual LLM with comparable performance to existing state-of-the-art LLMs... we open-source all details to reproduce our MAP-Neo, where the cleaned pre-training corpus, data cleaning pipeline, checkpoints, and well-optimized training/evaluation framework are provided." | |||
43 | K2 | LLM360 | https://huggingface.co/LLM360/K2 | 65 | 1400 | 22:1 | 1.0 | 64.8 | 🆆 📚⬆ 🕸 🌋 | May/2024 | 🟢 | https://www.llm360.ai/blog/several-new-releases-to-further-our-mission.html | Dense | "K2-65B is a fully reproducible LLM outperforming Llama 2 70B using 35% less compute." | |||
44 | Codestral | Mistral | https://huggingface.co/mistralai/Codestral-22B-v0.1 | 22 | 2000 | 91:1 | 0.7 | 🆆 📚⬆ 🕸 🌋 | May/2024 | 🟢 | https://mistral.ai/news/codestral/ | Dense | Fluent in 80+ programming languages | ||||
45 | Aya-23-35B | Cohere | https://huggingface.co/spaces/CohereForAI/aya-23 | 35 | 4800 | 138:1 | 1.4 | 🆆 📚⬆ 🕸 🌋 | May/2024 | 🟢 | https://drive.google.com/file/d/1YKBPo61pnl97C1c_1C2ZVOnPhqf7MLSc/view | Dense | |||||
46 | Yi-XLarge | 01-ai | https://platform.01.ai/ | 2000 | 20000 | 10:1 | 21.1 | 85.1 | 48.2 | 🆆 📚⬆ 🕸 🌋 | May/2024 | 🟢 | https://www.aixinzhijie.com/article/6845768 | MoE | Still training as of May/2024: https://appserversrc.8btc.cn/FnDYlEC4STBhphu6M3NL4CKH43FW dead link, use: https://finance.china.com.cn/roll/20240513/6116857.shtml | ||
47 | Yi-Large | 01-ai | https://platform.01.ai/ | 1000 | 15000 | 15:1 | 12.9 | 83.8 | 58.1 | 43.5 | 🆆 📚⬆ 🕸 🌋 | May/2024 | 🟢 | https://www.aixinzhijie.com/article/6845768 | Dense | ||
48 | Chameleon | Meta AI | https://ai.meta.com/resources/models-and-libraries/chameleon-downloads/?gk_enable=chameleon_web_flow_is_live | 34 | 9200 | 271:1 | 1.9 | 65.8 | 🆆 📚⬆ 🕸 🌋 | May/2024 | 🟢 | https://arxiv.org/abs/2405.09818 | Dense | Multimodal | |||
49 | Sparse Llama 7B | Cerebras | https://huggingface.co/spaces/neuralmagic/llama-2-sparse-transfer-chat-deepsparse | 7 | 145 | 21:1 | 0.1 | 🆆 📚⬆ 🕸 🌋 | May/2024 | 🟢 | https://arxiv.org/abs/2405.03594 | Hybrid | https://www.cerebras.net/blog/introducing-sparse-llama-70-smaller-3x-faster-full-accuracy "For the 50% sparse model, we utilized 45 billion tokens of pretraining data, while an additional 100 billion tokens were used for the 70% model. This represents approximately 2% to 8% of the original 2 trillion tokens used to train the base Llama-2 model." | ||||
50 | Gemini 1.5 Flash | Google DeepMind | https://aistudio.google.com/app/prompts/new_chat | 78.9 | 59.1 | 39.5 | 🆆 📚⬆ 🕸 🌋 | May/2024 | 🟢 | https://goo.gle/GeminiV1-5 | MoE | 1M context length. | |||||
51 | GPT-4o | OpenAI | https://chatgpt.com/ | 88.7 | 72.6 | 53.6 | 🆆 📚⬆ 🕸 🌋 | May/2024 | 🟢 | https://openai.com/index/hello-gpt-4o/ | MoE | Omnimodel. ‘[GPT-4o is] likely an early checkpoint of GPT-5’. https://twitter.com/drjimfan/status/1790089671365767313 ELO: https://twitter.com/LiamFedus/status/1790064963966370209 Demo: https://youtu.be/DQacCB9tDaw | |||||
52 | Falcon 2 11B | TII | https://huggingface.co/tiiuae/falcon-11B | 11 | 5500 | 500:1 | 0.8 | 58.37 | 🆆 📚⬆ 🕸 🌋 | May/2024 | 🟢 | https://www.tii.ae/news/falcon-2-uaes-technology-innovation-institute-releases-new-ai-model-series-outperforming-metas | Dense | Announce: https://www.tii.ae/news/falcon-2-uaes-technology-innovation-institute-releases-new-ai-model-series-outperforming-metas | |||
53 | Fugaku-LLM | Fujitsu | https://huggingface.co/Fugaku-LLM/Fugaku-LLM-13B-instruct | 13 | 380 | 30:1 | 0.2 | 🆆 📚⬆ 🕸 🌋 | May/2024 | 🟢 | https://www.fujitsu.com/global/about/resources/news/press-releases/2024/0510-01.html | Dense | Japanese. CPU trained: 158,976+ A64FX CPUs (7M+ cores), zero GPUs. https://en.wikipedia.org/wiki/Fugaku_(supercomputer) | ||||
54 | Yi 1.5 34B | 01-ai | https://huggingface.co/01-ai/Yi-1.5-34B-Chat | 34.4 | 3600 | 105:1 | 1.2 | 76.8 | 52.3 | 🆆 📚⬆ 🕸 🌋 | May/2024 | 🟢 | https://github.com/01-ai/Yi-1.5 | Dense | Uses 600B more training tokens than Yi 1.0 (Nov/2023). | ||
55 | YOCO | Microsoft | https://github.com/microsoft/unilm/tree/master/YOCO | 3 | 1600 | 534:1 | 0.2 | 🆆 📚⬆ 🕸 🌋 | May/2024 | 🟢 | https://arxiv.org/abs/2405.05254 | Dense | With Tsingua. You Only Cache Once (YOCO). Long context "1M context length with near-perfect needle retrieval accuracy" | ||||
56 | DeepSeek-V2 | DeepSeek-AI | https://chat.deepseek.com/ | 236 | 8100 | 35:1 | 4.6 | 78.5 | 54.8 | 🆆 📚⬆ 🕸 🌋 | May/2024 | 🟢 | https://arxiv.org/abs/2405.04434 | MoE | Huge dataset, 12% Chinese "Therefore, we acknowledge that DeepSeek-V2 still has a slight gap in basic English capabilities with LLaMA3 70B". | ||
57 | ChuXin | Independent | https://huggingface.co/chuxin-llm/Chuxin-1.6B-Base | 1.6 | 2300 | 1,438:1 | 0.2 | 41.07 | 🆆 📚⬆ 🕸 🌋 | May/2024 | 🟢 | https://arxiv.org/abs/2405.04828 | Dense | "results on the ”Needle In A Haystack”(NIAH) tests indicate that ChuXin-1M performs well across all context window lengths up to 1M." | |||
58 | RWKV-v6 Finch | RWKV | https://huggingface.co/spaces/BlinkDL/RWKV-Gradio-2 | 7.63 | 2500 | 328:1 | 0.5 | 🆆 📚⬆ 🕸 🌋 | May/2024 | 🟢 | https://huggingface.co/BlinkDL/rwkv-6-world | Dense | https://twitter.com/BlinkDL_AI/status/1787834625211158562 | ||||
59 | xLSTM | ELLIS | 2.7 | 15 | 6:1 | 0.0 | 🆆 📚⬆ 🕸 🌋 | May/2024 | 🔴 | https://arxiv.org/abs/2405.04517 | Dense | New method LSTM to xLSTM, see also RNNs. Code/weights doesn't seem to be released. https://github.com/AI-Guru/xlstm-resources | |||||
60 | Granite Code | IBM | https://github.com/ibm-granite/granite-code-models | 34 | 3500 | 103:1 | 1.1 | 🌋 | May/2024 | 🟢 | https://github.com/ibm-granite/granite-code-models/blob/main/paper.pdf | Dense | Dataset: publicly available datasets (e.g., GitHub Code Clean, Starcoder data), public code repositories, and issues from GitHub. | ||||
61 | Qwen-Max | Alibaba | https://chat.lmsys.org/ | 300 | 6000 | 20:1 | 4.5 | 🆆 📚⬆ 🕸 🌋 | May/2024 | 🟢 | https://help.aliyun.com/zh/dashscope/developer-reference/model-introduction | Dense | https://twitter.com/JustinLin610/status/1787584325367529509 | ||||
62 | Med-Gemini-L 1.0 | Google DeepMind | https://twitter.com/alan_karthi/status/1785117450528264216 | 1500 | 30000 | 20:1 | 22.4 | 🆆 📚⬆ 🕸 🌋 | May/2024 | 🔴 | https://arxiv.org/abs/2404.18416 | Dense | Med-Gemini-M 1.0 and Med-Gemini-L 1.0 (Pro and Ultra finetunes) "For language tasks that require less complex reasoning, such as summarizing medical notes and creating referral letters, we introduce Med-Gemini-M 1.0 by fine-tuning the Gemini 1.0 Pro model. For other tasks that require more advanced reasoning, we introduce Med-Gemini-L 1.0 by fine-tuning the Gemini 1.0 Ultra model using a self-training method to enable the models to efficiently use web search." | ||||
63 | Tele-FLM | BAAI | https://huggingface.co/CofeAI/Tele-FLM | 52 | 2000 | 39:1 | 1.1 | 64 | 🆆 📚⬆ 🕸 🌋 | Apr/2024 | 🟢 | https://arxiv.org/abs/2404.16645 | Dense | Also known as FLM-2. "We will open-source a 1T model checkpoint, namely Tele-FLM-1T, to advance further training and research." Discussion paper Jul/2024: https://arxiv.org/abs/2407.02783 | |||
64 | Qwen-1.5 110B | Alibaba | https://huggingface.co/spaces/Qwen/Qwen1.5-110B-Chat-demo | 111 | 3000 | 28:1 | 1.9 | 80.4 | 49.9 | 35.9 | 🆆 📚⬆ 🕸 🌋 | Apr/2024 | 🟢 | https://qwenlm.github.io/blog/qwen1.5-110b/ | Dense | Worse performance on GPQA (72B=36.3, 110B=35.9). | |
65 | Arctic | Snowflake AI Research | https://arctic.streamlit.app/ | 480 | 3500 | 8:1 | 4.3 | 67.3 | 🆆 📚⬆ 🕸 🌋 | Apr/2024 | 🟢 | https://www.snowflake.com/blog/arctic-open-efficient-foundation-language-models-snowflake/ | Hybrid | "Arctic uses a unique Dense-MoE Hybrid transformer architecture. It combines a 10B dense transformer model with a residual 128×3.66B MoE MLP resulting in 480B total and 17B active parameters chosen using a top-2 gating." | |||
66 | SenseNova 5.0 | SenseTime | 600 | 10000 | 17:1 | 8.2 | 84.78 | 42.93 | 🆆 📚⬆ 🕸 🌋 | Apr/2024 | 🟢 | https://news.futunn.com/en/post/41290101/a-large-shangtang-multi-modal-model-with-600-billion-parameters | MoE | GPT-4 scale; low media coverage; no demo in Western world. https://www.techinasia.com/sensetime-pauses-trading-stock-rises-30-model-launch | |||
67 | OpenELM | Apple | https://huggingface.co/apple/OpenELM-3B-Instruct | 3.04 | 1500 | 494:1 | 0.2 | 26.76 | 🆆 📚⬆ 🕸 🌋 | Apr/2024 | 🟢 | https://arxiv.org/abs/2404.14619 | Dense | On-device model (laptop, phone). Open-source Efficient Language Models (OpenELM). https://venturebeat.com/ai/apple-releases-openelm-small-open-source-ai-models-designed-to-run-on-device/ | |||
68 | phi-3-medium | Microsoft | https://huggingface.co/microsoft/Phi-3-medium-128k-instruct | 14 | 4800 | 343:1 | 0.9 | 78.2 | 55.7 | ⚛️ | Apr/2024 | 🟢 | https://arxiv.org/abs/2404.14219 | Dense | Preview only, benchmarks being investigated as of May/2024. | ||
69 | phi-3-mini | Microsoft | https://huggingface.co/microsoft/Phi-3-mini-128k-instruct | 3.8 | 3300 | 869:1 | 0.4 | 68.8 | 45.7 | ⚛️ | Apr/2024 | 🟢 | https://arxiv.org/abs/2404.14219 | Dense | "phi3-mini can be quantized to 4-bits so that it only occupies ≈ 1.8GB of memory. We tested the quantized model by deploying phi-3-mini on iPhone 14 with A16 Bionic chip running natively on-device and fully offline achieving more than 12 tokens per second." | ||
70 | Llama 3 70B | Meta AI | https://meta.ai/ | 70 | 15000 | 215:1 | 3.4 | 82 | 52.8 | 🆆 📚⬆ 🕸 🌋 | Apr/2024 | 🟢 | https://ai.meta.com/blog/meta-llama-3/ | Dense | Instruct MMLU-Pro=56.2 | ||
71 | HLAT | Amazon | 7 | 1800 | 258:1 | 0.4 | 41.318 | 🆆 📚⬆ 🕸 🌋 | Apr/2024 | 🔴 | https://arxiv.org/abs/2404.10630 | Dense | HLAT=High-quality LLM pre-trained on AWS Trainium. Same arch as Llama 7B. The pre-training is performed up to 64 Amazon EC2 trn1.32xlarge instances with totalling up to 1024 AWS Trainium accelerators. Read more about Trainium: https://www.aboutamazon.com/news/aws/what-you-need-to-know-about-the-aws-ai-chips-powering-amazons-partnership-with-anthropic | ||||
72 | Idefics2 | Hugging Face | https://huggingface.co/HuggingFaceM4/idefics2-8b | 8.4 | 🆆 🕸 | Apr/2024 | 🟢 | https://huggingface.co/blog/idefics2 | Dense | Clone of Flamingo now using Mistral 7B. Named after Asterix and Obelix's dog Idefix (Image-aware Decoder Enhanced à la Flamingo with Interleaved Cross-attentionS) | |||||||
73 | Reka Core | Reka AI | https://poe.com/RekaCore | 300 | 10000 | 34:1 | 5.8 | 83.2 | 38.2 | 🆆 📚⬆ 🕸 🌋 | Apr/2024 | 🟢 | https://publications.reka.ai/reka-core-tech-report.pdf | Dense | https://www.reka.ai/news/reka-core-our-frontier-class-multimodal-language-model | ||
74 | WizardLM-2-8x22B | Microsoft | https://huggingface.co/MaziyarPanahi/WizardLM-2-8x22B-GGUF | 141 | 2000 | 15:1 | 1.8 | 🆆 📚⬆ 🕸 🌋 | Apr/2024 | 🟢 | https://wizardlm.github.io/WizardLM2/ | MoE | Base model = mistral-8x22b. | ||||
75 | Pile-T5 | EleutherAI | https://huggingface.co/EleutherAI/pile-t5-xxl | 11 | 2000 | 182:1 | 0.5 | 53.84 | 🆆 📚⬆ 🕸 🌋 | Apr/2024 | 🟢 | https://blog.eleuther.ai/pile-t5/ | Dense | ||||
76 | Zephyr 141B-A35B | Hugging Face H4 | https://huggingface.co/HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1 | 35 | 2000 | 58:1 | 0.9 | 🆆 📚⬆ 🕸 🌋 | Apr/2024 | 🟢 | https://arxiv.org/abs/2403.07691 | MoE | mixtral-8x22b finetune using Odds Ratio Preference Optimization (ORPO). | ||||
77 | Rerank 3 | Cohere | https://docs.cohere.com/reference/rerank-1 | 104 | 4000 | 39:1 | 2.1 | 📚 🕸 | Apr/2024 | 🟢 | https://txt.cohere.com/rerank-3/ | Dense | RAG + semantic search, possibly backed by Command-R+. | ||||
78 | gpt-4-turbo-2024-04-09 | OpenAI | https://chat.openai.com/ | 86.5 | 63.7 | 49.1 | 🆆 📚⬆ 🕸 🌋 | Apr/2024 | 🟢 | https://cdn.openai.com/papers/gpt-4.pdf | MoE | This is such a significantly better model that I've added it here. This GPQA=46.5%, old GPT-4 GPQA=36%. https://twitter.com/EpochAIResearch/status/1778463039932584205 MMLU scores are unclear, but may have improved by 1%: https://twitter.com/OpenAI/status/1778602770784002136. Final benchmarks are here: https://archive.md/6Cc0Z | |||||
79 | MiniCPM-2.4B | Tsinghua | https://github.com/OpenBMB/MiniCPM/ | 2.4 | 1100 | 459:1 | 0.2 | 🆆 📚⬆ 🕸 🌋 | Apr/2024 | 🟢 | https://arxiv.org/abs/2404.06395 | Dense | MoE option=https://huggingface.co/openbmb/MiniCPM-MoE-8x2B | ||||
80 | Ferret-UI | Apple | https://github.com/apple/ml-ferret | 13 | 2000 | 154:1 | 0.5 | 🆆 📚⬆ 🕸 👥 | Apr/2024 | 🟢 | https://arxiv.org/abs/2404.05719 | Dense | Vicuna base, multimodal. Extension of Ferret from Oct/2023. | ||||
81 | mixtral-8x22b | Mistral | https://huggingface.co/mistral-community/Mixtral-8x22B-v0.1 | 141 | 2000 | 15:1 | 1.8 | 77.75 | 🆆 📚⬆ 🕸 🌋 | Apr/2024 | 🟢 | https://mistral.ai/news/mixtral-8x22b/ | MoE | MoE=22Bx8, seq=65536. | |||
82 | Sailor | Sail | https://huggingface.co/sail | 7 | 200 | 29:1 | 0.1 | 🆆 📚⬆ 🕸 🌋 | Apr/2024 | 🟢 | https://arxiv.org/abs/2404.03608v1 | Dense | SEA languages. Based on Qwen-1.5. https://github.com/sail-sg/sailor-llm "Generally Sailor models consume around 200B tokens, completing a full pass through the SailCraft corpus once. However, the Sailor-0.5B model undergoes training with 400B tokens, equivalent to 2 epochs." | ||||
83 | JetMoE-8B | MIT | https://www.lepton.ai/playground/chat?model=jetmoe-8b-chat | 8 | 1250 | 157:1 | 0.3 | 49.2 | 🆆 📚⬆ 🕸 🌋 | Apr/2024 | 🟢 | https://huggingface.co/jetmoe/jetmoe-8b | MoE | ||||
84 | Eurus | Tsinghua | https://huggingface.co/collections/openbmb/eurus-660bc40bec5376b3adc9d1c5 | 70 | 2000 | 29:1 | 1.2 | 🆆 📚⬆ 🕸 🌋 | Apr/2024 | 🟢 | https://huggingface.co/collections/openbmb/eurus-660bc40bec5376b3adc9d1c5 | Dense | Fine-tune of Mistral-7B and CodeLlama-70B. | ||||
85 | Command-R+ | Cohere | https://huggingface.co/spaces/CohereForAI/c4ai-command-r-plus | 104 | 4000 | 39:1 | 2.1 | 75.7 | 📚 🕸 | Apr/2024 | 🟢 | https://huggingface.co/CohereForAI/c4ai-command-r-plus | Dense | purpose-built to excel at real-world enterprise use cases. Announce with no arch details: https://txt.cohere.com/command-r-plus-microsoft-azure/ | |||
86 | Viking | Silo AI | 33 | 2000 | 61:1 | 0.9 | 🌋 | Apr/2024 | 🟢 | https://www.silo.ai/blog/viking-7b-13b-33b-sailing-the-nordic-seas-of-multilinguality | Dense | Viking uses an architecture similar to Llama 2, with flash attention, rotary embeddings, grouped query attention and supports a 4k sequence length' | |||||
87 | OLMo-Bitnet-1B | Nous Research | https://huggingface.co/NousResearch/OLMo-Bitnet-1B | 1 | 60 | 60:1 | 0.0 | 🌋 | Apr/2024 | 🟢 | https://arxiv.org/abs/2402.17764 | Dense | 1.58-bit quantized (ternary weights) means we can run a 70B model in ~14GB VRAM. See also BitNet b1.58 | ||||
88 | Aurora-M | International | https://huggingface.co/collections/aurora-m/aurora-m-models-65fdfdff62471e09812f5407 | 15.5 | 2035 | 132:1 | 0.6 | 🌋 | Mar/2024 | 🟢 | https://arxiv.org/abs/2404.00399 | Dense | |||||
89 | ReALM-3B | Apple | 3 | 134 | 45:1 | 0.1 | 🌋 | Mar/2024 | 🔴 | https://arxiv.org/abs/2403.20329 | Dense | FLAN-T5 (Oct/2022) finetune. | |||||
90 | Qwen1.5-MoE-A2.7B | Alibaba | https://qwenlm.github.io/blog/qwen-moe/ | 14.3 | 1500 | 105:1 | 0.5 | 62.5 | 🆆 📚⬆ 🕸 🌋 | Mar/2024 | 🟢 | https://qwenlm.github.io/blog/qwen-moe/ | MoE | MoE. "Of particular significance is the fact that, through upcycling, the necessity for training an equivalent volume of tokens as in the original model has been eliminated." I assumed half of the original 3T tokens | |||
91 | Grok-1.5 | xAI | https://grok.x.ai/ | 314 | 6000 | 20:1 | 4.6 | 81.3 | 🆆 📚⬆ 🕸 🌋 | Mar/2024 | 🟢 | https://x.ai/blog/grok-1.5 | Dense | Context=128k. | |||
92 | Jamba | AI21 | https://huggingface.co/ai21labs/Jamba-v0.1 | 52 | 5000 | 97:1 | 1.7 | 67.4 | 🆆 📚⬆ 🕸 🌋 | Mar/2024 | 🟢 | https://arxiv.org/abs/2403.19887 | MoE | MoE. Open weights, licensed under Apache 2.0. Announce: https://arxiv.org/abs/2403.19887 | |||
93 | DBRX | MosaicML | https://huggingface.co/spaces/databricks/dbrx-instruct | 132 | 12000 | 91:1 | 4.2 | 73.7 | 🆆 📚⬆ 🕸 🌋 | Mar/2024 | 🟢 | https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm | MoE | MoE. Trained for $10M on 3,072 NVIDIA H100s connected by 3.2Tbps Infiniband. | |||
94 | Stable Code Instruct 3B | Stability AI | https://huggingface.co/stabilityai/stable-code-instruct-3b | 2.7 | 560 | 208:1 | 0.1 | 🌋 | Mar/2024 | 🟢 | https://stability.ai/news/introducing-stable-code-instruct-3b | Dense | Context window=16,384. Trained on The Stack dataset. | ||||
95 | EvoLLM-JP | Sakana AI | https://huggingface.co/SakanaAI/EvoLLM-JP-v1-10B | 10 | 800 | 80:1 | 0.3 | 🆆 📚⬆ 🕸 🌋 | Mar/2024 | 🟢 | https://arxiv.org/abs/2403.13187 | Dense | Japanese. Model merge 'our EvoLLM-JP-A is a merge of shisa-gamma-7b-v1, Arithmo2-Mistral-7B, and Abel7B-002' https://sakana.ai/evolutionary-model-merge/ | ||||
96 | RakutenAI-7B | Rakuten Group | https://huggingface.co/Rakuten/RakutenAI-7B | 7 | 3000 | 429:1 | 0.5 | 61.31 | 🆆 📚⬆ 🕸 🌋 | Mar/2024 | 🟢 | https://arxiv.org/abs/2403.15484 | Dense | Japanese. Mistral 7B derivative. | |||
97 | Parakeet | Independent | https://colab.research.google.com/drive/1gI8CM9Bz9ov0-E6aL2jF808rE56UtZyF?usp=sharing | 0.378 | 3 | 8:1 | 0.0 | 🆆 📚⬆ 🕸 🌋 | Mar/2024 | 🟢 | https://news.ycombinator.com/item?id=39745700#39745702 | Dense | Tiny model (378M) for testing | ||||
98 | RWKV-v5 EagleX | RWKV | https://huggingface.co/recursal/EagleX_1-7T | 7.52 | 1700 | 227:1 | 0.4 | 40.14 | 🆆 📚⬆ 🕸 🌋 | Mar/2024 | 🟢 | https://substack.recursal.ai/p/eaglex-17t-soaring-past-llama-7b | Dense | Built on the RWKV-v5 architecture (a linear transformer with 10-100x+ lower inference cost) | |||
99 | MM1 | Apple | 30 | 2010 | 67:1 | 0.8 | 🌋 | Mar/2024 | 🔴 | https://arxiv.org/abs/2403.09611 | Dense | VLM, outperforms Flamingo 80B (Apr/2022) across benchmarks. 2T text tokens + ~10B+ other text (estimate). Unreleased. | |||||
100 | RFM-1 | Covariant | https://vimeo.com/921866765 | 8 | 160 | 20:1 | 0.1 | 🆆 📚⬆ 🕸 🌋 | Mar/2024 | 🟡 | https://covariant.ai/insights/introducing-rfm-1-giving-robots-human-like-reasoning-capabilities/ | Dense | Commercial, multimodal for robotics |