A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | (377) Permalink: | https://lifearchitect.ai/models-table/ | Timeline view: | https://lifearchitect.ai/timeline | The Memo: | https://lifearchitect.ai/memo | |||||||||||
2 | Model | Lab | Playground | Parameters (B) | Tokens trained (B) | Ratio Tokens:Params (Chinchilla scalingβ₯20:1) | ALScore "ALScore" is a quick and dirty rating of the model's power. The formula is: Sqr Root of (Parameters x Tokens) Γ· 300. Any ALScore β₯ 1.0 is a powerful model in mid-2023. | MMLU | MMLU -Pro | GPQA | Training dataset | Announced βΌ | Public? | Paper / Repo | Arch | Notes | |
3 | Olympus | Amazon | https://lifearchitect.ai/olympus/ | 2000 | 40000 | TBA | New related Titan details: '$65m training run. 200B dense model on 4T tokens of data across 13,760 NVIDIA A100 chips. 48 days to train. Training runs soon to cross $1B' https://importai.substack.com/p/import-ai-365-wmd-benchmark-amazon | ||||||||||
4 | GPT-5 | OpenAI | https://lifearchitect.ai/gpt-5/ | 52500 | TBA | Due 2024. | |||||||||||
5 | GPT-6 | OpenAI | https://lifearchitect.ai/gpt-6/ | TBA | Due 2025. | ||||||||||||
6 | AuroraGPT (ScienceGPT) | Argonne National Laboratory | https://www.hpcwire.com/2023/11/13/training-of-1-trillion-parameter-scientific-ai-begins/ | 1000 | TBA | π΄ | https://tpc.dev/2023/11/10/tpc-announced-with-founding-partners/ powered by Intel Ponte Vecchio GPUs. | ||||||||||
7 | Grok-2 | xAI | https://twitter.com/elonmusk/status/1773655245769330757 | TBA | Due 2025. | ||||||||||||
8 | MAI-1 | Microsoft | https://arstechnica.com/information-technology/2024/05/microsoft-developing-mai-1-language-model-that-may-compete-with-openai-report/ | 500 | 10000 | 20:1 | 7.5 | TBA | https://www.reuters.com/technology/microsoft-readies-new-ai-model-compete-with-google-openai-information-reports-2024-05-06/ | Dense | Due 2024. MAI=Microsoft artificial intelligence. MSFT CTO statement: https://archive.md/XRSgS | ||||||
9 | Causal Axioms | Microsoft | 0.067 | 1.2 | 18:1 | 0.0 | βοΈ | Jul/2024 | π΄ | https://arxiv.org/abs/2407.07612v1 | Dense | "the training dataset follows a specific structure, we develop a custom tokenizer. Alphanumeric node names are tokenized at a character level, while special terms such as βcausesβ, βDoesβ, βcauseβ, βYesβ, and βNoβ are tokenized at the word level... Our training setup consists of around 175k instances of sequential chains with size of chains ranging from 3 to 6 nodes... All models are trained for 100 epochs. [LifeArchitect.ai estimate is 12 tokens per node x 6 nodes x 175,000 instances x 100 epochs = 1.26B tokens]" Based on GPT-2 arch. | |||||
10 | SenseNova 5.5 | SenseTime | https://platform.sensenova.cn/home#/home | 600 | 10000 | 17:1 | 8.2 | βοΈ | Jul/2024 | π’ | https://www.sensetime.com/en/news-detail/51168278?categoryId=1072 | MoE | "The model training was based on over 10TB tokens [sic, taken as 10T tokens instead of 10TB=2T tokens] of high-quality training data, including a large amount of synthetically-generated reasoning chain data, which help to enhance its reasoning capabilities." & "The updates include SenseNova 5o, the first real-time multimodal model in China, which provides a new AI interaction model on par with GPT-4oβs streaming interaction capabilities" | ||||
11 | InternLM2.5 | Shanghai AI Laboratory/SenseTime | https://huggingface.co/collections/internlm/internlm25-66853f32717072d17581bc13 | 7 | 2600 | 372:1 | 0.4 | 72.8 | 38.4 | π πβ¬ πΈ π | Jul/2024 | π’ | https://github.com/InternLM/InternLM/blob/main/model_cards/internlm2.5_7b.md | Dense | "The release of InternLM2.5 series contains 7B model size for now and we are going to release the 1.8B and 20B versions soon" | ||
12 | Llama 3 405B | Meta AI | https://wabetainfo.com/whatsapp-beta-for-android-2-24-14-7-whats-new/ | 405 | 15000 | 38:1 | 8.2 | 84.8 | 48 | π πβ¬ πΈ π | Jun/2024 | π‘ | Dense | Waiting on release outside of WhatsApp Android as of 1/Jul/2024. | |||
13 | ERNIE 4.0 Turbo | Baidu | https://yiyan.baidu.com/ | π πβ¬ πΈ π | Jun/2024 | π’ | https://www.reuters.com/technology/artificial-intelligence/baidu-launches-upgraded-ai-model-says-user-base-hits-300-mln-2024-06-28/ | Dense | "Ernie Bot has reached 300 million users since its launch [on 16/Mar/2023, public Aug/2023]" Jun/2024 | ||||||||
14 | Gemma 2 | Google DeepMind | https://huggingface.co/google/gemma-2-27b-it | 27 | 13000 | 482:1 | 2.0 | 75.2 | π πβ¬ πΈ π | Jun/2024 | π’ | https://storage.googleapis.com/deepmind-media/gemma/gemma-2-report.pdf | Dense | Announce: https://blog.google/technology/developers/google-gemma-2/ | |||
15 | CriticGPT | OpenAI | π₯ | Jun/2024 | π΄ | https://cdn.openai.com/llm-critics-help-catch-llm-bugs-paper.pdf | Dense | "LLM Critics Help Catch LLM Bugs" Announce: https://openai.com/index/finding-gpt4s-mistakes-with-gpt-4/ | |||||||||
16 | 4M-21 | Apple | https://github.com/apple/ml-4m/ | 3 | π | Jun/2024 | π’ | https://arxiv.org/abs/2406.09406 | Dense | Vision model based on T5-XXL. Modalities: RGB, Caption, Bounding boxes, Semantic segmentation, Depth, Human poses, Surface normals, CLIP, DINOv2, ImageBind, Metadata, Canny edges, SAM edges, SAM instances, Color palette. Project page: https://4m.epfl.ch/ | |||||||
17 | ESM3 | EvolutionaryScale | https://github.com/evolutionaryscale/esm | 98 | 771 | 8:1 | 0.9 | π | Jun/2024 | π‘ | https://www.evolutionaryscale.ai/blog/esm3-release | Dense | Biology large language model: "sequence, structure, and function are all masked and predicted during training, ESM3 can generate in all three modalities." 1.4B only released. | ||||
18 | PanGu 5.0 Super | Huawei | https://www.huaweicloud.com/intl/en-us/product/modelarts.html | 1000 | 20000 | 20:1 | 14.9 | π | Jun/2024 | π‘ | https://www.huaweicentral.com/huawei-cloud-unveils-pangu-large-model-5-0/ | MoE | https://x.com/faridofanani96/status/1804079517193113850/photo/1 | ||||
19 | Claude 3.5 Sonnet | Anthropic | https://poe.com/Claude-3.5-Sonnet | 90.4 | 72.83 | 67.2 | π πβ¬ πΈ π | Jun/2024 | π’ | https://www.anthropic.com/news/claude-3-5-sonnet | Dense | Model card: https://www-cdn.anthropic.com/fed9cc193a14b84131812372d8d5857f8f304c52/Model_Card_Claude_3_Addendum.pdf | |||||
20 | DeepSeek-Coder-V2 | DeepSeek-AI | https://chat.deepseek.com/coder | 236 | 10200 | 35:1 | 4.6 | 79.2 | 63.63 | π πβ¬ πΈ π | Jun/2024 | π’ | https://github.com/deepseek-ai/DeepSeek-Coder-V2/blob/main/paper.pdf | MoE | DeepSeek-V2 with additional 6 trillion tokens. | ||
21 | DCLM-Baseline 7B 2.6T | International | https://www.datacomp.ai/dclm/ | 7 | 2600 | 372:1 | 0.4 | 63.7 | πΈ π | Jun/2024 | π‘ | https://arxiv.org/abs/2406.11794 | Dense | New dataset: 240T tokens: 8Γ larger than previous SOTA dataset. DCLM-Pool is 240T, DCLM-Baseline is 3.8T: "we combine our 3.8T DCLM-BASELINE with the StarCoder and ProofPile2 data to arrive at a 4.1T token dataset. We train a 7B model for 2.5T tokens" and "We release the DCLM benchmark, framework, models, and datasets at https://datacomp.ai/dclm." Model not yet available. | |||
22 | Nemotron-4-340B | NVIDIA | https://build.nvidia.com/nvidia/nemotron-4-340b-instruct | 340 | 9000 | 27:1 | 5.8 | 81.1 | π πβ¬ πΈ π | Jun/2024 | π’ | https://d1qx31qr3h6wln.cloudfront.net/publications/Nemotron_4_340B_8T.pdf | Dense | Open-source equiv of Mar/2023 GPT-4 (1760MoEβ340B, 13T), same param count but 2x the tokens of May/2023 PaLM 2 (340B, 3.6T), competitor to Nov/2023 Grok-1 (314B, 6T). Trained on 6,144 H100s. ~1.3TB for inference. 50+ natural and 40+ coding languages. Trained between December 2023 and May 2024. MMLU 0-shot for instruct=78.7, 5-shot for base=81.1. Permalink for paper: https://research.nvidia.com/publication/2024-06_nemotron-4-340b | |||
23 | Apple On-Device model Jun/2024 | Apple | https://github.com/apple/corenet/tree/main/projects/openelm | 3.04 | 1500 | 494:1 | 0.2 | 26.76 | π πβ¬ πΈ π | Jun/2024 | π’ | https://arxiv.org/abs/2404.14619 | Dense | https://lifearchitect.ai/apple/ Likely to be the Apple OpenELM model (Apr/2024). "two of these models β a ~3 billion parameter on-device language model, and a larger server-based language model available with Private Cloud Compute". https://machinelearning.apple.com/research/introducing-apple-foundation-models The server-based model is possibly Ferret, although it is more properly called a multimodal model (not just language). It could also be Apple GPT based on their Ajax framework: https://archive.md/f3C0r | |||
24 | MatMul-Free LM | UCSC | https://github.com/ridgerchu/matmulfreellm | 2.7 | 100 | 38:1 | 0.1 | π πβ¬ πΈ π | Jun/2024 | π’ | https://arxiv.org/abs/2406.02528 | Dense | "we explore alternative methods for mixing tokens without relying on matrix multiplications." Compared with Transformer++ based on Llama-2, not to be confused with the pre-GPT-3 American Express Transformer++ paper from 2/Mar/2020. Instead, Transformer++ is defined in the Mamba paper: 'Transformer++: A Transformer with an improved architecture, namely rotary positional encodings (Su et al. 2021) and SwiGLU MLP (Shazeer 2020)' | ||||
25 | Luna | Galileo | https://www.rungalileo.io/blog/introducing-galileo-luna-a-family-of-evaluation-foundation-models | 0.44 | 162 | 369:1 | 0.0 | π πβ¬ πΈ π | Jun/2024 | π’ | https://arxiv.org/abs/2406.00975 | Dense | Based on DeBERTA-large (440M). RoBERTa=162B token dataset. | ||||
26 | Qwen2 | Alibaba | https://huggingface.co/spaces/Qwen/Qwen2-72B-Instruct | 72 | 3000 | 42:1 | 1.5 | 84.2 | 55.6 | 37.9 | π πβ¬ πΈ π | Jun/2024 | π’ | https://qwenlm.github.io/blog/qwen2/ | Dense | Instruct MMLU=82. Instruct GPQA=41.9. Qwen1.5 trained on ~3T tokens, using this number for Qwen2 while waiting for official number. | |
27 | Qwen2-57B-A14B | Alibaba | https://github.com/QwenLM/Qwen2?tab=readme-ov-file | 57 | 3000 | 53:1 | 1.4 | 76.5 | 43 | 34.3 | π πβ¬ πΈ π | Jun/2024 | π’ | https://qwenlm.github.io/blog/qwen2/ | MoE | ||
28 | Skywork MoE 16x13B | Kunlun Tech | https://huggingface.co/Skywork/Skywork-MoE-Base | 146 | 77.4 | π πβ¬ πΈ π | Jun/2024 | π’ | https://github.com/SkyworkAI/Skywork-MoE/blob/main/skywork-moe-tech-report.pdf | MoE | CN + EN. "(MoE) model with 146 billion parameters, 16 experts, and 22 billion activated parameters. This model is initialized from the pre-existing dense checkpoints of our Skywork-13B model." | ||||||
29 | Mamba-2 | CMU | https://github.com/state-spaces/mamba | 2.7 | 300 | 112:1 | 0.1 | π πβ¬ πΈ π | May/2024 | π’ | https://arxiv.org/abs/2405.21060 | Dense | Analysis: https://tridao.me/blog/2024/mamba2-part1-model/ | ||||
30 | MAP-Neo | International | https://map-neo.github.io/ | 7 | 4500 | 643:1 | 0.6 | 58.14 | π πβ¬ πΈ π | May/2024 | π’ | https://arxiv.org/abs/2405.19327 | Dense | "first fully open-sourced bilingual LLM with comparable performance to existing state-of-the-art LLMs... we open-source all details to reproduce our MAP-Neo, where the cleaned pre-training corpus, data cleaning pipeline, checkpoints, and well-optimized training/evaluation framework are provided." | |||
31 | K2 | LLM360 | https://huggingface.co/LLM360/K2 | 65 | 1400 | 22:1 | 1.0 | 64.8 | π πβ¬ πΈ π | May/2024 | π’ | https://www.llm360.ai/blog/several-new-releases-to-further-our-mission.html | Dense | "K2-65B is a fully reproducible LLM outperforming Llama 2 70B using 35% less compute." | |||
32 | Codestral | Mistral | https://huggingface.co/mistralai/Codestral-22B-v0.1 | 22 | 2000 | 91:1 | 0.7 | π πβ¬ πΈ π | May/2024 | π’ | https://mistral.ai/news/codestral/ | Dense | Fluent in 80+ programming languages | ||||
33 | Aya-23-35B | Cohere | https://huggingface.co/spaces/CohereForAI/aya-23 | 35 | 4800 | 138:1 | 1.4 | π πβ¬ πΈ π | May/2024 | π’ | https://drive.google.com/file/d/1YKBPo61pnl97C1c_1C2ZVOnPhqf7MLSc/view | Dense | |||||
34 | Yi-XLarge | 01-ai | https://platform.01.ai/ | 2000 | 20000 | 10:1 | 21.1 | 85.1 | 48.2 | π πβ¬ πΈ π | May/2024 | π’ | https://www.aixinzhijie.com/article/6845768 | MoE | Still training as of May/2024: https://appserversrc.8btc.cn/FnDYlEC4STBhphu6M3NL4CKH43FW dead link, use: https://finance.china.com.cn/roll/20240513/6116857.shtml | ||
35 | Yi-Large | 01-ai | https://platform.01.ai/ | 1000 | 15000 | 15:1 | 12.9 | 83.8 | 58.1 | 43.5 | π πβ¬ πΈ π | May/2024 | π’ | https://www.aixinzhijie.com/article/6845768 | Dense | ||
36 | Chameleon | Meta AI | https://ai.meta.com/resources/models-and-libraries/chameleon-downloads/?gk_enable=chameleon_web_flow_is_live | 34 | 9200 | 271:1 | 1.9 | 65.8 | π πβ¬ πΈ π | May/2024 | π’ | https://arxiv.org/abs/2405.09818 | Dense | Multimodal | |||
37 | Sparse Llama 7B | Cerebras | https://huggingface.co/spaces/neuralmagic/llama-2-sparse-transfer-chat-deepsparse | 7 | 145 | 21:1 | 0.1 | π πβ¬ πΈ π | May/2024 | π’ | https://arxiv.org/abs/2405.03594 | Hybrid | https://www.cerebras.net/blog/introducing-sparse-llama-70-smaller-3x-faster-full-accuracy "For the 50% sparse model, we utilized 45 billion tokens of pretraining data, while an additional 100 billion tokens were used for the 70% model. This represents approximately 2% to 8% of the original 2 trillion tokens used to train the base Llama-2 model." | ||||
38 | Gemini 1.5 Flash | Google DeepMind | https://aistudio.google.com/app/prompts/new_chat | 78.9 | 59.1 | 39.5 | π πβ¬ πΈ π | May/2024 | π’ | https://goo.gle/GeminiV1-5 | MoE | 1M context length. | |||||
39 | GPT-4o | OpenAI | https://chatgpt.com/ | 88.7 | 72.6 | 53.6 | π πβ¬ πΈ π | May/2024 | π’ | https://openai.com/index/hello-gpt-4o/ | MoE | Omnimodel. β[GPT-4o is] likely an early checkpoint of GPT-5β. https://twitter.com/drjimfan/status/1790089671365767313 ELO: https://twitter.com/LiamFedus/status/1790064963966370209 Demo: https://youtu.be/DQacCB9tDaw | |||||
40 | Falcon 2 11B | TII | https://huggingface.co/tiiuae/falcon-11B | 11 | 5500 | 500:1 | 0.8 | 58.37 | π πβ¬ πΈ π | May/2024 | π’ | https://www.tii.ae/news/falcon-2-uaes-technology-innovation-institute-releases-new-ai-model-series-outperforming-metas | Dense | Announce: https://www.tii.ae/news/falcon-2-uaes-technology-innovation-institute-releases-new-ai-model-series-outperforming-metas | |||
41 | Fugaku-LLM | Fujitsu | https://huggingface.co/Fugaku-LLM/Fugaku-LLM-13B-instruct | 13 | 380 | 30:1 | 0.2 | π πβ¬ πΈ π | May/2024 | π’ | https://www.fujitsu.com/global/about/resources/news/press-releases/2024/0510-01.html | Dense | Japanese. CPU trained: 158,976+ A64FX CPUs (7M+ cores), zero GPUs. https://en.wikipedia.org/wiki/Fugaku_(supercomputer) | ||||
42 | Yi 1.5 34B | 01-ai | https://huggingface.co/01-ai/Yi-1.5-34B-Chat | 34.4 | 3600 | 105:1 | 1.2 | 76.8 | 52.3 | π πβ¬ πΈ π | May/2024 | π’ | https://github.com/01-ai/Yi-1.5 | Dense | Uses 600B more training tokens than Yi 1.0 (Nov/2023). | ||
43 | YOCO | Microsoft | https://github.com/microsoft/unilm/tree/master/YOCO | 3 | 1600 | 534:1 | 0.2 | π πβ¬ πΈ π | May/2024 | π’ | https://arxiv.org/abs/2405.05254 | Dense | With Tsingua. You Only Cache Once (YOCO). Long context "1M context length with near-perfect needle retrieval accuracy" | ||||
44 | DeepSeek-V2 | DeepSeek-AI | https://chat.deepseek.com/ | 236 | 8100 | 35:1 | 4.6 | 78.5 | 54.8 | π πβ¬ πΈ π | May/2024 | π’ | https://arxiv.org/abs/2405.04434 | MoE | Huge dataset, 12% Chinese "Therefore, we acknowledge that DeepSeek-V2 still has a slight gap in basic English capabilities with LLaMA3 70B". | ||
45 | ChuXin | Independent | https://huggingface.co/chuxin-llm/Chuxin-1.6B-Base | 1.6 | 2300 | 1,438:1 | 0.2 | 41.07 | π πβ¬ πΈ π | May/2024 | π’ | https://arxiv.org/abs/2405.04828 | Dense | "results on the βNeedle In A Haystackβ(NIAH) tests indicate that ChuXin-1M performs well across all context window lengths up to 1M." | |||
46 | RWKV-v6 Finch | RWKV | https://huggingface.co/spaces/BlinkDL/RWKV-Gradio-2 | 7.63 | 2500 | 328:1 | 0.5 | π πβ¬ πΈ π | May/2024 | π’ | https://huggingface.co/BlinkDL/rwkv-6-world | Dense | https://twitter.com/BlinkDL_AI/status/1787834625211158562 | ||||
47 | xLSTM | ELLIS | 2.7 | 15 | 6:1 | 0.0 | π πβ¬ πΈ π | May/2024 | π΄ | https://arxiv.org/abs/2405.04517 | Dense | New method LSTM to xLSTM, see also RNNs. Code/weights doesn't seem to be released. https://github.com/AI-Guru/xlstm-resources | |||||
48 | Granite Code | IBM | https://github.com/ibm-granite/granite-code-models | 34 | 3500 | 103:1 | 1.1 | π | May/2024 | π’ | https://github.com/ibm-granite/granite-code-models/blob/main/paper.pdf | Dense | Dataset: publicly available datasets (e.g., GitHub Code Clean, Starcoder data), public code repositories, and issues from GitHub. | ||||
49 | Qwen-Max | Alibaba | https://chat.lmsys.org/ | 300 | 6000 | 20:1 | 4.5 | π πβ¬ πΈ π | May/2024 | π’ | https://help.aliyun.com/zh/dashscope/developer-reference/model-introduction | Dense | https://twitter.com/JustinLin610/status/1787584325367529509 | ||||
50 | Med-Gemini-L 1.0 | Google DeepMind | https://twitter.com/alan_karthi/status/1785117450528264216 | 1500 | 30000 | 20:1 | 22.4 | π πβ¬ πΈ π | May/2024 | π΄ | https://arxiv.org/abs/2404.18416 | Dense | Med-Gemini-M 1.0 and Med-Gemini-L 1.0 (Pro and Ultra finetunes) "For language tasks that require less complex reasoning, such as summarizing medical notes and creating referral letters, we introduce Med-Gemini-M 1.0 by fine-tuning the Gemini 1.0 Pro model. For other tasks that require more advanced reasoning, we introduce Med-Gemini-L 1.0 by fine-tuning the Gemini 1.0 Ultra model using a self-training method to enable the models to efficiently use web search." | ||||
51 | Tele-FLM | BAAI | https://huggingface.co/CofeAI/Tele-FLM | 52 | 2000 | 39:1 | 1.1 | 64 | π πβ¬ πΈ π | Apr/2024 | π’ | https://arxiv.org/abs/2404.16645 | Dense | Also known as FLM-2. "We will open-source a 1T model checkpoint, namely Tele-FLM-1T, to advance further training and research." Discussion paper Jul/2024: https://arxiv.org/abs/2407.02783 | |||
52 | Qwen-1.5 110B | Alibaba | https://huggingface.co/spaces/Qwen/Qwen1.5-110B-Chat-demo | 111 | 3000 | 28:1 | 1.9 | 80.4 | 49.9 | 35.9 | π πβ¬ πΈ π | Apr/2024 | π’ | https://qwenlm.github.io/blog/qwen1.5-110b/ | Dense | Worse performance on GPQA (72B=36.3, 110B=35.9). | |
53 | Arctic | Snowflake AI Research | https://arctic.streamlit.app/ | 480 | 3500 | 8:1 | 4.3 | 67.3 | π πβ¬ πΈ π | Apr/2024 | π’ | https://www.snowflake.com/blog/arctic-open-efficient-foundation-language-models-snowflake/ | Hybrid | "Arctic uses a unique Dense-MoE Hybrid transformer architecture. It combines a 10B dense transformer model with a residual 128Γ3.66B MoE MLP resulting in 480B total and 17B active parameters chosen using a top-2 gating." | |||
54 | SenseNova 5.0 | SenseTime | 600 | 10000 | 17:1 | 8.2 | 84.78 | 42.93 | π πβ¬ πΈ π | Apr/2024 | π’ | https://news.futunn.com/en/post/41290101/a-large-shangtang-multi-modal-model-with-600-billion-parameters | MoE | GPT-4 scale; low media coverage; no demo in Western world. https://www.techinasia.com/sensetime-pauses-trading-stock-rises-30-model-launch | |||
55 | OpenELM | Apple | https://huggingface.co/apple/OpenELM-3B-Instruct | 3.04 | 1500 | 494:1 | 0.2 | 26.76 | π πβ¬ πΈ π | Apr/2024 | π’ | https://arxiv.org/abs/2404.14619 | Dense | On-device model (laptop, phone). Open-source Efficient Language Models (OpenELM). https://venturebeat.com/ai/apple-releases-openelm-small-open-source-ai-models-designed-to-run-on-device/ | |||
56 | phi-3-medium | Microsoft | https://huggingface.co/microsoft/Phi-3-medium-128k-instruct | 14 | 4800 | 343:1 | 0.9 | 78.2 | 55.7 | βοΈ | Apr/2024 | π’ | https://arxiv.org/abs/2404.14219 | Dense | Preview only, benchmarks being investigated as of May/2024. | ||
57 | phi-3-mini | Microsoft | https://huggingface.co/microsoft/Phi-3-mini-128k-instruct | 3.8 | 3300 | 869:1 | 0.4 | 68.8 | 45.7 | βοΈ | Apr/2024 | π’ | https://arxiv.org/abs/2404.14219 | Dense | "phi3-mini can be quantized to 4-bits so that it only occupies β 1.8GB of memory. We tested the quantized model by deploying phi-3-mini on iPhone 14 with A16 Bionic chip running natively on-device and fully offline achieving more than 12 tokens per second." | ||
58 | Llama 3 70B | Meta AI | https://meta.ai/ | 70 | 15000 | 215:1 | 3.4 | 82 | 52.8 | π πβ¬ πΈ π | Apr/2024 | π’ | https://ai.meta.com/blog/meta-llama-3/ | Dense | Instruct MMLU-Pro=56.2 | ||
59 | HLAT | Amazon | 7 | 1800 | 258:1 | 0.4 | 41.318 | π πβ¬ πΈ π | Apr/2024 | π΄ | https://arxiv.org/abs/2404.10630 | Dense | HLAT=High-quality LLM pre-trained on AWS Trainium. Same arch as Llama 7B. The pre-training is performed up to 64 Amazon EC2 trn1.32xlarge instances with totalling up to 1024 AWS Trainium accelerators. Read more about Trainium: https://www.aboutamazon.com/news/aws/what-you-need-to-know-about-the-aws-ai-chips-powering-amazons-partnership-with-anthropic | ||||
60 | Idefics2 | Hugging Face | https://huggingface.co/HuggingFaceM4/idefics2-8b | 8.4 | π πΈ | Apr/2024 | π’ | https://huggingface.co/blog/idefics2 | Dense | Clone of Flamingo now using Mistral 7B. Named after Asterix and Obelix's dog Idefix (Image-aware Decoder Enhanced Γ la Flamingo with Interleaved Cross-attentionS) | |||||||
61 | Reka Core | Reka AI | https://poe.com/RekaCore | 300 | 10000 | 34:1 | 5.8 | 83.2 | 38.2 | π πβ¬ πΈ π | Apr/2024 | π’ | https://publications.reka.ai/reka-core-tech-report.pdf | Dense | https://www.reka.ai/news/reka-core-our-frontier-class-multimodal-language-model | ||
62 | WizardLM-2-8x22B | Microsoft | https://huggingface.co/MaziyarPanahi/WizardLM-2-8x22B-GGUF | 141 | 2000 | 15:1 | 1.8 | π πβ¬ πΈ π | Apr/2024 | π’ | https://wizardlm.github.io/WizardLM2/ | MoE | Base model = mistral-8x22b. | ||||
63 | Pile-T5 | EleutherAI | https://huggingface.co/EleutherAI/pile-t5-xxl | 11 | 2000 | 182:1 | 0.5 | 53.84 | π πβ¬ πΈ π | Apr/2024 | π’ | https://blog.eleuther.ai/pile-t5/ | Dense | ||||
64 | Zephyr 141B-A35B | Hugging Face H4 | https://huggingface.co/HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1 | 35 | 2000 | 58:1 | 0.9 | π πβ¬ πΈ π | Apr/2024 | π’ | https://arxiv.org/abs/2403.07691 | MoE | mixtral-8x22b finetune using Odds Ratio Preference Optimization (ORPO). | ||||
65 | Rerank 3 | Cohere | https://docs.cohere.com/reference/rerank-1 | 104 | 4000 | 39:1 | 2.1 | π πΈ | Apr/2024 | π’ | https://txt.cohere.com/rerank-3/ | Dense | RAG + semantic search, possibly backed by Command-R+. | ||||
66 | gpt-4-turbo-2024-04-09 | OpenAI | https://chat.openai.com/ | 86.5 | 63.7 | 49.1 | π πβ¬ πΈ π | Apr/2024 | π’ | https://cdn.openai.com/papers/gpt-4.pdf | MoE | This is such a significantly better model that I've added it here. This GPQA=46.5%, old GPT-4 GPQA=36%. https://twitter.com/EpochAIResearch/status/1778463039932584205 MMLU scores are unclear, but may have improved by 1%: https://twitter.com/OpenAI/status/1778602770784002136. Final benchmarks are here: https://archive.md/6Cc0Z | |||||
67 | MiniCPM-2.4B | Tsinghua | https://github.com/OpenBMB/MiniCPM/ | 2.4 | 1100 | 459:1 | 0.2 | π πβ¬ πΈ π | Apr/2024 | π’ | https://arxiv.org/abs/2404.06395 | Dense | MoE option=https://huggingface.co/openbmb/MiniCPM-MoE-8x2B | ||||
68 | Ferret-UI | Apple | https://github.com/apple/ml-ferret | 13 | 2000 | 154:1 | 0.5 | π πβ¬ πΈ π₯ | Apr/2024 | π’ | https://arxiv.org/abs/2404.05719 | Dense | Vicuna base, multimodal. Extension of Ferret from Oct/2023. | ||||
69 | mixtral-8x22b | Mistral | https://huggingface.co/mistral-community/Mixtral-8x22B-v0.1 | 141 | 2000 | 15:1 | 1.8 | 77.75 | π πβ¬ πΈ π | Apr/2024 | π’ | https://mistral.ai/news/mixtral-8x22b/ | MoE | MoE=22Bx8, seq=65536. | |||
70 | Sailor | Sail | https://huggingface.co/sail | 7 | 200 | 29:1 | 0.1 | π πβ¬ πΈ π | Apr/2024 | π’ | https://arxiv.org/abs/2404.03608v1 | Dense | SEA languages. Based on Qwen-1.5. https://github.com/sail-sg/sailor-llm "Generally Sailor models consume around 200B tokens, completing a full pass through the SailCraft corpus once. However, the Sailor-0.5B model undergoes training with 400B tokens, equivalent to 2 epochs." | ||||
71 | JetMoE-8B | MIT | https://www.lepton.ai/playground/chat?model=jetmoe-8b-chat | 8 | 1250 | 157:1 | 0.3 | 49.2 | π πβ¬ πΈ π | Apr/2024 | π’ | https://huggingface.co/jetmoe/jetmoe-8b | MoE | ||||
72 | Eurus | Tsinghua | https://huggingface.co/collections/openbmb/eurus-660bc40bec5376b3adc9d1c5 | 70 | 2000 | 29:1 | 1.2 | π πβ¬ πΈ π | Apr/2024 | π’ | https://huggingface.co/collections/openbmb/eurus-660bc40bec5376b3adc9d1c5 | Dense | Fine-tune of Mistral-7B and CodeLlama-70B. | ||||
73 | Command-R+ | Cohere | https://huggingface.co/spaces/CohereForAI/c4ai-command-r-plus | 104 | 4000 | 39:1 | 2.1 | 75.7 | π πΈ | Apr/2024 | π’ | https://huggingface.co/CohereForAI/c4ai-command-r-plus | Dense | purpose-built to excel at real-world enterprise use cases. Announce with no arch details: https://txt.cohere.com/command-r-plus-microsoft-azure/ | |||
74 | Viking | Silo AI | 33 | 2000 | 61:1 | 0.9 | π | Apr/2024 | π’ | https://www.silo.ai/blog/viking-7b-13b-33b-sailing-the-nordic-seas-of-multilinguality | Dense | Viking uses an architecture similar to Llama 2, with flash attention, rotary embeddings, grouped query attention and supports a 4k sequence length' | |||||
75 | OLMo-Bitnet-1B | Nous Research | https://huggingface.co/NousResearch/OLMo-Bitnet-1B | 1 | 60 | 60:1 | 0.0 | π | Apr/2024 | π’ | https://arxiv.org/abs/2402.17764 | Dense | 1.58-bit quantized (ternary weights) means we can run a 70B model in ~14GB VRAM. See also BitNet b1.58 | ||||
76 | Aurora-M | International | https://huggingface.co/collections/aurora-m/aurora-m-models-65fdfdff62471e09812f5407 | 15.5 | 2035 | 132:1 | 0.6 | π | Mar/2024 | π’ | https://arxiv.org/abs/2404.00399 | Dense | |||||
77 | ReALM-3B | Apple | 3 | 134 | 45:1 | 0.1 | π | Mar/2024 | π΄ | https://arxiv.org/abs/2403.20329 | Dense | FLAN-T5 (Oct/2022) finetune. | |||||
78 | Qwen1.5-MoE-A2.7B | Alibaba | https://qwenlm.github.io/blog/qwen-moe/ | 14.3 | 1500 | 105:1 | 0.5 | 62.5 | π πβ¬ πΈ π | Mar/2024 | π’ | https://qwenlm.github.io/blog/qwen-moe/ | MoE | MoE. "Of particular significance is the fact that, through upcycling, the necessity for training an equivalent volume of tokens as in the original model has been eliminated." I assumed half of the original 3T tokens | |||
79 | Grok-1.5 | xAI | https://grok.x.ai/ | 314 | 6000 | 20:1 | 4.6 | 81.3 | π πβ¬ πΈ π | Mar/2024 | π’ | https://x.ai/blog/grok-1.5 | Dense | Context=128k. | |||
80 | Jamba | AI21 | https://huggingface.co/ai21labs/Jamba-v0.1 | 52 | 5000 | 97:1 | 1.7 | 67.4 | π πβ¬ πΈ π | Mar/2024 | π’ | https://arxiv.org/abs/2403.19887 | MoE | MoE. Open weights, licensed under Apache 2.0. Announce: https://arxiv.org/abs/2403.19887 | |||
81 | DBRX | MosaicML | https://huggingface.co/spaces/databricks/dbrx-instruct | 132 | 12000 | 91:1 | 4.2 | 73.7 | π πβ¬ πΈ π | Mar/2024 | π’ | https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm | MoE | MoE. Trained for $10M on 3,072 NVIDIA H100s connected by 3.2Tbps Infiniband. | |||
82 | Stable Code Instruct 3B | Stability AI | https://huggingface.co/stabilityai/stable-code-instruct-3b | 2.7 | 560 | 208:1 | 0.1 | π | Mar/2024 | π’ | https://stability.ai/news/introducing-stable-code-instruct-3b | Dense | Context window=16,384. Trained on The Stack dataset. | ||||
83 | EvoLLM-JP | Sakana AI | https://huggingface.co/SakanaAI/EvoLLM-JP-v1-10B | 10 | 800 | 80:1 | 0.3 | π πβ¬ πΈ π | Mar/2024 | π’ | https://arxiv.org/abs/2403.13187 | Dense | Japanese. Model merge 'our EvoLLM-JP-A is a merge of shisa-gamma-7b-v1, Arithmo2-Mistral-7B, and Abel7B-002' https://sakana.ai/evolutionary-model-merge/ | ||||
84 | RakutenAI-7B | Rakuten Group | https://huggingface.co/Rakuten/RakutenAI-7B | 7 | 3000 | 429:1 | 0.5 | 61.31 | π πβ¬ πΈ π | Mar/2024 | π’ | https://arxiv.org/abs/2403.15484 | Dense | Japanese. Mistral 7B derivative. | |||
85 | Parakeet | Independent | https://colab.research.google.com/drive/1gI8CM9Bz9ov0-E6aL2jF808rE56UtZyF?usp=sharing | 0.378 | 3 | 8:1 | 0.0 | π πβ¬ πΈ π | Mar/2024 | π’ | https://news.ycombinator.com/item?id=39745700#39745702 | Dense | Tiny model (378M) for testing | ||||
86 | RWKV-v5 EagleX | RWKV | https://huggingface.co/recursal/EagleX_1-7T | 7.52 | 1700 | 227:1 | 0.4 | 40.14 | π πβ¬ πΈ π | Mar/2024 | π’ | https://substack.recursal.ai/p/eaglex-17t-soaring-past-llama-7b | Dense | Built on the RWKV-v5 architecture (a linear transformer with 10-100x+ lower inference cost) | |||
87 | MM1 | Apple | 30 | 2010 | 67:1 | 0.8 | π | Mar/2024 | π΄ | https://arxiv.org/abs/2403.09611 | Dense | VLM, outperforms Flamingo 80B (Apr/2022) across benchmarks. 2T text tokens + ~10B+ other text (estimate). Unreleased. | |||||
88 | RFM-1 | Covariant | https://vimeo.com/921866765 | 8 | 160 | 20:1 | 0.1 | π πβ¬ πΈ π | Mar/2024 | π‘ | https://covariant.ai/insights/introducing-rfm-1-giving-robots-human-like-reasoning-capabilities/ | Dense | Commercial, multimodal for robotics | ||||
89 | Command-R | Cohere | Cohere | 35 | 700 | 20:1 | 0.5 | 37.9 | π πΈ | Mar/2024 | π’ | https://txt.cohere.com/command-r/ | Dense | RAG and tool use | |||
90 | DeepSeek-VL | DeepSeek-AI | https://github.com/deepseek-ai/DeepSeek-VL?tab=readme-ov-file | 7 | 2000 | 286:1 | 0.4 | π πβ¬ πΈ π | Mar/2024 | π’ | https://arxiv.org/abs/2403.05525 | Dense | Vision, based on DeepSeek-LLM-7B | ||||
91 | AnyGPT | Fudan University | https://junzhan2000.github.io/AnyGPT.github.io/ | 7 | 2000 | 286:1 | 0.4 | π πβ¬ πΈ π | Mar/2024 | π’ | https://arxiv.org/abs/2402.12226 | Dense | Llama 2 7B backbone with new matrices ('reshaping the embedding matrix and prediction layer') | ||||
92 | Stable Beluga 2.5 | Stability AI | 70 | 2000 | 29:1 | 1.2 | π πβ¬ πΈ π | Mar/2024 | π’ | https://stability.ai/news/putting-the-ai-supercomputer-to-work | Dense | Mentioned in Stability release about Intel chips 11/Mar/2024, availablity unknown | |||||
93 | Inflection-2.5 | Inflection AI | https://inflection.ai/inflection-2 | 1200 | 20000 | 17:1 | 16.3 | 85.5 | 38.4 | π π β¬ πΈ | Mar/2024 | π’ | https://inflection.ai/inflection-2-5 | Dense | |||
94 | Apollo | SRIBD/CUHK | https://apollo.llmzoo.com/ | 7 | 2500 | 358:1 | 0.4 | π ππΈ π | Mar/2024 | π’ | https://arxiv.org/abs/2403.03640 | Dense | Qwen 1.8B as base. Medical focus. | ||||
95 | Claude 3 Opus | Anthropic | https://claude.ai/ | 2000 | 40000 | 20:1 | 29.8 | 88.2 | 68.5 | 59.5 | π πβ¬ πΈ π | Mar/2024 | π’ | https://www.anthropic.com/claude-3-model-card | Dense | Original MMLU=86.8 (GPT-4=86.4). Original GPQA=50.4. 200k context, 1M for researchers. | |
96 | Nemotron-4 15B | NVIDIA | 15 | 8000 | 534:1 | 1.2 | 64.2 | π πβ¬ πΈ π | Feb/2024 | π’ | https://arxiv.org/abs/2402.16819 | Dense | |||||
97 | TowerLLM | Unbabel | https://unbabel.com/meet-towerllm/ | 7 | 1020 | 29:1 | 1.2 | π πβ¬ πΈ π | Feb/2024 | π’ | https://arxiv.org/abs/2402.17733 | Dense | Commercial product, Llama-2 as base. | ||||
98 | Hawk | Google DeepMind | 7 | 300 | 43:1 | 0.2 | 35 | π ππΈ π | Feb/2024 | π’ | https://arxiv.org/abs/2402.19427 | Dense | MMLU=35. RNN. | ||||
99 | Griffin | Google DeepMind | 14 | 300 | 22:1 | 0.2 | 49.5 | π ππΈ π | Feb/2024 | π’ | https://arxiv.org/abs/2402.19427 | Dense | MMLU=49.5. RNN. | ||||
100 | BitNet b1.58 | Microsoft | https://huggingface.co/1bitLLM/bitnet_b1_58-xl | 70 | 2000 | 29:1 | 1.2 | π ππΈ π | Feb/2024 | π’ | https://arxiv.org/abs/2402.17764 | Dense |