ABCDEFGHIJKLMNOP
1
MOVED:
https://docs.google.com/spreadsheets/d/1kc262HZSMAWI6FVsh0zJwbB-ooYvzhCHaHcNUiA0_hY/edit?gid=1158069878#gid=1158069878
2
Google corrupted this one...
3
ModelLabPlayground
Parameters
(B)
Tokens
trained (B)
Ratio Tokens:Params
(Chinchilla scaling≥20:1)
ALScore
"ALScore" is a quick and dirty rating of the model's power. The formula is:
Sqr Root of (Parameters x Tokens) ÷ 300.
Any ALScore ≥ 1.0 is a powerful model in mid-2023.
MMLUMMLU
-Pro
GPQA
Training dataset
Announced
Public?Paper / RepoArchNotes
4
OlympusAmazonhttps://lifearchitect.ai/olympus/200040000TBANew related Titan details: '$65m training run. 200B dense model on 4T tokens of data across 13,760 NVIDIA A100 chips. 48 days to train. Training runs soon to cross $1B' https://importai.substack.com/p/import-ai-365-wmd-benchmark-amazon
5
GPT-5OpenAIhttps://lifearchitect.ai/gpt-5/52500TBADue 2024.
6
GPT-6OpenAIhttps://lifearchitect.ai/gpt-6/TBADue 2025.
7
AuroraGPT (ScienceGPT)Argonne National Laboratoryhttps://www.hpcwire.com/2023/11/13/training-of-1-trillion-parameter-scientific-ai-begins/1000TBA🔴https://tpc.dev/2023/11/10/tpc-announced-with-founding-partners/ powered by Intel Ponte Vecchio GPUs.
8
Grok-2xAIhttps://twitter.com/elonmusk/status/1773655245769330757TBADue 2025.
9
MAI-1Microsoft
https://arstechnica.com/information-technology/2024/05/microsoft-developing-mai-1-language-model-that-may-compete-with-openai-report/
5001000020:17.5TBAhttps://www.reuters.com/technology/microsoft-readies-new-ai-model-compete-with-google-openai-information-reports-2024-05-06/DenseDue 2024. MAI=Microsoft artificial intelligence. MSFT CTO statement: https://archive.md/XRSgS
10
GPT-4o miniOpenAIhttps://chatgpt.com/86000750:10.78240.2🆆 📚 🕸 🌋Jul/2024🟢https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/MoEOmnimodel. "OpenAI would not disclose exactly how large GPT-4o mini is, but said it’s roughly in the same tier as other small AI models, such as Llama 3 8b, Claude Haiku and Gemini 1.5 Flash." https://techcrunch.com/2024/07/18/openai-unveils-gpt-4o-mini-a-small-ai-model-powering-chatgpt/ "tested GPT-4o to identify potential risks, which we have addressed and plan to share the details of in the forthcoming GPT-4o system card and Preparedness scorecard." And related paper about instruction hierarchy: https://arxiv.org/abs/2404.13208
11
NeMoMistralhttps://huggingface.co/mistralai/Mistral-Nemo-Base-2407122000167:10.568🆆 📚 🕸 🌋 Jul/2024🟢https://mistral.ai/news/mistral-nemo/DenseWith NVIDIA. "Drop-in replacement of Mistral 7B". "trained using Megatron-LM, part of NVIDIA NeMo, with 3,072 H100 80GB Tensor Core GPUs" https://blogs.nvidia.com/blog/mistral-nvidia-ai-model/
12
Codestral MambaMistralhttps://huggingface.co/mistralai/mamba-codestral-7B-v0.172000286:10.4🆆 📚 🕸 🌋 Jul/2024🟢https://mistral.ai/news/codestral-mamba/Dense"Unlike Transformer models, Mamba models offer the advantage of linear time inference and the theoretical ability to model sequences of infinite length."
13
MathstralMistralhttps://huggingface.co/mistralai/mathstral-7B-v0.172000286:10.463.47🆆 📚 🕸 🌋 Jul/2024🟢https://mistral.ai/news/mathstral/Dense"We’re contributing Mathstral to the science community to bolster efforts in advanced mathematical problems requiring complex, multi-step logical reasoning."
14
SpreadsheetLLMMicrosoft1760130008:115.9🆆 📚 🕸 🌋 Jul/2024🔴https://arxiv.org/abs/2407.09025v1DenseNotable finetune of GPT4-0125-preview "outperforming the vanilla approach by 25.6% in GPT4’s in-context learning setting"
15
next-genDeepLhttps://www.deepl.com/en/translator🌋Jul/2024🟢https://www.deepl.com/en/blog/next-gen-language-modelDense"Built using our own groundbreaking, specialized LLM technology and proprietary training data, designed specifically for translation"
16
SmolLMHugging Facehttps://huggingface.co/collections/HuggingFaceTB/smollm-6695016cad7167254ce159661.71000589:10.139.97🆆 📚 🕸 🌋 ⚛️Jul/2024🟢https://huggingface.co/blog/smollmDenseDataset includes new Cosmopedia v2 synthetic data. 135M and 360M models,each trained on 600B tokens from Smollm-Corpus. 1.7B model trained on 1T tokens from Smollm-Corpus.
17
MockingbirdVectarahttps://vectara.com/platform/91000112:10.3🆆 📚 🕸 🌋 ⚛️Jul/2024🟢https://vectara.com/blog/mockingbird-a-rag-and-structured-output-focused-llm/Dense"At <10B parameters it's an LLM trained to provide optimal results for RAG and structured outputs."
18
FLAMeGoogle DeepMind24100042:10.5👥Jul/2024🔴https://arxiv.org/abs/2407.10817v1DenseLLM-as-a-Judge autorater. Foundational Large Autorater Models (FLAMe). Uses an instruction-tuned PaLM-2-24B model. Unrelated to Microsoft FLAME Jan/2023.
19
H2O-Danube3-4BH2O.aihttps://h2o.ai/platform/danube/personal-gpt/460001,500:10.555.18🆆 📚 🕸 🌋 ⚛️Jul/2024🟢https://arxiv.org/abs/2407.09276DenseRuns natively and fully offline on mobile phone. "H2O-Danube3 is a family of decoder only LLM models that use the general Llama model architecture adopting core principles from Llama 2 and Mistral with custom parameters determining the shape of each layer and total parameter count. We use the Mistral tokenizer..." MMLU for chat=54.74, base=55.18 via https://huggingface.co/h2oai/h2o-danube3-4b-base
20
Causal AxiomsMicrosoft0.0671.218:10.0⚛️Jul/2024🔴https://arxiv.org/abs/2407.07612v1Dense"the training dataset follows a specific structure, we develop a custom tokenizer. Alphanumeric node names are tokenized at a character level, while special terms such as ‘causes’, ‘Does’, ‘cause’, ‘Yes’, and ‘No’ are tokenized at the word level... Our training setup consists of around 175k instances of sequential chains with size of chains ranging from 3 to 6 nodes... All models are trained for 100 epochs. [LifeArchitect.ai estimate is 12 tokens per node x 6 nodes x 175,000 instances x 100 epochs = 1.26B tokens]" Based on GPT-2 arch.
21
SenseNova 5.5SenseTimehttps://platform.sensenova.cn/home#/home6001000017:18.2⚛️Jul/2024🟢https://www.sensetime.com/en/news-detail/51168278?categoryId=1072MoE"The model training was based on over 10TB tokens [sic, taken as 10T tokens instead of 10TB=2T tokens] of high-quality training data, including a large amount of synthetically-generated reasoning chain data, which help to enhance its reasoning capabilities." & "The updates include SenseNova 5o, the first real-time multimodal model in China, which provides a new AI interaction model on par with GPT-4o’s streaming interaction capabilities"
22
Helium 7BKyutaihttps://moshi.chat/71000143:10.3⚛️Jul/2024🟢https://youtu.be/hm2IJSKcYvoDense"1. The model is fine-tuned on 100K transcripts generated by Helium itself. 2. These transcripts are highly detailed, heavily annotated with emotion and style, and conversational. 3. Text to Speech Engine is further fine-tuned on 20 hours of audio recorded by Alice and licensed."
23
InternLM2.5Shanghai AI Laboratory/SenseTimehttps://huggingface.co/collections/internlm/internlm25-66853f32717072d17581bc1372600372:10.472.838.4🆆 📚 🕸 🌋Jul/2024🟢https://github.com/InternLM/InternLM/blob/main/model_cards/internlm2.5_7b.mdDense"The release of InternLM2.5 series contains 7B model size for now and we are going to release the 1.8B and 20B versions soon"
24
Llama 3 405BMeta AIhttps://wabetainfo.com/whatsapp-beta-for-android-2-24-14-7-whats-new/4051500038:18.284.848🆆 📚 🕸 🌋 Jun/2024🟡DenseWaiting on release outside of WhatsApp Android as of 1/Jul/2024.
25
ERNIE 4.0 TurboBaiduhttps://yiyan.baidu.com/🆆 📚 🕸 🌋Jun/2024🟢https://www.reuters.com/technology/artificial-intelligence/baidu-launches-upgraded-ai-model-says-user-base-hits-300-mln-2024-06-28/Dense"Ernie Bot has reached 300 million users since its launch [on 16/Mar/2023, public Aug/2023]" Jun/2024
26
Gemma 2Google DeepMindhttps://huggingface.co/google/gemma-2-27b-it2713000482:12.075.2🆆 📚 🕸 🌋Jun/2024🟢https://storage.googleapis.com/deepmind-media/gemma/gemma-2-report.pdfDenseAnnounce: https://blog.google/technology/developers/google-gemma-2/
27
CriticGPTOpenAI👥Jun/2024🔴https://cdn.openai.com/llm-critics-help-catch-llm-bugs-paper.pdfDense"LLM Critics Help Catch LLM Bugs" Announce: https://openai.com/index/finding-gpt4s-mistakes-with-gpt-4/
28
4M-21Applehttps://github.com/apple/ml-4m/3🌋Jun/2024🟢https://arxiv.org/abs/2406.09406DenseVision model based on T5-XXL. Modalities: RGB, Caption, Bounding boxes, Semantic segmentation, Depth, Human poses, Surface normals, CLIP, DINOv2, ImageBind, Metadata, Canny edges, SAM edges, SAM instances, Color palette. Project page: https://4m.epfl.ch/
29
ESM3EvolutionaryScalehttps://github.com/evolutionaryscale/esm987718:10.9🌋Jun/2024🟡https://www.evolutionaryscale.ai/blog/esm3-releaseDenseBiology large language model: "sequence, structure, and function are all masked and predicted during training, ESM3 can generate in all three modalities." 1.4B only released.
30
PanGu 5.0 SuperHuaweihttps://www.huaweicloud.com/intl/en-us/product/modelarts.html10002000020:114.9🌋Jun/2024🟡https://www.huaweicentral.com/huawei-cloud-unveils-pangu-large-model-5-0/MoEhttps://x.com/faridofanani96/status/1804079517193113850/photo/1
31
Claude 3.5 SonnetAnthropichttps://poe.com/Claude-3.5-Sonnet90.472.8367.2🆆 📚 🕸 🌋Jun/2024🟢https://www.anthropic.com/news/claude-3-5-sonnetDenseModel card: https://www-cdn.anthropic.com/fed9cc193a14b84131812372d8d5857f8f304c52/Model_Card_Claude_3_Addendum.pdf
32
DeepSeek-Coder-V2DeepSeek-AIhttps://chat.deepseek.com/coder2361020035:14.679.263.63🆆 📚 🕸 🌋Jun/2024🟢https://github.com/deepseek-ai/DeepSeek-Coder-V2/blob/main/paper.pdfMoEDeepSeek-V2 with additional 6 trillion tokens.
33
DCLM-Baseline 7B 2.6TInternationalhttps://huggingface.co/apple/DCLM-Baseline-7B72600372:10.463.7🕸 🌋Jun/2024🟡https://arxiv.org/abs/2406.11794DenseNew dataset: 240T tokens: 8× larger than previous SOTA dataset. DCLM-Pool is 240T, DCLM-Baseline is 3.8T: "we combine our 3.8T DCLM-BASELINE with the StarCoder and ProofPile2 data to arrive at a 4.1T token dataset. We train a 7B model for 2.5T tokens" and "We release the DCLM benchmark, framework, models, and datasets at https://datacomp.ai/dclm."
34
Nemotron-4-340BNVIDIAhttps://build.nvidia.com/nvidia/nemotron-4-340b-instruct340900027:15.881.1🆆 📚 🕸 🌋Jun/2024🟢https://d1qx31qr3h6wln.cloudfront.net/publications/Nemotron_4_340B_8T.pdfDenseOpen-source equiv of Mar/2023 GPT-4 (1760MoE≈340B, 13T), same param count but 2x the tokens of May/2023 PaLM 2 (340B, 3.6T), competitor to Nov/2023 Grok-1 (314B, 6T). Trained on 6,144 H100s. ~1.3TB for inference. 50+ natural and 40+ coding languages. Trained between December 2023 and May 2024. MMLU 0-shot for instruct=78.7, 5-shot for base=81.1. Permalink for paper: https://research.nvidia.com/publication/2024-06_nemotron-4-340b
35
Apple On-Device model Jun/2024Applehttps://github.com/apple/corenet/tree/main/projects/openelm3.041500494:10.226.76🆆 📚 🕸 🌋Jun/2024🟢https://arxiv.org/abs/2404.14619Densehttps://lifearchitect.ai/apple/ Likely to be the Apple OpenELM model (Apr/2024). "two of these models — a ~3 billion parameter on-device language model, and a larger server-based language model available with Private Cloud Compute". https://machinelearning.apple.com/research/introducing-apple-foundation-models The server-based model is possibly Ferret, although it is more properly called a multimodal model (not just language). It could also be Apple GPT based on their Ajax framework: https://archive.md/f3C0r
36
MatMul-Free LMUCSChttps://github.com/ridgerchu/matmulfreellm2.710038:10.1🆆 📚 🕸 🌋Jun/2024🟢https://arxiv.org/abs/2406.02528Dense"we explore alternative methods for mixing tokens without relying on matrix multiplications." Compared with Transformer++ based on Llama-2, not to be confused with the pre-GPT-3 American Express Transformer++ paper from 2/Mar/2020. Instead, Transformer++ is defined in the Mamba paper: 'Transformer++: A Transformer with an improved architecture, namely rotary positional encodings (Su et al. 2021) and SwiGLU MLP (Shazeer 2020)'
37
LunaGalileohttps://www.rungalileo.io/blog/introducing-galileo-luna-a-family-of-evaluation-foundation-models0.44162369:10.0🆆 📚 🕸 🌋Jun/2024🟢https://arxiv.org/abs/2406.00975DenseBased on DeBERTA-large (440M). RoBERTa=162B token dataset.
38
Qwen2Alibabahttps://huggingface.co/spaces/Qwen/Qwen2-72B-Instruct72700098:12.484.255.637.9🆆 📚 🕸 🌋Jun/2024🟢https://arxiv.org/abs/2407.10671DenseInstruct MMLU=82. Instruct GPQA=41.9. https://qwenlm.github.io/blog/qwen2/
39
Qwen2-57B-A14BAlibabahttps://github.com/QwenLM/Qwen2?tab=readme-ov-file 57450079:11.776.54334.3🆆 📚 🕸 🌋Jun/2024🟢https://arxiv.org/abs/2407.10671MoEhttps://qwenlm.github.io/blog/qwen2/
40
Skywork MoE 16x13BKunlun Techhttps://huggingface.co/Skywork/Skywork-MoE-Base14677.4🆆 📚 🕸 🌋Jun/2024🟢https://github.com/SkyworkAI/Skywork-MoE/blob/main/skywork-moe-tech-report.pdfMoECN + EN. "(MoE) model with 146 billion parameters, 16 experts, and 22 billion activated parameters. This model is initialized from the pre-existing dense checkpoints of our Skywork-13B model."
41
Mamba-2CMUhttps://github.com/state-spaces/mamba2.7300112:10.1🆆 📚 🕸 🌋 May/2024🟢https://arxiv.org/abs/2405.21060DenseAnalysis: https://tridao.me/blog/2024/mamba2-part1-model/
42
MAP-NeoInternationalhttps://map-neo.github.io/74500643:10.658.14🆆 📚 🕸 🌋 May/2024🟢https://arxiv.org/abs/2405.19327Dense"first fully open-sourced bilingual LLM with comparable performance to existing state-of-the-art LLMs... we open-source all details to reproduce our MAP-Neo, where the cleaned pre-training corpus, data cleaning pipeline, checkpoints, and well-optimized training/evaluation framework are provided."
43
K2LLM360https://huggingface.co/LLM360/K265140022:11.064.8🆆 📚 🕸 🌋 May/2024🟢https://www.llm360.ai/blog/several-new-releases-to-further-our-mission.htmlDense"K2-65B is a fully reproducible LLM outperforming Llama 2 70B using 35% less compute."
44
CodestralMistralhttps://huggingface.co/mistralai/Codestral-22B-v0.122200091:10.7🆆 📚 🕸 🌋 May/2024🟢https://mistral.ai/news/codestral/DenseFluent in 80+ programming languages
45
Aya-23-35BCoherehttps://huggingface.co/spaces/CohereForAI/aya-23354800138:11.4🆆 📚 🕸 🌋May/2024🟢https://drive.google.com/file/d/1YKBPo61pnl97C1c_1C2ZVOnPhqf7MLSc/viewDense
46
Yi-XLarge01-aihttps://platform.01.ai/20002000010:121.185.148.2🆆 📚 🕸 🌋May/2024🟢https://www.aixinzhijie.com/article/6845768MoEStill training as of May/2024: https://appserversrc.8btc.cn/FnDYlEC4STBhphu6M3NL4CKH43FW dead link, use: https://finance.china.com.cn/roll/20240513/6116857.shtml
47
Yi-Large01-aihttps://platform.01.ai/10001500015:112.983.858.143.5🆆 📚 🕸 🌋May/2024🟢https://www.aixinzhijie.com/article/6845768Dense
48
ChameleonMeta AIhttps://ai.meta.com/resources/models-and-libraries/chameleon-downloads/?gk_enable=chameleon_web_flow_is_live349200271:11.965.8🆆 📚 🕸 🌋 May/2024🟢https://arxiv.org/abs/2405.09818DenseMultimodal
49
Sparse Llama 7BCerebrashttps://huggingface.co/spaces/neuralmagic/llama-2-sparse-transfer-chat-deepsparse714521:10.1🆆 📚 🕸 🌋 May/2024🟢https://arxiv.org/abs/2405.03594Hybridhttps://www.cerebras.net/blog/introducing-sparse-llama-70-smaller-3x-faster-full-accuracy "For the 50% sparse model, we utilized 45 billion tokens of pretraining data, while an additional 100 billion tokens were used for the 70% model. This represents approximately 2% to 8% of the original 2 trillion tokens used to train the base Llama-2 model."
50
Gemini 1.5 FlashGoogle DeepMindhttps://aistudio.google.com/app/prompts/new_chat78.959.139.5🆆 📚 🕸 🌋 May/2024🟢https://goo.gle/GeminiV1-5MoE1M context length.
51
GPT-4oOpenAIhttps://chatgpt.com/88.772.653.6🆆 📚 🕸 🌋May/2024🟢https://openai.com/index/hello-gpt-4o/MoEOmnimodel. ‘[GPT-4o is] likely an early checkpoint of GPT-5’. https://twitter.com/drjimfan/status/1790089671365767313 ELO: https://twitter.com/LiamFedus/status/1790064963966370209 Demo: https://youtu.be/DQacCB9tDaw
52
Falcon 2 11BTIIhttps://huggingface.co/tiiuae/falcon-11B115500500:10.858.37🆆 📚 🕸 🌋 May/2024🟢
https://www.tii.ae/news/falcon-2-uaes-technology-innovation-institute-releases-new-ai-model-series-outperforming-metas
DenseAnnounce: https://www.tii.ae/news/falcon-2-uaes-technology-innovation-institute-releases-new-ai-model-series-outperforming-metas
53
Fugaku-LLMFujitsuhttps://huggingface.co/Fugaku-LLM/Fugaku-LLM-13B-instruct1338030:10.2🆆 📚 🕸 🌋May/2024🟢https://www.fujitsu.com/global/about/resources/news/press-releases/2024/0510-01.htmlDenseJapanese. CPU trained: 158,976+ A64FX CPUs (7M+ cores), zero GPUs. https://en.wikipedia.org/wiki/Fugaku_(supercomputer)
54
Yi 1.5 34B01-aihttps://huggingface.co/01-ai/Yi-1.5-34B-Chat34.43600105:11.276.852.3🆆 📚 🕸 🌋May/2024🟢https://github.com/01-ai/Yi-1.5DenseUses 600B more training tokens than Yi 1.0 (Nov/2023).
55
YOCOMicrosofthttps://github.com/microsoft/unilm/tree/master/YOCO31600534:10.2🆆 📚 🕸 🌋 May/2024🟢https://arxiv.org/abs/2405.05254DenseWith Tsingua. You Only Cache Once (YOCO). Long context "1M context length with near-perfect needle retrieval accuracy"
56
DeepSeek-V2DeepSeek-AIhttps://chat.deepseek.com/236810035:14.678.554.8🆆 📚 🕸 🌋 May/2024🟢https://arxiv.org/abs/2405.04434MoEHuge dataset, 12% Chinese "Therefore, we acknowledge that DeepSeek-V2 still has a slight gap in basic English capabilities with LLaMA3 70B".
57
ChuXinIndependenthttps://huggingface.co/chuxin-llm/Chuxin-1.6B-Base1.623001,438:10.241.07🆆 📚 🕸 🌋 May/2024🟢https://arxiv.org/abs/2405.04828Dense"results on the ”Needle In A Haystack”(NIAH) tests indicate that ChuXin-1M performs well across all context window lengths up to 1M."
58
RWKV-v6 FinchRWKVhttps://huggingface.co/spaces/BlinkDL/RWKV-Gradio-27.632500328:10.5🆆 📚 🕸 🌋 May/2024🟢https://huggingface.co/BlinkDL/rwkv-6-worldDensehttps://twitter.com/BlinkDL_AI/status/1787834625211158562
59
xLSTMELLIS2.7156:10.0🆆 📚 🕸 🌋 May/2024🔴https://arxiv.org/abs/2405.04517DenseNew method LSTM to xLSTM, see also RNNs. Code/weights doesn't seem to be released. https://github.com/AI-Guru/xlstm-resources
60
Granite CodeIBMhttps://github.com/ibm-granite/granite-code-models343500103:11.1🌋May/2024🟢https://github.com/ibm-granite/granite-code-models/blob/main/paper.pdfDenseDataset: publicly available datasets (e.g., GitHub Code Clean, Starcoder data), public code repositories, and issues from GitHub.
61
Qwen-MaxAlibabahttps://chat.lmsys.org/300600020:14.5🆆 📚 🕸 🌋 May/2024🟢https://help.aliyun.com/zh/dashscope/developer-reference/model-introductionDensehttps://twitter.com/JustinLin610/status/1787584325367529509
62
Med-Gemini-L 1.0Google DeepMindhttps://twitter.com/alan_karthi/status/178511745052826421615003000020:122.4🆆 📚 🕸 🌋 May/2024🔴https://arxiv.org/abs/2404.18416DenseMed-Gemini-M 1.0 and Med-Gemini-L 1.0 (Pro and Ultra finetunes) "For language tasks that require less complex reasoning, such as summarizing medical notes and creating referral letters, we introduce Med-Gemini-M 1.0 by fine-tuning the Gemini 1.0 Pro model. For other tasks that require more advanced reasoning, we introduce Med-Gemini-L 1.0 by fine-tuning the Gemini 1.0 Ultra model using a self-training method to enable the models to efficiently use web search."
63
Tele-FLMBAAIhttps://huggingface.co/CofeAI/Tele-FLM52200039:11.164🆆 📚 🕸 🌋 Apr/2024🟢https://arxiv.org/abs/2404.16645DenseAlso known as FLM-2. "We will open-source a 1T model checkpoint, namely Tele-FLM-1T, to advance further training and research." Discussion paper Jul/2024: https://arxiv.org/abs/2407.02783
64
Qwen-1.5 110BAlibabahttps://huggingface.co/spaces/Qwen/Qwen1.5-110B-Chat-demo111300028:11.980.449.935.9🆆 📚 🕸 🌋 Apr/2024🟢https://qwenlm.github.io/blog/qwen1.5-110b/DenseWorse performance on GPQA (72B=36.3, 110B=35.9).
65
ArcticSnowflake AI Researchhttps://arctic.streamlit.app/48035008:14.367.3🆆 📚 🕸 🌋 Apr/2024🟢https://www.snowflake.com/blog/arctic-open-efficient-foundation-language-models-snowflake/Hybrid"Arctic uses a unique Dense-MoE Hybrid transformer architecture. It combines a 10B dense transformer model with a residual 128×3.66B MoE MLP resulting in 480B total and 17B active parameters chosen using a top-2 gating."
66
SenseNova 5.0SenseTime6001000017:18.284.7842.93🆆 📚 🕸 🌋 Apr/2024🟢https://news.futunn.com/en/post/41290101/a-large-shangtang-multi-modal-model-with-600-billion-parametersMoEGPT-4 scale; low media coverage; no demo in Western world. https://www.techinasia.com/sensetime-pauses-trading-stock-rises-30-model-launch
67
OpenELMApplehttps://huggingface.co/apple/OpenELM-3B-Instruct3.041500494:10.226.76🆆 📚 🕸 🌋 Apr/2024🟢https://arxiv.org/abs/2404.14619DenseOn-device model (laptop, phone). Open-source Efficient Language Models (OpenELM). https://venturebeat.com/ai/apple-releases-openelm-small-open-source-ai-models-designed-to-run-on-device/
68
phi-3-mediumMicrosofthttps://huggingface.co/microsoft/Phi-3-medium-128k-instruct144800343:10.978.255.7⚛️Apr/2024🟢https://arxiv.org/abs/2404.14219DensePreview only, benchmarks being investigated as of May/2024.
69
phi-3-miniMicrosofthttps://huggingface.co/microsoft/Phi-3-mini-128k-instruct3.83300869:10.468.845.7⚛️Apr/2024🟢https://arxiv.org/abs/2404.14219Dense"phi3-mini can be quantized to 4-bits so that it only occupies ≈ 1.8GB of memory. We tested the quantized model by deploying phi-3-mini on iPhone 14 with A16 Bionic chip running natively on-device and fully offline achieving more than 12 tokens per second."
70
Llama 3 70BMeta AIhttps://meta.ai/7015000215:13.48252.8🆆 📚 🕸 🌋 Apr/2024🟢https://ai.meta.com/blog/meta-llama-3/DenseInstruct MMLU-Pro=56.2
71
HLATAmazon71800258:10.441.318🆆 📚 🕸 🌋 Apr/2024🔴https://arxiv.org/abs/2404.10630DenseHLAT=High-quality LLM pre-trained on AWS Trainium. Same arch as Llama 7B. The pre-training is performed up to 64 Amazon EC2 trn1.32xlarge instances with totalling up to 1024 AWS Trainium accelerators. Read more about Trainium: https://www.aboutamazon.com/news/aws/what-you-need-to-know-about-the-aws-ai-chips-powering-amazons-partnership-with-anthropic
72
Idefics2Hugging Facehttps://huggingface.co/HuggingFaceM4/idefics2-8b8.4🆆 🕸Apr/2024🟢https://huggingface.co/blog/idefics2DenseClone of Flamingo now using Mistral 7B. Named after Asterix and Obelix's dog Idefix (Image-aware Decoder Enhanced à la Flamingo with Interleaved Cross-attentionS)
73
Reka CoreReka AIhttps://poe.com/RekaCore3001000034:15.883.238.2🆆 📚 🕸 🌋 Apr/2024🟢https://publications.reka.ai/reka-core-tech-report.pdfDensehttps://www.reka.ai/news/reka-core-our-frontier-class-multimodal-language-model
74
WizardLM-2-8x22BMicrosofthttps://huggingface.co/MaziyarPanahi/WizardLM-2-8x22B-GGUF141200015:11.8🆆 📚 🕸 🌋 Apr/2024🟢https://wizardlm.github.io/WizardLM2/MoEBase model = mistral-8x22b.
75
Pile-T5EleutherAIhttps://huggingface.co/EleutherAI/pile-t5-xxl112000182:10.553.84🆆 📚 🕸 🌋Apr/2024🟢https://blog.eleuther.ai/pile-t5/Dense
76
Zephyr 141B-A35BHugging Face H4https://huggingface.co/HuggingFaceH4/zephyr-orpo-141b-A35b-v0.135200058:10.9🆆 📚 🕸 🌋 Apr/2024🟢https://arxiv.org/abs/2403.07691MoEmixtral-8x22b finetune using Odds Ratio Preference Optimization (ORPO).
77
Rerank 3Coherehttps://docs.cohere.com/reference/rerank-1104400039:12.1📚 🕸 Apr/2024🟢https://txt.cohere.com/rerank-3/DenseRAG + semantic search, possibly backed by Command-R+.
78
gpt-4-turbo-2024-04-09OpenAIhttps://chat.openai.com/86.563.749.1🆆 📚 🕸 🌋Apr/2024🟢https://cdn.openai.com/papers/gpt-4.pdfMoEThis is such a significantly better model that I've added it here. This GPQA=46.5%, old GPT-4 GPQA=36%. https://twitter.com/EpochAIResearch/status/1778463039932584205 MMLU scores are unclear, but may have improved by 1%: https://twitter.com/OpenAI/status/1778602770784002136. Final benchmarks are here: https://archive.md/6Cc0Z
79
MiniCPM-2.4BTsinghuahttps://github.com/OpenBMB/MiniCPM/2.41100459:10.2🆆 📚 🕸 🌋 Apr/2024🟢https://arxiv.org/abs/2404.06395DenseMoE option=https://huggingface.co/openbmb/MiniCPM-MoE-8x2B
80
Ferret-UIApplehttps://github.com/apple/ml-ferret132000154:10.5🆆 📚 🕸 👥Apr/2024🟢https://arxiv.org/abs/2404.05719DenseVicuna base, multimodal. Extension of Ferret from Oct/2023.
81
mixtral-8x22bMistralhttps://huggingface.co/mistral-community/Mixtral-8x22B-v0.1141200015:11.877.75🆆 📚 🕸 🌋 Apr/2024🟢https://mistral.ai/news/mixtral-8x22b/MoEMoE=22Bx8, seq=65536.
82
SailorSailhttps://huggingface.co/sail720029:10.1🆆 📚 🕸 🌋 Apr/2024🟢https://arxiv.org/abs/2404.03608v1DenseSEA languages. Based on Qwen-1.5. https://github.com/sail-sg/sailor-llm "Generally Sailor models consume around 200B tokens, completing a full pass through the SailCraft corpus once. However, the Sailor-0.5B model undergoes training with 400B tokens, equivalent to 2 epochs."
83
JetMoE-8BMIThttps://www.lepton.ai/playground/chat?model=jetmoe-8b-chat81250157:10.349.2🆆 📚 🕸 🌋 Apr/2024🟢https://huggingface.co/jetmoe/jetmoe-8bMoE
84
EurusTsinghuahttps://huggingface.co/collections/openbmb/eurus-660bc40bec5376b3adc9d1c570200029:11.2🆆 📚 🕸 🌋 Apr/2024🟢https://huggingface.co/collections/openbmb/eurus-660bc40bec5376b3adc9d1c5DenseFine-tune of Mistral-7B and CodeLlama-70B.
85
Command-R+Coherehttps://huggingface.co/spaces/CohereForAI/c4ai-command-r-plus104400039:12.175.7📚 🕸 Apr/2024🟢https://huggingface.co/CohereForAI/c4ai-command-r-plusDensepurpose-built to excel at real-world enterprise use cases. Announce with no arch details: https://txt.cohere.com/command-r-plus-microsoft-azure/
86
VikingSilo AI33200061:10.9🌋Apr/2024🟢https://www.silo.ai/blog/viking-7b-13b-33b-sailing-the-nordic-seas-of-multilingualityDenseViking uses an architecture similar to Llama 2, with flash attention, rotary embeddings, grouped query attention and supports a 4k sequence length'
87
OLMo-Bitnet-1BNous Researchhttps://huggingface.co/NousResearch/OLMo-Bitnet-1B16060:10.0🌋Apr/2024🟢https://arxiv.org/abs/2402.17764Dense1.58-bit quantized (ternary weights) means we can run a 70B model in ~14GB VRAM. See also BitNet b1.58
88
Aurora-MInternationalhttps://huggingface.co/collections/aurora-m/aurora-m-models-65fdfdff62471e09812f540715.52035132:10.6🌋Mar/2024🟢https://arxiv.org/abs/2404.00399Dense
89
ReALM-3BApple313445:10.1🌋Mar/2024🔴https://arxiv.org/abs/2403.20329DenseFLAN-T5 (Oct/2022) finetune.
90
Qwen1.5-MoE-A2.7BAlibabahttps://qwenlm.github.io/blog/qwen-moe/14.31500105:10.562.5🆆 📚 🕸 🌋 Mar/2024🟢https://qwenlm.github.io/blog/qwen-moe/MoEMoE. "Of particular significance is the fact that, through upcycling, the necessity for training an equivalent volume of tokens as in the original model has been eliminated." I assumed half of the original 3T tokens
91
Grok-1.5xAIhttps://grok.x.ai/314600020:14.681.3🆆 📚 🕸 🌋 Mar/2024🟢https://x.ai/blog/grok-1.5DenseContext=128k.
92
JambaAI21https://huggingface.co/ai21labs/Jamba-v0.152500097:11.767.4🆆 📚 🕸 🌋 Mar/2024🟢https://arxiv.org/abs/2403.19887MoEMoE. Open weights, licensed under Apache 2.0. Announce: https://arxiv.org/abs/2403.19887
93
DBRXMosaicMLhttps://huggingface.co/spaces/databricks/dbrx-instruct1321200091:14.273.7🆆 📚 🕸 🌋 Mar/2024🟢https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llmMoEMoE. Trained for $10M on 3,072 NVIDIA H100s connected by 3.2Tbps Infiniband.
94
Stable Code Instruct 3BStability AIhttps://huggingface.co/stabilityai/stable-code-instruct-3b2.7560208:10.1🌋Mar/2024🟢https://stability.ai/news/introducing-stable-code-instruct-3bDenseContext window=16,384. Trained on The Stack dataset.
95
EvoLLM-JPSakana AIhttps://huggingface.co/SakanaAI/EvoLLM-JP-v1-10B1080080:10.3🆆 📚 🕸 🌋 Mar/2024🟢https://arxiv.org/abs/2403.13187DenseJapanese. Model merge 'our EvoLLM-JP-A is a merge of shisa-gamma-7b-v1, Arithmo2-Mistral-7B, and Abel7B-002' https://sakana.ai/evolutionary-model-merge/
96
RakutenAI-7BRakuten Grouphttps://huggingface.co/Rakuten/RakutenAI-7B73000429:10.561.31🆆 📚 🕸 🌋 Mar/2024🟢https://arxiv.org/abs/2403.15484DenseJapanese. Mistral 7B derivative.
97
ParakeetIndependenthttps://colab.research.google.com/drive/1gI8CM9Bz9ov0-E6aL2jF808rE56UtZyF?usp=sharing0.37838:10.0🆆 📚 🕸 🌋 Mar/2024🟢https://news.ycombinator.com/item?id=39745700#39745702DenseTiny model (378M) for testing
98
RWKV-v5 EagleXRWKVhttps://huggingface.co/recursal/EagleX_1-7T7.521700227:10.440.14🆆 📚 🕸 🌋 Mar/2024🟢https://substack.recursal.ai/p/eaglex-17t-soaring-past-llama-7bDenseBuilt on the RWKV-v5 architecture (a linear transformer with 10-100x+ lower inference cost)
99
MM1Apple30201067:10.8🌋Mar/2024🔴https://arxiv.org/abs/2403.09611DenseVLM, outperforms Flamingo 80B (Apr/2022) across benchmarks. 2T text tokens + ~10B+ other text (estimate). Unreleased.
100
RFM-1Covarianthttps://vimeo.com/921866765816020:10.1🆆 📚 🕸 🌋 Mar/2024🟡https://covariant.ai/insights/introducing-rfm-1-giving-robots-human-like-reasoning-capabilities/DenseCommercial, multimodal for robotics