ABCDEFGHIJKLMNOP
1
(377) Permalink:
https://lifearchitect.ai/models-table/
Timeline view:
https://lifearchitect.ai/timeline
The Memo:
https://lifearchitect.ai/memo
2
ModelLabPlayground
Parameters
(B)
Tokens
trained (B)
Ratio Tokens:Params
(Chinchilla scalingβ‰₯20:1)
ALScore
"ALScore" is a quick and dirty rating of the model's power. The formula is:
Sqr Root of (Parameters x Tokens) Γ· 300.
Any ALScore β‰₯ 1.0 is a powerful model in mid-2023.
MMLUMMLU
-Pro
GPQA
Training dataset
Announced
β–Ό
Public?Paper / RepoArchNotes
3
OlympusAmazon
https://lifearchitect.ai/olympus/
200040000TBA
New related Titan details: '$65m training run. 200B dense model on 4T tokens of data across 13,760 NVIDIA A100 chips. 48 days to train. Training runs soon to cross $1B' https://importai.substack.com/p/import-ai-365-wmd-benchmark-amazon
4
GPT-5OpenAI
https://lifearchitect.ai/gpt-5/
52500TBADue 2024.
5
GPT-6OpenAIhttps://lifearchitect.ai/gpt-6/TBADue 2025.
6
AuroraGPT (ScienceGPT)
Argonne National Laboratory
https://www.hpcwire.com/2023/11/13/training-of-1-trillion-parameter-scientific-ai-begins/1000TBAπŸ”΄
https://tpc.dev/2023/11/10/tpc-announced-with-founding-partners/ powered by Intel Ponte Vecchio GPUs.
7
Grok-2xAIhttps://twitter.com/elonmusk/status/1773655245769330757TBADue 2025.
8
MAI-1Microsoft
https://arstechnica.com/information-technology/2024/05/microsoft-developing-mai-1-language-model-that-may-compete-with-openai-report/
5001000020:17.5TBAhttps://www.reuters.com/technology/microsoft-readies-new-ai-model-compete-with-google-openai-information-reports-2024-05-06/Dense
Due 2024. MAI=Microsoft artificial intelligence. MSFT CTO statement: https://archive.md/XRSgS
9
Causal AxiomsMicrosoft0.0671.218:10.0βš›οΈJul/2024πŸ”΄https://arxiv.org/abs/2407.07612v1
Dense
"the training dataset follows a specific structure, we develop a custom tokenizer. Alphanumeric node names are tokenized at a character level, while special terms such as β€˜causes’, β€˜Does’, β€˜cause’, β€˜Yes’, and β€˜No’ are tokenized at the word level... Our training setup consists of around 175k instances of sequential chains with size of chains ranging from 3 to 6 nodes... All models are trained for 100 epochs. [LifeArchitect.ai estimate is 12 tokens per node x 6 nodes x 175,000 instances x 100 epochs = 1.26B tokens]" Based on GPT-2 arch.
10
SenseNova 5.5SenseTime
https://platform.sensenova.cn/home#/home
6001000017:18.2βš›οΈJul/2024🟒https://www.sensetime.com/en/news-detail/51168278?categoryId=1072MoE
"The model training was based on over 10TB tokens [sic, taken as 10T tokens instead of 10TB=2T tokens] of high-quality training data, including a large amount of synthetically-generated reasoning chain data, which help to enhance its reasoning capabilities." & "The updates include SenseNova 5o, the first real-time multimodal model in China, which provides a new AI interaction model on par with GPT-4o’s streaming interaction capabilities"
11
InternLM2.5
Shanghai AI Laboratory/SenseTime
https://huggingface.co/collections/internlm/internlm25-66853f32717072d17581bc13
72600372:10.472.838.4πŸ†† πŸ“šβ¬† πŸ•Έ πŸŒ‹Jul/2024🟒
https://github.com/InternLM/InternLM/blob/main/model_cards/internlm2.5_7b.md
Dense
"The release of InternLM2.5 series contains 7B model size for now and we are going to release the 1.8B and 20B versions soon"
12
Llama 3 405BMeta AI
https://wabetainfo.com/whatsapp-beta-for-android-2-24-14-7-whats-new/
4051500038:18.284.848πŸ†† πŸ“šβ¬† πŸ•Έ πŸŒ‹ Jun/2024🟑
Dense
Waiting on release outside of WhatsApp Android as of 1/Jul/2024.
13
ERNIE 4.0 TurboBaidu
https://yiyan.baidu.com/
πŸ†† πŸ“šβ¬† πŸ•Έ πŸŒ‹Jun/2024🟒https://www.reuters.com/technology/artificial-intelligence/baidu-launches-upgraded-ai-model-says-user-base-hits-300-mln-2024-06-28/
Dense
"Ernie Bot has reached 300 million users since its launch [on 16/Mar/2023, public Aug/2023]" Jun/2024
14
Gemma 2
Google DeepMind
https://huggingface.co/google/gemma-2-27b-it
2713000482:12.075.2πŸ†† πŸ“šβ¬† πŸ•Έ πŸŒ‹Jun/2024🟒
https://storage.googleapis.com/deepmind-media/gemma/gemma-2-report.pdf
Dense
Announce: https://blog.google/technology/developers/google-gemma-2/
15
CriticGPTOpenAIπŸ‘₯Jun/2024πŸ”΄https://cdn.openai.com/llm-critics-help-catch-llm-bugs-paper.pdf
Dense
"LLM Critics Help Catch LLM Bugs" Announce: https://openai.com/index/finding-gpt4s-mistakes-with-gpt-4/
16
4M-21Apple
https://github.com/apple/ml-4m/
3πŸŒ‹Jun/2024🟒https://arxiv.org/abs/2406.09406
Dense
Vision model based on T5-XXL. Modalities: RGB, Caption, Bounding boxes, Semantic segmentation, Depth, Human poses, Surface normals, CLIP, DINOv2, ImageBind, Metadata, Canny edges, SAM edges, SAM instances, Color palette. Project page: https://4m.epfl.ch/
17
ESM3
EvolutionaryScale
https://github.com/evolutionaryscale/esm
987718:10.9πŸŒ‹Jun/2024🟑https://www.evolutionaryscale.ai/blog/esm3-release
Dense
Biology large language model: "sequence, structure, and function are all masked and predicted during training, ESM3 can generate in all three modalities." 1.4B only released.
18
PanGu 5.0 SuperHuawei
https://www.huaweicloud.com/intl/en-us/product/modelarts.html
10002000020:114.9πŸŒ‹Jun/2024🟑https://www.huaweicentral.com/huawei-cloud-unveils-pangu-large-model-5-0/MoEhttps://x.com/faridofanani96/status/1804079517193113850/photo/1
19
Claude 3.5 Sonnet
Anthropic
https://poe.com/Claude-3.5-Sonnet
90.472.8367.2πŸ†† πŸ“šβ¬† πŸ•Έ πŸŒ‹Jun/2024🟒https://www.anthropic.com/news/claude-3-5-sonnet
Dense
Model card: https://www-cdn.anthropic.com/fed9cc193a14b84131812372d8d5857f8f304c52/Model_Card_Claude_3_Addendum.pdf
20
DeepSeek-Coder-V2
DeepSeek-AI
https://chat.deepseek.com/coder
2361020035:14.679.263.63πŸ†† πŸ“šβ¬† πŸ•Έ πŸŒ‹Jun/2024🟒https://github.com/deepseek-ai/DeepSeek-Coder-V2/blob/main/paper.pdfMoEDeepSeek-V2 with additional 6 trillion tokens.
21
DCLM-Baseline 7B 2.6T
International
https://www.datacomp.ai/dclm/
72600372:10.463.7πŸ•Έ πŸŒ‹Jun/2024🟑https://arxiv.org/abs/2406.11794
Dense
New dataset: 240T tokens: 8Γ— larger than previous SOTA dataset. DCLM-Pool is 240T, DCLM-Baseline is 3.8T: "we combine our 3.8T DCLM-BASELINE with the StarCoder and ProofPile2 data to arrive at a 4.1T token dataset. We train a 7B model for 2.5T tokens" and "We release the DCLM benchmark, framework, models, and datasets at https://datacomp.ai/dclm." Model not yet available.
22
Nemotron-4-340B
NVIDIA
https://build.nvidia.com/nvidia/nemotron-4-340b-instruct
340900027:15.881.1πŸ†† πŸ“šβ¬† πŸ•Έ πŸŒ‹Jun/2024🟒https://d1qx31qr3h6wln.cloudfront.net/publications/Nemotron_4_340B_8T.pdf
Dense
Open-source equiv of Mar/2023 GPT-4 (1760MoEβ‰ˆ340B, 13T), same param count but 2x the tokens of May/2023 PaLM 2 (340B, 3.6T), competitor to Nov/2023 Grok-1 (314B, 6T). Trained on 6,144 H100s. ~1.3TB for inference. 50+ natural and 40+ coding languages. Trained between December 2023 and May 2024. MMLU 0-shot for instruct=78.7, 5-shot for base=81.1. Permalink for paper: https://research.nvidia.com/publication/2024-06_nemotron-4-340b
23
Apple On-Device model Jun/2024
Apple
https://github.com/apple/corenet/tree/main/projects/openelm
3.041500494:10.226.76πŸ†† πŸ“šβ¬† πŸ•Έ πŸŒ‹Jun/2024🟒https://arxiv.org/abs/2404.14619
Dense
https://lifearchitect.ai/apple/ Likely to be the Apple OpenELM model (Apr/2024). "two of these models β€” a ~3 billion parameter on-device language model, and a larger server-based language model available with Private Cloud Compute". https://machinelearning.apple.com/research/introducing-apple-foundation-models The server-based model is possibly Ferret, although it is more properly called a multimodal model (not just language). It could also be Apple GPT based on their Ajax framework: https://archive.md/f3C0r
24
MatMul-Free LMUCSC
https://github.com/ridgerchu/matmulfreellm
2.710038:10.1πŸ†† πŸ“šβ¬† πŸ•Έ πŸŒ‹Jun/2024🟒https://arxiv.org/abs/2406.02528
Dense
"we explore alternative methods for mixing tokens without relying on matrix multiplications." Compared with Transformer++ based on Llama-2, not to be confused with the pre-GPT-3 American Express Transformer++ paper from 2/Mar/2020. Instead, Transformer++ is defined in the Mamba paper: 'Transformer++: A Transformer with an improved architecture, namely rotary positional encodings (Su et al. 2021) and SwiGLU MLP (Shazeer 2020)'
25
LunaGalileo
https://www.rungalileo.io/blog/introducing-galileo-luna-a-family-of-evaluation-foundation-models
0.44162369:10.0πŸ†† πŸ“šβ¬† πŸ•Έ πŸŒ‹Jun/2024🟒https://arxiv.org/abs/2406.00975
Dense
Based on DeBERTA-large (440M). RoBERTa=162B token dataset.
26
Qwen2Alibaba
https://huggingface.co/spaces/Qwen/Qwen2-72B-Instruct
72300042:11.584.255.637.9πŸ†† πŸ“šβ¬† πŸ•Έ πŸŒ‹Jun/2024🟒https://qwenlm.github.io/blog/qwen2/
Dense
Instruct MMLU=82. Instruct GPQA=41.9. Qwen1.5 trained on ~3T tokens, using this number for Qwen2 while waiting for official number.
27
Qwen2-57B-A14BAlibaba
https://github.com/QwenLM/Qwen2?tab=readme-ov-file
57300053:11.476.54334.3πŸ†† πŸ“šβ¬† πŸ•Έ πŸŒ‹Jun/2024🟒https://qwenlm.github.io/blog/qwen2/MoE
28
Skywork MoE 16x13B
Kunlun Tech
https://huggingface.co/Skywork/Skywork-MoE-Base
14677.4πŸ†† πŸ“šβ¬† πŸ•Έ πŸŒ‹Jun/2024🟒https://github.com/SkyworkAI/Skywork-MoE/blob/main/skywork-moe-tech-report.pdfMoE
CN + EN. "(MoE) model with 146 billion parameters, 16 experts, and 22 billion activated parameters. This model is initialized from the pre-existing dense checkpoints of our Skywork-13B model."
29
Mamba-2CMU
https://github.com/state-spaces/mamba
2.7300112:10.1πŸ†† πŸ“šβ¬† πŸ•Έ πŸŒ‹ May/2024🟒
https://arxiv.org/abs/2405.21060
Dense
Analysis: https://tridao.me/blog/2024/mamba2-part1-model/
30
MAP-NeoInternational
https://map-neo.github.io/
74500643:10.658.14πŸ†† πŸ“šβ¬† πŸ•Έ πŸŒ‹ May/2024🟒
https://arxiv.org/abs/2405.19327
Dense
"first fully open-sourced bilingual LLM with comparable performance to existing state-of-the-art LLMs... we open-source all details to reproduce our MAP-Neo, where the cleaned pre-training corpus, data cleaning pipeline, checkpoints, and well-optimized training/evaluation framework are provided."
31
K2LLM360
https://huggingface.co/LLM360/K2
65140022:11.064.8πŸ†† πŸ“šβ¬† πŸ•Έ πŸŒ‹ May/2024🟒
https://www.llm360.ai/blog/several-new-releases-to-further-our-mission.html
Dense
"K2-65B is a fully reproducible LLM outperforming Llama 2 70B using 35% less compute."
32
CodestralMistral
https://huggingface.co/mistralai/Codestral-22B-v0.1
22200091:10.7πŸ†† πŸ“šβ¬† πŸ•Έ πŸŒ‹ May/2024🟒
https://mistral.ai/news/codestral/
Dense
Fluent in 80+ programming languages
33
Aya-23-35BCohere
https://huggingface.co/spaces/CohereForAI/aya-23
354800138:11.4πŸ†† πŸ“šβ¬† πŸ•Έ πŸŒ‹May/2024🟒https://drive.google.com/file/d/1YKBPo61pnl97C1c_1C2ZVOnPhqf7MLSc/view
Dense
34
Yi-XLarge01-ai
https://platform.01.ai/
20002000010:121.185.148.2πŸ†† πŸ“šβ¬† πŸ•Έ πŸŒ‹May/2024🟒https://www.aixinzhijie.com/article/6845768MoE
Still training as of May/2024: https://appserversrc.8btc.cn/FnDYlEC4STBhphu6M3NL4CKH43FW dead link, use: https://finance.china.com.cn/roll/20240513/6116857.shtml
35
Yi-Large01-ai
https://platform.01.ai/
10001500015:112.983.858.143.5πŸ†† πŸ“šβ¬† πŸ•Έ πŸŒ‹May/2024🟒https://www.aixinzhijie.com/article/6845768
Dense
36
ChameleonMeta AI
https://ai.meta.com/resources/models-and-libraries/chameleon-downloads/?gk_enable=chameleon_web_flow_is_live
349200271:11.965.8πŸ†† πŸ“šβ¬† πŸ•Έ πŸŒ‹ May/2024🟒
https://arxiv.org/abs/2405.09818
Dense
Multimodal
37
Sparse Llama 7BCerebras
https://huggingface.co/spaces/neuralmagic/llama-2-sparse-transfer-chat-deepsparse
714521:10.1πŸ†† πŸ“šβ¬† πŸ•Έ πŸŒ‹ May/2024🟒
https://arxiv.org/abs/2405.03594
Hybrid
https://www.cerebras.net/blog/introducing-sparse-llama-70-smaller-3x-faster-full-accuracy "For the 50% sparse model, we utilized 45 billion tokens of pretraining data, while an additional 100 billion tokens were used for the 70% model. This represents approximately 2% to 8% of the original 2 trillion tokens used to train the base Llama-2 model."
38
Gemini 1.5 Flash
Google DeepMind
https://aistudio.google.com/app/prompts/new_chat
78.959.139.5πŸ†† πŸ“šβ¬† πŸ•Έ πŸŒ‹ May/2024🟒
https://goo.gle/GeminiV1-5
MoE1M context length.
39
GPT-4oOpenAIhttps://chatgpt.com/88.772.653.6πŸ†† πŸ“šβ¬† πŸ•Έ πŸŒ‹May/2024🟒https://openai.com/index/hello-gpt-4o/MoE
Omnimodel. β€˜[GPT-4o is] likely an early checkpoint of GPT-5’. https://twitter.com/drjimfan/status/1790089671365767313 ELO: https://twitter.com/LiamFedus/status/1790064963966370209 Demo: https://youtu.be/DQacCB9tDaw
40
Falcon 2 11BTII
https://huggingface.co/tiiuae/falcon-11B
115500500:10.858.37πŸ†† πŸ“šβ¬† πŸ•Έ πŸŒ‹ May/2024🟒
https://www.tii.ae/news/falcon-2-uaes-technology-innovation-institute-releases-new-ai-model-series-outperforming-metas
Dense
Announce: https://www.tii.ae/news/falcon-2-uaes-technology-innovation-institute-releases-new-ai-model-series-outperforming-metas
41
Fugaku-LLMFujitsu
https://huggingface.co/Fugaku-LLM/Fugaku-LLM-13B-instruct
1338030:10.2πŸ†† πŸ“šβ¬† πŸ•Έ πŸŒ‹May/2024🟒https://www.fujitsu.com/global/about/resources/news/press-releases/2024/0510-01.html
Dense
Japanese. CPU trained: 158,976+ A64FX CPUs (7M+ cores), zero GPUs. https://en.wikipedia.org/wiki/Fugaku_(supercomputer)
42
Yi 1.5 34B01-ai
https://huggingface.co/01-ai/Yi-1.5-34B-Chat
34.43600105:11.276.852.3πŸ†† πŸ“šβ¬† πŸ•Έ πŸŒ‹May/2024🟒https://github.com/01-ai/Yi-1.5
Dense
Uses 600B more training tokens than Yi 1.0 (Nov/2023).
43
YOCOMicrosoft
https://github.com/microsoft/unilm/tree/master/YOCO
31600534:10.2πŸ†† πŸ“šβ¬† πŸ•Έ πŸŒ‹ May/2024🟒https://arxiv.org/abs/2405.05254
Dense
With Tsingua. You Only Cache Once (YOCO). Long context "1M context length with near-perfect needle retrieval accuracy"
44
DeepSeek-V2DeepSeek-AI
https://chat.deepseek.com/
236810035:14.678.554.8πŸ†† πŸ“šβ¬† πŸ•Έ πŸŒ‹ May/2024🟒
https://arxiv.org/abs/2405.04434
MoEHuge dataset, 12% Chinese "Therefore, we acknowledge that DeepSeek-V2 still has a slight gap in basic English capabilities with LLaMA3 70B".
45
ChuXinIndependent
https://huggingface.co/chuxin-llm/Chuxin-1.6B-Base
1.623001,438:10.241.07πŸ†† πŸ“šβ¬† πŸ•Έ πŸŒ‹ May/2024🟒
https://arxiv.org/abs/2405.04828
Dense
"results on the ”Needle In A Haystack”(NIAH) tests indicate that ChuXin-1M performs well across all context window lengths up to 1M."
46
RWKV-v6 FinchRWKV
https://huggingface.co/spaces/BlinkDL/RWKV-Gradio-2
7.632500328:10.5πŸ†† πŸ“šβ¬† πŸ•Έ πŸŒ‹ May/2024🟒
https://huggingface.co/BlinkDL/rwkv-6-world
Dense
https://twitter.com/BlinkDL_AI/status/1787834625211158562
47
xLSTMELLIS2.7156:10.0πŸ†† πŸ“šβ¬† πŸ•Έ πŸŒ‹ May/2024πŸ”΄
https://arxiv.org/abs/2405.04517
Dense
New method LSTM to xLSTM, see also RNNs. Code/weights doesn't seem to be released. https://github.com/AI-Guru/xlstm-resources
48
Granite CodeIBM
https://github.com/ibm-granite/granite-code-models
343500103:11.1πŸŒ‹May/2024🟒
https://github.com/ibm-granite/granite-code-models/blob/main/paper.pdf
Dense
Dataset: publicly available datasets (e.g., GitHub Code Clean, Starcoder data), public code repositories, and issues from GitHub.
49
Qwen-MaxAlibaba
https://chat.lmsys.org/
300600020:14.5πŸ†† πŸ“šβ¬† πŸ•Έ πŸŒ‹ May/2024🟒
https://help.aliyun.com/zh/dashscope/developer-reference/model-introduction
Dense
https://twitter.com/JustinLin610/status/1787584325367529509
50
Med-Gemini-L 1.0
Google DeepMind
https://twitter.com/alan_karthi/status/1785117450528264216
15003000020:122.4πŸ†† πŸ“šβ¬† πŸ•Έ πŸŒ‹ May/2024πŸ”΄
https://arxiv.org/abs/2404.18416
Dense
Med-Gemini-M 1.0 and Med-Gemini-L 1.0 (Pro and Ultra finetunes) "For language tasks that require less complex reasoning, such as summarizing medical notes and creating referral letters, we introduce Med-Gemini-M 1.0 by fine-tuning the Gemini 1.0 Pro model. For other tasks that require more advanced reasoning, we introduce Med-Gemini-L 1.0 by fine-tuning the Gemini 1.0 Ultra model using a self-training method to enable the models to efficiently use web search."
51
Tele-FLMBAAI
https://huggingface.co/CofeAI/Tele-FLM
52200039:11.164πŸ†† πŸ“šβ¬† πŸ•Έ πŸŒ‹ Apr/2024🟒
https://arxiv.org/abs/2404.16645
Dense
Also known as FLM-2. "We will open-source a 1T model checkpoint, namely Tele-FLM-1T, to advance further training and research." Discussion paper Jul/2024: https://arxiv.org/abs/2407.02783
52
Qwen-1.5 110BAlibaba
https://huggingface.co/spaces/Qwen/Qwen1.5-110B-Chat-demo
111300028:11.980.449.935.9πŸ†† πŸ“šβ¬† πŸ•Έ πŸŒ‹ Apr/2024🟒
https://qwenlm.github.io/blog/qwen1.5-110b/
Dense
Worse performance on GPQA (72B=36.3, 110B=35.9).
53
Arctic
Snowflake AI Research
https://arctic.streamlit.app/
48035008:14.367.3πŸ†† πŸ“šβ¬† πŸ•Έ πŸŒ‹ Apr/2024🟒https://www.snowflake.com/blog/arctic-open-efficient-foundation-language-models-snowflake/
Hybrid
"Arctic uses a unique Dense-MoE Hybrid transformer architecture. It combines a 10B dense transformer model with a residual 128Γ—3.66B MoE MLP resulting in 480B total and 17B active parameters chosen using a top-2 gating."
54
SenseNova 5.0SenseTime6001000017:18.284.7842.93πŸ†† πŸ“šβ¬† πŸ•Έ πŸŒ‹ Apr/2024🟒https://news.futunn.com/en/post/41290101/a-large-shangtang-multi-modal-model-with-600-billion-parametersMoE
GPT-4 scale; low media coverage; no demo in Western world. https://www.techinasia.com/sensetime-pauses-trading-stock-rises-30-model-launch
55
OpenELMApple
https://huggingface.co/apple/OpenELM-3B-Instruct
3.041500494:10.226.76πŸ†† πŸ“šβ¬† πŸ•Έ πŸŒ‹ Apr/2024🟒https://arxiv.org/abs/2404.14619
Dense
On-device model (laptop, phone). Open-source Efficient Language Models (OpenELM). https://venturebeat.com/ai/apple-releases-openelm-small-open-source-ai-models-designed-to-run-on-device/
56
phi-3-mediumMicrosoft
https://huggingface.co/microsoft/Phi-3-medium-128k-instruct
144800343:10.978.255.7βš›οΈApr/2024🟒https://arxiv.org/abs/2404.14219
Dense
Preview only, benchmarks being investigated as of May/2024.
57
phi-3-miniMicrosoft
https://huggingface.co/microsoft/Phi-3-mini-128k-instruct
3.83300869:10.468.845.7βš›οΈApr/2024🟒https://arxiv.org/abs/2404.14219
Dense
"phi3-mini can be quantized to 4-bits so that it only occupies β‰ˆ 1.8GB of memory. We tested the quantized model by deploying phi-3-mini on iPhone 14 with A16 Bionic chip running natively on-device and fully offline achieving more than 12 tokens per second."
58
Llama 3 70BMeta AIhttps://meta.ai/7015000215:13.48252.8πŸ†† πŸ“šβ¬† πŸ•Έ πŸŒ‹ Apr/2024🟒https://ai.meta.com/blog/meta-llama-3/
Dense
Instruct MMLU-Pro=56.2
59
HLATAmazon71800258:10.441.318πŸ†† πŸ“šβ¬† πŸ•Έ πŸŒ‹ Apr/2024πŸ”΄https://arxiv.org/abs/2404.10630
Dense
HLAT=High-quality LLM pre-trained on AWS Trainium. Same arch as Llama 7B. The pre-training is performed up to 64 Amazon EC2 trn1.32xlarge instances with totalling up to 1024 AWS Trainium accelerators. Read more about Trainium: https://www.aboutamazon.com/news/aws/what-you-need-to-know-about-the-aws-ai-chips-powering-amazons-partnership-with-anthropic
60
Idefics2Hugging Face
https://huggingface.co/HuggingFaceM4/idefics2-8b
8.4πŸ†† πŸ•ΈApr/2024🟒https://huggingface.co/blog/idefics2
Dense
Clone of Flamingo now using Mistral 7B. Named after Asterix and Obelix's dog Idefix (Image-aware Decoder Enhanced Γ  la Flamingo with Interleaved Cross-attentionS)
61
Reka CoreReka AI
https://poe.com/RekaCore
3001000034:15.883.238.2πŸ†† πŸ“šβ¬† πŸ•Έ πŸŒ‹ Apr/2024🟒
https://publications.reka.ai/reka-core-tech-report.pdf
Dense
https://www.reka.ai/news/reka-core-our-frontier-class-multimodal-language-model
62
WizardLM-2-8x22B
Microsoft
https://huggingface.co/MaziyarPanahi/WizardLM-2-8x22B-GGUF
141200015:11.8πŸ†† πŸ“šβ¬† πŸ•Έ πŸŒ‹ Apr/2024🟒
https://wizardlm.github.io/WizardLM2/
MoEBase model = mistral-8x22b.
63
Pile-T5EleutherAI
https://huggingface.co/EleutherAI/pile-t5-xxl
112000182:10.553.84πŸ†† πŸ“šβ¬† πŸ•Έ πŸŒ‹Apr/2024🟒https://blog.eleuther.ai/pile-t5/
Dense
64
Zephyr 141B-A35B
Hugging Face H4
https://huggingface.co/HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1
35200058:10.9πŸ†† πŸ“šβ¬† πŸ•Έ πŸŒ‹ Apr/2024🟒
https://arxiv.org/abs/2403.07691
MoEmixtral-8x22b finetune using Odds Ratio Preference Optimization (ORPO).
65
Rerank 3Cohere
https://docs.cohere.com/reference/rerank-1
104400039:12.1πŸ“š πŸ•Έ Apr/2024🟒https://txt.cohere.com/rerank-3/
Dense
RAG + semantic search, possibly backed by Command-R+.
66
gpt-4-turbo-2024-04-09
OpenAI
https://chat.openai.com/
86.563.749.1πŸ†† πŸ“šβ¬† πŸ•Έ πŸŒ‹Apr/2024🟒https://cdn.openai.com/papers/gpt-4.pdfMoE
This is such a significantly better model that I've added it here. This GPQA=46.5%, old GPT-4 GPQA=36%. https://twitter.com/EpochAIResearch/status/1778463039932584205 MMLU scores are unclear, but may have improved by 1%: https://twitter.com/OpenAI/status/1778602770784002136. Final benchmarks are here: https://archive.md/6Cc0Z
67
MiniCPM-2.4BTsinghua
https://github.com/OpenBMB/MiniCPM/
2.41100459:10.2πŸ†† πŸ“šβ¬† πŸ•Έ πŸŒ‹ Apr/2024🟒
https://arxiv.org/abs/2404.06395
Dense
MoE option=https://huggingface.co/openbmb/MiniCPM-MoE-8x2B
68
Ferret-UIApple
https://github.com/apple/ml-ferret
132000154:10.5πŸ†† πŸ“šβ¬† πŸ•Έ πŸ‘₯Apr/2024🟒https://arxiv.org/abs/2404.05719
Dense
Vicuna base, multimodal. Extension of Ferret from Oct/2023.
69
mixtral-8x22bMistral
https://huggingface.co/mistral-community/Mixtral-8x22B-v0.1
141200015:11.877.75πŸ†† πŸ“šβ¬† πŸ•Έ πŸŒ‹ Apr/2024🟒
https://mistral.ai/news/mixtral-8x22b/
MoEMoE=22Bx8, seq=65536.
70
SailorSail
https://huggingface.co/sail
720029:10.1πŸ†† πŸ“šβ¬† πŸ•Έ πŸŒ‹ Apr/2024🟒
https://arxiv.org/abs/2404.03608v1
Dense
SEA languages. Based on Qwen-1.5. https://github.com/sail-sg/sailor-llm "Generally Sailor models consume around 200B tokens, completing a full pass through the SailCraft corpus once. However, the Sailor-0.5B model undergoes training with 400B tokens, equivalent to 2 epochs."
71
JetMoE-8BMIT
https://www.lepton.ai/playground/chat?model=jetmoe-8b-chat
81250157:10.349.2πŸ†† πŸ“šβ¬† πŸ•Έ πŸŒ‹ Apr/2024🟒https://huggingface.co/jetmoe/jetmoe-8bMoE
72
EurusTsinghuahttps://huggingface.co/collections/openbmb/eurus-660bc40bec5376b3adc9d1c570200029:11.2πŸ†† πŸ“šβ¬† πŸ•Έ πŸŒ‹ Apr/2024🟒https://huggingface.co/collections/openbmb/eurus-660bc40bec5376b3adc9d1c5
Dense
Fine-tune of Mistral-7B and CodeLlama-70B.
73
Command-R+Cohere
https://huggingface.co/spaces/CohereForAI/c4ai-command-r-plus
104400039:12.175.7πŸ“š πŸ•Έ Apr/2024🟒https://huggingface.co/CohereForAI/c4ai-command-r-plus
Dense
purpose-built to excel at real-world enterprise use cases. Announce with no arch details: https://txt.cohere.com/command-r-plus-microsoft-azure/
74
VikingSilo AI33200061:10.9πŸŒ‹Apr/2024🟒
https://www.silo.ai/blog/viking-7b-13b-33b-sailing-the-nordic-seas-of-multilinguality
Dense
Viking uses an architecture similar to Llama 2, with flash attention, rotary embeddings, grouped query attention and supports a 4k sequence length'
75
OLMo-Bitnet-1BNous Research
https://huggingface.co/NousResearch/OLMo-Bitnet-1B
16060:10.0πŸŒ‹Apr/2024🟒
https://arxiv.org/abs/2402.17764
Dense
1.58-bit quantized (ternary weights) means we can run a 70B model in ~14GB VRAM. See also BitNet b1.58
76
Aurora-MInternational
https://huggingface.co/collections/aurora-m/aurora-m-models-65fdfdff62471e09812f5407
15.52035132:10.6πŸŒ‹Mar/2024🟒
https://arxiv.org/abs/2404.00399
Dense
77
ReALM-3BApple313445:10.1πŸŒ‹Mar/2024πŸ”΄
https://arxiv.org/abs/2403.20329
Dense
FLAN-T5 (Oct/2022) finetune.
78
Qwen1.5-MoE-A2.7B
Alibaba
https://qwenlm.github.io/blog/qwen-moe/
14.31500105:10.562.5πŸ†† πŸ“šβ¬† πŸ•Έ πŸŒ‹ Mar/2024🟒
https://qwenlm.github.io/blog/qwen-moe/
MoE
MoE. "Of particular significance is the fact that, through upcycling, the necessity for training an equivalent volume of tokens as in the original model has been eliminated." I assumed half of the original 3T tokens
79
Grok-1.5xAIhttps://grok.x.ai/314600020:14.681.3πŸ†† πŸ“šβ¬† πŸ•Έ πŸŒ‹ Mar/2024🟒
https://x.ai/blog/grok-1.5
DenseContext=128k.
80
JambaAI21
https://huggingface.co/ai21labs/Jamba-v0.1
52500097:11.767.4πŸ†† πŸ“šβ¬† πŸ•Έ πŸŒ‹ Mar/2024🟒
https://arxiv.org/abs/2403.19887
MoEMoE. Open weights, licensed under Apache 2.0. Announce: https://arxiv.org/abs/2403.19887
81
DBRXMosaicML
https://huggingface.co/spaces/databricks/dbrx-instruct
1321200091:14.273.7πŸ†† πŸ“šβ¬† πŸ•Έ πŸŒ‹ Mar/2024🟒
https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm
MoEMoE. Trained for $10M on 3,072 NVIDIA H100s connected by 3.2Tbps Infiniband.
82
Stable Code Instruct 3B
Stability AI
https://huggingface.co/stabilityai/stable-code-instruct-3b
2.7560208:10.1πŸŒ‹Mar/2024🟒https://stability.ai/news/introducing-stable-code-instruct-3b
Dense
Context window=16,384. Trained on The Stack dataset.
83
EvoLLM-JPSakana AI
https://huggingface.co/SakanaAI/EvoLLM-JP-v1-10B
1080080:10.3πŸ†† πŸ“šβ¬† πŸ•Έ πŸŒ‹ Mar/2024🟒
https://arxiv.org/abs/2403.13187
Dense
Japanese. Model merge 'our EvoLLM-JP-A is a merge of shisa-gamma-7b-v1, Arithmo2-Mistral-7B, and Abel7B-002' https://sakana.ai/evolutionary-model-merge/
84
RakutenAI-7BRakuten Group
https://huggingface.co/Rakuten/RakutenAI-7B
73000429:10.561.31πŸ†† πŸ“šβ¬† πŸ•Έ πŸŒ‹ Mar/2024🟒
https://arxiv.org/abs/2403.15484
Dense
Japanese. Mistral 7B derivative.
85
ParakeetIndependent
https://colab.research.google.com/drive/1gI8CM9Bz9ov0-E6aL2jF808rE56UtZyF?usp=sharing
0.37838:10.0πŸ†† πŸ“šβ¬† πŸ•Έ πŸŒ‹ Mar/2024🟒
https://news.ycombinator.com/item?id=39745700#39745702
Dense
Tiny model (378M) for testing
86
RWKV-v5 EagleXRWKV
https://huggingface.co/recursal/EagleX_1-7T
7.521700227:10.440.14πŸ†† πŸ“šβ¬† πŸ•Έ πŸŒ‹ Mar/2024🟒
https://substack.recursal.ai/p/eaglex-17t-soaring-past-llama-7b
Dense
Built on the RWKV-v5 architecture (a linear transformer with 10-100x+ lower inference cost)
87
MM1Apple30201067:10.8πŸŒ‹Mar/2024πŸ”΄
https://arxiv.org/abs/2403.09611
Dense
VLM, outperforms Flamingo 80B (Apr/2022) across benchmarks. 2T text tokens + ~10B+ other text (estimate). Unreleased.
88
RFM-1Covariant
https://vimeo.com/921866765
816020:10.1πŸ†† πŸ“šβ¬† πŸ•Έ πŸŒ‹ Mar/2024🟑
https://covariant.ai/insights/introducing-rfm-1-giving-robots-human-like-reasoning-capabilities/
Dense
Commercial, multimodal for robotics
89
Command-RCohereCohere3570020:10.537.9πŸ“š πŸ•Έ Mar/2024🟒https://txt.cohere.com/command-r/
Dense
RAG and tool use
90
DeepSeek-VLDeepSeek-AI
https://github.com/deepseek-ai/DeepSeek-VL?tab=readme-ov-file
72000286:10.4πŸ†† πŸ“šβ¬† πŸ•Έ πŸŒ‹ Mar/2024🟒
https://arxiv.org/abs/2403.05525
Dense
Vision, based on DeepSeek-LLM-7B
91
AnyGPTFudan University
https://junzhan2000.github.io/AnyGPT.github.io/
72000286:10.4πŸ†† πŸ“šβ¬† πŸ•Έ πŸŒ‹ Mar/2024🟒
https://arxiv.org/abs/2402.12226
Dense
Llama 2 7B backbone with new matrices ('reshaping the embedding matrix and prediction layer')
92
Stable Beluga 2.5Stability AI70200029:11.2πŸ†† πŸ“šβ¬† πŸ•Έ πŸŒ‹ Mar/2024🟒
https://stability.ai/news/putting-the-ai-supercomputer-to-work
Dense
Mentioned in Stability release about Intel chips 11/Mar/2024, availablity unknown
93
Inflection-2.5Inflection AI
https://inflection.ai/inflection-2
12002000017:116.385.538.4πŸ†† πŸ“š ⬆ πŸ•ΈMar/2024🟒
https://inflection.ai/inflection-2-5
Dense
94
ApolloSRIBD/CUHK
https://apollo.llmzoo.com/
72500358:10.4πŸ†† πŸ“šπŸ•Έ πŸŒ‹ Mar/2024🟒
https://arxiv.org/abs/2403.03640
Dense
Qwen 1.8B as base. Medical focus.
95
Claude 3 OpusAnthropichttps://claude.ai/20004000020:129.888.268.559.5πŸ†† πŸ“šβ¬† πŸ•Έ πŸŒ‹Mar/2024🟒
https://www.anthropic.com/claude-3-model-card
Dense
Original MMLU=86.8 (GPT-4=86.4). Original GPQA=50.4. 200k context, 1M for researchers.
96
Nemotron-4 15BNVIDIA158000534:11.264.2πŸ†† πŸ“šβ¬† πŸ•Έ πŸŒ‹Feb/2024🟒https://arxiv.org/abs/2402.16819
Dense
97
TowerLLMUnbabel
https://unbabel.com/meet-towerllm/
7102029:11.2πŸ†† πŸ“šβ¬† πŸ•Έ πŸŒ‹Feb/2024🟒
https://arxiv.org/abs/2402.17733
Dense
Commercial product, Llama-2 as base.
98
Hawk
Google DeepMind
730043:10.235πŸ†† πŸ“šπŸ•Έ πŸŒ‹ Feb/2024🟒
https://arxiv.org/abs/2402.19427
Dense
MMLU=35. RNN.
99
Griffin
Google DeepMind
1430022:10.249.5πŸ†† πŸ“šπŸ•Έ πŸŒ‹ Feb/2024🟒
https://arxiv.org/abs/2402.19427
Dense
MMLU=49.5. RNN.
100
BitNet b1.58Microsoft
https://huggingface.co/1bitLLM/bitnet_b1_58-xl
70200029:11.2πŸ†† πŸ“šπŸ•Έ πŸŒ‹ Feb/2024🟒
https://arxiv.org/abs/2402.17764
Dense