2024 LifeArchitect.ai data (shared) - OLD (corrupted)

	A	B	C	D	E	F	G	H	I	J	K	L	M	N	O	P
1	MOVED:	https://docs.google.com/spreadsheets/d/1kc262HZSMAWI6FVsh0zJwbB-ooYvzhCHaHcNUiA0_hY/edit?gid=1158069878#gid=1158069878
2	Google corrupted this one...
3	Model	Lab	Playground	Parameters (B)	Tokens trained (B)	Ratio Tokens:Params (Chinchilla scaling≥20:1)	ALScore "ALScore" is a quick and dirty rating of the model's power. The formula is: Sqr Root of (Parameters x Tokens) ÷ 300. Any ALScore ≥ 1.0 is a powerful model in mid-2023.	MMLU	MMLU -Pro	GPQA	Training dataset	Announced ▼	Public?	Paper / Repo	Arch	Notes

4	Olympus	Amazon	https://lifearchitect.ai/olympus/	2000	40000							TBA				New related Titan details: '$65m training run. 200B dense model on 4T tokens of data across 13,760 NVIDIA A100 chips. 48 days to train. Training runs soon to cross $1B' https://importai.substack.com/p/import-ai-365-wmd-benchmark-amazon
5	GPT-5	OpenAI	https://lifearchitect.ai/gpt-5/	52500								TBA				Due 2024.
6	GPT-6	OpenAI	https://lifearchitect.ai/gpt-6/									TBA				Due 2025.
7	AuroraGPT (ScienceGPT)	Argonne National Laboratory	https://www.hpcwire.com/2023/11/13/training-of-1-trillion-parameter-scientific-ai-begins/	1000								TBA	🔴			https://tpc.dev/2023/11/10/tpc-announced-with-founding-partners/ powered by Intel Ponte Vecchio GPUs.
8	Grok-2	xAI	https://twitter.com/elonmusk/status/1773655245769330757									TBA				Due 2025.
9	MAI-1	Microsoft	https://arstechnica.com/information-technology/2024/05/microsoft-developing-mai-1-language-model-that-may-compete-with-openai-report/	500	10000	20:1	7.5					TBA		https://www.reuters.com/technology/microsoft-readies-new-ai-model-compete-with-google-openai-information-reports-2024-05-06/	Dense	Due 2024. MAI=Microsoft artificial intelligence. MSFT CTO statement: https://archive.md/XRSgS
10	GPT-4o mini	OpenAI	https://chatgpt.com/	8	6000	750:1	0.7	82		40.2	🆆 📚⬆ 🕸 🌋	Jul/2024	🟢	https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/	MoE	Omnimodel. "OpenAI would not disclose exactly how large GPT-4o mini is, but said it’s roughly in the same tier as other small AI models, such as Llama 3 8b, Claude Haiku and Gemini 1.5 Flash." https://techcrunch.com/2024/07/18/openai-unveils-gpt-4o-mini-a-small-ai-model-powering-chatgpt/ "tested GPT-4o to identify potential risks, which we have addressed and plan to share the details of in the forthcoming GPT-4o system card and Preparedness scorecard." And related paper about instruction hierarchy: https://arxiv.org/abs/2404.13208
11	NeMo	Mistral	https://huggingface.co/mistralai/Mistral-Nemo-Base-2407	12	2000	167:1	0.5	68			🆆 📚⬆ 🕸 🌋	Jul/2024	🟢	https://mistral.ai/news/mistral-nemo/	Dense	With NVIDIA. "Drop-in replacement of Mistral 7B". "trained using Megatron-LM, part of NVIDIA NeMo, with 3,072 H100 80GB Tensor Core GPUs" https://blogs.nvidia.com/blog/mistral-nvidia-ai-model/
12	Codestral Mamba	Mistral	https://huggingface.co/mistralai/mamba-codestral-7B-v0.1	7	2000	286:1	0.4				🆆 📚⬆ 🕸 🌋	Jul/2024	🟢	https://mistral.ai/news/codestral-mamba/	Dense	"Unlike Transformer models, Mamba models offer the advantage of linear time inference and the theoretical ability to model sequences of infinite length."
13	Mathstral	Mistral	https://huggingface.co/mistralai/mathstral-7B-v0.1	7	2000	286:1	0.4	63.47			🆆 📚⬆ 🕸 🌋	Jul/2024	🟢	https://mistral.ai/news/mathstral/	Dense	"We’re contributing Mathstral to the science community to bolster efforts in advanced mathematical problems requiring complex, multi-step logical reasoning."
14	SpreadsheetLLM	Microsoft		1760	13000	8:1	15.9				🆆 📚⬆ 🕸 🌋	Jul/2024	🔴	https://arxiv.org/abs/2407.09025v1	Dense	Notable finetune of GPT4-0125-preview "outperforming the vanilla approach by 25.6% in GPT4’s in-context learning setting"
15	next-gen	DeepL	https://www.deepl.com/en/translator								🌋	Jul/2024	🟢	https://www.deepl.com/en/blog/next-gen-language-model	Dense	"Built using our own groundbreaking, specialized LLM technology and proprietary training data, designed specifically for translation"
16	SmolLM	Hugging Face	https://huggingface.co/collections/HuggingFaceTB/smollm-6695016cad7167254ce15966	1.7	1000	589:1	0.1	39.97			🆆 📚⬆ 🕸 🌋 ⚛️	Jul/2024	🟢	https://huggingface.co/blog/smollm	Dense	Dataset includes new Cosmopedia v2 synthetic data. 135M and 360M models,each trained on 600B tokens from Smollm-Corpus. 1.7B model trained on 1T tokens from Smollm-Corpus.
17	Mockingbird	Vectara	https://vectara.com/platform/	9	1000	112:1	0.3				🆆 📚⬆ 🕸 🌋 ⚛️	Jul/2024	🟢	https://vectara.com/blog/mockingbird-a-rag-and-structured-output-focused-llm/	Dense	"At <10B parameters it's an LLM trained to provide optimal results for RAG and structured outputs."
18	FLAMe	Google DeepMind		24	1000	42:1	0.5				👥	Jul/2024	🔴	https://arxiv.org/abs/2407.10817v1	Dense	LLM-as-a-Judge autorater. Foundational Large Autorater Models (FLAMe). Uses an instruction-tuned PaLM-2-24B model. Unrelated to Microsoft FLAME Jan/2023.
19	H2O-Danube3-4B	H2O.ai	https://h2o.ai/platform/danube/personal-gpt/	4	6000	1,500:1	0.5	55.18			🆆 📚⬆ 🕸 🌋 ⚛️	Jul/2024	🟢	https://arxiv.org/abs/2407.09276	Dense	Runs natively and fully offline on mobile phone. "H2O-Danube3 is a family of decoder only LLM models that use the general Llama model architecture adopting core principles from Llama 2 and Mistral with custom parameters determining the shape of each layer and total parameter count. We use the Mistral tokenizer..." MMLU for chat=54.74, base=55.18 via https://huggingface.co/h2oai/h2o-danube3-4b-base
20	Causal Axioms	Microsoft		0.067	1.2	18:1	0.0				⚛️	Jul/2024	🔴	https://arxiv.org/abs/2407.07612v1	Dense	"the training dataset follows a specific structure, we develop a custom tokenizer. Alphanumeric node names are tokenized at a character level, while special terms such as ‘causes’, ‘Does’, ‘cause’, ‘Yes’, and ‘No’ are tokenized at the word level... Our training setup consists of around 175k instances of sequential chains with size of chains ranging from 3 to 6 nodes... All models are trained for 100 epochs. [LifeArchitect.ai estimate is 12 tokens per node x 6 nodes x 175,000 instances x 100 epochs = 1.26B tokens]" Based on GPT-2 arch.
21	SenseNova 5.5	SenseTime	https://platform.sensenova.cn/home#/home	600	10000	17:1	8.2				⚛️	Jul/2024	🟢	https://www.sensetime.com/en/news-detail/51168278?categoryId=1072	MoE	"The model training was based on over 10TB tokens [sic, taken as 10T tokens instead of 10TB=2T tokens] of high-quality training data, including a large amount of synthetically-generated reasoning chain data, which help to enhance its reasoning capabilities." & "The updates include SenseNova 5o, the first real-time multimodal model in China, which provides a new AI interaction model on par with GPT-4o’s streaming interaction capabilities"
22	Helium 7B	Kyutai	https://moshi.chat/	7	1000	143:1	0.3				⚛️	Jul/2024	🟢	https://youtu.be/hm2IJSKcYvo	Dense	"1. The model is fine-tuned on 100K transcripts generated by Helium itself. 2. These transcripts are highly detailed, heavily annotated with emotion and style, and conversational. 3. Text to Speech Engine is further fine-tuned on 20 hours of audio recorded by Alice and licensed."
23	InternLM2.5	Shanghai AI Laboratory/SenseTime	https://huggingface.co/collections/internlm/internlm25-66853f32717072d17581bc13	7	2600	372:1	0.4	72.8		38.4	🆆 📚⬆ 🕸 🌋	Jul/2024	🟢	https://github.com/InternLM/InternLM/blob/main/model_cards/internlm2.5_7b.md	Dense	"The release of InternLM2.5 series contains 7B model size for now and we are going to release the 1.8B and 20B versions soon"
24	Llama 3 405B	Meta AI	https://wabetainfo.com/whatsapp-beta-for-android-2-24-14-7-whats-new/	405	15000	38:1	8.2	84.8		48	🆆 📚⬆ 🕸 🌋	Jun/2024	🟡		Dense	Waiting on release outside of WhatsApp Android as of 1/Jul/2024.
25	ERNIE 4.0 Turbo	Baidu	https://yiyan.baidu.com/								🆆 📚⬆ 🕸 🌋	Jun/2024	🟢	https://www.reuters.com/technology/artificial-intelligence/baidu-launches-upgraded-ai-model-says-user-base-hits-300-mln-2024-06-28/	Dense	"Ernie Bot has reached 300 million users since its launch [on 16/Mar/2023, public Aug/2023]" Jun/2024
26	Gemma 2	Google DeepMind	https://huggingface.co/google/gemma-2-27b-it	27	13000	482:1	2.0	75.2			🆆 📚⬆ 🕸 🌋	Jun/2024	🟢	https://storage.googleapis.com/deepmind-media/gemma/gemma-2-report.pdf	Dense	Announce: https://blog.google/technology/developers/google-gemma-2/
27	CriticGPT	OpenAI									👥	Jun/2024	🔴	https://cdn.openai.com/llm-critics-help-catch-llm-bugs-paper.pdf	Dense	"LLM Critics Help Catch LLM Bugs" Announce: https://openai.com/index/finding-gpt4s-mistakes-with-gpt-4/
28	4M-21	Apple	https://github.com/apple/ml-4m/	3							🌋	Jun/2024	🟢	https://arxiv.org/abs/2406.09406	Dense	Vision model based on T5-XXL. Modalities: RGB, Caption, Bounding boxes, Semantic segmentation, Depth, Human poses, Surface normals, CLIP, DINOv2, ImageBind, Metadata, Canny edges, SAM edges, SAM instances, Color palette. Project page: https://4m.epfl.ch/
29	ESM3	EvolutionaryScale	https://github.com/evolutionaryscale/esm	98	771	8:1	0.9				🌋	Jun/2024	🟡	https://www.evolutionaryscale.ai/blog/esm3-release	Dense	Biology large language model: "sequence, structure, and function are all masked and predicted during training, ESM3 can generate in all three modalities." 1.4B only released.
30	PanGu 5.0 Super	Huawei	https://www.huaweicloud.com/intl/en-us/product/modelarts.html	1000	20000	20:1	14.9				🌋	Jun/2024	🟡	https://www.huaweicentral.com/huawei-cloud-unveils-pangu-large-model-5-0/	MoE	https://x.com/faridofanani96/status/1804079517193113850/photo/1
31	Claude 3.5 Sonnet	Anthropic	https://poe.com/Claude-3.5-Sonnet					90.4	72.83	67.2	🆆 📚⬆ 🕸 🌋	Jun/2024	🟢	https://www.anthropic.com/news/claude-3-5-sonnet	Dense	Model card: https://www-cdn.anthropic.com/fed9cc193a14b84131812372d8d5857f8f304c52/Model_Card_Claude_3_Addendum.pdf
32	DeepSeek-Coder-V2	DeepSeek-AI	https://chat.deepseek.com/coder	236	10200	35:1	4.6	79.2	63.63		🆆 📚⬆ 🕸 🌋	Jun/2024	🟢	https://github.com/deepseek-ai/DeepSeek-Coder-V2/blob/main/paper.pdf	MoE	DeepSeek-V2 with additional 6 trillion tokens.
33	DCLM-Baseline 7B 2.6T	International	https://huggingface.co/apple/DCLM-Baseline-7B	7	2600	372:1	0.4	63.7			🕸 🌋	Jun/2024	🟡	https://arxiv.org/abs/2406.11794	Dense	New dataset: 240T tokens: 8× larger than previous SOTA dataset. DCLM-Pool is 240T, DCLM-Baseline is 3.8T: "we combine our 3.8T DCLM-BASELINE with the StarCoder and ProofPile2 data to arrive at a 4.1T token dataset. We train a 7B model for 2.5T tokens" and "We release the DCLM benchmark, framework, models, and datasets at https://datacomp.ai/dclm."
34	Nemotron-4-340B	NVIDIA	https://build.nvidia.com/nvidia/nemotron-4-340b-instruct	340	9000	27:1	5.8	81.1			🆆 📚⬆ 🕸 🌋	Jun/2024	🟢	https://d1qx31qr3h6wln.cloudfront.net/publications/Nemotron_4_340B_8T.pdf	Dense	Open-source equiv of Mar/2023 GPT-4 (1760MoE≈340B, 13T), same param count but 2x the tokens of May/2023 PaLM 2 (340B, 3.6T), competitor to Nov/2023 Grok-1 (314B, 6T). Trained on 6,144 H100s. ~1.3TB for inference. 50+ natural and 40+ coding languages. Trained between December 2023 and May 2024. MMLU 0-shot for instruct=78.7, 5-shot for base=81.1. Permalink for paper: https://research.nvidia.com/publication/2024-06_nemotron-4-340b
35	Apple On-Device model Jun/2024	Apple	https://github.com/apple/corenet/tree/main/projects/openelm	3.04	1500	494:1	0.2	26.76			🆆 📚⬆ 🕸 🌋	Jun/2024	🟢	https://arxiv.org/abs/2404.14619	Dense	https://lifearchitect.ai/apple/ Likely to be the Apple OpenELM model (Apr/2024). "two of these models — a ~3 billion parameter on-device language model, and a larger server-based language model available with Private Cloud Compute". https://machinelearning.apple.com/research/introducing-apple-foundation-models The server-based model is possibly Ferret, although it is more properly called a multimodal model (not just language). It could also be Apple GPT based on their Ajax framework: https://archive.md/f3C0r
36	MatMul-Free LM	UCSC	https://github.com/ridgerchu/matmulfreellm	2.7	100	38:1	0.1				🆆 📚⬆ 🕸 🌋	Jun/2024	🟢	https://arxiv.org/abs/2406.02528	Dense	"we explore alternative methods for mixing tokens without relying on matrix multiplications." Compared with Transformer++ based on Llama-2, not to be confused with the pre-GPT-3 American Express Transformer++ paper from 2/Mar/2020. Instead, Transformer++ is defined in the Mamba paper: 'Transformer++: A Transformer with an improved architecture, namely rotary positional encodings (Su et al. 2021) and SwiGLU MLP (Shazeer 2020)'
37	Luna	Galileo	https://www.rungalileo.io/blog/introducing-galileo-luna-a-family-of-evaluation-foundation-models	0.44	162	369:1	0.0				🆆 📚⬆ 🕸 🌋	Jun/2024	🟢	https://arxiv.org/abs/2406.00975	Dense	Based on DeBERTA-large (440M). RoBERTa=162B token dataset.
38	Qwen2	Alibaba	https://huggingface.co/spaces/Qwen/Qwen2-72B-Instruct	72	7000	98:1	2.4	84.2	55.6	37.9	🆆 📚⬆ 🕸 🌋	Jun/2024	🟢	https://arxiv.org/abs/2407.10671	Dense	Instruct MMLU=82. Instruct GPQA=41.9. https://qwenlm.github.io/blog/qwen2/
39	Qwen2-57B-A14B	Alibaba	https://github.com/QwenLM/Qwen2?tab=readme-ov-file	57	4500	79:1	1.7	76.5	43	34.3	🆆 📚⬆ 🕸 🌋	Jun/2024	🟢	https://arxiv.org/abs/2407.10671	MoE	https://qwenlm.github.io/blog/qwen2/
40	Skywork MoE 16x13B	Kunlun Tech	https://huggingface.co/Skywork/Skywork-MoE-Base	146				77.4			🆆 📚⬆ 🕸 🌋	Jun/2024	🟢	https://github.com/SkyworkAI/Skywork-MoE/blob/main/skywork-moe-tech-report.pdf	MoE	CN + EN. "(MoE) model with 146 billion parameters, 16 experts, and 22 billion activated parameters. This model is initialized from the pre-existing dense checkpoints of our Skywork-13B model."
41	Mamba-2	CMU	https://github.com/state-spaces/mamba	2.7	300	112:1	0.1				🆆 📚⬆ 🕸 🌋	May/2024	🟢	https://arxiv.org/abs/2405.21060	Dense	Analysis: https://tridao.me/blog/2024/mamba2-part1-model/
42	MAP-Neo	International	https://map-neo.github.io/	7	4500	643:1	0.6	58.14			🆆 📚⬆ 🕸 🌋	May/2024	🟢	https://arxiv.org/abs/2405.19327	Dense	"first fully open-sourced bilingual LLM with comparable performance to existing state-of-the-art LLMs... we open-source all details to reproduce our MAP-Neo, where the cleaned pre-training corpus, data cleaning pipeline, checkpoints, and well-optimized training/evaluation framework are provided."
43	K2	LLM360	https://huggingface.co/LLM360/K2	65	1400	22:1	1.0	64.8			🆆 📚⬆ 🕸 🌋	May/2024	🟢	https://www.llm360.ai/blog/several-new-releases-to-further-our-mission.html	Dense	"K2-65B is a fully reproducible LLM outperforming Llama 2 70B using 35% less compute."
44	Codestral	Mistral	https://huggingface.co/mistralai/Codestral-22B-v0.1	22	2000	91:1	0.7				🆆 📚⬆ 🕸 🌋	May/2024	🟢	https://mistral.ai/news/codestral/	Dense	Fluent in 80+ programming languages
45	Aya-23-35B	Cohere	https://huggingface.co/spaces/CohereForAI/aya-23	35	4800	138:1	1.4				🆆 📚⬆ 🕸 🌋	May/2024	🟢	https://drive.google.com/file/d/1YKBPo61pnl97C1c_1C2ZVOnPhqf7MLSc/view	Dense
46	Yi-XLarge	01-ai	https://platform.01.ai/	2000	20000	10:1	21.1	85.1		48.2	🆆 📚⬆ 🕸 🌋	May/2024	🟢	https://www.aixinzhijie.com/article/6845768	MoE	Still training as of May/2024: https://appserversrc.8btc.cn/FnDYlEC4STBhphu6M3NL4CKH43FW dead link, use: https://finance.china.com.cn/roll/20240513/6116857.shtml
47	Yi-Large	01-ai	https://platform.01.ai/	1000	15000	15:1	12.9	83.8	58.1	43.5	🆆 📚⬆ 🕸 🌋	May/2024	🟢	https://www.aixinzhijie.com/article/6845768	Dense
48	Chameleon	Meta AI	https://ai.meta.com/resources/models-and-libraries/chameleon-downloads/?gk_enable=chameleon_web_flow_is_live	34	9200	271:1	1.9	65.8			🆆 📚⬆ 🕸 🌋	May/2024	🟢	https://arxiv.org/abs/2405.09818	Dense	Multimodal
49	Sparse Llama 7B	Cerebras	https://huggingface.co/spaces/neuralmagic/llama-2-sparse-transfer-chat-deepsparse	7	145	21:1	0.1				🆆 📚⬆ 🕸 🌋	May/2024	🟢	https://arxiv.org/abs/2405.03594	Hybrid	https://www.cerebras.net/blog/introducing-sparse-llama-70-smaller-3x-faster-full-accuracy "For the 50% sparse model, we utilized 45 billion tokens of pretraining data, while an additional 100 billion tokens were used for the 70% model. This represents approximately 2% to 8% of the original 2 trillion tokens used to train the base Llama-2 model."
50	Gemini 1.5 Flash	Google DeepMind	https://aistudio.google.com/app/prompts/new_chat					78.9	59.1	39.5	🆆 📚⬆ 🕸 🌋	May/2024	🟢	https://goo.gle/GeminiV1-5	MoE	1M context length.
51	GPT-4o	OpenAI	https://chatgpt.com/					88.7	72.6	53.6	🆆 📚⬆ 🕸 🌋	May/2024	🟢	https://openai.com/index/hello-gpt-4o/	MoE	Omnimodel. ‘[GPT-4o is] likely an early checkpoint of GPT-5’. https://twitter.com/drjimfan/status/1790089671365767313 ELO: https://twitter.com/LiamFedus/status/1790064963966370209 Demo: https://youtu.be/DQacCB9tDaw
52	Falcon 2 11B	TII	https://huggingface.co/tiiuae/falcon-11B	11	5500	500:1	0.8	58.37			🆆 📚⬆ 🕸 🌋	May/2024	🟢	https://www.tii.ae/news/falcon-2-uaes-technology-innovation-institute-releases-new-ai-model-series-outperforming-metas	Dense	Announce: https://www.tii.ae/news/falcon-2-uaes-technology-innovation-institute-releases-new-ai-model-series-outperforming-metas
53	Fugaku-LLM	Fujitsu	https://huggingface.co/Fugaku-LLM/Fugaku-LLM-13B-instruct	13	380	30:1	0.2				🆆 📚⬆ 🕸 🌋	May/2024	🟢	https://www.fujitsu.com/global/about/resources/news/press-releases/2024/0510-01.html	Dense	Japanese. CPU trained: 158,976+ A64FX CPUs (7M+ cores), zero GPUs. https://en.wikipedia.org/wiki/Fugaku_(supercomputer)
54	Yi 1.5 34B	01-ai	https://huggingface.co/01-ai/Yi-1.5-34B-Chat	34.4	3600	105:1	1.2	76.8	52.3		🆆 📚⬆ 🕸 🌋	May/2024	🟢	https://github.com/01-ai/Yi-1.5	Dense	Uses 600B more training tokens than Yi 1.0 (Nov/2023).
55	YOCO	Microsoft	https://github.com/microsoft/unilm/tree/master/YOCO	3	1600	534:1	0.2				🆆 📚⬆ 🕸 🌋	May/2024	🟢	https://arxiv.org/abs/2405.05254	Dense	With Tsingua. You Only Cache Once (YOCO). Long context "1M context length with near-perfect needle retrieval accuracy"
56	DeepSeek-V2	DeepSeek-AI	https://chat.deepseek.com/	236	8100	35:1	4.6	78.5	54.8		🆆 📚⬆ 🕸 🌋	May/2024	🟢	https://arxiv.org/abs/2405.04434	MoE	Huge dataset, 12% Chinese "Therefore, we acknowledge that DeepSeek-V2 still has a slight gap in basic English capabilities with LLaMA3 70B".
57	ChuXin	Independent	https://huggingface.co/chuxin-llm/Chuxin-1.6B-Base	1.6	2300	1,438:1	0.2	41.07			🆆 📚⬆ 🕸 🌋	May/2024	🟢	https://arxiv.org/abs/2405.04828	Dense	"results on the ”Needle In A Haystack”(NIAH) tests indicate that ChuXin-1M performs well across all context window lengths up to 1M."
58	RWKV-v6 Finch	RWKV	https://huggingface.co/spaces/BlinkDL/RWKV-Gradio-2	7.63	2500	328:1	0.5				🆆 📚⬆ 🕸 🌋	May/2024	🟢	https://huggingface.co/BlinkDL/rwkv-6-world	Dense	https://twitter.com/BlinkDL_AI/status/1787834625211158562
59	xLSTM	ELLIS		2.7	15	6:1	0.0				🆆 📚⬆ 🕸 🌋	May/2024	🔴	https://arxiv.org/abs/2405.04517	Dense	New method LSTM to xLSTM, see also RNNs. Code/weights doesn't seem to be released. https://github.com/AI-Guru/xlstm-resources
60	Granite Code	IBM	https://github.com/ibm-granite/granite-code-models	34	3500	103:1	1.1				🌋	May/2024	🟢	https://github.com/ibm-granite/granite-code-models/blob/main/paper.pdf	Dense	Dataset: publicly available datasets (e.g., GitHub Code Clean, Starcoder data), public code repositories, and issues from GitHub.
61	Qwen-Max	Alibaba	https://chat.lmsys.org/	300	6000	20:1	4.5				🆆 📚⬆ 🕸 🌋	May/2024	🟢	https://help.aliyun.com/zh/dashscope/developer-reference/model-introduction	Dense	https://twitter.com/JustinLin610/status/1787584325367529509
62	Med-Gemini-L 1.0	Google DeepMind	https://twitter.com/alan_karthi/status/1785117450528264216	1500	30000	20:1	22.4				🆆 📚⬆ 🕸 🌋	May/2024	🔴	https://arxiv.org/abs/2404.18416	Dense	Med-Gemini-M 1.0 and Med-Gemini-L 1.0 (Pro and Ultra finetunes) "For language tasks that require less complex reasoning, such as summarizing medical notes and creating referral letters, we introduce Med-Gemini-M 1.0 by fine-tuning the Gemini 1.0 Pro model. For other tasks that require more advanced reasoning, we introduce Med-Gemini-L 1.0 by fine-tuning the Gemini 1.0 Ultra model using a self-training method to enable the models to efficiently use web search."
63	Tele-FLM	BAAI	https://huggingface.co/CofeAI/Tele-FLM	52	2000	39:1	1.1	64			🆆 📚⬆ 🕸 🌋	Apr/2024	🟢	https://arxiv.org/abs/2404.16645	Dense	Also known as FLM-2. "We will open-source a 1T model checkpoint, namely Tele-FLM-1T, to advance further training and research." Discussion paper Jul/2024: https://arxiv.org/abs/2407.02783
64	Qwen-1.5 110B	Alibaba	https://huggingface.co/spaces/Qwen/Qwen1.5-110B-Chat-demo	111	3000	28:1	1.9	80.4	49.9	35.9	🆆 📚⬆ 🕸 🌋	Apr/2024	🟢	https://qwenlm.github.io/blog/qwen1.5-110b/	Dense	Worse performance on GPQA (72B=36.3, 110B=35.9).
65	Arctic	Snowflake AI Research	https://arctic.streamlit.app/	480	3500	8:1	4.3	67.3			🆆 📚⬆ 🕸 🌋	Apr/2024	🟢	https://www.snowflake.com/blog/arctic-open-efficient-foundation-language-models-snowflake/	Hybrid	"Arctic uses a unique Dense-MoE Hybrid transformer architecture. It combines a 10B dense transformer model with a residual 128×3.66B MoE MLP resulting in 480B total and 17B active parameters chosen using a top-2 gating."
66	SenseNova 5.0	SenseTime		600	10000	17:1	8.2	84.78		42.93	🆆 📚⬆ 🕸 🌋	Apr/2024	🟢	https://news.futunn.com/en/post/41290101/a-large-shangtang-multi-modal-model-with-600-billion-parameters	MoE	GPT-4 scale; low media coverage; no demo in Western world. https://www.techinasia.com/sensetime-pauses-trading-stock-rises-30-model-launch
67	OpenELM	Apple	https://huggingface.co/apple/OpenELM-3B-Instruct	3.04	1500	494:1	0.2	26.76			🆆 📚⬆ 🕸 🌋	Apr/2024	🟢	https://arxiv.org/abs/2404.14619	Dense	On-device model (laptop, phone). Open-source Efficient Language Models (OpenELM). https://venturebeat.com/ai/apple-releases-openelm-small-open-source-ai-models-designed-to-run-on-device/
68	phi-3-medium	Microsoft	https://huggingface.co/microsoft/Phi-3-medium-128k-instruct	14	4800	343:1	0.9	78.2	55.7		⚛️	Apr/2024	🟢	https://arxiv.org/abs/2404.14219	Dense	Preview only, benchmarks being investigated as of May/2024.
69	phi-3-mini	Microsoft	https://huggingface.co/microsoft/Phi-3-mini-128k-instruct	3.8	3300	869:1	0.4	68.8	45.7		⚛️	Apr/2024	🟢	https://arxiv.org/abs/2404.14219	Dense	"phi3-mini can be quantized to 4-bits so that it only occupies ≈ 1.8GB of memory. We tested the quantized model by deploying phi-3-mini on iPhone 14 with A16 Bionic chip running natively on-device and fully offline achieving more than 12 tokens per second."
70	Llama 3 70B	Meta AI	https://meta.ai/	70	15000	215:1	3.4	82	52.8		🆆 📚⬆ 🕸 🌋	Apr/2024	🟢	https://ai.meta.com/blog/meta-llama-3/	Dense	Instruct MMLU-Pro=56.2
71	HLAT	Amazon		7	1800	258:1	0.4	41.318			🆆 📚⬆ 🕸 🌋	Apr/2024	🔴	https://arxiv.org/abs/2404.10630	Dense	HLAT=High-quality LLM pre-trained on AWS Trainium. Same arch as Llama 7B. The pre-training is performed up to 64 Amazon EC2 trn1.32xlarge instances with totalling up to 1024 AWS Trainium accelerators. Read more about Trainium: https://www.aboutamazon.com/news/aws/what-you-need-to-know-about-the-aws-ai-chips-powering-amazons-partnership-with-anthropic
72	Idefics2	Hugging Face	https://huggingface.co/HuggingFaceM4/idefics2-8b	8.4							🆆 🕸	Apr/2024	🟢	https://huggingface.co/blog/idefics2	Dense	Clone of Flamingo now using Mistral 7B. Named after Asterix and Obelix's dog Idefix (Image-aware Decoder Enhanced à la Flamingo with Interleaved Cross-attentionS)
73	Reka Core	Reka AI	https://poe.com/RekaCore	300	10000	34:1	5.8	83.2		38.2	🆆 📚⬆ 🕸 🌋	Apr/2024	🟢	https://publications.reka.ai/reka-core-tech-report.pdf	Dense	https://www.reka.ai/news/reka-core-our-frontier-class-multimodal-language-model
74	WizardLM-2-8x22B	Microsoft	https://huggingface.co/MaziyarPanahi/WizardLM-2-8x22B-GGUF	141	2000	15:1	1.8				🆆 📚⬆ 🕸 🌋	Apr/2024	🟢	https://wizardlm.github.io/WizardLM2/	MoE	Base model = mistral-8x22b.
75	Pile-T5	EleutherAI	https://huggingface.co/EleutherAI/pile-t5-xxl	11	2000	182:1	0.5	53.84			🆆 📚⬆ 🕸 🌋	Apr/2024	🟢	https://blog.eleuther.ai/pile-t5/	Dense
76	Zephyr 141B-A35B	Hugging Face H4	https://huggingface.co/HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1	35	2000	58:1	0.9				🆆 📚⬆ 🕸 🌋	Apr/2024	🟢	https://arxiv.org/abs/2403.07691	MoE	mixtral-8x22b finetune using Odds Ratio Preference Optimization (ORPO).
77	Rerank 3	Cohere	https://docs.cohere.com/reference/rerank-1	104	4000	39:1	2.1				📚 🕸	Apr/2024	🟢	https://txt.cohere.com/rerank-3/	Dense	RAG + semantic search, possibly backed by Command-R+.
78	gpt-4-turbo-2024-04-09	OpenAI	https://chat.openai.com/					86.5	63.7	49.1	🆆 📚⬆ 🕸 🌋	Apr/2024	🟢	https://cdn.openai.com/papers/gpt-4.pdf	MoE	This is such a significantly better model that I've added it here. This GPQA=46.5%, old GPT-4 GPQA=36%. https://twitter.com/EpochAIResearch/status/1778463039932584205 MMLU scores are unclear, but may have improved by 1%: https://twitter.com/OpenAI/status/1778602770784002136. Final benchmarks are here: https://archive.md/6Cc0Z
79	MiniCPM-2.4B	Tsinghua	https://github.com/OpenBMB/MiniCPM/	2.4	1100	459:1	0.2				🆆 📚⬆ 🕸 🌋	Apr/2024	🟢	https://arxiv.org/abs/2404.06395	Dense	MoE option=https://huggingface.co/openbmb/MiniCPM-MoE-8x2B
80	Ferret-UI	Apple	https://github.com/apple/ml-ferret	13	2000	154:1	0.5				🆆 📚⬆ 🕸 👥	Apr/2024	🟢	https://arxiv.org/abs/2404.05719	Dense	Vicuna base, multimodal. Extension of Ferret from Oct/2023.
81	mixtral-8x22b	Mistral	https://huggingface.co/mistral-community/Mixtral-8x22B-v0.1	141	2000	15:1	1.8	77.75			🆆 📚⬆ 🕸 🌋	Apr/2024	🟢	https://mistral.ai/news/mixtral-8x22b/	MoE	MoE=22Bx8, seq=65536.
82	Sailor	Sail	https://huggingface.co/sail	7	200	29:1	0.1				🆆 📚⬆ 🕸 🌋	Apr/2024	🟢	https://arxiv.org/abs/2404.03608v1	Dense	SEA languages. Based on Qwen-1.5. https://github.com/sail-sg/sailor-llm "Generally Sailor models consume around 200B tokens, completing a full pass through the SailCraft corpus once. However, the Sailor-0.5B model undergoes training with 400B tokens, equivalent to 2 epochs."
83	JetMoE-8B	MIT	https://www.lepton.ai/playground/chat?model=jetmoe-8b-chat	8	1250	157:1	0.3	49.2			🆆 📚⬆ 🕸 🌋	Apr/2024	🟢	https://huggingface.co/jetmoe/jetmoe-8b	MoE
84	Eurus	Tsinghua	https://huggingface.co/collections/openbmb/eurus-660bc40bec5376b3adc9d1c5	70	2000	29:1	1.2				🆆 📚⬆ 🕸 🌋	Apr/2024	🟢	https://huggingface.co/collections/openbmb/eurus-660bc40bec5376b3adc9d1c5	Dense	Fine-tune of Mistral-7B and CodeLlama-70B.
85	Command-R+	Cohere	https://huggingface.co/spaces/CohereForAI/c4ai-command-r-plus	104	4000	39:1	2.1	75.7			📚 🕸	Apr/2024	🟢	https://huggingface.co/CohereForAI/c4ai-command-r-plus	Dense	purpose-built to excel at real-world enterprise use cases. Announce with no arch details: https://txt.cohere.com/command-r-plus-microsoft-azure/
86	Viking	Silo AI		33	2000	61:1	0.9				🌋	Apr/2024	🟢	https://www.silo.ai/blog/viking-7b-13b-33b-sailing-the-nordic-seas-of-multilinguality	Dense	Viking uses an architecture similar to Llama 2, with flash attention, rotary embeddings, grouped query attention and supports a 4k sequence length'
87	OLMo-Bitnet-1B	Nous Research	https://huggingface.co/NousResearch/OLMo-Bitnet-1B	1	60	60:1	0.0				🌋	Apr/2024	🟢	https://arxiv.org/abs/2402.17764	Dense	1.58-bit quantized (ternary weights) means we can run a 70B model in ~14GB VRAM. See also BitNet b1.58
88	Aurora-M	International	https://huggingface.co/collections/aurora-m/aurora-m-models-65fdfdff62471e09812f5407	15.5	2035	132:1	0.6				🌋	Mar/2024	🟢	https://arxiv.org/abs/2404.00399	Dense
89	ReALM-3B	Apple		3	134	45:1	0.1				🌋	Mar/2024	🔴	https://arxiv.org/abs/2403.20329	Dense	FLAN-T5 (Oct/2022) finetune.
90	Qwen1.5-MoE-A2.7B	Alibaba	https://qwenlm.github.io/blog/qwen-moe/	14.3	1500	105:1	0.5	62.5			🆆 📚⬆ 🕸 🌋	Mar/2024	🟢	https://qwenlm.github.io/blog/qwen-moe/	MoE	MoE. "Of particular significance is the fact that, through upcycling, the necessity for training an equivalent volume of tokens as in the original model has been eliminated." I assumed half of the original 3T tokens
91	Grok-1.5	xAI	https://grok.x.ai/	314	6000	20:1	4.6	81.3			🆆 📚⬆ 🕸 🌋	Mar/2024	🟢	https://x.ai/blog/grok-1.5	Dense	Context=128k.
92	Jamba	AI21	https://huggingface.co/ai21labs/Jamba-v0.1	52	5000	97:1	1.7	67.4			🆆 📚⬆ 🕸 🌋	Mar/2024	🟢	https://arxiv.org/abs/2403.19887	MoE	MoE. Open weights, licensed under Apache 2.0. Announce: https://arxiv.org/abs/2403.19887
93	DBRX	MosaicML	https://huggingface.co/spaces/databricks/dbrx-instruct	132	12000	91:1	4.2	73.7			🆆 📚⬆ 🕸 🌋	Mar/2024	🟢	https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm	MoE	MoE. Trained for $10M on 3,072 NVIDIA H100s connected by 3.2Tbps Infiniband.
94	Stable Code Instruct 3B	Stability AI	https://huggingface.co/stabilityai/stable-code-instruct-3b	2.7	560	208:1	0.1				🌋	Mar/2024	🟢	https://stability.ai/news/introducing-stable-code-instruct-3b	Dense	Context window=16,384. Trained on The Stack dataset.
95	EvoLLM-JP	Sakana AI	https://huggingface.co/SakanaAI/EvoLLM-JP-v1-10B	10	800	80:1	0.3				🆆 📚⬆ 🕸 🌋	Mar/2024	🟢	https://arxiv.org/abs/2403.13187	Dense	Japanese. Model merge 'our EvoLLM-JP-A is a merge of shisa-gamma-7b-v1, Arithmo2-Mistral-7B, and Abel7B-002' https://sakana.ai/evolutionary-model-merge/
96	RakutenAI-7B	Rakuten Group	https://huggingface.co/Rakuten/RakutenAI-7B	7	3000	429:1	0.5	61.31			🆆 📚⬆ 🕸 🌋	Mar/2024	🟢	https://arxiv.org/abs/2403.15484	Dense	Japanese. Mistral 7B derivative.
97	Parakeet	Independent	https://colab.research.google.com/drive/1gI8CM9Bz9ov0-E6aL2jF808rE56UtZyF?usp=sharing	0.378	3	8:1	0.0				🆆 📚⬆ 🕸 🌋	Mar/2024	🟢	https://news.ycombinator.com/item?id=39745700#39745702	Dense	Tiny model (378M) for testing
98	RWKV-v5 EagleX	RWKV	https://huggingface.co/recursal/EagleX_1-7T	7.52	1700	227:1	0.4	40.14			🆆 📚⬆ 🕸 🌋	Mar/2024	🟢	https://substack.recursal.ai/p/eaglex-17t-soaring-past-llama-7b	Dense	Built on the RWKV-v5 architecture (a linear transformer with 10-100x+ lower inference cost)
99	MM1	Apple		30	2010	67:1	0.8				🌋	Mar/2024	🔴	https://arxiv.org/abs/2403.09611	Dense	VLM, outperforms Flamingo 80B (Apr/2022) across benchmarks. 2T text tokens + ~10B+ other text (estimate). Unreleased.
100	RFM-1	Covariant	https://vimeo.com/921866765	8	160	20:1	0.1				🆆 📚⬆ 🕸 🌋	Mar/2024	🟡	https://covariant.ai/insights/introducing-rfm-1-giving-robots-human-like-reasoning-capabilities/	Dense	Commercial, multimodal for robotics