ABCDEFGHIJKLMNOPQRSTUVWXYZAAABACADAEAFAGAHAIAJAKAL
1
ModelDerivativeRelease dateProfileAvailable on hugging facepyNo. ParametersLanguagesGerman language supportThe size of training dataTraining datasetData availableTechnologyTrained on (hardware) for (time)Code available for finetuningMin. Hardware for finetuningLangChain SupportMin. Hardware for Inference
2
LLaMALLaMA - base model24.02.2023Text completion7BMainly EN but also bg, ca, cs, da, de, es, fr, hr, hu, it, nl, pl, pt, ro, ru, sl, sr, sv, ukvery little1 trillion tokensCCNet [67%],
C4 [15%],
GitHub [4.5%],
Wikipedia [4.5%],
Books [4.5%],
ArXiv [2.5%],
Stack Exchange[2%]
- Transformer architecture,
- Pre-Normalization with RMSNorm,
- SwiGLU activation function,
- Rotary Embeddings
? x A100 GPU 80GB of RAM for 82432 GPU-hoursRTX 40906GB VRAM, 16 GB RAM;
RTX 1660, 2060, AMD 5700xt, RTX 3050, 3060
3
24.02.2023Text completion13B1 trillion tokens? x A100 GPU 80GB of RAM for 135168 GPU-hours1x NVidia Titan RTX 24G10GB VRAM, 32GB RAM;
AMD 6900xt, RTX 2060 12GB, 3060 12GB, 3080, A2000
4
24.02.2023Text completion33B1.4 trillion tokens? x A100 GPU 80GB of RAM for 530432 GPU-hours1xA100 80GB20GB VRAM, 64GB RAM;
RTX 3080 20GB, A4500, A5000, 3090, 4090, 6000, Tesla V100
5
24.02.2023Text completion65B1.4 trillion tokens2048 A100 GPU 80GB of RAM for 21 days (1,022,362 GPU-hours)?40GB VRAM, 128GB RAM;
A100 40GB, 2x3090, 2x4090, A40, RTX A6000, 8000, Titan Ada
6
Alpaca13.03.2023Dialog7B initially by stanford (13B, 33B, 65B trained by other parties)+ ENvery little52kinstruction-following data generated in the style of self-instruct using text-davinci-003- via Hugging Face’s training framework,
- Fully Sharded Data Parallel,
- Mixed precision training
8 80GB A100 for 3 hours1 A100 GPUCPU with 5GB of RAM (alpaca.cpp, 4 bits)
7
Alpaca - Lora28.03.2023Dialog7B initial model (13B, 33B, 65B trained by other parties)EN + unofficial weights for
- 7B: BR, CN, JP, FR, TH, DE, PL, IT, RU, UA;
- 13B: JP, CR, CN, ES;
- 30B: JP;
very little- low-rank adaptation (LoRA),
- Huggingface's PEFT,
- Tim Dettmers' bitsandbytes
1 RTX 4090 for 5 hours1 A100 GPU / RTX 4090 / NVIDIA T4 - 7B/ 13B: T4 16GB
- 30B: A100 40GB
(without quantization)
8
Lit-LLaMA04.04.2023Text completion7Binherited languages from LLaMavery little????GPU with ~24 GB memory (GTX 3090)8-10GB VRAM
9
Vicuna~01.04.2023Dialog7B+ ENvery little70kconversations from the ShareGPT- gradient checkpointing,
- flash attention,
- SkyPilot managed spot
8 A100 GPUs with 80GB8 A100 GPUs with 80GB (or fewer with adjustements)14GB VRAM (GPU) or 30GB CPU memory
10
13B28GB VRAM (GPU) or 60GB of CPU memory
11
Cabrita17.03.2023Dialog7B + Portuguesevery little52kalpaca dataset translated to portuguese- low-rank adaptation (LoRA),
- Huggingface's PEFT,
1 A100 at Colab for 4 hours1 A100 GPU14GB VRAM,T4
12
ColossalAI14.02.2023Dialog7B+ EN + CNvery little104Kbilingual datasets of Chinese and English- RLHF (Reinforcement Learning with Human Feedback),
- ZeRO (Zero Redundancy Optimizer) ,
- LoRA
up to 8-GPU servers4x32 GB GPUs4GB VRAM (4-bit quantized, 7B)
13
Koala03.04.2023Dialog13B (but also 7B, 30B, 65B)mainly EN but also codevery little~ 500k examples- ShareGPT,
- HC3,
- OIG,
- Stanford Alpaca,
- Anthropic HH,
- OpenAI WebGPT.
- OpenAI Summarization
- implemented with JAX/Flax in EasyLM8 A100 GPU for 6 hours?14GB VRAM,T4 (7B Model)
14
Baize03.04.2023Dialog7B+ ENvery little111.5k + 52kdialogs generated by letting ChatGPT chat with itself with topics from Quora, StackOverFlow + Alpaca Dataset- Adapter,
- BitFit,
- Diffpruning,
- Prefix Tuning,
- LoRA
A100-80G GPU26GB VRAM (with int8)16GB VRAM (without int8)
15
13B25GB VRAM (with int8)28GB VRAM (without int8)
16
30B67GB VRAM (with int8)67GB VRAM (without int8)
17
7B (Medical)+ EN (Medical)(111.5k + 47k) + 52kas above plus generated dialogs on MedQuAD questions42GB VRAM (with int8)16GB VRAM (without int8)
18
GPT4All29.03.2023Dialog7B+ EN + Codevery little~440kGPT-3.5-Turbo Generations based on LLaMa- LoRA8 x A100 80GB GPUs for 8 hours16 GB VRAMfull model on GPU (16GB of RAM required) or quantized model CPU with 8GB RAM
19
20
BLOOMBLOOM (also distributed via petals)July 2022Text completion(560m-)176B46 languages and 13 programming languagesno1.6TB of pre-processed text, converted into 350B unique tokensROOTS- decoder-only,
- ALiBi embeddings
416 A100 80GB GPUs for 4 months3 A100 (8 bit) or basically none if using petalsPossible/ 3 minutes per token on CPU (16GB RAM, 1TB SSD)
21
BLOOMz (also distributed via petals)03.11.2022Dialog(560m-)176B13 training tasks across 46 languages with English prompts + prompts in 19 additional languages which were all machine translated from the English promptsvery little13 tasksBigScience's xP3 datasets288 A100 80GB GPUs144 A100 or basically none if using petals4 A100 (8 bit) or basically none if using petals
22
BLOOM LoRA27.03.2023Dialog7BSame as Bloom + more ENno50k + 5 kAlpaca cleaned, ChatDoctor- LoRA,
-PEFT
RTX 4090 for 5 hours??
23
BLOOM-CLP German25.01.2023Dialog6.4B DEyes50.4B tokensGerman OSCAR dataset, German court decisions from Open Legal Data- Cross-Lingual and Progressive Transfer Learning32xA100-40GB GPUs for 12.5 days??
24
IGEL (on BLOOM-CLP German)04.04.2023Dialog6.4B DEyes?instructions in English translated into German using an automated translation tool- LoRA???
25
GPT-NeoX-20B
GPT-NeoXT-Chat-Base-20B10.03.2023Dialog20BENno43Minstructions from OIG-43M- finetuning focused on question answering, classification, extraction, and summarization2 x 8 x A100 GPUs8 x A100 GPUs ( 6 x A100/80GB GPUs for int8)48 GB VRAM ( 24GB VRAM for int8), also CPU inference possible
26
PythiaPythia-Chat-Base-7B10.03.2023Dialog7B8 x A100 GPUs8 GPU's with 32GB24 GB VRAM ( 12GB VRAM for int8), also CPU inference possible
27
Pythia 12BOpen Assistant15.04.2023Dialog12BENin the dataset but not the model625k tasks, or >10k Conversation TreesThe OpenAssistant Conversations dataset is a comprehensive collection of conversational data that
was obtained through a crowdsourcing effort involving more than 13,000 volunteers. The process was divided into five separate
steps: prompting, labelling prompts, adding reply messages as prompter or assistant, labelling replies,
and ranking assistant replies.
- using Supervised Fine-Tuning
(SFT)
?? (doesn't seem to be meant for finetuning)40 GB VRAM (8-bit)
28
Dolly 2.012.04.2023Dialog13BENno15knew, high-quality human generated instruction following dataset, crowdsourced among Databricks employees- Rotary Position Embedding (RoPE)?8 A100 GPUs or V100 instances with 32GB of GPU memory13B: A100, or A10 24GB (8-bits) or smaller for smaller models
29
GPT-JDolly24.03.2023Dialog6BENno52kAlpaca- Rotary Position Embedding (RoPE)8x A100 40GB GPUs for 30 mins4 x A10 24GB or 8 V100s with 32GB of GPU memory?
30
GPT4All-J13.04.2023Dialog6BENno800k pointSubsets of LAION OIG, Coding questions with a random sub-sample of Stackoverflow Questions, sub-sample of Bigscience/P3, Custom-generated creative questions?8 A100 80GB GPUs for ~12 hours?Runs with CPU only with 16GB RAM
31
UL2FLAN-UL203.03.2023Dialog20BEN, FR, DE + Codeyes1 trillion tokenstaskmaster2, djaym7/wiki_dialog, deepmind/code_contests, lambada, gsm8k, aqua_rat, esnli, quasc and qed- T5 architecture,
- Mixture-of-Denoisers (MoD)
TPU v3 or TPU v4 pods, using t5x codebase together with jax?A10G with 24GB VRAM via Amazon SageMaker or A100 GPU
32
RWKV RWKV-4-Raven30.03.2023Dialog3B/7B/14BENno?Alpaca, CodeAlpaca, Guanaco, GPT4All, ShareGPT and more- RNNon n x A100 GPUs 40G of Stability and EleutherAI80G VRAM (for 14B model)12GB-16GB RAM or 9GB-15GB VRAM (could be also combined)
33
Cerebras-GPTCerebras-GPT06.04.2023Text Completion(111M-)13BENno400B tokensThe Pile: consists of 22 smaller datasets, s,includingCommon Crawl,PubMedCentral,Books3,OpenWebText2,Github,andarXiv- Chinchilla scaling laws,
- Cerebras' weight streaming technology,
on Andromeda AI supercomputer comprised of 16 CS-2 wafer scale systems??
34
35
Visualization in good resolution
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100