ABCDEFGHIJKLMNOPQRSTUVWXYZAAABACADAEAFAGAHAIAJAKALAMANAOAP
1
Model NameDatasetTokenizerTraining LibraryPos EmbedNormalizationNorm H ParamsParallel LayersBiasesAct Funcd_attn / d_ffOptimizerOptimizer H ParamsLR Warm-UpLR DecayPrecisionClippingDropoutWeight DecayMisc.DateSource
2
RecommendedYour favoriteGPT-NeoX-20BGPT-NeoXRotary w/ ctx extensionLayerNormn/aYesNoGeLU4AdamW0.9, 0.95LinearCosine to 10%fp32 / bf161.00.00.101/1/2024Stella Biderman
3
GPT-1UnreleasedGPT-1UnreleasedLearnedLayerNormn/aNoYesGeLU4AdamNot disclosedCosine to 0Not disclosedNot disclosed0.1Not disclosedJune 11, 2018Paper
4
GPT-2UnreleasedGPT-2UnreleasedLearnedLayerNormn/aNoYesGeLU4AdamNot disclosedCosine to 0Not disclosedNot disclosed0.1Not disclosedFebruary 14, 2019
5
GPT-3UnreleasedGPT-2UnreleasedLearnedLayerNormn/aNoYesGeLU4Adam0.9, 0.95Cosine to 10%fp32 / fp161.00.00.10Alternating sparse and dense layersMay 28, 2020Paper
6
GPT-Neothe PileGPT-2GPT-NeoLearnedLayerNormn/aNoYesGeLU4Adam0.9, 0.95Cosine to ???fp32 / bf161.00.00.10Sliding window attentionMarch 22, 2021Config file
7
GPT-Jthe PileGPT-2mesh-transformer-jaxRotaryLayerNormn/aYesYesGeLU4Adam0.9, 0.999, 1e-08Cosine to 10%fp32 / bf161.00.00.10June 8, 2021Config file
8
FairSeq DenseUnreleasedGPT-2FairSeqSinusoidalLayerNormn/aNoYesGeLU4Adam0.9, 0.98Linear to 0Pure fp161.00.10.01December 20, 2021Paper
9
GopherUnreleasedUnreleasedUnreleasedTransformer-XL-styleRMSNormNoYesGeLU4AdamNot disclosedCosine to 10%Pure bf161.0Not disclosedNot disclosedDecember 8, 2021
10
GPT-NeoXthe PileGPT-NeoX-20BGPT-NeoXRotaryLayerNormn/aYesYesGeLU4Adam0.9, 0.95Cosine to 10%fp32 / fp161.00.00.01February 2, 2022Paper
Default config behavior
11
PaLMUnreleasedUnreleasedUnreleasedRotaryLayerNormn/aYesNoSwiGLU4Adafactor w/o factorization0.9, 1 - step_num^-0.81/sqrt(step_num)fp32 / bf161.00.0lr^2.0April 4, 2022Paper
12
OPTthe Pile + UnreleasedGPT-2UnreleasedLearnedLayerNormn/aNoYesReLU4AdamW0.9, 0.95Custom to 10%fp32 / fp161.00.10.10May 2, 2022Paper
HF implementation
13
BLOOMROOTSBLOOMMegatron-DeepSpeedAlibiLayerNormn/aNoYesGeLU4Adam0.9, 0.95Cosine to ???fp32 / bf161.00.00.10May 26, 2022PaperConfig file
14
Pythiathe PileGPT-NeoX-20BGPT-NeoXRotaryLayerNormn/aYesYesGeLU4Adam0.9, 0.95Cosine to 10%fp32 / fp161.00.00.10December 10, 2022Paper
Default config behavior
15
LLaMAUnreleasedLLaMAUnreleasedRotaryRMSNormNoYesSwiGLU8/3AdamW0.9, 0.95Cosine to 10%fp32 / fp161.00.00.10February 24, 2023Paper
Looking at the actual weights
16
RedPajama-INCITERedPajamasGPT-NeoX-20BGPT-NeoXRotaryLayerNormn/aNoYesGeLU4Not disclosedNot disclosedNot disclosedfp32/fp16Not disclosedMay 5, 2023
17
MPTC4, RP, Stack, S2ORCGPT-NeoX-20BComposerAlibiQKNormNoNoGeLU4LIONNot disclosedNot disclosedfp32 / bf16Not disclosed0.0Not disclosedMay 5, 2023
18
CerebrasGPTthe PileGPT-2UnreleasedLearnedLayerNormn/aNoYesGeLU4Adam0.9, 0.95, 1e-9Linear over 375M tokensCosine to 10%fp32 / bf161.00.00.10Uses muP (but only up to 3B!)
19
LLaMA 2UnreleasedLLaMAUnreleasedRotary w/ ctx extensionRMSNormNoNoSwiGLU8/3AdamW0.9, 0.95Cosine to 10%fp32 / bf161.00.00.10July 18, 2023Paper
Looking at the actual weights
20
WeLabUnreleasedWeLabGPT-NeoXRotaryLayerNormn/aYesYesGeLU4AdamNot disclosedNot disclosedNot disclosedNot disclosedNot disclosed
21
Stable LM v2the Pile + UnreleasedGPT-NeoX-20BGPT-NeoXRotary w/ ctx extensionLayerNormn/aYesNorms onlySwiGLU8/3Adam0.9, 0.95Cosine to 10%fp32 / fp161.00.00.0001August 5, 2023Config file
22
Falcon-180BUnreleasedFalconUnreleasedRotary w/ ctx extensionLayerNormn/aYesNot disclosedGeLU4AdamWNot disclosedCosine to 10%fp32 / bf161.00.00.10September 6, 2023Model card
23
MistralUnreleasedMistralUnreleasedRotary w/ ctx extensionRMSNormNoNoSwiGLU8/3Not disclosedNot disclosedfp32 / bf16Not disclosedSliding window attentionSeptember 27, 2023
24
QwenUnreleasedQwenUnreleasedRotary w/ ctx extensionRMSNormNoQKV-onlySwiGLU8/3AdamW0.9, 0.95, 1e-8Not disclosedCosine to 10%fp32 / bf161.00.10.10September 24, 2023
25
YiUnreleasedLLaMAUnreleasedRotaryRMSNormNoNoSwiGLU8/3Not disclosedNot disclosedfp32 / bf16Not disclosed
26
AmberRP + RW + StarCoderAmberAmber-TrainRotaryRMSNorm1e-6NoYesSwiGLU8/3AdamW0.9, 0.95Linear over 9.1B tokensCosine to 10%fp32 / bf161.00.00.10Paper
27
TeleChatUnreleasedTeleChatMegatron-DeepSpeedRotary w/ ctx extensionRMSNorm1e-5NoYesSwiGLU8/3 < x < 4Adam0.9, 0.95, 1e-5Linear over 1B tokensCosine to 10%fp32 / bf161.00.10.0001Batch size ramp-up1/8/2024Paper
28
InternLMUnreleasedInternLMInternLMRotary w/ ctx extensionRMSNorm1e-6Not disclosedNot disclosedSwiGLUNot disclosedNot disclosedNot disclosedNot disclosedNot disclosedNot disclosedNot disclosedNot disclosedNot disclosed
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100