Scaling laws for neural language models
Kaplan, McCandlish, et al., 2020
Abhishek Moturu
Addison Weatherhead
Umangi Jain
5 February 2024
Why study scaling laws in deep learning?
2
Fixed Budget
Essential ingredients for training deep learning model
3
What is an optimal allocation with fixed budget?
More data
Essential ingredients for training deep learning model
4
What is an optimal allocation with fixed budget?
Higher capacity models
Essential ingredients for training deep learning model
5
What is an optimal allocation with fixed budget?
More compute
Essential ingredients for training deep learning model
6
What is an optimal allocation with fixed budget?
Diminishing returns
Would it help to increase data, keeping the same model?
How about increasing model parameters for the same data?
7
Assuming sufficient compute, everytime I scale my model by x times, how much more data I need to get best returns?
Predicting capabilities
Given we have established a power law on the number of parameters, can we get a estimate of a model with 1010 parameters?
8
Success is often a line, not a point
Would we endorse the new idea?
9
Image credit: Jared Kaplan
Which scaling laws to investigate in deep learning?
10
Too many decisions for any experiment setting!
What kind of “data” do scaling laws refer to for a particular task?
11
5x larger but noisier data
When should we care?
Other factors contributing to the complexity?
12
Good news: Performance does not depend strongly on all choices
13
Transformers asymptotically outperform LSTMs
due to improved use of long contexts
As long as the learning rate is not too small and does not decay too quickly, performance does not depend strongly on it
Scaling laws before advent of large neural nets
Density estimation
Random forests
14
Biau et al., Analysis of a random forests model, 2012�Lobacheva et al., On Power Laws in Deep Ensembles, 2020
Evolution of MSE with respect to increasing data for random forests
Negative log-likelihood with respect to ensemble size
Are these scaling laws consistent across studies?
15
Inconsistent findings
[Hestness et al. 2017] “We also show that model size scales sublinearly with data size”
[Kaplan et al. 2020] “ We should increase the dataset size sublinearly [....] with model size”
16
Hestness et al., Deep Learning Scaling is Predictable, Empirically, 2017
For language modeling,
The two analysis reveal: and respectively
For the first analysis, dataset size: 221 - 229 tokens and model parameters: 1M - 177M
For the second analysis, dataset size 223 - 233 tokens and model parameters: 0.1M - 1000M
Inconsistent findings
[Hestness et al. 2017] “We also show that model size scales sublinearly with data size”
[Kaplan et al. 2020] “ We should increase the dataset size sublinearly [....] with model size”
17
Hestness et al., Deep Learning Scaling is Predictable, Empirically, 2017
For language modeling,
The two analysis reveal: and respectively
For the first analysis, dataset size: 221 - 229 tokens and model parameters: 1M - 177M
For the second analysis, dataset size 223 - 233 tokens and model parameters: 0.1M - 1000M
Across modalities
Perplexity on speech tokens of a 2.7B model
18
Aghajanyan et al., Scaling Laws for Generative Mixed-Modal Language Models, 2023
Certain modalities flatten out during training
Emergent Abilities: an unpredictable phenomenon
Emergent abilities cannot be predicted simply by extrapolating the performance of smaller models
Wei et al., Emergent Abilities of Large Language Models, 2022
Why study scaling laws for language?
20
Order of magnitudes for language modeling
21
Dataset trends
GPT-4, T5, Bard, Gemini, Grok, Claude, Megatron, Jurassic are all pushing the frontier further
22
Compute
1 PF-day = 1015 × 24 × 3600 = 8.64 × 1019 Flops
23
It is speculated that GPT-4 might have around FLOPs
1 A100 GPU performs ~ 2e19 FLOP/day
GPT-3
Model Size
Modern-day LLMs have crossed a trillion parameters now: Switch-C, GLaM, MoE-1.1T
24
Questions
Are they universal?
Will they breakdown?
When will we reach limits?
How capable would be the next generation models?
What problems can scaling not solve?
25
Key Findings
Models: Decoder-only Transformer models
Train set: WebText2 dataset consists of 20.3M documents (96 GB of text, 1.62 × 10^10 words)
Test sets: Books Corpus, Common Crawl, English Wikipedia, and some publicly-available Internet Books
Performance depends on scale
Performance is closely linked with scale, including the size of the model (N), the size of the dataset (D), and computational power (C).
Scaling up any of these factors in a synchronized fashion leads to significant improvements in model performance.
Performance depends very weakly on other architectural hyperparameters such as depth or width.
Performance depends on scale
Performance depends very mildly on model shape when the total number of non-embedding parameters N is held fixed. The loss varies only a few percent over a wide range of shapes.
Smooth power laws
A power-law relationship exists between model performance and each of the scale factors (number of parameters, dataset size, compute) when not bottlenecked by the other two.
This relationship is consistent across various magnitudes, indicating predictable performance gains from increased scale.
Smooth power laws
Language modeling performance improves smoothly as we increase the N, D, and C used for training.
Universality of overfitting
Increasing model size and dataset size together leads to consistent performance improvements.
Overfitting becomes a concern when scaling is unbalanced, emphasizing the need for balanced scaling.
Every time we increase the model size 8x, we only need to increase the data by roughly 5x to avoid a penalty.
Universality of overfitting
For large D, performance is a straight power law in N.
For a smaller fixed D, performance stops improving as N increases and the model begins to overfit.
Universality of training
Training efficiency and curves exhibit universal patterns across models of different sizes.
Early training behavior can be used to forecast long-term performance and help with efficient resource allocation.
Universality of training
For large N, performance is a straight power law in D.
For a smaller fixed N, performance stops improving as D increases.
Transfer improves with test performance
Models trained on one text distribution then evaluated on another maintain a strong correlation with their training validation performance.
That is to say, transfer to a different distribution incurs a constant penalty but otherwise improves roughly in line with performance on the training set.
Transfer improves with test performance
Generalization performance to other data distributions improves smoothly with model size. Generalization performance depends only on training distribution performance, and not on the phase of training (points: converged models, dashed line: single large model).
Sample efficiency
Larger models require fewer data points and optimization steps to achieve comparable performance levels to smaller models.
This highlights the benefits of large-scale models in terms of learning speed and resource utilization.
This suggests prioritizing scale in model development.
Sample efficiency
Convergence is inefficient
Achieving optimal performance within a fixed compute budget involves training very large models for fewer epochs, stopping significantly before full convergence.
This emphasizes the importance of sample efficiency and smart use of computational resources.
Convergence is inefficient
For optimally compute-efficient training, should mostly increase model size, then a small increase in data (avoid reuse and use larger batch sizes), and then a very small increase in serial training time.
Optimal batch size
The optimal batch size is linked to loss and gradient noise scale, with empirical data suggesting a range for the largest models.
Gradient noise scale “quantifies the signal-to-noise ratio of the network gradients, i.e. the noise scale measures the variation in the data as seen by the model - when the noise scale is small, looking at a lot of data in parallel quickly becomes redundant, whereas when it is large, we can learn a lot from huge batches of data.” *
Adjusting batch size according to these parameters can optimize training efficiency and model performance.
* Source: https://openai.com/research/how-ai-training-scales
Optimal batch size
The critical batch size follows a power law in the loss as performance increases, and does not depend on the model size. The critical batch size approximately doubles for every 13% decrease in loss.
Future Work & Implications
Loss can’t always decrease
-Obviously loss must level off at some point
-Natural Language has non-zero entropy
-Size of available datasets/compute limit experiments
Loss can’t always decrease
-Obviously loss must level off at some point
-Contradiction in predicted loss:
C_min is minimum amount of compute needed to obtain a certain loss
Vision Transformers
Vision Transformers
-Conducted experiments with different sizes of Vision Transformers and dataset sizes, and evaluate on a transfer learning task
-Pretrain ViTs on ImageNet-21k and a set of weakly labelled images
-Then either fine tune on a new set or few shot train a linear classifier on frozen weights
Vision Transformers
-Conclusion: Larger models are more sample efficient, just as in language
Multi-modal models
Aghajanyan et al., Scaling Laws for Generative Mixed-Modal Language Models, 2023
Implications of Scaling Laws in LLMs
'An Actually-Good Argument Against Naive AI Scaling'
-The naive AI Scaling Argument: Just scaling up GPT models will eventually lead to super general intelligence
-Could a future GPT model play at 5000 ELO?
-Limits of data from the internet
-Train narrow AI and use that to train the future GPT
‘An Actually-Good Argument Against Naive AI Scaling;, Jacob Buckman
Questions/Discussion