1 of 52

Scaling laws for neural language models

Kaplan, McCandlish, et al., 2020

Abhishek Moturu

Addison Weatherhead

Umangi Jain

5 February 2024

2 of 52

Why study scaling laws in deep learning?

3 of 52

Fixed Budget

Essential ingredients for training deep learning model

What is an optimal allocation with fixed budget?

4 of 52

More data

Essential ingredients for training deep learning model

What is an optimal allocation with fixed budget?

5 of 52

Higher capacity models

Essential ingredients for training deep learning model

What is an optimal allocation with fixed budget?

6 of 52

More compute

Essential ingredients for training deep learning model

What is an optimal allocation with fixed budget?

7 of 52

Diminishing returns

Would it help to increase data, keeping the same model?

How about increasing model parameters for the same data?

Assuming sufficient compute, everytime I scale my model by x times, how much more data I need to get best returns?

8 of 52

Predicting capabilities

Given we have established a power law on the number of parameters, can we get a estimate of a model with 10¹⁰ parameters?

9 of 52

Success is often a line, not a point

Would we endorse the new idea?

Image credit: Jared Kaplan

10 of 52

Which scaling laws to investigate in deep learning?

11 of 52

Too many decisions for any experiment setting!

What kind of “data” do scaling laws refer to for a particular task?

5x larger but noisier data

Noise: how clean is the data?
Distribution: are there imbalances?
Diversity: do we have a narrower set or a wider range?
Augmentations: will augmentations improve performance?
Limits: how far can we go?

When should we care?

12 of 52

Other factors contributing to the complexity?

More axes of variations: model parameters (width, depth, MLP, heads), context length, training time, batch size
Choice of optimization parameters and loss functions
Furthermore, these choices are not independent of each other
Tough to determine the universality of the scaling laws

13 of 52

Good news: Performance does not depend strongly on all choices

Transformers asymptotically outperform LSTMs

due to improved use of long contexts

As long as the learning rate is not too small and does not decay too quickly, performance does not depend strongly on it

14 of 52

Scaling laws before advent of large neural nets

Density estimation

Random forests

Biau et al., Analysis of a random forests model, 2012�Lobacheva et al., On Power Laws in Deep Ensembles, 2020

Evolution of MSE with respect to increasing data for random forests

Negative log-likelihood with respect to ensemble size

15 of 52

Are these scaling laws consistent across studies?

16 of 52

Inconsistent findings

[Hestness et al. 2017] “We also show that model size scales sublinearly with data size”

[Kaplan et al. 2020] “ We should increase the dataset size sublinearly [....] with model size”

Hestness et al., Deep Learning Scaling is Predictable, Empirically, 2017

For language modeling,

m: training set size
s(m): number of model parameters to fit a data set

The two analysis reveal: and respectively

For the first analysis, dataset size: 2²¹ - 2²⁹ tokens and model parameters: 1M - 177M

For the second analysis, dataset size 2²³ - 2³³ tokens and model parameters: 0.1M - 1000M

17 of 52

Inconsistent findings

[Hestness et al. 2017] “We also show that model size scales sublinearly with data size”

[Kaplan et al. 2020] “ We should increase the dataset size sublinearly [....] with model size”

Hestness et al., Deep Learning Scaling is Predictable, Empirically, 2017

For language modeling,

m: training set size
s(m): number of model parameters to fit a data set

The two analysis reveal: and respectively

For the first analysis, dataset size: 2²¹ - 2²⁹ tokens and model parameters: 1M - 177M

For the second analysis, dataset size 2²³ - 2³³ tokens and model parameters: 0.1M - 1000M

18 of 52

Across modalities

Perplexity on speech tokens of a 2.7B model

Aghajanyan et al., Scaling Laws for Generative Mixed-Modal Language Models, 2023

Certain modalities flatten out during training

19 of 52

Emergent Abilities: an unpredictable phenomenon

Emergent abilities cannot be predicted simply by extrapolating the performance of smaller models

Wei et al., Emergent Abilities of Large Language Models, 2022

20 of 52

Why study scaling laws for language?

Language provides a framework to easily express several reasoning tasks
Allows us to get intuition from the responses that can help align these models
GPT-4 red team to mitigate potential vulnerabilities

21 of 52

Order of magnitudes for language modeling

22 of 52

Dataset trends

GPT-4, T5, Bard, Gemini, Grok, Claude, Megatron, Jurassic are all pushing the frontier further

23 of 52

Compute

1 PF-day = 10¹⁵ × 24 × 3600 = 8.64 × 10¹⁹Flops

https://www.lesswrong.com/posts/TAbQHFwGD4E3jCMnt/is-it-a-coincidence-that-gpt-3-requires-roughly-the-same

^{It is speculated that GPT-4 might have around FLOPs}

^{1 A100 GPU performs ~ 2e19 FLOP/day}

^GPT-3

^{Training ~ FLOPs}
^{Inference ~ FLOPs}

24 of 52

Model Size

Modern-day LLMs have crossed a trillion parameters now: Switch-C, GLaM, MoE-1.1T

Image credit: https://developer.nvidia.com/blog/scaling-language-model-training-to-a-trillion-parameters-using-megatron/

25 of 52

Questions

Are they universal?

Will they breakdown?

When will we reach limits?

How capable would be the next generation models?

What problems can scaling not solve?

26 of 52

Key Findings

Models: Decoder-only Transformer models

Train set: WebText2 dataset consists of 20.3M documents (96 GB of text, 1.62 × 10^10 words)

Test sets: Books Corpus, Common Crawl, English Wikipedia, and some publicly-available Internet Books

27 of 52

Performance depends on scale

Performance is closely linked with scale, including the size of the model (N), the size of the dataset (D), and computational power (C).

Scaling up any of these factors in a synchronized fashion leads to significant improvements in model performance.

Performance depends very weakly on other architectural hyperparameters such as depth or width.

28 of 52

Performance depends on scale

Performance depends very mildly on model shape when the total number of non-embedding parameters N is held fixed. The loss varies only a few percent over a wide range of shapes.

29 of 52

Smooth power laws

A power-law relationship exists between model performance and each of the scale factors (number of parameters, dataset size, compute) when not bottlenecked by the other two.

This relationship is consistent across various magnitudes, indicating predictable performance gains from increased scale.

30 of 52

Smooth power laws

Language modeling performance improves smoothly as we increase the N, D, and C used for training.

31 of 52

Universality of overfitting

Increasing model size and dataset size together leads to consistent performance improvements.

Overfitting becomes a concern when scaling is unbalanced, emphasizing the need for balanced scaling.

Every time we increase the model size 8x, we only need to increase the data by roughly 5x to avoid a penalty.

32 of 52

Universality of overfitting

For large D, performance is a straight power law in N.

For a smaller fixed D, performance stops improving as N increases and the model begins to overfit.

33 of 52

Universality of training

Training efficiency and curves exhibit universal patterns across models of different sizes.

Early training behavior can be used to forecast long-term performance and help with efficient resource allocation.

34 of 52

Universality of training

For large N, performance is a straight power law in D.

For a smaller fixed N, performance stops improving as D increases.

35 of 52

Transfer improves with test performance

Models trained on one text distribution then evaluated on another maintain a strong correlation with their training validation performance.

That is to say, transfer to a different distribution incurs a constant penalty but otherwise improves roughly in line with performance on the training set.

36 of 52

Transfer improves with test performance

Generalization performance to other data distributions improves smoothly with model size. Generalization performance depends only on training distribution performance, and not on the phase of training (points: converged models, dashed line: single large model).

37 of 52

Sample efficiency

Larger models require fewer data points and optimization steps to achieve comparable performance levels to smaller models.

This highlights the benefits of large-scale models in terms of learning speed and resource utilization.

This suggests prioritizing scale in model development.

38 of 52

Sample efficiency

39 of 52

Convergence is inefficient

Achieving optimal performance within a fixed compute budget involves training very large models for fewer epochs, stopping significantly before full convergence.

This emphasizes the importance of sample efficiency and smart use of computational resources.

40 of 52

Convergence is inefficient

For optimally compute-efficient training, should mostly increase model size, then a small increase in data (avoid reuse and use larger batch sizes), and then a very small increase in serial training time.

41 of 52

Optimal batch size

The optimal batch size is linked to loss and gradient noise scale, with empirical data suggesting a range for the largest models.

Gradient noise scale “quantifies the signal-to-noise ratio of the network gradients, i.e. the noise scale measures the variation in the data as seen by the model - when the noise scale is small, looking at a lot of data in parallel quickly becomes redundant, whereas when it is large, we can learn a lot from huge batches of data.” *

Adjusting batch size according to these parameters can optimize training efficiency and model performance.

* Source: https://openai.com/research/how-ai-training-scales

42 of 52

Optimal batch size

The critical batch size follows a power law in the loss as performance increases, and does not depend on the model size. The critical batch size approximately doubles for every 13% decrease in loss.

43 of 52

Future Work & Implications

44 of 52

Loss can’t always decrease

-Obviously loss must level off at some point

-Natural Language has non-zero entropy

-Size of available datasets/compute limit experiments

45 of 52

Loss can’t always decrease

-Obviously loss must level off at some point

-Contradiction in predicted loss:

C_min is minimum amount of compute needed to obtain a certain loss

46 of 52

Vision Transformers

47 of 52

Vision Transformers

-Conducted experiments with different sizes of Vision Transformers and dataset sizes, and evaluate on a transfer learning task

-Pretrain ViTs on ImageNet-21k and a set of weakly labelled images

-Then either fine tune on a new set or few shot train a linear classifier on frozen weights

48 of 52

Vision Transformers

-Conclusion: Larger models are more sample efficient, just as in language

49 of 52

Multi-modal models

Aghajanyan et al., Scaling Laws for Generative Mixed-Modal Language Models, 2023

50 of 52

Implications of Scaling Laws in LLMs

51 of 52

'An Actually-Good Argument Against Naive AI Scaling'

-The naive AI Scaling Argument: Just scaling up GPT models will eventually lead to super general intelligence

-Could a future GPT model play at 5000 ELO?

-Limits of data from the internet

-Train narrow AI and use that to train the future GPT

‘An Actually-Good Argument Against Naive AI Scaling;, Jacob Buckman

1 of 52

2 of 52

3 of 52

4 of 52

5 of 52

6 of 52

7 of 52

8 of 52

9 of 52

10 of 52

11 of 52

12 of 52

13 of 52

14 of 52

15 of 52

16 of 52

17 of 52

18 of 52

19 of 52

20 of 52

21 of 52

22 of 52

23 of 52

24 of 52

25 of 52

26 of 52

27 of 52

28 of 52

29 of 52

30 of 52

31 of 52

32 of 52

33 of 52

34 of 52

35 of 52

36 of 52

37 of 52

38 of 52

39 of 52

40 of 52

41 of 52

42 of 52

43 of 52

44 of 52

45 of 52

46 of 52

47 of 52

48 of 52

49 of 52

50 of 52

51 of 52

52 of 52