1 of 61

Neural Scaling Laws

Ethan Caballero Brady Neal

2 of 61

Overview

Neural Scaling Laws
GPT-3
Future/Other

3 of 61

Preliminary Notation for Neural Scaling Laws

Parameters = N

Dataset_size = D

Compute_budget = C

Test_loss = L�

�Neural Scaling Laws are about the the relationship between L & C,N,D.

4 of 61

The Primary Finding

Compute-Efficient Scaling Law

Data-Efficient Scaling Law

Parameter-Efficient Scaling Law

Here are the primary findings (i.e. scaling laws) of Neural Scaling Laws research.

You can view these figures as scaling laws for when one is trying to be maximally efficient with respect to the x-axis (i.e. is only bottlenecked with respect to x-axis).

As a result, the left plot is called the Compute-efficient scaling law, the middle plot is called the Data-efficient scaling law, & the right plot is called the Parameter-efficient scaling law.��For the Compute-efficient scaling law in left plot, Compute is fixed to x-axis value, and dataset size and number of parameters are the values that yield best test loss.�For the Data-efficient scaling law in middle plot, Dataset size is fixed to x-axis value, and number of parameters and Compute (i.e. batch_size and number of steps) are the values that yield the best test loss.�For the Parameter-efficient scaling law in right plot, Number of non-embedding Parameters is fixed to x-axis value, and (using the batch_size that yields the best test_loss) dataset tokens trained on is increased until test_loss in no longer decreasing significantly.

These specific figures were for autoregressive language models trained via cross entropy to generate distribution of most of the text on the mostly english portion of the internet. For example, GPT-2 & GPT-3

However, similar scaling laws seem to be present for all popular neural models and all popular data modalities and all popular loss functions. (we’ll talk about this universality more later)

Does anyone have any questions about this slide?

5 of 61

Larger models are more sample-efficient

6 of 61

Money is/was bottleneck.

- Compute & (unlabelled) Data are both limited resources. But compute cost very large amount of money compared to monetary cost of scraping unlabelled data. �

- For GPT-2 & GPT-3 (very large language models), most monetary cost came from compute. As a result, for this setup optimizing for compute-efficiency (minimize FLOPs needed to get test loss low) becomes more important than optimizing for data-efficiency.�

- GPT-3 costed at least $5M to train.�

- This paper focuses on Compute-Efficient Scaling Laws, because paper’s original purpose was to inform scaling to GPT-3.

7 of 61

Compute-efficient scaling relationship is: �

C^1 X compute increase : C^.73 X params increase : C^.27 X data increase

8 of 61

Contradictions and a Conjecture

GPT-3 = 3.64E+03 PF-days = 🔴

🔴

L(D(C)) by definition must be a lower bound of L(Cmin)

9 of 61

GPT-3

10 of 61

GPT-3

175 Billion parameter Autoregressive Language Model with a context length of 2048 trained via cross-entropy minimization to predict net token for approximately one epoch on 200 billion words of very diverse mostly english text.

11 of 61

Each color is what an individual attention head is able to attend to.

GPT-3 uses the same model and architecture as GPT-2, with the exception that it alternates dense and sparse attention patterns in the layers of the transformer.

Sparse Transformer

Vanilla (aka Dense) Transformer

Each attention head is able to attend everything.

12 of 61

The Scale of GPT-3

Training GPT-3 uses:

�3.64e+23 FLOPs = 1.250e-1 X FLOPs human brain uses in lifetime

�2.00e+11 words = 2.000e+1 X Data (words) human reads in lifetime

1.75e+11 params = 4.375e-3 X Params that human brain uses for language tasks.��

10x bigger than previous biggest model (Turing NLG from MSR 17B params)��GPT-3 training seems more compute-efficient and parameter efficient than human brain, and less data-efficient than humans.��Human Brain runs ~ 1e15 FLOPs per second (i.e. 1PF (PetaFLOP)).

GPT-3 used 3.64E+03 PF-days for training. This is ~equivalent to the amount of compute that the human brain uses to run for 10 years (3640 days).

GPT-3 trained on 200B words of (mostly non-overlapping) text. Single human reads less than 10B words in lifetime.

GPT-3 is 1.75e11 parameters. Human Brain is 1e15 parameters (synapses), and approximately 4% of human brain (4e+13 parameters) is used for language tasks. So GPT-3 is ~2.28e2 X less parameters than portion of human brain that is responsible for language.

Estimate for compute of human brain:

(Each synapse spikes ~once per second)*(10^15 synapses) = 1e15 FLOPs = 1 PetaFLOP

Reference for where estimate for compute of human brain comes from:�https://www.openphilanthropy.org/blog/new-report-brain-computation

13 of 61

What is few-shot learning?

Finetuning; (how most papers evaluate downstream)

Zero Shot

Few Shot

14 of 61

Models and Data

15 of 61

Results

16 of 61

Did the Neural Scaling Law continue?

17 of 61

Results: SotA Language Modeling

Penn Tree Bank Perplexity (lower is better), previous SOTA is GPT2

18 of 61

Results and Extrapolations:

Language Modeling, Cloze, and Completion Tasks
Closed Book Question Answering
Translation
Winograd-Style Tasks
Reading Comprehension
SuperGLUE
NLI
Synthetic and Qualitative Tasks

19 of 61

Results: High Level Observations

Few shot performance keeps increasing a significant amount for all tasks as params increase; zero-shot performance keeps increasing for most tasks (ANLI is most notable exception)
For ~half of evaluation tasks, scaling exact same setup further is ~tractable strategy for getting to human-level performance
For Some Evaluation Tasks, model seems to undergo “phase transitions” in which model has horrible performance for lower number of params and then ~skyrockets to high performance as number of params cross threshold

20 of 61

Results: Everything

Would need to scale params >5 OOM to get to 100% accuracy on all benchmarks

21 of 61

Brief Detour into how much resources might be allocated to largest ML training run(s) this century

22 of 61

Money (aka compute) constraint going forward

$1e12 * (1e24 FLOP/$) = 1e36 FLOPs = max FLOPs for training model this century. �(Yes, this is highly speculative, but it’s better than nothing.)

23 of 61

Total Text Data on internet is only growing linearly

<- New (non-overlapping) with previous data scraped each month by Common Crawl

Humans aren’t getting faster at typing

A bound of the amount of webtext-quality text that can be obtained this century is between 1e15 and 1e16 bytes of uncompressed text.

3 bpe tokens = 2 words = 10 characters = 12 bytes of (uncompressed) text��

New Common Crawl Data each Month

24 of 61

Extrapolation

1e36 FLOPs (1e12 X GPT-3) is ~max compute to be available to train ml model this century.
To maintain compute-efficient (aka money-efficient) training, data would need to increase to 1.2e12*((1e12)^.27) = 2.085e+15 bytes of uncompressed text. (<- which is ~in range of max obtainable text)�
However, at or before 8.64e+23 FLOPs, L(Cmin) will start curving upward because it’s lower bounded by L(D(Cmin)). As a result, exponent in amount of data needed for compute-efficient training keeps increasing for values >4e12 bytes of uncompressed text, meaning data requirements are (much?) >2.085e+15 bytes of uncompressed text
This suggests data will (probably) eventually be the main bottleneck.

So given that max compute available to train a model this century is roughly 1e36 FLOPs.��That would mean largest plausible future GPT (with exact current setup and algos as GPT-3) would use 1e12 times as many FLOPs as GPT-3,��If one increases compute by X times, then one should increase data by X^.27 to remain on the compute-efficient frontier. This corresponds to roughly 2e15 bytes of uncompressed text (which is ~in range of max obtainable text).��However, at or before 8.64e+23 FLOPs, L(Cmin) (the compute-efficiency scaling law) will start curving upward because it’s lower bounded by L(D(Cmin)) (the data-efficiency scaling law). As a result, the exponent in the amount of data needed for compute-efficient training keeps increasing for values >4e12 bytes of uncompressed text, meaning data requirements are possibly much greater than 2e15 bytes of uncompressed text to remain on compute-efficient frontier when training using 1e36 FLOPs

This suggests data will (probably) eventually be the main bottleneck because in previous slide we said that a rough bound of the amount of webtext-quality text that can be obtained this century is between 1e15 and 1e16 bytes of uncompressed text.

25 of 61

Contradictions and a Conjecture

GPT-3 = 3.64E+03 PF-days = 🔴

🔴

L(D(C)) by definition must be a lower bound of L(Cmin)

26 of 61

Extrapolation TL;DR

Compute-efficient GPT-∞ with current algorithms/setup would use/need:

� 1e36 FLOPs (1e12 X GPT-3)�>2e15 bytes of uncompressed text (>2e3 X GPT-3)�<1e20 parameters (<6e9 X GPT-3)

It’s test loss would be greater than 0.84120903154 nats/token, which seems to be in the range of what Claude Shannon estimated to be the entropy of English.

�

The TL;DR of this extrapolation is that the largest OOM increase we could expect this century for scaling GPT-3 further (with same algorithms and setup) with compute-efficient scaling laws is equal to or less than 1e12 times the FLOPs used by GPT-3, which would need greater than 2e3 times the bytes of uncompressed text used by GPT-3 (which also encroaches on the limits of obtainable text data), and less than 6e9 times the parameters used by GPT-3.��This 6e9 times as many parameters increase is the most important thing to remember from all this extrapolation because it provides a very rough upper bound for how far one could expect to empirically extrapolate this century the x-axis of the next view result plots we are about to show you.��Test Loss is obtained by plugging data_amount into his the data-efficiency scaling law. ��

27 of 61

How many OOM increases would be needed for human level performance on each individual task?

28 of 61

Tractable

LAMBADA: Predict last word of paragraph
TriviaQA: Closed Book Question Answering
CoQA: Conversational Question Answering Challenge about paragraph

29 of 61

Maybe Tractable: Winogrande, SAT

Winogrande: Adversarially-Mined Winograd Schema Challenge
SAT Analogies: Analogy Section of SAT

30 of 61

Maybe Less Tractable: ANLI, SuperGLUE

ANLI: Adversarially-Mined Natural Language Inference
SuperGLUE: 8 tasks for which humans do significantly better than BERT

31 of 61

Less Tractable

32 of 61

Results: Caveat “Phase Transition”

- Possibility of Phase Transitions makes extrapolating these trends highly uncertain

33 of 61

Novel (OOD?) Symbolic Manipulations

34 of 61

Fake News

35 of 61

Is GPT-3 memorizing?

36 of 61

GPT-3 is not overfitting

37 of 61

Contamination

38 of 61

Public Reception

39 of 61

Other Important Scaling Laws Findings�(from original scaling laws paper)

40 of 61

Performance depends only weakly on model shape

“Within reasonable limits, performance depends very weakly on other architectural hyperparameters such as depth vs. width.”

41 of 61

We can predict training curve

(batch size adjusted training step)

42 of 61

We can predict effect of more data

43 of 61

Transfer improves with test performance

44 of 61

Other Important Scaling Laws Findings�(from follow-up papers)

45 of 61

Scaling Laws Exist on All Data Modalities

46 of 61

Optimal model size is consistent across modalities. Big Mystery.

Exponent for optimal model size of compute-efficient scaling laws seems to be the ~same for every modality 🤯!

When following compute-efficient scaling law, C^1 X compute increase corresponds to C^.73 X params increase for any modality.

47 of 61

What about other neural architectures?

ConvNets

LSTMs (from this paper)

48 of 61

Effective data transferred

49 of 61

Larger models transfer more data

50 of 61

Contradiction resolution

Longer explanation: https://www.lesswrong.com/posts/diutNaWF669WgEt3v/the-scaling-inconsistency-openai-s-new-insight

51 of 61

Takeaways

52 of 61

Evaluate how methods scale

Currently, most papers at prestigious conferences compare methods with 1 or a few different parameter counts.�
We should instead focus on estimating the scalings laws as a proposed method is provided more compute, parameters, and data.�
Endgame of Machine Learning could involve orders of magnitude (e.g. 1e12? X) more compute than is available (to any entity) today.�
Methods with better scaling law exponents (steeper slope) are more likely to stand the test of time.

53 of 61

2. If scaling is the answer, how do we scale better?

54 of 61

https://twitter.com/ericjang11/status/1291807917071405056

Better Scaling Laws via Multimodal Data?

�

55 of 61

Text-to-image might have better scaling law than image-to-text

56 of 61

Image-to-text scales better with bag of words?!

57 of 61

Questions

58 of 61

VideoGPT is frontier

biological_evolution (starting at cambrian explosion) >1e36 frames

CCP >1e18 frames

NSA >1e17 frames

youtube >1e14 frames

twitch >1e14 frames

dota2 openai_five >1e13 frames

tesla >1e13 frames

facebook >1e13 frames

agent57 >1e12 frames

single_human_life >1e11 frames

comma.ai >1e11 frames

youtube-8m >1e10 frames

netflix >1e9 frames

ig-3.5B >1e9 frames

kinetics-600M >1e8 frames

dall·e/clip/jft-300m >1e8 frames

lsun >1e7 frames

imagenet >1e6 frames

59 of 61

Advantageouser Scaling Laws: OOD_G Inspiration

https://openreview.net/forum?id=jrA5GAccy_

“We have yet to understand what kind of assumptions we are able to make in practical problems of interest that mathematically describe the amount of data required by empirical risk minimization in nonlinear realizable problems, and what the (perhaps from exponential to polynomial) reduction of sample complexity is when we exploit invariance. We leave all these fascinating research problems for future work.” - Martin Arjovsky’s PhD thesis

60 of 61

Advantageouser Scaling Laws: Compute-Performance Tradeoff of Architecture

Of all the papers trying to create efficienter (big O notation) transformers, some may get to lower test loss using less FLOPs (and memory) but no one seems to have tested this yet because community is not explicitly focusing on scaling laws 🤦🏽‍♀️.

Better Big O Complexity is not the same as Better Neural Scaling Laws

61 of 61

Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers�https://arxiv.org/abs/2002.11794�