1 of 61

Neural Scaling Laws

Ethan Caballero Brady Neal

2 of 61

Overview

  1. Neural Scaling Laws
  2. GPT-3
  3. Future/Other

3 of 61

Preliminary Notation for Neural Scaling Laws

Parameters = N

Dataset_size = D

Compute_budget = C

Test_loss = L�

�Neural Scaling Laws are about the the relationship between L & C,N,D.

4 of 61

The Primary Finding

Compute-Efficient Scaling Law

Data-Efficient Scaling Law

Parameter-Efficient Scaling Law

5 of 61

Larger models are more sample-efficient

6 of 61

Money is/was bottleneck.

- Compute & (unlabelled) Data are both limited resources. But compute cost very large amount of money compared to monetary cost of scraping unlabelled data. �

- For GPT-2 & GPT-3 (very large language models), most monetary cost came from compute. As a result, for this setup optimizing for compute-efficiency (minimize FLOPs needed to get test loss low) becomes more important than optimizing for data-efficiency.�

- GPT-3 costed at least $5M to train.�

- This paper focuses on Compute-Efficient Scaling Laws, because paper’s original purpose was to inform scaling to GPT-3.

7 of 61

Compute-efficient scaling relationship is: �

C^1 X compute increase : C^.73 X params increase : C^.27 X data increase

8 of 61

Contradictions and a Conjecture

GPT-3 = 3.64E+03 PF-days = 🔴

🔴

L(D(C)) by definition must be a lower bound of L(Cmin)

9 of 61

GPT-3

10 of 61

GPT-3

  • 175 Billion parameter Autoregressive Language Model with a context length of 2048 trained via cross-entropy minimization to predict net token for approximately one epoch on 200 billion words of very diverse mostly english text.

11 of 61

  • Each color is what an individual attention head is able to attend to.
  • GPT-3 uses the same model and architecture as GPT-2, with the exception that it alternates dense and sparse attention patterns in the layers of the transformer.

Sparse Transformer

Vanilla (aka Dense) Transformer

  • Each attention head is able to attend everything.

12 of 61

The Scale of GPT-3

Training GPT-3 uses:

�3.64e+23 FLOPs = 1.250e-1 X FLOPs human brain uses in lifetime

�2.00e+11 words = 2.000e+1 X Data (words) human reads in lifetime

1.75e+11 params = 4.375e-3 X Params that human brain uses for language tasks.�� �

13 of 61

What is few-shot learning?

Finetuning; (how most papers evaluate downstream)

Zero Shot

Few Shot

14 of 61

Models and Data

15 of 61

Results

16 of 61

Did the Neural Scaling Law continue?

17 of 61

Results: SotA Language Modeling

Penn Tree Bank Perplexity (lower is better), previous SOTA is GPT2

18 of 61

Results and Extrapolations:

  • Language Modeling, Cloze, and Completion Tasks
  • Closed Book Question Answering
  • Translation
  • Winograd-Style Tasks
  • Reading Comprehension
  • SuperGLUE
  • NLI
  • Synthetic and Qualitative Tasks

19 of 61

Results: High Level Observations

  • Few shot performance keeps increasing a significant amount for all tasks as params increase; zero-shot performance keeps increasing for most tasks (ANLI is most notable exception)
  • For ~half of evaluation tasks, scaling exact same setup further is ~tractable strategy for getting to human-level performance
  • For Some Evaluation Tasks, model seems to undergo “phase transitions” in which model has horrible performance for lower number of params and then ~skyrockets to high performance as number of params cross threshold

20 of 61

Results: Everything

Would need to scale params >5 OOM to get to 100% accuracy on all benchmarks

21 of 61

Brief Detour into how much resources might be allocated to largest ML training run(s) this century

22 of 61

Money (aka compute) constraint going forward

$1e12 * (1e24 FLOP/$) = 1e36 FLOPs = max FLOPs for training model this century. �(Yes, this is highly speculative, but it’s better than nothing.)

23 of 61

Total Text Data on internet is only growing linearly

  • <- New (non-overlapping) with previous data scraped each month by Common Crawl

  • Humans aren’t getting faster at typing

  • A bound of the amount of webtext-quality text that can be obtained this century is between 1e15 and 1e16 bytes of uncompressed text.

3 bpe tokens = 2 words = 10 characters = 12 bytes of (uncompressed) text��

New Common Crawl Data each Month

24 of 61

Extrapolation

  • 1e36 FLOPs (1e12 X GPT-3) is ~max compute to be available to train ml model this century.
  • To maintain compute-efficient (aka money-efficient) training, data would need to increase to 1.2e12*((1e12)^.27) = 2.085e+15 bytes of uncompressed text. (<- which is ~in range of max obtainable text)
  • However, at or before 8.64e+23 FLOPs, L(Cmin) will start curving upward because it’s lower bounded by L(D(Cmin)). As a result, exponent in amount of data needed for compute-efficient training keeps increasing for values >4e12 bytes of uncompressed text, meaning data requirements are (much?) >2.085e+15 bytes of uncompressed text
  • This suggests data will (probably) eventually be the main bottleneck.

25 of 61

Contradictions and a Conjecture

GPT-3 = 3.64E+03 PF-days = 🔴

🔴

L(D(C)) by definition must be a lower bound of L(Cmin)

26 of 61

Extrapolation TL;DR

Compute-efficient GPT-∞ with current algorithms/setup would use/need:

� 1e36 FLOPs (1e12 X GPT-3)�>2e15 bytes of uncompressed text (>2e3 X GPT-3)�<1e20 parameters (<6e9 X GPT-3)

It’s test loss would be greater than 0.84120903154 nats/token, which seems to be in the range of what Claude Shannon estimated to be the entropy of English.

27 of 61

How many OOM increases would be needed for human level performance on each individual task?

28 of 61

Tractable

  • LAMBADA: Predict last word of paragraph
  • TriviaQA: Closed Book Question Answering
  • CoQA: Conversational Question Answering Challenge about paragraph

29 of 61

Maybe Tractable: Winogrande, SAT

  • Winogrande: Adversarially-Mined Winograd Schema Challenge
  • SAT Analogies: Analogy Section of SAT

30 of 61

Maybe Less Tractable: ANLI, SuperGLUE

  • ANLI: Adversarially-Mined Natural Language Inference
  • SuperGLUE: 8 tasks for which humans do significantly better than BERT

31 of 61

Less Tractable

32 of 61

Results: Caveat “Phase Transition”

- Possibility of Phase Transitions makes extrapolating these trends highly uncertain

33 of 61

Novel (OOD?) Symbolic Manipulations

34 of 61

Fake News

35 of 61

Is GPT-3 memorizing?

36 of 61

GPT-3 is not overfitting

37 of 61

Contamination

38 of 61

Public Reception

39 of 61

Other Important Scaling Laws Findings�(from original scaling laws paper)

40 of 61

Performance depends only weakly on model shape

Within reasonable limits, performance depends very weakly on other architectural hyperparameters such as depth vs. width.

41 of 61

We can predict training curve

(batch size adjusted training step)

42 of 61

We can predict effect of more data

43 of 61

Transfer improves with test performance

44 of 61

Other Important Scaling Laws Findings�(from follow-up papers)

45 of 61

Scaling Laws Exist on All Data Modalities

46 of 61

Optimal model size is consistent across modalities. Big Mystery.

  • Exponent for optimal model size of compute-efficient scaling laws seems to be the ~same for every modality 🤯!

  • When following compute-efficient scaling law, C^1 X compute increase corresponds to C^.73 X params increase for any modality.

47 of 61

What about other neural architectures?

ConvNets

LSTMs (from this paper)

48 of 61

Effective data transferred

49 of 61

Larger models transfer more data

50 of 61

Contradiction resolution

51 of 61

Takeaways

52 of 61

  1. Evaluate how methods scale
  • Currently, most papers at prestigious conferences compare methods with 1 or a few different parameter counts.�
  • We should instead focus on estimating the scalings laws as a proposed method is provided more compute, parameters, and data.�
  • Endgame of Machine Learning could involve orders of magnitude (e.g. 1e12? X) more compute than is available (to any entity) today.
  • Methods with better scaling law exponents (steeper slope) are more likely to stand the test of time.

53 of 61

2. If scaling is the answer, how do we scale better?

54 of 61

Better Scaling Laws via Multimodal Data?

55 of 61

Text-to-image might have better scaling law than image-to-text

56 of 61

Image-to-text scales better with bag of words?!

57 of 61

Questions

58 of 61

VideoGPT is frontier

biological_evolution (starting at cambrian explosion) >1e36 frames

CCP >1e18 frames

NSA >1e17 frames

youtube >1e14 frames

twitch >1e14 frames

dota2 openai_five >1e13 frames

tesla >1e13 frames

facebook >1e13 frames

agent57 >1e12 frames

single_human_life >1e11 frames

comma.ai >1e11 frames

youtube-8m >1e10 frames

netflix >1e9 frames

ig-3.5B >1e9 frames

kinetics-600M >1e8 frames

dall·e/clip/jft-300m >1e8 frames

lsun >1e7 frames

imagenet >1e6 frames

59 of 61

Advantageouser Scaling Laws: OOD_G Inspiration

“We have yet to understand what kind of assumptions we are able to make in practical problems of interest that mathematically describe the amount of data required by empirical risk minimization in nonlinear realizable problems, and what the (perhaps from exponential to polynomial) reduction of sample complexity is when we exploit invariance. We leave all these fascinating research problems for future work.” - Martin Arjovsky’s PhD thesis

60 of 61

Advantageouser Scaling Laws: Compute-Performance Tradeoff of Architecture

  • Of all the papers trying to create efficienter (big O notation) transformers, some may get to lower test loss using less FLOPs (and memory) but no one seems to have tested this yet because community is not explicitly focusing on scaling laws 🤦🏽‍♀️.

Better Big O Complexity is not the same as Better Neural Scaling Laws

61 of 61

Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers�https://arxiv.org/abs/2002.11794