Neural Scaling Laws
Ethan Caballero Brady Neal
Overview
Preliminary Notation for Neural Scaling Laws
Parameters = N
Dataset_size = D
Compute_budget = C
Test_loss = L�
�Neural Scaling Laws are about the the relationship between L & C,N,D.
The Primary Finding
Compute-Efficient Scaling Law
Data-Efficient Scaling Law
Parameter-Efficient Scaling Law
Larger models are more sample-efficient
Money is/was bottleneck.
- Compute & (unlabelled) Data are both limited resources. But compute cost very large amount of money compared to monetary cost of scraping unlabelled data. �
- For GPT-2 & GPT-3 (very large language models), most monetary cost came from compute. As a result, for this setup optimizing for compute-efficiency (minimize FLOPs needed to get test loss low) becomes more important than optimizing for data-efficiency.�
- GPT-3 costed at least $5M to train.�
- This paper focuses on Compute-Efficient Scaling Laws, because paper’s original purpose was to inform scaling to GPT-3.
Compute-efficient scaling relationship is: �
C^1 X compute increase : C^.73 X params increase : C^.27 X data increase
Contradictions and a Conjecture
GPT-3 = 3.64E+03 PF-days = 🔴
🔴
L(D(C)) by definition must be a lower bound of L(Cmin)
GPT-3
GPT-3
Sparse Transformer
Vanilla (aka Dense) Transformer
The Scale of GPT-3
Training GPT-3 uses:
�3.64e+23 FLOPs = 1.250e-1 X FLOPs human brain uses in lifetime
�2.00e+11 words = 2.000e+1 X Data (words) human reads in lifetime
1.75e+11 params = 4.375e-3 X Params that human brain uses for language tasks.�� �
What is few-shot learning?
Finetuning; (how most papers evaluate downstream)
Zero Shot
Few Shot
Models and Data
Results
Did the Neural Scaling Law continue?
Results: SotA Language Modeling
Penn Tree Bank Perplexity (lower is better), previous SOTA is GPT2
Results and Extrapolations:
Results: High Level Observations
Results: Everything
Would need to scale params >5 OOM to get to 100% accuracy on all benchmarks
Brief Detour into how much resources might be allocated to largest ML training run(s) this century
Money (aka compute) constraint going forward
$1e12 * (1e24 FLOP/$) = 1e36 FLOPs = max FLOPs for training model this century. �(Yes, this is highly speculative, but it’s better than nothing.)
Total Text Data on internet is only growing linearly
3 bpe tokens = 2 words = 10 characters = 12 bytes of (uncompressed) text��
New Common Crawl Data each Month
Extrapolation
Contradictions and a Conjecture
GPT-3 = 3.64E+03 PF-days = 🔴
🔴
L(D(C)) by definition must be a lower bound of L(Cmin)
Extrapolation TL;DR
Compute-efficient GPT-∞ with current algorithms/setup would use/need:
� 1e36 FLOPs (1e12 X GPT-3)�>2e15 bytes of uncompressed text (>2e3 X GPT-3)�<1e20 parameters (<6e9 X GPT-3)
It’s test loss would be greater than 0.84120903154 nats/token, which seems to be in the range of what Claude Shannon estimated to be the entropy of English.
�
How many OOM increases would be needed for human level performance on each individual task?
Tractable
Maybe Tractable: Winogrande, SAT
Maybe Less Tractable: ANLI, SuperGLUE
Less Tractable
Results: Caveat “Phase Transition”
- Possibility of Phase Transitions makes extrapolating these trends highly uncertain
Novel (OOD?) Symbolic Manipulations
Fake News
Is GPT-3 memorizing?
GPT-3 is not overfitting
Contamination
Public Reception
Other Important Scaling Laws Findings�(from original scaling laws paper)
Performance depends only weakly on model shape
“Within reasonable limits, performance depends very weakly on other architectural hyperparameters such as depth vs. width.”
We can predict training curve
(batch size adjusted training step)
We can predict effect of more data
Transfer improves with test performance
Other Important Scaling Laws Findings�(from follow-up papers)
Scaling Laws Exist on All Data Modalities
Optimal model size is consistent across modalities. Big Mystery.
What about other neural architectures?
ConvNets
LSTMs (from this paper)
Effective data transferred
Larger models transfer more data
Contradiction resolution
Takeaways
2. If scaling is the answer, how do we scale better?
Better Scaling Laws via Multimodal Data?
�
Text-to-image might have better scaling law than image-to-text
Image-to-text scales better with bag of words?!
Questions
VideoGPT is frontier
biological_evolution (starting at cambrian explosion) >1e36 frames
CCP >1e18 frames
NSA >1e17 frames
youtube >1e14 frames
twitch >1e14 frames
dota2 openai_five >1e13 frames
tesla >1e13 frames
facebook >1e13 frames
agent57 >1e12 frames
single_human_life >1e11 frames
comma.ai >1e11 frames
youtube-8m >1e10 frames
netflix >1e9 frames
ig-3.5B >1e9 frames
kinetics-600M >1e8 frames
dall·e/clip/jft-300m >1e8 frames
lsun >1e7 frames
imagenet >1e6 frames
Advantageouser Scaling Laws: OOD_G Inspiration
“We have yet to understand what kind of assumptions we are able to make in practical problems of interest that mathematically describe the amount of data required by empirical risk minimization in nonlinear realizable problems, and what the (perhaps from exponential to polynomial) reduction of sample complexity is when we exploit invariance. We leave all these fascinating research problems for future work.” - Martin Arjovsky’s PhD thesis
Advantageouser Scaling Laws: Compute-Performance Tradeoff of Architecture
Better Big O Complexity is not the same as Better Neural Scaling Laws
Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers�https://arxiv.org/abs/2002.11794�