1 of 80

Vision Summer School

On Scaling & Compression

2 of 80

Part I

Scaling

3 of 80

AI is advancing so fast that it is hard to keep up

People spend a lot of time and energy catching up with the latest developments

But not enough attention goes to the old things

It is more important to study the change itself

4 of 80

What does it mean to study the change itself?

Identify the dominant driving forces behind the change

Understand the dominant driving forces

Predict the future trajectory

1

2

3

5 of 80

Predicting the future trajectory is difficult because there are many driving forces and the complexity of their interactions

Number of dominant driving forces

Dropping a pen

Prediction difficulty

AI Research?

6 of 80

Figure from Rich Sutton’s WAIC keynote

Brain-scale compute power

Roughly, 10x more compute every 5 years

7 of 80

The job of AI researchers is to teach machines how to “think”

One (unfortunately common) approach

Teach the machines how we think we think

This approach imposes structure to the problem, which can become the limitation when scaled up

8 of 80

The job of AI researchers is to teach machines how to “think”

Bitter lesson: progress of AI in the past 70 years boils down to

Develop progressively more general methods with weaker modeling assumptions
Add more data and computation (i.e. scale up)

9 of 80

”We think the most benefits will go to whoever has the biggest computer.”

Greg Brockman, OpenAI’s CTO

10 of 80

The more structure, the less scalable the method is

Compute

Performance

Less structure

More structure

11 of 80

However, we can’t just go for the most general method

Compute

Performance

Less structure

More structure

Even less structure

12 of 80

However, we can’t just go for the most general method

Compute

Performance

Difference between the two methods is the additional “inductive biases” or structure imposed

Less structure

More structure

Even less structure

If we are here, we should choose “Less structure”. But remember to undo later

13 of 80

Remember CNNs!

Inductive bias: we “hard-code” the fact that neighboring pixels are related

14 of 80

Remember CNNs!

Inductive bias: we “hard-code” the fact that neighboring pixels are related

But transformers (with less inductive bias) are just as good (or better)!

15 of 80

Peak humor circa ~8 years ago

16 of 80

Now

…but unironically!

17 of 80

More data / compute = better performance

Can we predict in advance what performance we get given X amount of data / compute?

(without actually training the model)

18 of 80

19 of 80

20 of 80

What is a scaling law?

Simple formula that maps a quantity related to size to error

Multiple types of scaling laws

Data scaling laws:

Given fixed model. Compute f: dataset size -> error

Model scaling laws:

Given fixed dataset. Compute f: model size -> error

Joint data-model scaling laws:

Compute f: dataset size X model size -> error

Error can be validation loss, accuracy on some benchmark, etc.

21 of 80

Example

22 of 80

Example

?

X

Observation: It’s just linear regression in the log-log plot

23 of 80

Emergent Abilities of Large Language Models

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph et al. (2022)

Bigger is better!

Bigger is different!

24 of 80

Scaling law studies

Scaling Laws for Neural Language Models

https://arxiv.org/pdf/2001.08361.pdf

Scaling Laws for Transfer

https://arxiv.org/pdf/2102.01293.pdf

Training Compute-Optimal Large Language Models

https://arxiv.org/pdf/2203.15556.pdf

Reproducible scaling laws for contrastive language-image learning

https://arxiv.org/abs/2212.07143

Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design

https://arxiv.org/pdf/2305.13035.pdf

Important to understand what problems can be “brute forced”

25 of 80

Not all problems obey scaling laws

Inverse Scaling: When Bigger Isn’t Better, McKenzie et al., 2023

https://arxiv.org/pdf/2306.09479.pdf

BROKEN NEURAL SCALING LAWS, Caballero et. al., 2023

https://arxiv.org/pdf/2210.14891.pdf

26 of 80

27 of 80

How is the scaling actually done?

28 of 80

Most models so far use Transformer architecture

29 of 80

Internals don’t matter that much!

More useful to take a “functional” viewpoint on the Transformer

Sequence-to-sequence mapping with bunch of matmuls

Input: [batch, d_model, length]

Output: [batch, d_model, length]

30 of 80

matmuls can be easily parallelized on GPU

We can make them huge!

10000x10000 and bigger (per layer, have 10s of layers)

31 of 80

32 of 80

Model weights split across GPUs

Data batches split across GPUs

33 of 80

Model weights split across GPUs

Data batches split across GPUs

34 of 80

Model weights split across GPUs

Data batches split across GPUs

35 of 80

Industry takes it to another level

https://cloud.google.com/blog/products/compute/using-cloud-tpu-multislice-to-scale-ai-workloads

6144 TPU chips

36 of 80

37 of 80

Internals of Transformers do not matter as much

Many Transformer variants have been proposed but almost all fancy variations don’t scale well

More useful to abstract away Transformer as sequence of functions and think about input and output shapes and types

Do Transformer Modifications Transfer Across Implementations and Applications?

Sharan Narang, Hyung Won Chung, Yi Tay, William Fedus, et al. (2021)

38 of 80

39 of 80

Learnable part of the system

Input

Hand-designed program

Output

Rule-based systems

40 of 80

Input

Hand-designed features

Output

Mapping from features

Hand- designed

loss function

Classical machine learning

Learnable part of the system

Input

Hand-designed program

Output

Rule-based systems

41 of 80

Input

Hand-designed features

Output

Mapping from features

Hand- designed

loss function

Classical machine learning

Deep learning: (self-)supervised learning

Learnable part of the system

Input

Learned

features

Mapping from features

Input

Hand-designed program

Output

Rule-based systems

Output

Hand- designed

loss function

42 of 80

Input

Hand-designed features

Output

Mapping from features

Hand- designed

loss function

Classical machine learning

Deep learning: (self-)supervised learning

Learnable part of the system

Input

Learned

features

Mapping from features

Input

Hand-designed program

Output

Rule-based systems

Output

Hand- designed

loss function

logistic regression

Feedforward neural net

43 of 80

Input

Hand-designed features

Output

Mapping from features

Hand- designed

loss function

Classical machine learning

Deep learning: (self-)supervised learning

Learnable part of the system

Input

Learned

features

Mapping from features

Input

Hand-designed program

Output

Rule-based systems

Deep learning: other RL formulations

Input

Learned

features

Mapping from features

Learned

loss function

Output

Hand- designed

loss function

logistic regression

Feedforward neural net

Output

44 of 80

Input

Hand-designed features

Output

Mapping from features

Hand- designed

loss function

Classical machine learning

Deep learning: (self-)supervised learning

Learnable part of the system

Input

Learned

features

Mapping from features

Input

Hand-designed program

Output

Rule-based systems

Deep learning: other RL formulations

Input

Learned

features

Mapping from features

Learned

loss function

Output

Hand- designed

loss function

Output

IBM DeepBlue

SVM

GPT-3

RLHF??

The Dream?

45 of 80

Part II

Compression

46 of 80

Why do scaling laws hold?

Do we have any theories?
How can we improve the scaling curve?

47 of 80

Flip a coin 45 times. Which outcome is more likely?

X1 = 001001001001001001001001001001001001001001001

X2 = 0100101010110100010110100101110001010110001111

48 of 80

Flip a coin 45 times. Which outcome is more likely?

X1 = 001001001001001001001001001001001001001001001

X2 = 0100101010110100010110100101110001010110001111

Both are equally likely! But X1 “feels” less random.

49 of 80

Kolmogorov Complexity

K(X) = length of the shortest program to compute X

X = binary string (aka anything)

Measure of compressibility / randomness of a string

Ideal compressor!

50 of 80

Algorithmically Compressing Data

X1 = 001001001001001001001001001001001001001001001

X2 = 0100101010110100010110100101110001010110001111

K(X1) ~= len(‘print(“001” * 15)’)

K(X2) ~= len(‘print(“0100101010110100010110100101110001010110001111”)’)

51 of 80

How can we find K(X)?

Undecidable

Enumerate all programs in order of length
Run each program check if the output is ok
Some programs do not halt. (Halting problem = undecidable!)

We can make approximations to the real K(X) and get closer and closer to the optimal compressor

For instance, we can approximate K(X) by a compressor, i.e., gzip

K(X) = length of the shortest program to compute X

52 of 80

Example - compression is learning

“Low-Resource” Text Classification: A Parameter-Free Classification Method with Compressors, Jiang et al., 2023

https://aclanthology.org/2023.findings-acl.426.pdf
Gzip + KNN beats BERT on some tasks

53 of 80

Neural Networks are Programs

A neural network is a program with some constraints

Neural networks are circuits. Circuits are programs. (it’s just sequential matrix multiplications)
Finite number of sequential steps / memory

54 of 80

Neural Networks are (compression) Programs

A neural network is a program with some constraints

Neural networks are circuits. Circuits are programs. (it’s just sequential matrix multiplications)
Finite number of sequential steps / memory

A deep neural network is a compressor

e.g. LANGUAGE MODELING IS COMPRESSION (https://arxiv.org/pdf/2309.10668)

55 of 80

Training a neural network is actually a search over the program space

“Program space” is defined by current architecture hyperparameters

56 of 80

State-of-the-art Neural Network

=

State-of-the-art compressor (i.e., close to K(X))

57 of 80

Intelligence = Compression

58 of 80

An interesting result of theory

Let X, Y be two sequences of {0, 1}*. Then

K(X, Y) <= K(X) + K(Y) + c

Concatenating two datasets makes the result more compressible than each individual dataset!

59 of 80

K(X, Y) <= K(X) + K(Y) + c

A model trained on the concatenation of two datasets is better than individual models trained on each separate dataset
The model uses common statistics between datasets to compress better
The difference in performance (c) is due to the irreducible differences between X and Y

More data (even multimodal / even unpaired) is provably always better

60 of 80

More data = better performance

The more “compressible” the dataset, the easier it is to learn!

61 of 80

The more “compressible” the dataset, the easier it is to learn!

E.g. training an LLM on code will learn better / faster than just text

Code has more structure -> it is more compressible

E.g. training a model on clean data will learn better than noisy data

Clean data is more compressible

62 of 80

The more “compressible” the dataset, the easier it is to learn!

Corollary: An incompressible dataset (i.e., random) is impossible to learn.

63 of 80

“No Free Lunch” Theorem

Any two optimization algorithms are equivalent when their performance is averaged across all possible problems.

64 of 80

“No Free Lunch” Theorem

Translation: Each individual problem requires specially tailored inductive biases.

65 of 80

“No Free Lunch” Theorem

Any two optimization algorithms are equivalent when their performance is averaged across all possible problems.

???

66 of 80

“No Free Lunch” Theorem

Any two optimization algorithms are equivalent when their performance is averaged across all possible problems.

In mathematics, a problem is defined as a tuple (p, S)

p = binary string (i.e. the problem description)
S = set of binary strings (i.e., solutions to the problem)

An Introduction to Kolmogorov Complexity and Its Applications, Ming Li , Paul Vitányi, 2019

???

67 of 80

“No Free Lunch” Theorem

Any two optimization algorithms are equivalent when their performance is averaged across all possible problems.

In mathematics, a problem is defined as a tuple (p, S)

p = binary string (i.e. the problem description)
S = set of binary strings (i.e., solutions to the problem)

An Introduction to Kolmogorov Complexity and Its Applications, Ming Li , Paul Vitányi, 2019

???

Infinite (exponential) combinations

No relation to the real world!

68 of 80

“... All but exponentially few sequences of a given length have near maximal Kolmogorov complexity and are thus incompressible”

69 of 80

https://libraryofbabel.info/

70 of 80

Compressible binary strings

Binary strings that we

(as humans) care about

All possible binary strings

71 of 80

“No Free Lunch” Theorem

72 of 80

Ideas for VSS Competition

Train a bigger model

Adjust the learning rate accordingly!

model size x2 bigger = learning rate x2 smaller

Train on more data

More data could mean a more diverse set of augmentations

Train on more “compressible” data

Noisy data is harder to compress. A model learns better on clean data.

Train for longer

if model is big enough / data is large enough, otherwise it could lead to overfitting!

73 of 80

A Perspective on the Field

74 of 80

75 of 80

76 of 80

Tasks in Computer Vision

Scene text reading (OCR)
Object recognition (classification)
Object delineation (detection, segmentation)
Chart / infographic parsing
Document parsing
Instrument reading
Place recognition
Action recognition
Face recognition
World knowledge
Visual question answering

77 of 80

Tasks in Computer Vision

Scene text reading (OCR)
Object recognition (classification)
Object delineation (detection, segmentation)
Chart / infographic parsing
Document parsing
Instrument reading
Place recognition
Action recognition
Face recognition
World knowledge
Visual question answering

Fake Tasks

Real Tasks

78 of 80

Conclusions

Identified the main driving force behind AI progress

General methods with few assumptions
Scale driven by exponentially cheaper compute
Transformer architecture easily scalable

Scaling laws

Predictable model performance given scale
Useful for predicting model capability for a given $$$ budget

Models learn because real world data is compressible

Kolmogorov complexity
The better a dataset is compressible, the easier it is to learn

79 of 80

Burgers? / Break

80 of 80

Work on the competition!