Vision Summer School
On Scaling & Compression
Part I
Scaling
AI is advancing so fast that it is hard to keep up
People spend a lot of time and energy catching up with the latest developments
But not enough attention goes to the old things
It is more important to study the change itself
What does it mean to study the change itself?
Identify the dominant driving forces behind the change
Understand the dominant driving forces
Predict the future trajectory
1
2
3
Predicting the future trajectory is difficult because there are many driving forces and the complexity of their interactions
Number of dominant driving forces
Dropping a pen
Prediction difficulty
AI Research?
Figure from Rich Sutton’s WAIC keynote
Brain-scale compute power
Roughly, 10x more compute every 5 years
The job of AI researchers is to teach machines how to “think”
One (unfortunately common) approach
Teach the machines how we think we think
This approach imposes structure to the problem, which can become the limitation when scaled up
The job of AI researchers is to teach machines how to “think”
Bitter lesson: progress of AI in the past 70 years boils down to
”We think the most benefits will go to whoever has the biggest computer.”
Greg Brockman, OpenAI’s CTO
The more structure, the less scalable the method is
Compute
Performance
Less structure
More structure
However, we can’t just go for the most general method
Compute
Performance
Less structure
More structure
Even less structure
However, we can’t just go for the most general method
Compute
Performance
Difference between the two methods is the additional “inductive biases” or structure imposed
Less structure
More structure
Even less structure
If we are here, we should choose “Less structure”. But remember to undo later
Remember CNNs!
Inductive bias: we “hard-code” the fact that neighboring pixels are related
Remember CNNs!
Inductive bias: we “hard-code” the fact that neighboring pixels are related
But transformers (with less inductive bias) are just as good (or better)!
Peak humor circa ~8 years ago
Now
…but unironically!
More data / compute = better performance
Can we predict in advance what performance we get given X amount of data / compute?
(without actually training the model)
What is a scaling law?
Example
Example
?
X
Observation: It’s just linear regression in the log-log plot
Emergent Abilities of Large Language Models
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph et al. (2022)
Bigger is better!
Bigger is different!
Scaling law studies
Important to understand what problems can be “brute forced”
Not all problems obey scaling laws
How is the scaling actually done?
Most models so far use Transformer architecture
Internals don’t matter that much!
More useful to take a “functional” viewpoint on the Transformer
Sequence-to-sequence mapping with bunch of matmuls
Input: [batch, d_model, length]
Output: [batch, d_model, length]
matmuls can be easily parallelized on GPU
We can make them huge!
10000x10000 and bigger (per layer, have 10s of layers)
Model weights split across GPUs
Data batches split across GPUs
Model weights split across GPUs
Data batches split across GPUs
Model weights split across GPUs
Data batches split across GPUs
Industry takes it to another level
6144 TPU chips
Internals of Transformers do not matter as much
Many Transformer variants have been proposed but almost all fancy variations don’t scale well
More useful to abstract away Transformer as sequence of functions and think about input and output shapes and types
Do Transformer Modifications Transfer Across Implementations and Applications?
Sharan Narang, Hyung Won Chung, Yi Tay, William Fedus, et al. (2021)
Learnable part of the system
Input
Hand-designed program
Output
Rule-based systems
Input
Hand-designed features
Output
Mapping from features
Hand- designed
loss function
Classical machine learning
Learnable part of the system
Input
Hand-designed program
Output
Rule-based systems
Input
Hand-designed features
Output
Mapping from features
Hand- designed
loss function
Classical machine learning
Deep learning: (self-)supervised learning
Learnable part of the system
Input
Learned
features
Mapping from features
Input
Hand-designed program
Output
Rule-based systems
Output
Hand- designed
loss function
Input
Hand-designed features
Output
Mapping from features
Hand- designed
loss function
Classical machine learning
Deep learning: (self-)supervised learning
Learnable part of the system
Input
Learned
features
Mapping from features
Input
Hand-designed program
Output
Rule-based systems
Output
Hand- designed
loss function
logistic regression
Feedforward neural net
Input
Hand-designed features
Output
Mapping from features
Hand- designed
loss function
Classical machine learning
Deep learning: (self-)supervised learning
Learnable part of the system
Input
Learned
features
Mapping from features
Input
Hand-designed program
Output
Rule-based systems
Deep learning: other RL formulations
Input
Learned
features
Mapping from features
Learned
loss function
Output
Hand- designed
loss function
logistic regression
Feedforward neural net
Output
Input
Hand-designed features
Output
Mapping from features
Hand- designed
loss function
Classical machine learning
Deep learning: (self-)supervised learning
Learnable part of the system
Input
Learned
features
Mapping from features
Input
Hand-designed program
Output
Rule-based systems
Deep learning: other RL formulations
Input
Learned
features
Mapping from features
Learned
loss function
Output
Hand- designed
loss function
Output
IBM DeepBlue
SVM
GPT-3
RLHF??
The Dream?
Part II
Compression
Why do scaling laws hold?
Flip a coin 45 times. Which outcome is more likely?
X1 = 001001001001001001001001001001001001001001001
X2 = 0100101010110100010110100101110001010110001111
Flip a coin 45 times. Which outcome is more likely?
X1 = 001001001001001001001001001001001001001001001
X2 = 0100101010110100010110100101110001010110001111
Both are equally likely! But X1 “feels” less random.
Kolmogorov Complexity
Algorithmically Compressing Data
X1 = 001001001001001001001001001001001001001001001
X2 = 0100101010110100010110100101110001010110001111
K(X1) ~= len(‘print(“001” * 15)’)
K(X2) ~= len(‘print(“0100101010110100010110100101110001010110001111”)’)
How can we find K(X)?
K(X) = length of the shortest program to compute X
Example - compression is learning
“Low-Resource” Text Classification: A Parameter-Free Classification Method with Compressors, Jiang et al., 2023
Neural Networks are Programs
Neural Networks are (compression) Programs
State-of-the-art Neural Network
=
State-of-the-art compressor (i.e., close to K(X))
Intelligence = Compression
An interesting result of theory
Let X, Y be two sequences of {0, 1}*. Then
K(X, Y) <= K(X) + K(Y) + c
Concatenating two datasets makes the result more compressible than each individual dataset!
K(X, Y) <= K(X) + K(Y) + c
More data = better performance
The more “compressible” the dataset, the easier it is to learn!
“No Free Lunch” Theorem
Any two optimization algorithms are equivalent when their performance is averaged across all possible problems.
“No Free Lunch” Theorem
Translation: Each individual problem requires specially tailored inductive biases.
“No Free Lunch” Theorem
Any two optimization algorithms are equivalent when their performance is averaged across all possible problems.
???
“No Free Lunch” Theorem
Any two optimization algorithms are equivalent when their performance is averaged across all possible problems.
In mathematics, a problem is defined as a tuple (p, S)
An Introduction to Kolmogorov Complexity and Its Applications, Ming Li , Paul Vitányi, 2019
???
“No Free Lunch” Theorem
Any two optimization algorithms are equivalent when their performance is averaged across all possible problems.
In mathematics, a problem is defined as a tuple (p, S)
An Introduction to Kolmogorov Complexity and Its Applications, Ming Li , Paul Vitányi, 2019
???
Infinite (exponential) combinations
No relation to the real world!
“... All but exponentially few sequences of a given length have near maximal Kolmogorov complexity and are thus incompressible”
https://libraryofbabel.info/
Compressible binary strings
Binary strings that we
(as humans) care about
All possible binary strings
“No Free Lunch” Theorem
Ideas for VSS Competition
A Perspective on the Field
Tasks in Computer Vision
Tasks in Computer Vision
Fake Tasks
Real Tasks
Conclusions
Burgers? / Break
Work on the competition!