Better Intelligence if and only if Better Compression
ML in Physics workshop
12, September, 2019
What is intelligence?
What is intelligence?
The power to predict the future!
What is intelligence?
Given
0010101101101010110
Predict the next bit
Intelligence => Compression
Idea: Suppose alphabet was X, Y and
prob(X) = 2/3
prob(Y) = 1/3
If we are only concerned with encoding length 2 messages, then we can map all possible messages to intervals in the range [0..1]:
Intelligence => Compression
To encode message, just send enough bits of a binary fraction that uniquely specifies the interval
Intelligence => Compression
Number of bits is determined by the size of the interval
On average, symbol X takes log(1 / P(X)) bits, symbol Y takes log(1 / P(Y)) bits
Intelligence => Compression
On average, each symbol takes P(X) * log(1 / P(X)) + P(Y) * log(1 / P(Y))
We reach the Shannon entropy limit for optimal compression
This assumes we can do perfect prediction by knowing P(X)and P(Y)
What if we don't?
Prediction idea 1 - Finite state machine
Assuming we know the underlying alphabet, we can count the frequencies of each symbol in the alphabet
Input: 000100100100
Interpretation:
00 01 00 10 01 00
sleep run sleep ice-cream run sleep
Question: What if we don't know the alphabet?
Prediction idea 2 - Lempel-Ziv
This is what is done in gzip
Split abracadabra into a|b|r|ac|ad|ab|ra
Count the occurrences at each node
The counts are taken as probabilities
For example, P(c|a) = 5/17
Prediction idea 2 - Lempel-Ziv
This is what is done in gzip
Split abracadabra into a|b|r|ac|ad|ab|ra
Drawbacks
Prediction idea 2 - Lempel-Ziv
Drawbacks
These drawbacks show up for data sources that are generated not from blocks but from prefixes
Prediction idea 3 - Context Tree Weighting
LZ drawback 1: some subsequences are not parsed
Improvement: Parse each and every prefix, symbol by symbol
At each node, count the occurances
Prediction idea 3 - Context Tree Weighting
LZ drawback 2: some satisfying prefixes are not used
Solving this problem means we have to weight these satisfying prefixes
For example, to predict the next symbol, which prefix should we use?
P(x=0|10) = = 2/2 = 1
P(x=0|0) = = 2/4 = 0.5
Prediction idea 3 - Context Tree Weighting
Epicurus’ principle: If more than one theory is consistent with the observations, keep all theories
Model 1: simply count 0's and 1's
Model 2
Model 4
Model 3
Model 5
Prediction idea 3 - Context Tree Weighting
We need to weight these different theories
Pw(x|xt-2=1,xt-1=0) = w1 * Pe(x|M1) + w2 * Pe(x|M2)
P(x|xt-3=1,xt-2=1,xt-1=0) = θ110
P(x|xt-3=0,xt-2=1,xt-1=0) = θ010
P(x|xt-2=1,xt-1=0) = θ10
Pw(x|xt-2=0,xt-1=0) = w3 * Pe(x|M3) + w4 * Pe(x|M4)
Prediction idea 3 - Context Tree Weighting
Pw(x|xt-1=0) = w1' * Pe(x|M1') + w2' * Pe(x|M2') + w3' * Pe(x|M3') + w4' * Pe(x|M4') + w5' * Pe(x|M5')
How to compute Pw(x|xt-1=0) from
Pw(x|xt-2=1, xt-1=0) and Pw(x|xt-2=0, xt-1=0)?
Prediction idea 3 - Context Tree Weighting
It looks like Pw(x|xt-1=0) = w1'Pe(x|M1') +
(1-w1') * Pw(x|xt-2=1,xt-1=0) * Pw(x|xt-2=0,xt-1=0)
This is true if we consider the probability of the entire sequence Pe(s), instead of the next symbol Pe(x)
Prediction idea 3 - Context Tree Weighting
Pw(x|xt-2=1,xt-1=0) * Pw(x|xt-2=0,xt-1=0) is a cross product, for example
Pe(s|M3') = Pe(s|M2) * Pe(s|M3)
θ011
θ010
θ00
P(x1)P(x4)P(x2)P(x5) =
(1-θ011)(1-θ010)θ00θ00
(1-θ011)(1-θ010)
θ00θ00
Prediction idea 3 - Context Tree Weighting
Also note that by defining
the weights, w, sum to 1
the second equality comes from induction
Prediction idea 3 - Context Tree Weighting
Effectively, the weight of a tree is defined as the product of its nodes
Prediction idea 3 - Context Tree Weighting
Let the weights w be a function of depth d, one way to set w is
This is saying all double exponentially number of trees, , have equal weight
The code length for this equal weight setting is
A better way is to set all weights to ½,
which amounts to describing trees in their
most natural form via recursion:
Prediction idea 3 - Context Tree Weighting
If we set all weights to ½ via the recursive coding scheme, the code length is #nodes instead of
In other words, we gain efficiency by weighting trees inversely proportionately to their complexity
Occam's razor: The simpler the explanation, the more plausible it is
Prediction idea 3 - Context Tree Weighting
Let L be the length of the compressed string, S be the entropy of the source data
As n ->∞
Lempel-Ziv is optimal in the sense that per symbol redundancy approaches 0. However, it is not truly optimal as the rate that it converges is not the fastest.
Prediction idea 3 - Context Tree Weighting
If the maximum likelihood estimates of the source distribution P(x, θ1, θ2,.. θk) satisfy the Central Limit Theorem, then
length = (data entropy) + (cost of learning) + (cost of inferior estimator)
Minimized by arithmetic coding
log(n) per degree of freedom
should strive to make this 0,
CTW achieves this
Prediction idea 3 - Context Tree Weighting
Gettysburg address:
Four score and seven years ago…... government of the people, by the people, for the people, shall not perish from the earth
Compressed size in bytes:
Compression => Intelligence
Intelligence: find patterns in the data
Given a compressor, how to use it to measure the similarity of two objects?
Compression => Intelligence
Let K(x) be the compressed length of string x, K(xy)the compressed length of the concatenation of x and y
Mammalian evolution
Conflict among individual Mitochondrial Proteins in Resolving the Phylogeny of Eutherian Orders
Although the overall evidence strongly suggests ((primates, ferungulates), rodents), the ND1 data clearly support another tree, ((primates, rodents), ferungulates)...
Mammalian evolution
Which tree is correct?
Given mitochondrial DNA:
Chimpanzee: gtttatgtagcttacccccttaa...
Pig: gttaatgtagcttaaattatcaaa...
Mammalian evolution
Common approaches:
This approach: Use a compressor on the raw sequence
gzip
CTW
飛猴
CTW
SARS virus phylogeny
Severe acute respiratory syndrome (SARS) is a disease caused by the SARS coronavirus
The outbreak in 2003 caused 774 deaths with a fatality rate of 9.6%
There is no vaccine
Understanding SARS virus evolution history is key to its prevention and treatment
SARS virus phylogeny
SARS virus phylogeny
Input: amino acid sequences of spike proteins, which is the key penetration apparatus of coronaviruses
MFIFLLFLTLTSGSDLDRCTTFDDVQA...
Common approach:
SARS virus phylogeny
This approach:
with CTW
Findings:
Group 2
Group 1
Group 2
Group 1
Conclusion
Details of experiments can be found here
Quiz
The BARF compressor will… eventually compress any file to 0 bytes
What is wrong here???