1 of 40

Better Intelligence if and only if Better Compression

ML in Physics workshop

12, September, 2019

2 of 40

What is intelligence?

Skill to solve hard problems
Ability to communicate with humans - Turing
Intelligence is what is measured by intelligence tests - Boring (1923)

3 of 40

What is intelligence?

Skill to solve hard problems
Ability to communicate with humans - Turing
Intelligence is what is measured by intelligence tests - Boring (1923)

The power to predict the future!

4 of 40

What is intelligence?

Given

0010101101101010110

Predict the next bit

5 of 40

Intelligence => Compression

Idea: Suppose alphabet was X, Y and

prob(X) = 2/3

prob(Y) = 1/3

If we are only concerned with encoding length 2 messages, then we can map all possible messages to intervals in the range [0..1]:

6 of 40

Intelligence => Compression

To encode message, just send enough bits of a binary fraction that uniquely specifies the interval

7 of 40

Intelligence => Compression

Number of bits is determined by the size of the interval

first interval is 8/27, needs 2 bits -> 2/3 bit per symbol X
last interval is 1/27, need 5 bits

On average, symbol X takes log(1 / P(X)) bits, symbol Y takes log(1 / P(Y)) bits

8 of 40

Intelligence => Compression

On average, each symbol takes P(X) * log(1 / P(X)) + P(Y) * log(1 / P(Y))

We reach the Shannon entropy limit for optimal compression

This assumes we can do perfect prediction by knowing P(X)and P(Y)

What if we don't?

9 of 40

Prediction idea 1 - Finite state machine

Assuming we know the underlying alphabet, we can count the frequencies of each symbol in the alphabet

Input: 000100100100

Interpretation:

00 01 00 10 01 00

sleep run sleep ice-cream run sleep

Question: What if we don't know the alphabet?

10 of 40

Prediction idea 2 - Lempel-Ziv

This is what is done in gzip

Split abracadabra into a|b|r|ac|ad|ab|ra

Count the occurrences at each node

The counts are taken as probabilities

For example, P(c|a) = 5/17

11 of 40

Prediction idea 2 - Lempel-Ziv

This is what is done in gzip

Split abracadabra into a|b|r|ac|ad|ab|ra

Drawbacks

Some subsequences are not parsed. For example br is not parsed, which is needed for P(a|br)
Some satisfying prefixes are not used. For example, P(b|raa) is treated as P(b) since raa is not in the tree. However, a better approximation P(b|a) could still be used, as a is in the tree

12 of 40

Prediction idea 2 - Lempel-Ziv

Drawbacks

Some subsequences are not parsed
Some satisfying prefixes are not used

These drawbacks show up for data sources that are generated not from blocks but from prefixes

13 of 40

Prediction idea 3 - Context Tree Weighting

LZ drawback 1: some subsequences are not parsed

Improvement: Parse each and every prefix, symbol by symbol

At each node, count the occurances

14 of 40

Prediction idea 3 - Context Tree Weighting

LZ drawback 2: some satisfying prefixes are not used

Solving this problem means we have to weight these satisfying prefixes

For example, to predict the next symbol, which prefix should we use?

P(x=0|10) = = 2/2 = 1

P(x=0|0) = = 2/4 = 0.5

15 of 40

Prediction idea 3 - Context Tree Weighting

Epicurus’ principle: If more than one theory is consistent with the observations, keep all theories

Model 1: simply count 0's and 1's

Model 2

Model 4

Model 3

Model 5

16 of 40

Prediction idea 3 - Context Tree Weighting

We need to weight these different theories

Pw(x|xt-2=1,xt-1=0) = w1 * Pe(x|M1) + w2 * Pe(x|M2)

P(x|xt-3=1,xt-2=1,xt-1=0) = θ110

P(x|xt-3=0,xt-2=1,xt-1=0) = θ010

P(x|xt-2=1,xt-1=0) = θ10

Pw(x|xt-2=0,xt-1=0) = w3 * Pe(x|M3) + w4 * Pe(x|M4)

17 of 40

Prediction idea 3 - Context Tree Weighting

How to compute Pw(x|xt-1=0) from

Pw(x|xt-2=1, xt-1=0) and Pw(x|xt-2=0, xt-1=0)?

18 of 40

Prediction idea 3 - Context Tree Weighting

It looks like Pw(x|xt-1=0) = w1'Pe(x|M1') +

(1-w1') * Pw(x|xt-2=1,xt-1=0) * Pw(x|xt-2=0,xt-1=0)

This is true if we consider the probability of the entire sequence Pe(s), instead of the next symbol Pe(x)

19 of 40

Prediction idea 3 - Context Tree Weighting

Pw(x|xt-2=1,xt-1=0) * Pw(x|xt-2=0,xt-1=0) is a cross product, for example

Pe(s|M3') = Pe(s|M2) * Pe(s|M3)

θ011

θ010

θ00

P(x1)P(x4)P(x2)P(x5) =

(1-θ011)(1-θ010)θ00θ00

(1-θ011)(1-θ010)

θ00θ00

20 of 40

Prediction idea 3 - Context Tree Weighting

Also note that by defining

the weights, w, sum to 1

the second equality comes from induction

21 of 40

Prediction idea 3 - Context Tree Weighting

Effectively, the weight of a tree is defined as the product of its nodes

22 of 40

Prediction idea 3 - Context Tree Weighting

Let the weights w be a function of depth d, one way to set w is

This is saying all double exponentially number of trees, , have equal weight

The code length for this equal weight setting is

A better way is to set all weights to ½,

which amounts to describing trees in their

most natural form via recursion:

23 of 40

Prediction idea 3 - Context Tree Weighting

If we set all weights to ½ via the recursive coding scheme, the code length is #nodes instead of

In other words, we gain efficiency by weighting trees inversely proportionately to their complexity

Occam's razor: The simpler the explanation, the more plausible it is

24 of 40

Prediction idea 3 - Context Tree Weighting

Let L be the length of the compressed string, S be the entropy of the source data

As n ->∞

Lempel-Ziv is as good as using the true finite state machine
, Serap Savari 1997

Lempel-Ziv is optimal in the sense that per symbol redundancy approaches 0. However, it is not truly optimal as the rate that it converges is not the fastest.

25 of 40

Prediction idea 3 - Context Tree Weighting

Rissanen lower bound:

If the maximum likelihood estimates of the source distribution P(x, θ1, θ2,.. θk) satisfy the Central Limit Theorem, then

length = (data entropy) + (cost of learning) + (cost of inferior estimator)

Minimized by arithmetic coding

log(n) per degree of freedom

should strive to make this 0,

CTW achieves this

26 of 40

Prediction idea 3 - Context Tree Weighting

https://github.com/fumin/ctw

Gettysburg address:

Four score and seven years ago…... government of the people, by the people, for the people, shall not perish from the earth

Compressed size in bytes:

Original: 1463
tar.gz: 993
7z: 908
zip: 874
CTW: 772

27 of 40

Compression => Intelligence

Intelligence: find patterns in the data

Discover phases in matter
Group light curves into their respective generating celestial objects

Given a compressor, how to use it to measure the similarity of two objects?

28 of 40

Compression => Intelligence

Let K(x) be the compressed length of string x, K(xy)the compressed length of the concatenation of x and y

K(xy) = K(yx)
K(y|x) = K(xy) - K(x)
K(x|x) = 0
is the difference per bit from x to y
is symmetric
d(x, y) <= d(x, z) + d(z, y)

29 of 40

Mammalian evolution

Conflict among individual Mitochondrial Proteins in Resolving the Phylogeny of Eutherian Orders

Although the overall evidence strongly suggests ((primates, ferungulates), rodents), the ND1 data clearly support another tree, ((primates, rodents), ferungulates)...

30 of 40

Mammalian evolution

Which tree is correct?

Given mitochondrial DNA:

Chimpanzee: gtttatgtagcttacccccttaa...

Pig: gttaatgtagcttaaattatcaaa...

31 of 40

Mammalian evolution

Common approaches:

Align the sequence and remove gaps (by eye) so that it is valid
Consider only the first and second positions of the codons
Choose a rate model and estimate its parameters

This approach: Use a compressor on the raw sequence

32 of 40

gzip

CTW

飛猴

33 of 40

CTW

Cao et al., Journal of Molecular Evolution

34 of 40

SARS virus phylogeny

Severe acute respiratory syndrome (SARS) is a disease caused by the SARS coronavirus

The outbreak in 2003 caused 774 deaths with a fatality rate of 9.6%

There is no vaccine

Understanding SARS virus evolution history is key to its prevention and treatment

35 of 40

SARS virus phylogeny

Human coronavirus 229E (HCoV-229E), common cold
Porcine epidemic diarrhea virus (PEDV), pig disease
Transmissible gastroenteritis virus (TGEV), pig disease
Bovine coronavirus (BCoV), cow disease
Murine hepatitis virus (MHV), mouse disease
Avian infectious bronchitis virus (IBV), chicken respiratory disease
Canine coronavirus (CCoV), dog disease
Feline coronavirus (FCoV), cat disease
Porcine respiratory coronavirus (PRCoV), pig disease
Human coronavirus OC43 (HCoV-OC43), common cold
Porcine hemagglutinating encephalomyelitis virus (HEV), pig brain disease
Rat coronavirus (RtCoV)

36 of 40

SARS virus phylogeny

Input: amino acid sequences of spike proteins, which is the key penetration apparatus of coronaviruses

MFIFLLFLTLTSGSDLDRCTTFDDVQA...

Common approach:

Assume states are single amino acids, count their statistics from a database to form a score matrix
Distances is the sum of the scores after using dynamic programming to insert gaps

37 of 40

SARS virus phylogeny

This approach:

with CTW

Findings:

Coronavirus form several groups

Group 1: TGEV, PEDV,…
Group 2: BCoV, MHV,...

SARS virus is distinct from other groups

Group 2

Group 1

38 of 40

Group 2

Group 1

Rota et al., Science

39 of 40

Conclusion

Higher intelligence leads to tighter compression

In theory, measured in terms of closeness to the Rissanen bound
In practice, closeness to the human lower bound of 1 bit per character

A good compressor such as CTW evinces high intelligence in unsupervised learning clustering tasks
Occam's razor, one of Physicists best tools, is just as useful in the field of AI

Details of experiments can be found here

40 of 40

Quiz

The BARF compressor will… eventually compress any file to 0 bytes

What is wrong here???