2 of 63

Zip Files, How Do They Work?

Size in Bytes

$ zip mobydick.zip mobydick.txt

adding: mobydick.txt (deflated 59%)

$ ls -l

-rw-rw-r-- 1 jug jug 643207 Apr 24 10:55 mobydick.txt

-rw-rw-r-- 1 jug jug 261375 Apr 24 10:55 mobydick.zip

File is unchanged by zipping / unzipping.

$ unzip mobydick.zip

replace mobydick.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename: r

new name: unzipped.txt

inflating: unzipped.txt

$ diff mobydick.txt unzipped.txt

3 of 63

Compression Model #1: Algorithms Operating on Bits

In a lossless algorithm we require that no information is lost.

Text files often compressible by 70% or more.

01010101000001010101110...

Compression

Algorithm C

1001010101...

01010101000001010101110...

Decompression

Algorithm C^-1

1001010101...

Bitstream B

Compressed bits C(B)

C(B)

4 of 63

Prefix Free Codes

5 of 63

Increasing Optimality of Coding

By default, English text is usually represented by sequences of characters, each 8 bits long, e.g. ‘d’ is 01100100.

Easy way to compress: Use fewer than 8 bits for each letter.

Have to decide which bit sequences should go with which letters.
More generally, we’d say which codewords go with which symbols.

word	binary	hexadecimal
dog	01100100 01101111 01100111	64 6f 67

6 of 63

More Code: Mapping Alphanumeric Symbols to Codewords

Example: Morse code.

Goal: Compact representation.
What is – – • – – •?

7 of 63

More Code: Mapping Alphanumeric Symbols to Codewords

Example: Morse code.

Goal: Compact representation.
What is – – • – – •? It’s ambiguous!

MEME
GG

Note:

Can think of dot as 0, dash as 1.
Operators pause between codewords to avoid ambiguity.

Pause acts as a 3rd symbol.

Alternate strategy: Avoid ambiguity by making code prefix free.

8 of 63

Morse Code (as a Tree)

From Wikimedia

9 of 63

Prefix-Free Codes [Example 1]

A prefix-free code is one in which no codeword is a prefix of any other. Example for English:

space	1
E	01
T	001
A	0001
O	00001
I	000001
...

I ATE: 0000011000100101

start

space

...

10 of 63

Prefix-Free Codes [Example 2]

A prefix-free code is one in which no codeword is a prefix of any other. Example for English:

space	111
E	010
T	1101
A	1011
O	1001
I	1000
...

I ATE: 100011110111101010

start

space

...

11 of 63

Prefix Free Code Design

Observation: Some prefix-free codes are better for some texts than others.

Observation: It’d be useful to have a procedure that calculates the “best” code for a given text.

space	1
E	01
T	001
A	0001
O	00001
I	000001
...

space	111
E	010
T	1101
A	1011
O	1001
I	1000
...

Better for EEEEAT (8+3+4 = 15 bits).

Much worse for JOSH (25+5+8+10 = 48 bits).

Worse for EEEEAT (12+4+4 = 20 bits).

Better for JOSH (7+4+6+6 = 23 bits).

12 of 63

Code Calculation Approach #1 (Shannon-Fano Coding)

Count relative frequencies of all characters in a text.
Split into ‘left’ and ‘right halves’ of roughly equal frequency.

Left half gets a leading zero. Right half gets a leading one.
Repeat.

Symbol	Frequency
我	0.35
爸	0.17
是	0.17
李	0.16
刚	0.15

Left half

Right half

我

爸

是

李

刚

35% of all characters are 我

13 of 63

Code Calculation Approach #1 (Shannon-Fano Coding)

Count relative frequencies of all characters in a text.
Split into ‘left’ and ‘right halves’ of roughly equal frequency.

Left half gets a leading zero. Right half gets a leading one.
Repeat.

Symbol	Frequency	Code
我	0.35	0...
爸	0.17	0...
是	0.17	1...
李	0.16	1...
刚	0.15	1...

我

爸

是

李

刚

Left half

Right half

14 of 63

Code Calculation Approach #1 (Shannon-Fano Coding)

Count relative frequencies of all characters in a text.
Split into ‘left’ and ‘right halves’ of roughly equal frequency.

Left half gets a leading zero. Right half gets a leading one.
Repeat.

Symbol	Frequency	Code
我	0.35	00
爸	0.17	01
是	0.17	1...
李	0.16	1...
刚	0.15	1...

我

爸

是

李

刚

Left half

Right half

15 of 63

Code Calculation Approach #1 (Shannon-Fano Coding)

Count relative frequencies of all characters in a text.
Split into ‘left’ and ‘right halves’ of roughly equal frequency.

Left half gets a leading zero. Right half gets a leading one.
Repeat.

Symbol	Frequency	Code
我	0.35	00
爸	0.17	01
是	0.17	1...
李	0.16	1...
刚	0.15	1...

我

爸

是

李

刚

Left half

Right half

16 of 63

Code Calculation Approach #1 (Shannon-Fano Coding)

Count relative frequencies of all characters in a text.
Split into ‘left’ and ‘right halves’ of roughly equal frequency.

Left half gets a leading zero. Right half gets a leading one.
Repeat.

Symbol	Frequency	Code
我	0.35	00
爸	0.17	01
是	0.17	10
李	0.16	11...
刚	0.15	11...

我

爸

是

李

刚

Left half

Right half

17 of 63

Code Calculation Approach #1 (Shannon-Fano Coding)

Count relative frequencies of all characters in a text.
Split into ‘left’ and ‘right halves’ of roughly equal frequency.

Left half gets a leading zero. Right half gets a leading one.
Repeat.

Symbol	Frequency	Code
我	0.35	00
爸	0.17	01
是	0.17	10
李	0.16	110
刚	0.15	111

我

爸

是

李

刚

18 of 63

Efficiency Assessment: http://yellkey.com/??

How many bits per symbol do we need to compress a file with the character frequencies listed below using the Shannon-Fano code that we created?

Symbol	Frequency	S-F Code
我	0.35	00
爸	0.17	01
是	0.17	10
李	0.16	110
刚	0.15	111

A. (2*3 + 3*2) / 5

= 2.4 bits per symbol

B. (0.35 + 0.17 + 0.17) * 2 + (0.16 + 0.15) * 3

= 2.31 bits per symbol

C. Not enough information, we need to know the exact characters in the file being compressed.

19 of 63

Efficiency Assessment of Shannon-Fano Coding

How many bits per symbol do we need to compress a file with the character frequencies listed below using the Shannon-Fano code that we created?

B. (0.35 + 0.17 + 0.17) * 2 + (0.16 + 0.15) * 3 = 2.31 bits per symbol.

Symbol	Frequency	S-F Code
我	0.35	00
爸	0.17	01
是	0.17	10
李	0.16	110
刚	0.15	111

Example assuming we have 100 symbols:

35 * 2 = 70 bits
17 * 2 = 34 bits
17 * 2 = 34 bits
16 * 3 = 48 bits
15 * 3 = 45 bits

Total: 231 bits

231 / 100 = 2.31 bits/symbol

20 of 63

Efficiency Assessment of Shannon-Fano Coding

If we had a file with 350 我 characters , 170 爸 characters , 170 是 characters, 160 李 characters, and 150 刚 characters, how many total bits would we need to encode this file using 32 bit Unicode? Using our Shannon-Fano code?

You don’t need a calculator.

Symbol	Frequency	S-F Code
我	0.35	00
爸	0.17	01
是	0.17	10
李	0.16	110
刚	0.15	111

2.31 bits per symbol for texts with this distribution

21 of 63

Efficiency Assessment of Shannon-Fano Coding

1000 total characters.

Space used:

32 bit Unicode: 32,000 bits.
S-F code: 2,310 bits.

Our code is 14 times as efficient!

Can only encode strings with these 5 symbols.

Symbol	Frequency	S-F Code
我	0.35	00
爸	0.17	01
是	0.17	10
李	0.16	110
刚	0.15	111

2.31 bits per symbol for texts with this distribution

22 of 63

Code Calculation Approach #1 (Shannon-Fano Coding)

Shannon-Fano coding is NOT optimal. Does a good job, but possible to find ‘better’ codes (see CS170).

Optimal solution assigned (and solved) as alternative to a final exam: http://www.huffmancoding.com/my-uncle/scientific-american

Symbol	Frequency	Code
我	0.35	00
爸	0.17	01
是	0.17	10
李	0.16	110
刚	0.15	111

我

爸

是

李

刚

23 of 63

Code Calculation Approach #2: Huffman Coding

As before, calculate relative frequencies.

Assign each symbol to a node with weight = relative frequency.
Take the two smallest nodes and merge them into a super node with weight equal to sum of weights.
Repeat until everything is part of a tree.

我

爸

是

李

刚

0.35

0.17

0.16

0.15

我

爸

是

李

刚

0.35

0.17

0.31

我

爸

是

李

刚

0.35

0.31

0.34

24 of 63

Code Calculation Approach #2: Huffman Coding

As before, calculate relative frequencies.

Assign each symbol to a node with weight = relative frequency.
Take the two smallest nodes and merge them into a super node with weight equal to sum of weights.
Repeat until everything is part of a tree.

我

爸

是

李

刚

0.35

0.31

0.34

我

爸

是

李

刚

0.35

0.65

我

爸

是

李

刚

25 of 63

Huffman vs. Shannon-Fano

Shannon-Fano code below results in an average of 2.31 bits per symbol, whereas Huffman is only 2.3 bits per symbol.

Huffman coded file is 0.35*1 + 0.65*3 = 2.3 bits per symbol.

Symbol	Frequency	S-F Code	Huffman Code
我	0.35	00	0
爸	0.17	01	100
是	0.17	10	101
李	0.16	110	110
刚	0.15	111	111

Strictly better than Shannon-Fano coding. There is NO downside to Huffman coding instead.

26 of 63

Prefix-Free Codes

Question: For encoding (bitstream to compressed bitstream), what is a natural data structure to use? Assume characters are of type Character, and bit sequences are of type BitSequence.

space	111
E	010
T	1101
A	1011
O	1001
I	1000
...	0111

space	1
E	01
T	001
A	0001
O	00001
I	000001
...

I ATE: 0000011000100101

I ATE: 100011110111101010

27 of 63

Prefix-Free Codes

Question: For encoding (bitstream to compressed bitstream), what is a natural data structure to use? chars are just integers, e.g. ‘A’ = 65.

Array of BitSequence[], to retrieve, can use character as index.
How is this different from a HashMap<Character, BitSequence>? Lookup in a hashmap consists of:

Compute hashCode.
Mod by number of buckets.
Look in a linked list.

Compared to HashMaps, Arrays are faster (just get the item from the array) and only use more memory if some characters in the alphabet are unused.��

I ATE: 0000011000100101

I ATE: 100011110111101010

28 of 63

Prefix-Free Codes

Question: For decoding (compressed bitstream back to bitstream), what is a natural data structure to use?

space	111
E	010
T	1101
A	1011
O	1001
I	1000
...	0111

space	1
E	01
T	001
A	0001
O	00001
I	000001
...

I ATE: 0000011000100101

I ATE: 100011110111101010

29 of 63

Prefix-Free Codes

Question: For decoding (compressed bitstream back to bitstream), what is a natural data structure to use?

We need to lookup longest matching prefix, an operation that Tries excel at.

space	111
E	010
T	1101
A	1011
O	1001
I	1000
...	0111

space	1
E	01
T	001
A	0001
O	00001
I	000001
...

I ATE: 0000011000100101

I ATE: 100011110111101010

30 of 63

Huffman Coding in Practice

31 of 63

Huffman Compression

Two possible philosophies for using Huffman Compression:

For each input type (English text, Chinese text, images, Java source code, etc.), assemble huge numbers of sample inputs for that category. Use each corpus to create a standard code for English, Chinese, etc.
For every possible input file, create a unique code just for that file. Send the code along with the compressed file.

What are some advantages/disadvantages of each idea? Which is better?�

$ java HuffmanEncodePh1 ENGLISH mobydick.txt

$ java HuffmanEncodePh1 BITMAP horses.bmp

$ java HuffmanEncodePh2 mobydick.txt

$ java HuffmanEncodePh2 horses.bmp

32 of 63

Huffman Compression (Your Answers)

Two possible philosophies for using Huffman Compression:

Build one corpus per input type.
For every possible input file, create a unique code just for that file. Send the code along with the compressed file.

What are some advantages/disadvantages of each idea? Which is better?

#2 has the advantage that the code is tailor made and more efficient. Weird books like Gadsby which has no letter ‘e’.
#2 has the advantage that it feels kind of encryptedy?
#2 would additional to build codeword table.
#2 has the disadvantage that the code itself takes up space, which is what we care about.
#1 you can have one translator, but #2 takes many translators (sending the code makes it so you really just need one decoder, though, not a big deal).

33 of 63

Huffman Compression

Two possible philosophies for using Huffman Compression:

For each input type (English text, Chinese text, images, Java source code, etc.), assemble huge numbers of sample inputs for that category. Use each corpus to create a standard code for English, Chinese, etc.
For every possible input file, create a unique code just for that file. Send the code along with the compressed file.

In practice, Philosophy 2 is used in the real world.

34 of 63

Huffman Compression Example [Demo Link]

Given input text: 我我刚刚刚是我是我刚李刚我李是爸李爸李是李我我李刚是我是刚爸是刚我爸我李是是李是我我刚爸是李我我我是爸我是我爸是我爸是我是刚我是爸刚爸我刚我我刚爸我我爸我刚爸爸李李李李我我爸李我我刚爸李我我李我爸我我

Step 1: Count frequencies.

Step 2: Build encoding array and decoding trie.

Step 3: Write decoding trie to output.huf.

Step 4: Write codeword for each symbol to output.huf.

Output bits: 010101010101001…00111111111101…

Decoding Trie

Codewords

我

爸

是

李

刚

0.35

0.17

0.16

0.15

Decoding Trie

See optional textbook for how to do this.

35 of 63

Huffman Decompression Example [Demo Link]

Given input bitstream: 010101010101001…00111111111101…

Step 1: Read in decoding trie.

Step 2: Use codeword bits to walk down the trie, outputting symbols every time you reach a leaf.

Note: Symbols are really just bits!

我 is 111001011000100010010001 in Unicode.
“Outputting 我” actually means outputting these 32 bits.

Output symbols: 我我刚刚刚是…

Output bits: 111001011000100010010001...

我

爸

是

李

刚

0.35

0.17

0.16

0.15

Decoding Trie

Codewords

36 of 63

Huffman Coding Summary

Given a file X.txt that we’d like to compress into X.huf:

Consider each b-bit symbol (e.g. 8-bit chunks, Unicode characters, etc.) of X.txt, counting occurrences of each of the 2^b possibilities, where b is the size of each symbol in bits.
Use Huffman code construction algorithm to create a decoding trie and encoding map. Store this trie at the beginning of X.huf.
Use encoding map to write codeword for each symbol of input into X.huf.

To decompress X.huf:

Read in the decoding trie.
Repeatedly use the decoding trie’s longestPrefixOf operation until all bits in X.hug have been converted back to their uncompressed form.�

See Huffman.java for an example implementation on 8-bit symbols.

37 of 63

Compression Algorithms (General)

The big idea in Huffman Coding is representing common symbols with small numbers of bits.

Many other approaches, e.g.

Run-length encoding: Replace each character by itself concatenated with the number of occurrences.

Rough idea: XXXXXXXXXYYYYXXXXX -> X10Y4X5

LZW: Search for common repeated patterns in the input. See extra slides.

General idea: Exploit redundancy and existing order inside the sequence.

Sequences with no existing redundancy or order may actually get enlarged.

38 of 63

Compression Theory

39 of 63

Comparing Compression Algorithms

Different compression algorithms achieve different compression ratios on different files.

We’d like to try to compare them in some nice way.

To do this, we’ll need to refine our model from slide 3 to be a bit more sophisticated.

Let’s start with a straightforward puzzle.

40 of 63

SuperZip

Suppose an algorithm designer says their algorithm SuperZip can compress any bitstream by 50%. Why is this impossible?

41 of 63

Universal Compression: An Impossible Idea

Argument 1: If true, they’d be able to compress any bitstream down to a single bit. Interpreter would have to be able to do the following (impossible) task for ANY output sequence.

01010101000001010101110

101010100001

111001

101

01010101000001010101110

101010100001

111001

Compression

101

Compression

42 of 63

Universal Compression: An Impossible Idea

Argument 2: There are far fewer short bitstreams than long ones. Guaranteeing compression even once by 50% is impossible. Proof:

There are 2¹⁰⁰⁰ 1000-bit sequences.
There are only 1+2+4+...+2⁵⁰⁰ = 2⁵⁰¹ - 1 bit streams of length ≤ 500.
In other words, you have 2¹⁰⁰⁰ things and only 2⁵⁰¹ - 1 places to put them.
Of our 1000-bit inputs, only roughly 1 in 2⁴⁹⁹ can be compressed by 50%!

43 of 63

A Sneaky Situation

Universal compression is impossible, but the following example implies that comparing compression algorithms could still be quite difficult.

Suppose we write a special purpose compression algorithm that simply hardcodes small bit sequences into large ones.

Example, represent GameOfThronesSeason6-Razor1911-Rip-Episode1.mp4 as 010.

010

00000000001111111111...

GameOfThronesSeason6-Razor1911-Rip-Episode1.mp4

Compressed Bits

Decompression

Algorithm C^-1

3 bits

8927472560 bits

44 of 63

A Sneaky Situation

Suppose we write a special purpose compression algorithm that simply hardcodes small bit sequences into large ones.

Example, represent GameOfThronesSeason6-Razor1911-Rip-Episode1.mp4 as 010.

To avoid this sort of trickery, we should include the bits needed to encode the decompression algorithm itself.

010

00000000001111111111...

GameOfThronesSeason6-Razor1911-Rip-Episode1.mp4

Compressed Bits

Decompression

Algorithm C^-1

3 bits

8927472560 bits

8927472707 bits

45 of 63

Compression Model 2: Self-Extracting Bits

As a model for the decompression process, let’s treat the algorithm and the compressed bitstream as a single sequence of bits.

If you want a concrete idea to hold on to, imagine storing the compressed bitstream as a byte[] variable in a .java file. We’ll show an example on the coming slides involving compressing an image.
Can think of the algorithm + compressed bitstream as an input to an interpreter. Interpreter somehow executes those bits (see 61A)

At the very “bottom” of these abstractions is some kind of physical machine (see 61C).

0111000001110101...

SelfExtractingGoT.java

Interpreter

8,927,472,560 bits

0100001001001101...

GameOfThronesSeason6-Razor1911-Rip-Episode1.mp4

8,927,472,707 bits

46 of 63

HugPlant

Huffman Coding can be used to compress any data, not just text. In bitmap format, the plant below is simply the stream of bits shown on the right.

Total: 8389584 bits

Original Uncompressed Bits B

47 of 63

HugPlant Compressed

00 20 00 f4 c3 b7 6d c2 31 24 92 dc 24 a7 c9 25 ae 24 b5 c4 85 88 40 be c4 92 46 25 79 2f c4 af 25 f8 92 49 24 92 64 c9 92 49 30 b1 24 92 49 24 2c 49 24 92 49 0b 12 49 24 92 42 c4 92 49 24 92 49 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ...

Decoding Trie: 2560 bits

Image data: 1991464 bits

74 68 65 20 70 61 73 73 63 6F 64 65 20 69 73 20 68 75 67 39 31 38 32 37 78 79 7A 2E 65 75 7a c0

09 eb cd d4 2a 55 9f d8 98 d1 4e e7 97 56 58 68

0c 7a 43 dd 80 00 7b 11 58 f4 75 73 77 bc 26 01

e0 92 28 ef 47 24 66 9b de 8b 25 04 1f 0e 87 bd

87 9e 03 c9 f1 cf ad fa 82 dc 9f a1 31 b5 79 13

9b 95 d5 63 26 8b 90 5e d5 b0 17 fb e9 c0 e6 53

c7 cb dd 5f 77 d3 bd 80 f9 b6 5e 94 aa 74 34 3a

a9 c1 ca e6 b8 9c 60 ab 36 3b a5 8a b4 3a 5c 5a

62 e9 2f 16 4c 34 60 6e 51 28 36 2c e7 4e 50 be

c0 15 1b 01 d9 c0 bd b4 20 87 42 be d4 e2 23 a2

b6 84 22 4c cf 74 cd 4f 23 06 54 e6 c2 0f 2d bd

e5 81 f4 c6 de 15 59 f1 68 a4 a5 88 16 b0 7f bf

8a 1d 98 bd 33 b4 d5 71 22 93 81 af e0 cc ce 12

57 23 62 3a e4 3d 8c f1 12 8d a5 40 3b 70 d6 9b

12 49 62 8d 6f d4 52 f6 7f d5 11 7c ca 07 dd e3

dc 1c 7f c4 a4 69 77 6e 5e 60 db 5a 69 01 95 c8

d7 2e 57 62 b7 8e 5c 51 f9 70 55 1b 7c ba 68 bc

Huffman.java

compress()

Total: 1994024 bits

48 of 63

Opening a .huf File

Of course, major operating systems have no idea what to do with a .huf file.

Have to send over the 43,400 bits of Huffman.java code as well.
Total size (including .java file): 2,037,424 bits.

49 of 63

Or an Even Simpler View

To keep things conceptually simpler, let’s package the compressed data plus decoder into a single self-extracting .java file.

Bitstream on the left generates bitstream on the right.

70 75 62 6c 69 63 20 63 6c 61 73 73 20 53 65 6c 66 45 78 74 72 61 63 74 69 6e 67 48 75 67 50 6c 61 6e 74 20 7b 0d 0a ... 74 68 65 20 70 61 73 73 63 6F 64 65 20 69 73 20 68 75 67 39 31 38 32 37 78 79 7A 2E 65 75 7a c0 09 eb cd d4 2a 55 9f d8 98 d1 4e e7 97 56 58 68 0c 7a 43 dd 80 00 7b 11 58 f4 75 73 77 bc 26 01 e0 92 28 ef 47 24 66 9b de 8b 25 04 1f 0e 87 bd 87 9e 03 c9 f1 cf ad fa 82 dc 9f a1 31 b5 79 13 9b 95 d5 63 26 8b 90 5e d5 b0 17 fb e9 c0 e6 53 c7 cb dd 5f 77 d3 bd 80 f9 b6 5e 94 aa 74 34 3a ... 00 20 00 f4 c3 b7 6d c2 31 24 92 dc 24 a7 c9 25 ae 24 b5 c4 85 88 40 be c4 92 46 25 79 2f c4 af 25 f8 92 49 24 92 64 c9 92 49 c4 92 49 24 92 49 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ...

SelfExtractingHugPlant.java

42 4d 7a 00 10 00 00 00 00 00 7a 00 00 00 6c 00 00 00 00 02 00 00 00 02 00 00 01 00 20 00 03 00 00 00 00 00 10 00 12 0b 00 00 12 0b 00 00 00 00 00 00 00 00 00 00 00 00 ff 00 00 ff 00 00 ff 00 00 00 00 00 00 ff 01 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff...

HugPlant.bmp

2,037,424 bits

8,389,584 bits

50 of 63

Compression Model #2: Self-Extracting Bits

As a model for the decompression process, let’s treat the algorithm and the compressed bitstream as a single sequence of bits.

We’ve now seen an example: SelfExtractingHugPlant.

Will discuss the implications of this model next time.

0111000001110101...

SelfExtractingHugPlant.java

Interpreter

8,389,584 bits

0100001001001101...

HugPlant.bmp

2,037,424 bits

51 of 63

LZW Style Compression (Extra)

52 of 63

Thought Experiment

How might we compress the following bitstreams (underlines for emphasis only)?

B=”aababcabcdabcdeabcdefabcdefgabcdefgh”?
B=”abababababababababababababababab”?
B=”aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa”?

53 of 63

The LZW Approach

Key idea: Each codeword represents multiple symbols.

Start with ‘trivial’ codeword table where each codeword corresponds to one ASCII symbol.
Every time a codeword X is used, record a new codeword Y corresponding to X concatenated with the next symbol.

Demo Example: http://goo.gl/68Dncw

54 of 63

LZW

Named for inventors Limpel, Ziv, Welch.

Related algorithm used as a component in many compression tools, including .gif files, .zip files, and more.
Once a hated algorithm because of attempts to enforce licensing fees. Patent expired in 2003.

Our version in lecture is simplified, for example:

Assumed inputs were ≤ 0x7f (7 bit input) and also provided 8 bit outputs (real LZW can have variable length outputs).
Didn’t say what happens when table is full (many variants exist).

55 of 63

LZW

Neat fact: You don’t have to send the codeword table along with the compressed bitstream.

Possible to reconstruct codeword table from C(B) alone.

LZW decompression example:

http://goo.gl/fdYU9C

56 of 63

Side Trip: Lossy Compression

57 of 63

Lossy Compression

Most media formats lose information during compression process:

.JPEG images.
.MP3 audio.
.MP4 video.

Why?

MP4 video: 1920 x 1080 pixels, 60 times per second, 24 bits per pixel: 0.37 gigabytes per second, 1,343 gigabytes per hour.
Downloading a movie: 30 days at 1 MB/second.�

58 of 63

Lossy Compression

Basic idea: Throw away information that human sensory system doesn’t care about.

�Examples:

Audio: High frequencies.
Video: Subtle gradations of color (low frequencies).

�See EE20 (or perhaps 16A/16B?) for more.

59 of 63

Summary

Compression: Make common bitstreams more compact.

Huffman coding:

Represents common symbols as codeword with fewer bits.
Uses something like Map<Character, BitSeq> for compression.
Uses something like TrieMap<Character> for decompression.

�LZW:

Represents multiple symbols with a single codeword.
Uses something like TrieMap<Integer> for compression.
Uses something like Map<Character, SymbolSeq> for decompression.�

60 of 63

Slides Moved to Next Lecture

61 of 63

TAANSTAFL: There ain’t no such thing as a free lunch.

Compression can be thought of as a ‘remapping’ of bitstreams.

For very bitstream that has a “short” representation, some other bitstream must have a “long” representation by C.

0111000001110101...

SelfExtractingHugPlant.java

Interpreter

8,389,584 bits

0100001001001101...

HugPlant.bmp

2,037,424 bits

62 of 63

Even Better Compression

Compression ratio of 25% is certainly very impressive, but we can do much better. MysteryX achieves a 0.35% compression ratio!

Of the 2^8389584 possible bit streams of length 8389584, only one in 2^8360151 can be generated by our computer using an input of length 29,432 bits.

29,432 bits

0010000001110000...

MysteryX

Interpreter

0111000001110101...

SelfExtractingHugPlant.java

Interpreter

8,389,584 bits

0100001001001101...

HugPlant.bmp

2,037,424 bits

8,389,584 bits

0100001001001101...

HugPlant.bmp

Anybody want to guess what this is?

63 of 63

MysteryX: HugPlant.java

MysteryX is just HugPlant.java, the piece of code that I used to generate the .bmp file originally.

Interesting question:

Given a target bitstream B, what is the shortest bitstream C_B that outputs B?
More next time...�

29,432 bits

0010000001110000...

C_B: HugPlant.java

Interpreter

8,389,584 bits

0100001001001101...

B: HugPlant.bmp

69 6d 70 6f 72 74 20 6a 61 76 61 2e 61 77 74 2e

43 6f 6c 6f 72 3b 0a 0a 0a 70 75 62 6c 69 63 20

63 6c 61 73 73 20 48 75 67 50 6c 61 6e 74 20 7b

0a 09 2f 2f 73 65 6e 64 20 65 6d 61 69 6c 20 74

6f 20 70 72 69 7a 65 40 6a 6f 73 68 68 2e 75 67

20 74 6f 20 72 65 63 65 69 76 65 20 79 6f 75 72

20 70 72 69 7a 65 0a 0a 09 70 72 69 76 61 74 65

20 73 74 61 74 69 63 20 64 6f 75 62 6c 65 20 73

63 61 6c 65 46 61 63 74 6f 72 3d 32 30 2e 30 3b

0a 0a 09 70 72 69 76 61 74 65 20 73 74 61 74 69

63 20 69 6e 74 20 67 65 6e 43 6f 6c 6f 72 56 61

6c 75 65 28 69 6e 74 20 6f 6c 64 56 61 6c 2c