1 of 70

CSCI 3280

Introduction to Multimedia Systems

(2026 Term 2)

Computer Science & Engineering

The Chinese University of Hong Kong

2 of 70

Announcement

Final project:

1) presentation: April 16 (by group)

2) project report: due on April 26

3 of 70

Data Compression (1)

4 of 70

Media We have Learned

5 of 70

Storage Size for Media Types

6 of 70

Compression

We can view compression as an encoding method that transforms a sequence of bits into an another representation which takes less space than the input.

The output data can be transformed back to the input representation without any loss (lossless compression) or with acceptable loss (lossy compression).

7 of 70

Redundancy

How can we represent the same information with lesser no of bits?

Compression is made possible by exploiting redundancy in source data:

For example, redundancy found in:

- Character distribution

- Character repetition

- High-usage pattern

- Positional redundancy� (these categories overlap to some extent)

8 of 70

Redundancy (2)

Character Distribution

- Some characters are used more frequently than others.

- In English text, ‘e’ and space occur most often.

- Example: “abacabacabacabacabacabacabac”

- Technique: use variable length encoding.

For your interest:

- In 1838, Samuel Morse and Alfred Vail designed the “Morse Code” for telegraph.

- Each letter of the alphabet is represented by dots (electric current of short duration) and dashes (3 dots).

- e.g., E => “.” Z => “--..”

- Morse’s principle: assign short code strings to frequently occurring letters and longer code to letters that occur less often.

9 of 70

Redundancy (3)

Character Repetition

- When a string of repetitions of a single character occurs, e.g., “aaaaaaaaaaaaaabc”

- Example: long strings of spaces/tabs in forms/tables.

- Example: in graphics, it is common to have a line of pixels using the same color, e.g. fax

- Technique: Run-length encoding.

High Usage Patterns

- Certain patterns (e.g., sequences of characters) occur very often.

- Example: “qu”, “th”,...…

- Technique: Use a single special character to represent a common sequence of characters.

10 of 70

Redundancy (4)

Positional Redundancy

- A certain thing appears persistently at a predictable place(s).

- Example: the left-uppermost pixel of my slides is always white.

- Technique: Let me make a prediction, if wrong, tell me the “differences” only.

11 of 70

Categories of Compression Techniques

One way to classify a compression method is to see whether it takes the source semantics into consideration.

Entropy encoding

- Regard data stream as a simple digital sequence, its semantics is ignored.

- Lossless compression.

Source encoding

- Take into account the semantics of data and the characteristics of specific medium to achieve a higher compression ratio.

- Lossless or lossy compression.

12 of 70

Categories of Compression Techniques (2)

13 of 70

Run-Length Encoding

Image, video and audio data streams often contain sequence of the same bytes. e.g. background in images, and silence in sound files.
Substantial compression achieved by replacing repeated byte sequences with the number of occurrences.
If a byte occurs at least 4 consecutive times, the sequence is encoded with the byte followed by a special flag and the number of its occurrence. e.g.,

- Uncompressed: ABCCCCCCCCDEFGGGHI

- Compressed: ABC!4DEFGGGHI

If 4Cs (CCCC) is encoded as C!0, and 5Cs as C!1, and so on, then it allows compression of 4 to 259 repetitions into 3 bytes only.

14 of 70

Run-Length Encoding (2)

Problems:

- Ineffective with text

Q: Is RLE useful with: “ccc”?

- Need an escape byte

OK with text

Nahhhh with data

Solution?

Used in “Kermit” and CCITT Group 3 fax compression.

15 of 70

Variable-Length Code

Key: Use less bits for frequently used symbols and more bits for less used symbols
This approach obviously reduces the total data size.
Problems:

- How to assign codes?

- Can we uniquely decipher the code?

16 of 70

Prefix Code

The idea is to translate symbols of input data into variable-length codes.
We wish to encode each symbol into a sequence of 0’s and 1’s so that no code of the symbol is the prefix of the code of any other symbol -- the prefix condition.
Hence, no special delimiter is needed.
With the prefix property, we are able to uniquely decode a sequence of 0,1 into a sequence of characters.

17 of 70

Huffman Code

A simple method that generates a kind of prefix code -Huffman code.
Used in many compression programs, e.g., pkzip, arj.
Algorithm (bottom-up approach):

- Assume the probabilities (frequencies) of symbols used are known

- Label each node with its correspondence probability and put all nodes into the candidate list.

- Pick two nodes with smallest probabilities, create a parent to connect them and label this parent with the sum of probabilities of these two children.

- Put the parent node into the candidate list.

- Repeat the last two steps until one node (root) left in candidate list.

- Assign 0 and 1 to left and right branches of the tree in the candidate list.

18 of 70

Huffman Code (2)

Example 1: 5 symbols are used in the source.

Code assignment may not be unique.

19 of 70

Huffman Code (3)

More example 2:

20 of 70

Huffman Code (Advantages)

Decoding is trivial as long as the coding table is sent before the data. Overhead is relatively negligible if the data file is big.
Unique Decipherable:

- No code is a prefix to any other code. Forward scanning is never ambiguous. Good for decoder.

Average bit per symbol (expected no of code symbols per source symbol) for previous examples�Ex 1:�Ex 2:��Optimal for instantaneous coding (code without buffering)

21 of 70

Huffman Code (Disadvantages)

A table is needed that lists each input symbol and its corresponding code.

Need to know the character frequency distribution in advance => need two passes over the data.

More seriously, it does not explore the coherence between symbols. You cannot group a set of symbols and output one single code for them, e.g. pattern “the” is usually used in English.

22 of 70

Shannon-Fano Algorithm

Similar to Huffman code
Algorithm (top-down approach):

- 1. Sort symbols according to their probabilities.

- 2. Recursively divide symbols into 2 half-sets, each with roughly equal sum of probabilities.

Result:

23 of 70

Shannon-Fano Algorithm (2)

Average bit per symbol

How good we can compress?
Let’s measure it with the lower bound of the average bit per symbol.

24 of 70

How small we can compress?

Entropy is a measure of uncertainty or randomness.
According to Shannon (father of information theory), the entropy of an information source S with the probability distribution {pj} is defined as:

- where pj is the probability that symbol Sj in S will occur.

So what?
The average codeword length l of a binary code is always at least as great as the source entropy calculated, i.e.

25 of 70

Entropy

The entropy is the lower bound for lossless compression: when the probability of each symbol is fixed, each symbol should be represented at least with H bits on average.
Proof: lj be the length of code word for symbol Sj

26 of 70

Entropy (2)

Q: How about an image in which half of the pixels are white, another half are black?
Ans: p(white) = 0.5, p(black) = 0.5, so entropy is 1.

Let check Example 2 in the slides of Huffman code�{pj} = {0.15, 0.4, 0.15, 0.1, 0.1, 0.05, 0.04, 0.01}

Again,

27 of 70

Universal Loseless Compression

Isn’t Huffman code optimal? Why bother with others?
By observation, there are still gaps between entropy and the average length of Huffman codes. Why?
Problem 1: The statistics have to be known in advance
Problem 2: Group of symbols cannot be coded as one codeword to further reduce average code length.
Huffman code is optimal as an instantaneous code (no buffering is allowed).
A coding scheme is universal if, it succeeds in compressing that data down to the entropy rate of source despite having no a priori knowledge of the source statistics.

28 of 70

Lempel-Ziv Algorithms

Proposed by Abraham Lempel and Jacob Ziv
Variants:

- LZ77, proposed in 1977, used in pkzip, gzip

- LZ78, proposed in 1978

- LZW, improved by Terry A. Welch in 1984, used in compress, GIF. It is patented by Unisys Corporation.

- LZY, developed by Yakoo in 1992

Probably the most widely-used loseless compression methods.
All approach H (entropy bound) when input data size is large.
No data statistics is needed in advance.
Universal

29 of 70

Lempel-Ziv Algorithms (2)

Motivation:

-Huffman code cannot encode multiple input symbols by a single codeword. Hence, a lot of patterns (e.g. “the”, “of”, “an” in English are regular patterns) are frequently seen. Can we do something on them?

May be we can build a dictionary of all frequently seen pattern and encode them with the table index.
But, how to build the dictionary if we have no knowledge?
LZ methods build the dictionary (or table or tree) on-the-fly as the input symbols are encoded.
Let’s illustrate the idea by introducing LZ78 and then LZW.

30 of 70

LZ78

Let’s consider an input source with only two symbols “a” and “b”. S = {a, b}
The basic idea is to construct a LZ78 tree. (“dictionary”)
Since there are only two alphabets, the LZ78 tree is binary in this case.
The tree starts as a single root node “R”. It grows as more input data enter.
Let’s further assume the input data stream is� a a a b a b b a a a a a b a b a a b a a b a a a b a a b
LZ78 will break the stream into phrases, each will then be encoded by a codeword (“dictionary index”)�a, a a, b, a b, b a, a a a, a b a, b a a, b a a b, a a a b, a a b

44 of 70

LZ78 (3)

Each time the phrase is constructed by looking up the LZ78 tree find the maximal match from the root node.
Output the terminal node that matches the input string.
Output the last symbol in the input string.
What do you find?
The matched string will become longer and longer as more symbols are entered.
It does not work well for short input string like the example
Asymptotically, the compression ratio approaches H
Previous slides show how it is encoded, how about decoding?
Have we output the LZ78 tree to the decoder?
Yes, we implicitly sent the LZ78 tree

52 of 70

LZW

In fact, LZ78 is already very good.
We send the extra symbol in uncompressed form.
Can we do something even better?
This motivates the development of LZW
The initial LZW tree contains root, a-descendant and b-descendant, i.e. 3 nodes instead of 1 in LZ78
Encoder updates the LZW tree by adding the node corresponding to one-symbol extension of current phrase (Look ahead)
The extra symbol is not part of current phrase
The decoder will deduce this mystery extra symbol
Let’s reuse the previous example

60 of 70

LZW (3)

OK, try to encode the rest yourself.

Note the output only contains codewords, no symbol is sent in uncompressed form.
Let’s see how can we decode. This will be a little bit tricky.
Again we start with a LZW tree containing 3 nodes.
Let’s use a string variable “Last string” to memorize the last decoded string

69 of 70

LZW (5)

It seems the decoder�is very tricky.
But the algorithm is �very simple again.

70 of 70

LZW: Pros and Cons

LZW (advantages)

- Effective at exploiting:

character repetition redundancy

high-usage pattern redundancy

** but not on positional redundancy

- An adaptive algorithm

- One-pass procedure

LZW (disadvantages)

- Message must be sufficient long for effective compression

1 of 70

2 of 70

3 of 70

4 of 70

5 of 70

6 of 70

7 of 70

8 of 70

9 of 70

10 of 70

11 of 70

12 of 70

13 of 70

14 of 70

15 of 70

16 of 70

17 of 70

18 of 70

19 of 70

20 of 70

21 of 70

22 of 70

23 of 70

24 of 70

25 of 70

26 of 70

27 of 70

28 of 70

29 of 70

30 of 70

31 of 70

32 of 70

33 of 70

34 of 70

35 of 70

36 of 70

37 of 70

38 of 70

39 of 70

40 of 70

41 of 70

42 of 70

43 of 70

44 of 70

45 of 70

46 of 70

47 of 70

48 of 70

49 of 70

50 of 70

51 of 70

52 of 70

53 of 70

54 of 70

55 of 70

56 of 70

57 of 70

58 of 70

59 of 70

60 of 70

61 of 70

62 of 70

63 of 70

64 of 70

65 of 70

66 of 70

67 of 70

68 of 70

69 of 70

70 of 70