CSCI 3280
Introduction to Multimedia Systems
(2026 Term 2)
Computer Science & Engineering
The Chinese University of Hong Kong
Announcement
1) presentation: April 16 (by group)
2) project report: due on April 26
Data Compression (1)
Media We have Learned
Storage Size for Media Types
Compression
Redundancy
For example, redundancy found in:
- Character distribution
- Character repetition
- High-usage pattern
- Positional redundancy� (these categories overlap to some extent)
Redundancy (2)
- Some characters are used more frequently than others.
- In English text, ‘e’ and space occur most often.
- Example: “abacabacabacabacabacabacabac”
- Technique: use variable length encoding.
- In 1838, Samuel Morse and Alfred Vail designed the “Morse Code” for telegraph.
- Each letter of the alphabet is represented by dots (electric current of short duration) and dashes (3 dots).
- e.g., E => “.” Z => “--..”
- Morse’s principle: assign short code strings to frequently occurring letters and longer code to letters that occur less often.
Redundancy (3)
- When a string of repetitions of a single character occurs, e.g., “aaaaaaaaaaaaaabc”
- Example: long strings of spaces/tabs in forms/tables.
- Example: in graphics, it is common to have a line of pixels using the same color, e.g. fax
- Technique: Run-length encoding.
- Certain patterns (e.g., sequences of characters) occur very often.
- Example: “qu”, “th”,...…
- Technique: Use a single special character to represent a common sequence of characters.
Redundancy (4)
- A certain thing appears persistently at a predictable place(s).
- Example: the left-uppermost pixel of my slides is always white.
- Technique: Let me make a prediction, if wrong, tell me the “differences” only.
Categories of Compression Techniques
- Regard data stream as a simple digital sequence, its semantics is ignored.
- Lossless compression.
- Take into account the semantics of data and the characteristics of specific medium to achieve a higher compression ratio.
- Lossless or lossy compression.
Categories of Compression Techniques (2)
Run-Length Encoding
- Uncompressed: ABCCCCCCCCDEFGGGHI
- Compressed: ABC!4DEFGGGHI
Run-Length Encoding (2)
- Ineffective with text
Q: Is RLE useful with: “ccc”?
- Need an escape byte
OK with text
Nahhhh with data
Solution?
Variable-Length Code
- How to assign codes?
- Can we uniquely decipher the code?
Prefix Code
Huffman Code
- Assume the probabilities (frequencies) of symbols used are known
- Label each node with its correspondence probability and put all nodes into the candidate list.
- Pick two nodes with smallest probabilities, create a parent to connect them and label this parent with the sum of probabilities of these two children.
- Put the parent node into the candidate list.
- Repeat the last two steps until one node (root) left in candidate list.
- Assign 0 and 1 to left and right branches of the tree in the candidate list.
Huffman Code (2)
Huffman Code (3)
Huffman Code (Advantages)
- No code is a prefix to any other code. Forward scanning is never ambiguous. Good for decoder.
Huffman Code (Disadvantages)
Shannon-Fano Algorithm
- 1. Sort symbols according to their probabilities.
- 2. Recursively divide symbols into 2 half-sets, each with roughly equal sum of probabilities.
Shannon-Fano Algorithm (2)
How small we can compress?
- where pj is the probability that symbol Sj in S will occur.
Entropy
Entropy (2)
Universal Loseless Compression
Lempel-Ziv Algorithms
- LZ77, proposed in 1977, used in pkzip, gzip
- LZ78, proposed in 1978
- LZW, improved by Terry A. Welch in 1984, used in compress, GIF. It is patented by Unisys Corporation.
- LZY, developed by Yakoo in 1992
Lempel-Ziv Algorithms (2)
-Huffman code cannot encode multiple input symbols by a single codeword. Hence, a lot of patterns (e.g. “the”, “of”, “an” in English are regular patterns) are frequently seen. Can we do something on them?
LZ78
LZ78 (2)
LZ78 (2)
LZ78 (2)
LZ78 (2)
LZ78 (2)
LZ78 (2)
LZ78 (2)
LZ78 (2)
LZ78 (2)
LZ78 (2)
LZ78 (2)
LZ78 (2)
LZ78 (2)
LZ78 (3)
LZ78 (4)
LZ78 (4)
LZ78 (4)
LZ78 (4)
LZ78 (4)
LZ78 (4)
LZ78 (4)
LZW
LZW (2)
LZW (2)
LZW (2)
LZW (2)
LZW (2)
LZW (2)
LZW (2)
LZW (3)
OK, try to encode the rest yourself.
LZW (4)
LZW (4)
LZW (4)
LZW (4)
LZW (4)
LZW (4)
LZW (4)
LZW (4)
LZW (5)
LZW: Pros and Cons
- Effective at exploiting:
character repetition redundancy
high-usage pattern redundancy
** but not on positional redundancy
- An adaptive algorithm
- One-pass procedure
- Message must be sufficient long for effective compression