1 of 34

Project description 2026

2 of 34

Project specification TOR III

  • Mandatory – 2 to 3 students form a group (2 students in a group for projects A-M and 3 students in a group for projects N-Z).

  • Use Google for material on LZW compression scheme and for other techniques as well, though developing own code might be easier.

  • Project should be completed and handed in by the end of the semester, more precisely May 31 is the deadline

  • Power Point is recommended, at most 10 pages, demonstrate the algorithm and results (compression ratio)

  • Take suitable screenshots of your program.

3 of 34

Huffman coding – GROUP A

(Students Names: 2 or 3

  • Restrict your alphabet to 6 letters {a, b, c, d, e, f} and construct the sequence of 300 symbols e.g.

abacefddcabbdedacffaeebdfcabeafbdcab … using the following probability distribution :

  • Pb(a)=0.05, Pb(b)=0.1, Pb(c)=0.15, Pb(d)=0.18, Pb(e)=0.22, Pb(f)=0.3.

  • Show Huffman coding and decoding of the stream and compute the average length of codewords, what is a compression ratio ?

  • Discuss what happens (compression ratio) if „suddenly“ the probability is changed so that Pb(a)=0.3 and P(f)=0.05. You use the same Huffman encoding for the above fixed distribution which is not optimal in this case

4 of 34

Project LZW – GROUP-B

((Students Names: 2 or 3))

  • Restrict your alphabet to 4 letters {a, b, c, d} and construct the sequence of 1000 symbols e.g.

abaccadbdbadcdabcddbadbcab … using the following prob. distributions :

  • Case 1 : Pb(a)=0.5, Pb(b)=0.2, Pb(c)=0.15, Pb(d)=0.15.
  • Case 2: Pb(a)=0.3, Pb(b)=0.3, Pb(c)=0.25, Pb(d)=0.15.

  • The table size is 256 entries ! Discuss the differences (if any) in compression ratio for the two cases. Notice that sending 1000 symbols above without any coding requires 2000 bits. Figure out how to measure compression ratio properly.

5 of 34

Project Arithmetic coding – GROUP C

((Students Names: 2 or 3))

  • Restrict your alphabet to 4 letters {a, b, c, d} and construct the sequence of 100 symbols e.g.

abadccdabdbadcabdcbabdcab …

using the following pro. distr. and assume that the symbols are independent (when computing the intervals):

  • Case 1 : Pb(a)=0.4, Pb(b)=0.3, Pb(c)=0.2, Pb(d)=0.1.

  • „Implement“ arithmetic coding (Google it up) and decoding and compare the length of binary sequence to Huffman coding in this case: Assume that

the symbols are independent p(a_1,a_2)=p(a_1)p(a_2)

6 of 34

Project Arithmetic coding II – GROUP D

((Students Names: 2 or 3))

  • Restrict your alphabet to 4 letters {a, b, c, d} and construct the sequence of 100 symbols e.g.

abadccdabdbadcabdcbabdcab …

using the following pro. distr. and assume that the symbols are NOT independent (when computing the intervals):

  • Case 1 : Pb(a)=0.5, Pb(b)=0.3, Pb(c)=0.1, Pb(d)=0.1.

  • „Implement“ arithmetic coding (Google it up) and decoding. Use the assumption that the symbols are not independent and use Laplace estimate for the conditional probabilities. Compare the length of binary sequence to Hufman coding

7 of 34

AEP coding – GROUP E

((Students Names: 2 or 3))

  • Restrict your alphabet to 2 letters {a, b} with probability 0.7 and 0.3, respectively, and construct the sequence of 1024 symbols

  • Decide what are typical sequences of length 8 if this theoretical value

„epsilon“ as introduced at lecture is 0.1. For instance, is the block

„aaaaaaaa“ 0.1-typical or not ?

  • Calculate the binary length needed to encode this sequence by considering blocks of length 8. Each time you see a typical sequence you use sufficient number of binary bits to encode it (of same length) and for non-typical 10 bits is enough to distinguish these
  • Increase the block length to 16, what happens now ?

8 of 34

Huffman block coding – GROUP F

((Students Names: 2 or 3))

  • Restrict your alphabet to 6 letters {a, b, c, d, e, f} and construct the sequence of 1000 symbols e.g.

abacefddcabbdedacffaeebdfcabeafbdcab …

using the following probability distribution :

  • Pb(a)=0.05, Pb(b)=0.1, Pb(c)=0.15, Pb(d)=0.18, Pb(e)=0.22, Pb(f)=0.3.

  • Consider Huffman coding of block length 2 (alphabet becomes 6x6=36) and decoding of the stream and compute the average length of codewords, what is the compression ratio ?

  • Discuss what happens (compression ratio) if „suddenly“ the probability is changed so that Pb(a)=0.3 and P(f)=0.05. You use the same Huffman encoding for the above fixed distribution which is not optimal in this case

9 of 34

Tunstall coding – GROUP G

((Students Names: 2 or 3))

  • Restrict your alphabet to 4 letters {a, b, c, d} and construct the sequence of 1000 symbols e.g.

abacbcaddcabbdcdacbaacdbdcabeabdcab … using the following probability distribution :

  • Pb(a)=0.05, Pb(b)=0.1, Pb(c)=0.15, Pb(d)=0.7.

  • Construct Tunstall code, thus a codebook of size 16 codewords using the algorithm given in the lecture notes.

  • Encode these codewords as binary strings of length 4 bits and compute the average length of codewords, what is a compression ratio ? Compare to Huffman coding

10 of 34

Tunstall coding II – GROUP H

(Students Names: 2 or 3)

  • Restrict your alphabet to 4 letters {a, b, c, d} and construct the sequence of 1000 symbols e.g.

abacbcaddcabbdcdacbaacdbdcabeabdcab … using the following probability distribution :

  • Tunstall code that contains a codebook
  • Pb(a)=0.05, Pb(b)=0.1, Pb(c)=0.15, Pb(d)=0.7.

  • Constructof size 8 codewords using the algorithm given in the lecture notes.
  • Apply Huffman coding to these 8 codewords and estimate the compression ratio to uncoded case

11 of 34

Huffman code with Fibonacci numbers- GROUP I

((Students Names: 2 or 3))

  • Construct a Huffman code for the following set of frequencies, based on the first 8 Fibonacci numbers (appearance in 54 symbols):

a:1 b:1 c:2 d:3 e:5 f:8 g:13 h:21

  • Generate 108 symbols and calculate the compression ratio compared to uncoded case

  • Consider the increased alphabet length so that you instead encode blocks of 2 symbols. Calculate again the compression ratio and compare to the above case of encoding each symbol separately.

12 of 34

Adaptive Huffman coding GROUP-J

(Students Names: 2 or 3)

  • Implement fully working adaptive Huffman coding algorithm as mentioned at the lectures (with constant time updates to the tree).
  • Restrict your alphabet to 6 letters {a, b, c, d, e, f} and construct the sequence of 2000 symbols e.g.

abacefddcabbdedacffaeebdfcabeafbdcab …

using the following probability distribution :

Pb(a)=0.05, Pb(b)=0.1, Pb(c)=0.15, Pb(d)=0.18, Pb(e)=0.22, Pb(f)=0.3.

  • Compute the compression ratio when the above distribution is fixed. Then, assume that the distribution changes after each 200 symbols so that Pb(a) 🡪 Pb(c) and vice versa whereas Pb(d) -> Pb(f) and v. versa.
  • Your adaptive coding scheme is supposed to adapt every 100 symbols by counting the frequencies of the letters in the sequence up to that point to approximate the probabilities of the letters for the next 100 symbols.

13 of 34

1st order test algorithms – GROUP K

((Students Names: 2 or 3))

  • Implement the ideas discussed during the lecture for selecting optimal number of tests (in average), where the alphabet of binary random variables is at most 15.
  • Generate a code which generates randomly up to 15 tests with approximately equal number of zeros and ones (be careful that you cannot have same columns.

  • Apply your code to 10 possible outcomes using 10 tests generated as above.

  • Compare the average number of tests with the entropy.

14 of 34

Kullback-Leibler distance – GROUP L

(Students Names: 2 or 3)

  • Generate a sequence of 1000 symbols that follows the distribution: (*) Pb(a)=1/2, Pb(b)=1/4, Pb(c)=1/8, Pb(d)=1/16, Pb(e)=1/16,
  • Compute the sequence length and compression ratio when (*) is used with Huffman coding.
  • Compute D(p||q) where p refers to the true distribution given by (*) and q refers to the case when Pb(a) and Pb(e) swaps so that Pb(a)=1/16 and Pb(e)=1/2.
  • Compute the sequence length and compression ratio when the distribution q is used and compare to (*) in terms of D(p||q) or D(q||p)

15 of 34

Shannon-Fano – GROUP M

(Students Names: 2 or 3)

  • Restrict your alphabet to 6 letters {a, b, c, d, e, f} and construct the sequence of 1000 symbols e.g.

abacefddcabbdedacffaeebdfcabeafbdcab … using the following probability distribution :

  • Pb(a)=0.05, Pb(b)=0.1, Pb(c)=0.15, Pb(d)=0.18, Pb(e)=0.22, Pb(f)=0.3.

  • Consider Shannon-Fano coding of block length 2 (alphabet becomes 6x6=36) and decoding of the stream and compute the average length of codewords, what is a compression ratio ?

  • Discuss what happens (compression ratio) if „suddenly“ the probability is changed so that Pb(a)=0.3 and P(f)=0.05. You use the same Shannon-Fano encoding for the above fixed distribution which is not optimal in this case

16 of 34

Huffman block coding (song) – GROUP N

(Students Names: 2 or 3)

  • Find a Waveform Audio File Format (WAVE, or WAV filename extension) of your favorite song and read the contents of the file into a byte array.
  • Change bytes to bits to obtain a representation of the wav file as a string of bits.
  • Split the obtained sequence of bits into blocks of 8, and find the frequency of every possible block, and use the obtained frequencies to estimate the probability of all possible blocks of 8 bits.
  • Use the estimated probability distribution to obtain a Huffman code for the blocks of 8 bits.
  • Use the Huffman code to encode the bit representation of the song.
  • Compare the size of the compressed file with the size of an mp3 file of the same song.

17 of 34

Lempel-Ziv (song) – GROUP O

Sara Delić, Ivan Sharikov, Antón Expósito Campo, Meryem Nobatova

  • Find a Waveform Audio File Format (WAVE, or WAV filename extension) of your favorite song and read the contents of the file into a byte array.
  • Change bytes to bits to obtain a representation of the wav file as a string of bits.
  • Using a starting dictionary of 128 bits use Lempel-Ziv algorithm to compress the obtained string of bits.
  • Compare the size of the compressed file with the size of an mp3 file of the same song.

18 of 34

Shannon-Fano block coding (video) – GROUP P

(Students Names: 2 or 3)

  • Find a mkv file (5 seconds) of your favorite movie scene and read the contents of the file into a byte array.
  • Change bytes to bits to obtain a representation of the mkv file as a string of bits.
  • Split the obtained sequence of bits into blocks of 16, and find the frequency of every possible block, and use the obtained frequencies to estimate the probability of all possible blocks of 8 bits.
  • Use the estimated probability distribution to obtain a Shannon-Fano for the blocks of 16 bits.
  • Use the Shannon-Fano code to encode the bit representation of the clip.
  • Compare the size of the compressed file with the size of an mp4 file of the same clip.

19 of 34

Huffman block coding (video) – GROUP Q

(Students Names: 2 or 3)

  • Find a mkv file (5 seconds) of your favorite movie scene and read the contents of the file into a byte array.
  • Change bytes to bits to obtain a representation of the mkv file as a string of bits.
  • Split the obtained sequence of bits into blocks of 16, and find the frequency of every possible block, and use the obtained frequencies to estimate the probability of all possible blocks of 8 bits.
  • Use the estimated probability distribution to obtain a Huffman for the blocks of 16 bits.
  • Use the Huffman code to encode the bit representation of the clip.
  • Compare the size of the compressed file with the size of an mp4 file of the same clip.

20 of 34

Shannon-Fano block coding (song) – GROUP R

(Students Names: 2 or 3)

  • Find a raw audio file of your favorite song and read the contents of the file into a byte array.
  • Change bytes to bits to obtain a representation of the wav file as a string of bits.
  • Split the obtained sequence of bits into blocks of 8, and find the frequency of every possible block, and use the obtained frequencies to estimate the probability of all possible blocks of 8 bits.
  • Use the estimated probability distribution to obtain a Shannon-Fano for the blocks of 8 bits.
  • Use the Shannon-Fano code to encode the bit representation of the song.
  • Compare the size of the compressed file with the size of an mp3 file of the same song.

21 of 34

Lempel-Ziv (song) – GROUP S

(Students Names: 2 or 3)

  • Find a mkv file (5 seconds) of your favorite movie scene and read the contents of the file into a byte array.
  • Change bytes to bits to obtain a representation of the mkv file as a string of bits.
  • Using a starting dictionary of 256 bits use Lempel-Ziv algorithm to compress the obtained string of bits.
  • Compare the size of the compressed file with the size of an mp4 file of the same clip.

22 of 34

Huffman block coding (image) – GROUP T

(Students Names: 2 or 3)

  • Find a JPEG image of David Huffman and read the contents of the file into a byte array.
  • Change bytes to bits to obtain a representation of the JPEG image as a string of bits.
  • Split the obtained sequence of bits into blocks of 8, and find the frequency of every possible block, and use the obtained frequencies to estimate the probability of all possible blocks of 8 bits.
  • Use the estimated probability distribution to obtain a Huffman code for the blocks of 8 bits.
  • Use the Huffman code to encode the bit representation of the image.
  • Compare the size of the compressed file with the sizes of the same image in some other image formats.

23 of 34

Lempel-Ziv (image) – GROUP U

((Students Names: 2 or 3))

  • Find a JPEG image of Abraham Lempel and read the contents of the file into a byte array.
  • Change bytes to bits to obtain a representation of the JPEG image as a string of bits.
  • Using a starting dictionary of 256 bits use Lempel-Ziv algorithm to compress the obtained string of bits.
  • Compare the size of the compressed file with the sizes of the same image in some other image formats.

24 of 34

Shannon-Fano block coding (image) – GROUP V

(Students Names: 2 or 3)

  • Find a JPEG image of Claude Shannon and read the contents of the file into a byte array.
  • Change bytes to bits to obtain a representation of the JPEG image as a string of bits.
  • Split the obtained sequence of bits into blocks of 8, and find the frequency of every possible block, and use the obtained frequencies to estimate the probability of all possible blocks of 8 bits.
  • Use the estimated probability distribution to obtain a Shannon-Fano code for the blocks of 8 bits.
  • Use the Shannon-Fano code to encode the bit representation of the image.
  • Compare the size of the compressed file with the sizes of the same image in some other image formats.

25 of 34

Huffman block coding (book) – GROUP W

(Students Names: 2 or 3)

  • Find a pdf file of your favorite book and read the contents of the file into a byte array.
  • Change bytes to bits to obtain a representation of the pdf file as a string of bits.
  • Split the obtained sequence of bits into blocks of 16, and find the frequency of every possible block, and use the obtained frequencies to estimate the probability of all possible blocks of 16 bits.
  • Use the estimated probability distribution to obtain a Huffman code for the blocks of 16 bits.
  • Use the Huffman code to encode the bit representation of the pdf file of the book.
  • Compare the size of the compressed file with the sizes of the same book in some other document formats (DjVu etc.).

26 of 34

Shannon-Fano coding (book) – GROUP X

(Students Names: 2 or 3)

  • Find a pdf file of your favorite book and read the contents of the file into a byte array.
  • Change bytes to bits to obtain a representation of the pdf file as a string of bits.
  • Split the obtained sequence of bits into blocks of 16, and find the frequency of every possible block, and use the obtained frequencies to estimate the probability of all possible blocks of 16 bits.
  • Use the estimated probability distribution to obtain a Shannon-Fano code for the blocks of 16 bits.
  • Use the Shannon-Fano code to encode the bit representation of the pdf file of the book.
  • Compare the size of the compressed file with the sizes of the same book in some other document formats (DjVu etc.).

27 of 34

Lempel-Ziv (book) – GROUP Y

(Students Names: 2 or 3)

  • Find a pdf file of your favorite book and read the contents of the file into a byte array.
  • Change bytes to bits to obtain a representation of the pdf file of the book as a string of bits.
  • Using a starting dictionary of 256 bits use Lempel-Ziv algorithm to compress the obtained string of bits.
  • Compare the size of the compressed file with the sizes of the same book in some other document formats (DjVu etc.).

28 of 34

Tunstall coding (book) – GROUP Z

(Students Names: 2 or 3)

  • Find a pdf file of your favorite book and read the contents of the file into a byte array.
  • Change bytes to bits to obtain a representation of the pdf file as a string of bits.
  • Split the obtained sequence of bits into blocks of 8, and find the frequency of every possible block, and use the obtained frequencies to estimate the probability of all possible blocks of 8 bits.
  • Use the estimated probability distribution on the 8 bits blocks and Tunstall coding to obtain a Tunstall code with the length of 16 bits.
  • Use the obtained code to encode the bit representation of the pdf file of the book.
  • Compare the size of the compressed file with the sizes of the same book in some other document formats (DjVu etc.).

29 of 34

Helpful links and material

  • Lecture notes are available at e-učilnica.

  • The implementation in java for LZW is appended, you can find other options available

  • Not so much work in general for 2 students – but you need to explain the details of both encoding and decoding process

  • Maybe not all projects are equally balanced w.r.t. workload so maybe there will be additional grading regarding this !

30 of 34

Pseudo codes for LZW

  • For encoding a pseudo code is :

31 of 34

Pseudo codes II

  • For decoding pseudo code is :

32 of 34

Java implementation

  • import java.io.*;
  • public class LZWCompression {
  • private static final int BITS = 12;
  • private static final int HASHING_SHIFT = 4;
  • private static final int MAX_VALUE = (1 << BITS ) - 1;
  • private static final int MAX_CODE = MAX_VALUE - 1;
  • private static final int TABLE_SIZE = 5021;
  • private static final int EOF = -1;
  • private BufferedInputStream input = null; 1
  • private BufferedOutputStream output = null;
  • private int output_bit_count = 0;
  • private int output_bit_buffer = 0;
  • private short[] code_value = new short[ TABLE_SIZE ];
  • private short[] prefix_code = new short[ TABLE_SIZE ];
  • private short[] append_character = new short[ TABLE_SIZE ];
  • LZW Compression ( FileInputStream input, FileOutputStream output )
  • { this.input = new BufferedInputStream( input ); this.output = new BufferedOutputStream( output

);

  • } …..
  • Type LZW java implementation !!

33 of 34

Example LZW

34 of 34

Example LZW cont.