1 of 16

Text Compression

CSinSF Summer Institute

2 of 16

When you send text messages to a friend, do you spell every word correctly?

3 of 16

Why do you use these abbreviations? What is the benefit?

4 of 16

Compression: Same Data, Fewer Bits

  • When you abbreviate or use coded language to shorten the original text, you are “compressing text.” Computers do this too, in order to save time and space.
  • The art and science of compression is about figuring out how to represent the SAME DATA with FEWER BITS.
  • Why is this important?
    • One reason is that storage space is limited and you’d always prefer to use fewer bits if you could. A much more compelling reason is that there is an upper limit to how fast bits can be transmitted over the Internet.
  • What if we need to send a large amount of text faster over the Internet, but we’ve reached the physical limit of how fast we can send bits?

5 of 16

Activity: Decode this Mystery Text

6 of 16

Activity: Decode this Mystery Text

Pitter_patter_patter_listen_to_the_rain_pitter_patter_

pitter_patter_on_the_window_pane

7 of 16

8 of 16

Activity: Decode this Mystery Text

  • The compressed poem is not just this part:
    • If you were to send this to someone over the Internet they would not be able to decode it.
  • The full compressed text includes BOTH the compressed text and the key to solve it.
  • Thus, you must account for the total number of characters in the message plus the total number of characters in the key to see how much you’ve compressed it over the original.

9 of 16

10 of 16

Activity: Use the Text Compression Widget

  • Time to partner up!
  • Go to www.yellkey.com/degree
    • Login with Google Account
  • Challenge: compress your assigned poem as much as possible.
  • Compare with other groups to see if you can do better.
  • Try to develop a general strategy that will lead to a good compression

11 of 16

Activity: Use the Text Compression Widget

  • What makes doing this compression hard?
  • Do we think that these compression amounts that we’ve found are the the best? Is there a way to know what the best compression is?

12 of 16

Activity: Develop a heuristic for doing compression

In computer science there is a word for strategies to use when you’re not sure what the exact or best solution to a problem is.

Heuristic

A problem solving approach (typically an algorithm) to find a satisfactory solution where finding an optimal or exact solution is impractical or impossible.

13 of 16

Activity: Develop a heuristic for doing compression

  • Continue working on compressing your poem using the Text Compression Widget. As you do so, develop a set of rules, or a “heuristic” that generally seems to provide good results.
  • Record your heuristic as a list of steps that someone else unfamiliar with the problem could follow and still end up with decent compression.

14 of 16

Share-out

Share your heuristic findings...

15 of 16

Wrap-up

What did all groups’ processes for compression have in common?

16 of 16

Vocabulary

Heuristic - a problem solving approach (algorithm) to find a satisfactory solution where finding an optimal or exact solution is impractical or impossible.

Lossless Compression - a data compression algorithm that allows the original data to be perfectly reconstructed from the compressed data.

Lossy Compression (or irreversible compression) - a data compression method that uses inexact approximations, discarding some data to represent the content. Most commonly seen in image formats like .jpg.