395T / INF385T / LIN386M: Data-Intensive Computing for Text Analysis

Assignment 3: Compression, Inverted Indexing, and Popular Passages

Due: Sunday 10/16 on BlackBoard 

Submit: a single PDF, including both answers with any requested code. Code should be formatted using a fixed-width font like Courier.

As usual we encourage you to start early...

A. Hadoop Compression. Consider the following simple Hadoop WordCount program:

public class SimpleWordCount {

public static void main(String[] args) {

JobClient client = new JobClient();

JobConf conf = new JobConf(SimpleWordCount.class);

FileInputFormat.addInputPath(conf, new Path(args[0]));

FileOutputFormat.setOutputPath(conf, new Path(args[1]));




conf.setCombinerClass(LongSumReducer.class); conf.setReducerClass(LongSumReducer.class);


try {


} catch (Exception e) {





A.I Show how the above code could be modified to

  1. read in gzip-compressed input
  2. use gzip to compress intermediate (mapper) output
  3. gzip-compress persistent (reducer) output.

Use bold text and comments to clearly indicate modified lines and explain which modifications pertain to which functionality.


  1. What factors affect the run-time and why/how when using different compression codecs with MapReduce/Hadoop?  (max two sentences)
  2.  baseline: Run the original SimpleWordCount on “all_oanc.txt” or some larger corpus (the bigger the better for stress testing, see earlier assignments for file details). Report which file(s) you used and running time of this baseline configuration?  You might try running a couple of times and averaging.  This is your baseline.
  3. other configurations: For each of the following configurations, how long does the program take, reported as a % of baseline time (not absolute time)?  Besides tracking time, you are also encouraged to monitor performance of your job using the Web GUI and report any other metrics of note for which you observe a difference between baseline performance vs. other configurations.
  1. no combiner: Comment out the combiner class  
  2. no combiner, mapper-compression: use gzip to compress intermediate output
  3. combiner, mapper-compression: uncomment the combiner class to include it once more with compression left enabled
  4. Try g-zipping the input file, then running WordCount reading in the gzip-compressed input.  You can try this with any combination of baseline or a-c runs and compare.
  1. What if any trends do you observe across these runs?  In what if any circumstances does compression seem sufficiently helpful to bother with?  (max four sentences)

A.III (optional, not required). Install and use LZO compression with appropriate pre-processing and compare its performance to gzip in one or more of the settings above.  What do you observe? (max four sentences)

B. Spelling Correction and Inverted Indexing. In class we have discussed how to perform efficient automatic spelling correction and suggestion, using character tri-grams.  In this problem you will implement these concepts.

B.I. Implement boolean-search character-trigram inverted indexing in Hadoop and provide a code listing.  This initial “vanilla” implementation should be straightforward.

B.II. What limitations/complications have we discussed with this “vanilla” approach and/or what pattern(s) might be utilized to address this?  What if any tradeoffs are involved?  (max four sentences)

B.III. Modify your original code to implement a revised program reflecting your comments to the question above. Provide a listing for the revised program, using bold text and comments to clearly indicate what modifications you have made to your program and the expected behavior and/or any new pre-/post-invariants (you can be informal).  In one sentence, how do you expect your revised program to behave differently?

B.IV. Run both forms of your program on “all_oanc.txt” or some larger corpus (the bigger the better for stress testing, see earlier assignments for file details). Use the Web GUI or other metric tracking to compare behavior of the two versions of your program. Report which file(s) you used and what if any differences in metrics you observed.

B.V. Using your inverted indexing program, how many possible corrections are there for acress from an inverted index built from “all_oanc.txt”?

B.VI. Suggest and implement a simple scheme for ranking candidates using this same boolean index for “all_oanc.txt.” What are the top 5 candidates and their scoress for “acress”?

B.VII. Modify your boolean-search indexing program to do TF.IDF inverted indexing of character trigrams. As in B.III, your program should reflect the issues and approaches you discussed in B.II. While you do not need to compress the postings, organize your program such that it would be easy to do so later were you provided an external library for Golomb compression.

B.VIII. Perform simple TF.IDF ranking of correction candidates using this TF.IDF index for “all_oanc.txt.” What are the top 5 candidates and their scores for “acress”?

B.IX (optional, not required). In class we only discussed “vanilla” TF.IDF weighting schemes. Salton and Buckley’s Term-weighting approaches in automatic text retrieval (1988) presents a wider range of possible schemes (Tables 1 and 2) along with their evaluation on several classic text collections of abstracts.  You might also take a look at Amit Singhal’s Modern Information Retrieval: A Brief Overview (2001), particularly Table 1.

Implement one or more of these schemes and compare the spelling correction rankings vs. what you got with vanilla TF.IDF. Report each scheme tried and the top-5 candidates and scores under the revised ranking. How do they compare to the baseline vanilla scheme?

C. Popular Passages. In class we discussed how to write a MapReduce program to find where all (word) tri-grams occur in all documents (documentID and position). In a follow-up in-class exercise, we looked at how to order these consecutively for each document such that one might iterate over each position in a document and see for each position, where else the ngram at that position in the current document occurs in other documents.  Finally, some psuedo-code was presented for how these tri-grams might be extended to find longer sequences of consecutive words aligned across multiple documents.  This approach was dubbed “popular passages” for Google scanned books project; see Kolak and Schilit (HyperText 2008) for more details.

C.I. The first line in “enwiki_250k.txt” (see previous assignments for file details) represents Wikipedia’s page on “Anarchism”.  Report all n-grams of length 3 or longer (the striings) and their alignments (other documentIDs and document-specific word index) between the “Anarchism” document (listed in order of their occurrence in this document) and one or more other documents in “enwiki_250k.txt”.  

[this one paragraph added Sunday Oct 16 at 6:20 CST, also email sent to course listserv]

Please sort the 3-gram (or longer) strings lexicographically for submission (you can just use the unix sort command), and report the total number of alignments found between “Anarchism” and all other documents (if all alignments in a text file, one per line, unix “wc -l” command will give you the number of alignments).

You might represent documentIDs by their line number (as an integer or string). Note that you do not actually use or store the n-gram strings themselves beyond using the tri-grams in the first MapReduce grouping (while our in-class exercise output the strings, this was not actually necessary).

Submit a code-listing of your MapReduce job(s) which finds these popular passages.

C.II (bonus, not required).

  1. Suggest a simple way to prevent your program from producing duplicates (i.e. finding the same alignments between documents A & B when processing A and when processing B)?
  2. How might you use chaining or other mechanisms to organize your MapReduce computations?  Pig/Cascalog etc.?


Wrapup 1: Do you have any feedback on this homework? (It’s okay if you don’t.)

Wrapup 2: How long did this entire homework take you?