CS61A         Summer 2011         Homework 11

Topic: Streams & Map Reduce

Reading: Abelson & Sussman, Section  3.5.1-3.5.3, 3.5.5

MapReduce paper in course reader.

Please go to lab this week: you’ll need the instructions in order to test your homework questions regarding mapreduce.

1.  Exercises 3.50, 3.51, 3.52, 3.53, 3.54, 3.55, 3.56, 3.64, 3.66, 3.68

2.  Write and test two functions to manipulate nonnegative proper fractions.

The first function, fract-stream, will take as its argument a list of

two nonnegative integers, the numerator and the denominator, in which

the numerator is less than the denominator.  It will return

an infinite stream of decimal digits representing the decimal expansion of

the fraction.  The second function, approximation, will take two

arguments: a fraction stream and a nonnegative integer (num).

It will return a list (not a stream) containing the first num

digits of the decimal expansion.

(fract-stream '(1 7))} should return the stream representing the

decimal expansion of 1/7, which is 0.142857142857142857...

(stream-car (fract-stream '(1 7)))} should return 1.

(stream-car (stream-cdr (stream-cdr (fract-stream '(1 7))))) should return 2.

(approximation (fract-stream '(1 7)) 4)} should return (1 4 2 8).

(approximation (fract-stream '(1 2)) 4)} should return (5 0 0 0).

Extra for Experts:

1.  Do exercises 3.59--3.62.

2.  Consider this procedure:

(define (hanoi-stream n)

  (if (= n 0)

          the-empty-stream

          (stream-append (hanoi-stream (- n 1))

                     (cons-stream n (hanoi-stream (- n 1))))))

It generates finite streams; here are the first few values:

(hanoi-stream 1)        (1)

(hanoi-stream 2)        (1 2 1)

(hanoi-stream 3)        (1 2 1 3 1 2 1)

(hanoi-stream 4)        (1 2 1 3 1 2 1 4 1 2 1 3 1 2 1)

Notice that each of these starts with the same values as the one above it,

followed by some more values.  There is no reason why this pattern can't be

continued to generate an infinite stream whose first 2^n - 1 elements are

(hanoi-stream~n)}.  Generate this stream.

MapReduce exercises:

Keep in mind: (mapreduce mapper reducer basecase input)

                        mapper - must return a LIST of kv-pairs!

                        [Sorting is done automatically based on kv-keys!]

                        reducer - procedure of two arguments, works like accumulate (by keeping track of    

                                   the current value & returning it at the end)

1(a). Google uses MapReduce for a variety of search-engine-related tasks

involving processing large data sets. An example of such a task is building an

inverted word index. For each word in a given set of documents, we want to

generate a key-value pair, where the key is the word and the value is all

documents in which the word appears.  Write a procedure that takes a directory

name as argument and returns the inverted index stream.

1(b). The list above will include words like ``a,'' ``so,'' ``the,'' etc.

Suppose now that we only want ``important words.'' We can use word length as a

measure of importance (a real search engine would use a more complex

heuristic). Write a procedure that returns an index where all the words are N

or more letters long, taking the filename and N as arguments.

2.  Google also manages the GMail e-mail service, and they would like to

filter out as many spammers as possible. In this problem you will implement a

simple spam filtering idea with mapreduce so that it can be efficiently

applied to the multitudes of e-mail messages in GMail.

We want to produce a ``blacklist'' of e-mail addresses that are spamming GMail

users. Spam messages are sent to many addresses at once, and so a spam message

can be identified by having one of the most commonly-used subject lines. If an

address sends many messages with the most common subject lines, it is likely a

spammer address. We would like to find the top ten addresses that have sent

the most messages with frequently-recurring subject lines.

The input data is a collection of e-mail records in the file

/sample-emails. (These are simulated messages, not actual GMail data!)

Each e-mail record is a list with the format

(from-address to-address subject body)

where each element is a double-quoted string. The mapper function will be

applied to each e-mail record. An small example set of e-mail records is

below.

("cs61a-tb" "cs61a-tc" "mapreduce" "mapreduce is great! lucky students!")

("bot1337" "cs61a-ta" "free ipod now!" "buy herbal ipod enhancer!")

("bot1338" "cs61c-tf" "free ipod now!" "buy herbal ipod enhancer!")

...

2(a). Our first step is to produce a table with subject lines as keys and

counts of occurrences as values. Identify the intermediate key-value pairs and

write the mapper and reducer functions.

2(b). From our tabulation of subject line occurrences in 2(a), we want to find

the most common subject lines in the table. We can perform a sort by using the

fact that MapReduce sorts by intermediate keys into the reducer groups.

Identify the intermediate key-value pairs and write the mapper and reducer

functions.

2(c). Finally, assuming you've moved the ten most common subject lines into a

list, we want to make a table with from-addresses as keys and counts of

e-mails sent with common subject lines as values. You don't need to sort the

table, since the procedures would be identical to those in 2(b). Identify the

intermediate key-value pairs and write the mapper and reducer functions.

3.  Sometimes either the map or reduce operation is trivial, such as an

identity function. Is MapReduce still the best way to handle those cases,

or would it be better to provide separate map and groupreduce

functions as we did for the non-parallel toy version?  Why?

4.  Could you use MapReduce to perform a parallelized Sieve of Eratosthenes?

If so, describe the process briefly. If not, why not?  (Don't try to solve

this problem by quizmanship; there's a case to be made for both answers!)