CS61A Summer 2011 Homework 11
Topic: Streams & Map Reduce
Reading: Abelson & Sussman, Section 3.5.1-3.5.3, 3.5.5
MapReduce paper in course reader.
Please go to lab this week: you’ll need the instructions in order to test your homework questions regarding mapreduce.
1. Exercises 3.50, 3.51, 3.52, 3.53, 3.54, 3.55, 3.56, 3.64, 3.66, 3.68
2. Write and test two functions to manipulate nonnegative proper fractions.
The first function, fract-stream, will take as its argument a list of
two nonnegative integers, the numerator and the denominator, in which
the numerator is less than the denominator. It will return
an infinite stream of decimal digits representing the decimal expansion of
the fraction. The second function, approximation, will take two
arguments: a fraction stream and a nonnegative integer (num).
It will return a list (not a stream) containing the first num
digits of the decimal expansion.
(fract-stream '(1 7))} should return the stream representing the
decimal expansion of 1/7, which is 0.142857142857142857...
(stream-car (fract-stream '(1 7)))} should return 1.
(stream-car (stream-cdr (stream-cdr (fract-stream '(1 7))))) should return 2.
(approximation (fract-stream '(1 7)) 4)} should return (1 4 2 8).
(approximation (fract-stream '(1 2)) 4)} should return (5 0 0 0).
Extra for Experts:
1. Do exercises 3.59--3.62.
2. Consider this procedure:
(define (hanoi-stream n)
(if (= n 0)
(stream-append (hanoi-stream (- n 1))
(cons-stream n (hanoi-stream (- n 1))))))
It generates finite streams; here are the first few values:
(hanoi-stream 1) (1)
(hanoi-stream 2) (1 2 1)
(hanoi-stream 3) (1 2 1 3 1 2 1)
(hanoi-stream 4) (1 2 1 3 1 2 1 4 1 2 1 3 1 2 1)
Notice that each of these starts with the same values as the one above it,
followed by some more values. There is no reason why this pattern can't be
continued to generate an infinite stream whose first 2^n - 1 elements are
(hanoi-stream~n)}. Generate this stream.
Keep in mind: (mapreduce mapper reducer basecase input)
mapper - must return a LIST of kv-pairs!
[Sorting is done automatically based on kv-keys!]
reducer - procedure of two arguments, works like accumulate (by keeping track of
the current value & returning it at the end)
1(a). Google uses MapReduce for a variety of search-engine-related tasks
involving processing large data sets. An example of such a task is building an
inverted word index. For each word in a given set of documents, we want to
generate a key-value pair, where the key is the word and the value is all
documents in which the word appears. Write a procedure that takes a directory
name as argument and returns the inverted index stream.
1(b). The list above will include words like ``a,'' ``so,'' ``the,'' etc.
Suppose now that we only want ``important words.'' We can use word length as a
measure of importance (a real search engine would use a more complex
heuristic). Write a procedure that returns an index where all the words are N
or more letters long, taking the filename and N as arguments.
2. Google also manages the GMail e-mail service, and they would like to
filter out as many spammers as possible. In this problem you will implement a
simple spam filtering idea with mapreduce so that it can be efficiently
applied to the multitudes of e-mail messages in GMail.
We want to produce a ``blacklist'' of e-mail addresses that are spamming GMail
users. Spam messages are sent to many addresses at once, and so a spam message
can be identified by having one of the most commonly-used subject lines. If an
address sends many messages with the most common subject lines, it is likely a
spammer address. We would like to find the top ten addresses that have sent
the most messages with frequently-recurring subject lines.
The input data is a collection of e-mail records in the file
/sample-emails. (These are simulated messages, not actual GMail data!)
Each e-mail record is a list with the format
(from-address to-address subject body)
where each element is a double-quoted string. The mapper function will be
applied to each e-mail record. An small example set of e-mail records is
("cs61a-tb" "cs61a-tc" "mapreduce" "mapreduce is great! lucky students!")
("bot1337" "cs61a-ta" "free ipod now!" "buy herbal ipod enhancer!")
("bot1338" "cs61c-tf" "free ipod now!" "buy herbal ipod enhancer!")
2(a). Our first step is to produce a table with subject lines as keys and
counts of occurrences as values. Identify the intermediate key-value pairs and
write the mapper and reducer functions.
2(b). From our tabulation of subject line occurrences in 2(a), we want to find
the most common subject lines in the table. We can perform a sort by using the
fact that MapReduce sorts by intermediate keys into the reducer groups.
Identify the intermediate key-value pairs and write the mapper and reducer
2(c). Finally, assuming you've moved the ten most common subject lines into a
list, we want to make a table with from-addresses as keys and counts of
e-mails sent with common subject lines as values. You don't need to sort the
table, since the procedures would be identical to those in 2(b). Identify the
intermediate key-value pairs and write the mapper and reducer functions.
3. Sometimes either the map or reduce operation is trivial, such as an
identity function. Is MapReduce still the best way to handle those cases,
or would it be better to provide separate map and groupreduce
functions as we did for the non-parallel toy version? Why?
4. Could you use MapReduce to perform a parallelized Sieve of Eratosthenes?
If so, describe the process briefly. If not, why not? (Don't try to solve
this problem by quizmanship; there's a case to be made for both answers!)