Exercise set 3

Exercise set 3 (last updated 2015-04-10)

Return your solutions via Moodle page no later than 9.4. at 2pm strictly. We do not count belated returnings. You have to use .pdf file type for any written answers, and .scala for your code. Do not return any project or object files, etc. Mark your full name and student number clearly to all your solution files. Prepare to explain and discuss your solutions on Friday exercise session. Each exercise will be graded pass/fail.

Programming Guidelines

Place your code in a single .scala file, called Solutions.scala, in the package “bdf.exercise3.yourname”, i.e. start the file with:

package bdf.exercise3.yourname

The file should include one scala program with a main function, called Solutions, and any functions, classes, case classes, and objects that you need.

Example file:

Solutions.scala

package bdf.exercise3.eemillagerspetz

/* Definitions of all other objects, classes, global functions etc go here */

/* Finally we have the Solutions object */

object Solutions{
/* Definitions of other functions of Solutions go here */

def main(args: Array[String]) {

val conf = new SparkConf().setAppName(getClass.getName).setMaster("local[2]")

val sc = new SparkContext(conf)

/* Run exercise2 code */

exercise2(sc)

/* Run exercise3 code */

exercise3(sc)

/* Run exercise4 code */

exercise4(sc)

/* Run exercise5 code */

exercise5(sc)

/* Run exercise6 code */

exercise6(sc)

}

This way we can import and run all of the code of all the students in a single Eclipse project.

All these exercises have to be done using Spark and Scala programming languages. Please, return your answers as .scala files (NO class files or project files). If asked so, return also a written solution in a .pdf file. NO CODE AS PDF, because they are impossible to run then.

TIP: to allow testing locally and running the same program in Ukko, give the master url to the program as an argument (either in Eclipse’s Run configurations - arguments or on the command line):

/someplace/spark-1.3.0-bin-hadoop2.4/bin/spark-submit \

--master "local[2]" … thejar.jar \

"local[2]" # the second time for your program if run without spark-submit

# or

/someplace/spark-1.3.0-bin-hadoop2.4/bin/spark-submit \

--master "spark://ukko123.hpc.cs.helsinki.fi:7077" … thejar.jar \

"spark://ukko123.hpc.cs.helsinki.fi:7077" # the second time for your program if run without spark-submit

You can also give the data files as an argument to the program, and read it in main:
val master = args(0)
val data = args(1)
val dataRdd = sc.textFile(data)

…

Exercise 1

Read the tutorial of submitting Spark programs to the cluster: http://spark.apache.org/docs/latest/submitting-applications.html

Write a Spark function that

Takes paths for input and output files as parameters
Reads text from the input file (words separated by semicolon, at least three words per line, at least some tens of lines) as RDD[String]
Parses these Strings so that the result is RDD[Array[String]]
Saves these Arrays to the output path as Object files

Now “pack” your simple program to a jar file. Use eclipse’s export function, the command line jar tool, or maven, or your favorite compression program. Generate some test data for yourself. Run your function in the Ukko cluster using spark-submit, at least three remote worker nodes (in addition to the master node), and “client” as a deploy mode. Create an ssh connection to ukko and call spark-submit there (i.e. not on your own laptop).

Return your code as .scala file and include the spark-submit command you used (with all the arguments, options, file paths etc.) as a comment in the top of the file.

TIP: You need to configure files from spark-xxx/conf (at least, update slaves.template to be a file called slaves, where each slaves are given in their own rows) and run stuff from spark-xxx/sbin to start your cluster. The Spark master’s name is in form

spark://ukkoXXX.hpc.cs.helsinki.fi:7077

You can give your slaves as whatever the command hostname returns. Read also other tips previously in this exercise set...

Exercise 2

Continue to work in the Ukko cluster (or any other remote set of servers you have access to). Improve to your function (from the exercise 1):

Make a function, that changes a letter of a word, for example, each “a” letter are first replaced by “b”. This function takes only one word as a parameter and reads the replacement letter as a class variable
Create var replacement = “a” to a class variable and broadcast it
Call letter changing function three times for all the words in your dataset, every time changing the replacement letter and broadcast it again
Print the two first elements between each function call
Note: Only the change from the last call should be visible, i. e. the previous call’s change to the word is forgotten in the next.

Example
word -> bord
word -> wold
word -> ward

If your broadcast works correctly, you should get updated letters without giving the replacement as a parameter (that’s the alternative way to do this, but in some cases it is not just possible or easy to give a huge list of parameters).

Return your code as a solution.

Exercise 3

Choose a Wikipedia article you like (or today’s featured article, if cannot make a decision) and download it to your computer. On the department’s systems and Linux you can also use the command wget url. Parse your article so that you will have separated headings and words belonging to that section, as an RDD[heading or subheading, section’s words]. Clean up the document, removing HTML tags and other non-article content. You can use any library for Scala or Java that does this, or parse the text by hand.

Addition: You are not required to save the heading structure, so you can handle each header level (h1, h2, h3) as equivalent.

Then, compute a probability distribution for “small” words of each section:

and, or, the, a, an, at, on, in

Plot your findings and return them as a short report (plots + some description text) as a .pdf file. Return also your code as a .scala file.

Exercise 4

Continue working with your Wikipedia article from the exercise 3. Now implement your own function to compute Hamming distance, that is one way to present similarity of words: http://en.wikipedia.org/wiki/Hamming_distance

Addition: For the words of different size, match the shorter one to the substrings of the longer one, and return the average Hamming distance.

Implement k-means clustering for Spark (presented in the couple of tutorials) that uses the Hamming distance as a similarity/distance metric. You can use median of the cluster as a definition for a new centroid, e.g. sort the words in the cluster by alphabetic order and pick one from the middle.

Cluster all the words from you Wikipedia article. Test with different size of k.

As a solution, evaluate your clusters answering at least these questions:

What kind of clusters can you find?
Do your clusters match to the sections’ subheadings?
Which k gives you the best (or the most reasonable) clusters?
Addition: Evaluate your heuristic. Do the Hamming distance work there? Would you like to try some other metric instead?

Return your answers as a .pdf file, and also your code as .scala file.

Hint: When you compare Arrays in Scala, you cannot use == because it compares Objects, not the elements of the Arrays. That's a little weird, because == works for Strings pretty well even if they are Objects too. To compare arrays, use a.deep == b.deep.

Exercise 5

In this exercise, use the MovieLens dataset used in the exercise set 2. Create your own similarity metric and cluster MovieLens users based on it using k-medoids: http://en.wikipedia.org/wiki/K-medoids

Use at least 10 clusters. The metric should assign each user (or user pair, your choice) a single number. Try to find a metric that results in clusters with funny or understandable user -> genre associations. Example (probably meaningless) metric pseudocode: zipcode + (male?1:0)*1000+age*10

As a solution, present your metric and evaluate your clusters answering at least these questions:

Which users are the cluster medoids (aka centroids, or cluster heads)?
What are the most common favorite movie genres in the clusters?
What are the most common occupations in the cluster? Are the users mostly highly educated in some clusters and unemployed in others? Don’t just use the occupation as the metric.

Return your answers as a pdf file AND separate .scala file(s) for your code (metric, clustering algorithm, and a main for running your results).

Exercise 6

Create a case class that allows iterative update of the expected value (ev), error, and number of samples of a statistical distribution. Support both removal and addition via overloaded - and + operators.

http://math.stackexchange.com/questions/102978/incremental-computation-of-standard-deviation/103025#103025

The link above outlines the solution to iteratively update

the average (or expected value)
the standard deviation or standard error of the mean
the number of samples

when given the average, square difference sum, and the number of samples from two different sets of samples.

Create a case class that allows adding two such summaries together so that the result is the same as when you have all of the individual samples at hand instead of the summaries, e.g. for the two data sets

1 2 3 8 4 3 2 1

2 4 3 1 5 7 4 2

The averages are 3 and 3.5, the number of samples is 8 for both of them, and the standard deviations are 2.27 and 1.93. The square difference sums are 36 and 26.

The joint average is 3.25, the number of samples 16, and the standard deviation 2.049. The combined square difference sum is 63.

Your task is to create a case class such that it passes this test:

// IMPLEMENT THIS:

case class Stat(name:String, ev:Double, squares:Double, samples:Double){

def +(b:Stat) = { … }

def -(b:Stat) = { … }

def stddev() = {...}

}

// TESTS:

val one = Stat("one", 3, 36, 8)

val two = Stat("two", 3.5, 26, 8)

val combined = one + two

val known = Stat("comb", 3.25, 63, 16)

combined.ev == known.ev

combined.stddev == known.stddev

combined.squares == known.squares

combined.samples == known.samples

val oneBySubtraction = combined - two

oneBySubtraction.ev == one.ev

oneBySubtraction.stddev == one.stddev

oneBySubtraction.squares == one.squares

oneBySubtraction.samples == one.samples

Return your code as a solution.