1
Applied Data Analysis (CS401)
Robert West
Lecture 13
Scaling up
9 Dec 2020
Scaling up
Announcements
2
Feedback
3
Give us feedback on this lecture here: https://go.epfl.ch/ada2020-lec13-feedback
So far in this class...
The big-data problem
Data is growing faster than computation speed
Growing data sources
Cheap hard-disk storage
Stalling CPU speeds
RAM bottlenecks
Examples
Facebook’s daily logs: 60 TB
1000 Genomes project: 200 TB
Google Web index: 100+ PB
Cost of 1 TB of disk: $50
Time to read 1 TB from disk: 3 hours (100 MB/s)
The big-data problem
Single machine can no longer store, let alone process, all the data
Only solution is to distribute over a large cluster of machines
But how much data should you get?
Of course, “it depends”, but for many applications the answer is:
As much as you can get
Big data about people (text, Web, social media) tends to follow power law statistics.
Number of
search queries [log]
of freq >= x
x: search-query frequency [log]
Most queries occur�only once or twice
59% of all Web search queries are unique
17% of all queries were made only twice
8% were made three times
Hardware for big data
Budget (a.k.a. commodity) hardware
Not "gold-plated" (a.k.a. custom)
Many low-end servers
Easy to add capacity
Cheaper per CPU and per disk
Increased complexity in software:
Google Corkboard server: Steve Jurvetson/Flickr
Problems with cheap hardware
Failures, e.g. (Google numbers)
Commodity network (1-10 Gb/s) speeds vs. RAM
Uneven performance
Disclaimer: these numbers are constantly changing thanks to new technology!
Google datacenter
How to program this beast?
What’s hard about cluster computing?
How do we split work across machines?
How do we deal with failures?
How do you count the number of occurrences of each word in a document?
“I am Sam
I am Sam
Sam I am
Do you like
Green eggs and ham?”
I: 3
am: 3
Sam: 3
do: 1
you: 1
like: 1
…
A hashtable (a.k.a. dict)!
“I am Sam
I am Sam
Sam I am
Do you like
Green eggs and ham?”
{}
A hashtable!
“I am Sam
I am Sam
Sam I am
Do you like
Green eggs and ham?”
{I: 1}
A hashtable!
“I am Sam
I am Sam
Sam I am
Do you like
Green eggs and ham?”
{I: 1,
am: 1}
A hashtable!
“I am Sam
I am Sam
Sam I am
Do you like
Green eggs and ham?”
{I: 1,
am: 1,
Sam: 1}
A hashtable!
“I am Sam
I am Sam
Sam I am
Do you like
Green eggs and ham?”
{I: 2,
am: 1,
Sam: 1}
What if the document is really big?
What if the document is really big?
“I am Sam
I am Sam
Sam I am
Do you like
Green eggs and ham?
I do not like them
Sam I am
I do not like
Green eggs and ham
Would you like...”
{I: 3,
am: 3,
Sam: 3
{do: 2, … }
{Sam:1,
… }
{Would: 1, … }
{I: 6,
am: 4,
Sam: 4,
do: 3
… }
“I am Sam
I am Sam
Sam I am
Do you like
Green eggs and ham?
I do not like them
Sam I am
I do not like
Green eggs and ham
Would you like…”
{I: 3,
am: 3,
…}
{do: 1,
you: 1, … }
{Sam: 1,
I: 1,
… }
{Would: 1, you: 1,… }
{I: 6,
do: 3,
…}
{am: 5,
Sam: 4
…}
{you: 2
…}
{Would: 1
…}
“Divide and Conquer”
“I am Sam
I am Sam
Sam I am
Do you like
Green eggs and ham?
I do not like them
Sam I am
I do not like
Green eggs and ham
Would you like…”
{I: 3,
am: 3,
…}
{do: 1,
you: 1, … }
{Sam: 1,
I: 1,
… }
{Would: 1, you: 1,… }
{I: 6,
do: 3,
…}
{am: 5,
Sam: 4
…}
{you: 2
…}
{Would: 1
…}
“Divide and Conquer”
MAP
“I am Sam
I am Sam
Sam I am
Do you like
Green eggs and ham?
I do not like them
Sam I am
I do not like
Green eggs and ham
Would you like…”
{I: 3,
am: 3,
…}
{do: 1,
you: 1, … }
{Sam: 1,
I: 1,
… }
{Would: 1, you: 1,… }
{I: 6,
do: 3,
…}
{am: 5,
Sam: 4
…}
{you: 2
…}
{Would: 1
…}
“Divide and Conquer”
MAP
REDUCE
What’s hard about cluster computing?
How to divide work across machines?
How to deal with failures?
Solution: MapReduce
25
Jeff Dean [facts]
Applied Machine Learning Days ’19 [link]
26
How to deal with failures?
{I: 6,
do: 3,
…}
{am: 5,
Sam: 4
…}
{you: 2
…}
{Would: 1
…}
{4: Sam,
5: am, … }
{2: you,
… }
{1: would,… }
Just launch another task!
{3: do,
6: I,
… }
How to deal with slow tasks?
{I: 6,
do: 3,
…}
{am: 5,
Sam: 4
…}
{you: 2
…}
{Would: 1
…}
{4: Sam,
5: am, … }
{2: you,
… }
{3: do,
6: I,
… }
{1: would,… }
Just launch another task!
Solution: MapReduce
29
Jeff Dean
Need to break more complex jobs into sequence of MapReduce jobs
Example task
Suppose you have user info in one file, website logs in another, and you need to find the top 5 pages most visited by users aged 18-25.
30
Load Users
Load Pages
Filter by age
Join on name
Group on url
Count visits
Order by visits
Take top 5
Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt
In MapReduce
31
Enter: .
32
sc = SparkContext()
print “I am a regular Python program, using the pyspark lib”
users = sc.textFile(‘users.tsv’) # user <TAB> age
.map(lambda s: tuple(s.split(‘\t’)))
.filter(lambda (user, age): age>=18 and age<=25)�pages = sc.textFile(‘pageviews.tsv’) # user <TAB> url
.map(lambda s: tuple(s.split(‘\t’)))�counts = users.join(pages)
.map(lambda (user, (age, url)): (url, 1)
.reduceByKey(add)
.takeOrdered(5)
33
RDD: resilient distributed dataset
34
. architecture
35
Your Python script runs in the driver
RDD operations
are run in executors
RDD operations
36
Lazy execution [unrelated]
37
RDD transformations [full list]
38
RDD transformations [full list]
39
RDD transformations [full list]
40
POLLING TIME!
Why relative fraction, and not absolute number?
RDD transformations [full list]
41
RDD transformations [full list]
42
RDD actions [full list]
43
Broadcast variables
44
Accumulators
45
RDD persistence
rdd2 = rdd1.map(f1)
list1 = rdd2.filter(f2).collect()
list2 = rdd2.filter(f3).collect()
rdd2 = rdd1.map(f1)
rdd2.persist()
list1 = rdd2.filter(f2).collect()
list2 = rdd2.filter(f3).collect()
map(f1) transformation is executed twice
Result of map(f1) transformation is cached and reused (can choose between memory and disk)
46
}
}
Spark DataFrames
47
Spark SQL
sc = SparkContext()
sqlContext = HiveContext(sc)
df = sqlContext.sql("SELECT * from table1 GROUP BY id")
48
Spark's Machine Learning Toolkit
MLlib: Algorithms [more details]
Classification
Regression
Unsupervised:
Optimizers
Example:
Logistic regression with MLLib
from pyspark.mllib.classification \
import LogisticRegressionWithSGD
�trainData = sc.textFile("...").map(...)�testData = sc.textFile("...").map(...)
model = \
LogisticRegressionWithSGD.train(trainData)
predictions = model.predict(testData)�
Remarks
51
Cluster etiquette
52
Feedback
53
Give us feedback on this lecture here: https://go.epfl.ch/ada2020-lec13-feedback