Lightning-fast cluster computing
Thomas Cooper
PhD Student
Newcastle University
Cloud Computing for Big Data
Background
Map
w → (w, 1)
Map
w → (w, 1)
Map
w → (w, 1)
Map
w → (w, 1)
(“is”, 1)
(“python”, 1)
(“awesome”, 1)
(“is”, 1)
(“java”, 1)
(“verbose”, 1)
(“is”, 1)
(“scala”, 1)
(“unreadable”, 1)
(“yeah”, 1)
(“python”, 1)
(“awesome”, 1)
Reduce
a, b → a + b
Reduce
a, b → a + b
(“yeah”, 1)
(“python”, 2)
(“is”, 3)
(“awesome”, 2)
(“java”, 1)
(“scala”, 1)
(“verbose”, 1)
(“unreadable”, 1)
Map Reduce
Background
Good for acyclic data flows (one pass)
For example:
Word counts
Log processing
Bad for iterative data flows (multi pass) due to repeated loading from disk
For example:
Page Rank
Parameter optimisation in machine learning algorithms
Interactive analytics
What is Spark?
Resilient Distributed Datasets (RDDs)
RDD
Cluster
Resilient Distributed Datasets (RDDs)
Transformation:
Creation
map
filter
join
reduceByKey
Actions:
count
collect
save
or
Background
Map
w → (w, 1)
Map
w → (w, 1)
Map
w → (w, 1)
Map
w → (w, 1)
(“is”, 1)
(“python”, 1)
(“awesome”, 1)
(“is”, 1)
(“java”, 1)
(“verbose”, 1)
(“is”, 1)
(“scala”, 1)
(“unreadable”, 1)
(“yeah”, 1)
(“python”, 1)
(“awesome”, 1)
Reduce
a, b → a + b
Reduce
a, b → a + b
(“yeah”, 1)
(“python”, 2)
(“is”, 3)
(“awesome”, 2)
(“java”, 1)
(“scala”, 1)
(“verbose”, 1)
(“unreadable”, 1)
Map Reduce Word Count
Resilient Distributed Datasets (RDDs)
text_file = sc.textFile("hdfs://...")
words = text_file.flatMap(lambda line: line.split(" "))
word_tuples = words.map(lambda word: (word, 1))
counts = word_tuples.reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("hdfs://...")
text_file
words
word_tuples
flatMap
map
reduceByKey
counts
Dependencies
Narrow Dependency
eg map or filter
Wide Dependency
eg join or groupByKey
Staging
text_file
words
word_tuples
flatMap
map
reduceBy
Key
Stage 1
Stage 2
Stage 3
Stage 4
text_file.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1))
word_tuples.reduceByKey(lambda a, b: a + b)
counts
Fault Tolerance
text_file
words
word_tuples
counts
flatMap
map
reduceBy
Key
Spark Components
Spark Components
Spark SQL
Spark MLlib
Machine Learning
from pyspark.mllib.classification import LogisticRegressionWithLBFGS, LogisticRegressionModel�from pyspark.mllib.regression import LabeledPoint��# Load and parse the data�def parsePoint(line):� values = [float(x) for x in line.split(' ')]� return LabeledPoint(values[0], values[1:])��data = sc.textFile("data/mllib/sample_svm_data.txt")�parsedData = data.map(parsePoint)��# Build the model�model = LogisticRegressionWithLBFGS.train(parsedData)��# Evaluating the model on training data�labelsAndPreds = parsedData.map(lambda p: (p.label, model.predict(p.features)))�trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() / float(parsedData.count())�print("Training Error = " + str(trainErr))��# Save and load model�model.save(sc, "myModelPath")�sameModel = LogisticRegressionModel.load(sc, "myModelPath")
Spark Streaming
Spark Streaming
from pyspark import SparkContext�from pyspark.streaming import StreamingContext
# Create a local StreamingContext with two working thread and batch interval of 1 second�sc = SparkContext("local[2]", "NetworkWordCount")�ssc = StreamingContext(sc, 1)
# Create a DStream that will connect to hostname:port, like localhost:9999�lines = ssc.socketTextStream("localhost", 9999)
# Split each line into words�words = lines.flatMap(lambda line: line.split(" "))
# Count each word in each batch�pairs = words.map(lambda word: (word, 1))�wordCounts = pairs.reduceByKey(lambda x, y: x + y)
ssc.start() # Start the computation�ssc.awaitTermination() # Wait for the computation to terminate
Developing for Spark
Developing for Spark
$ ./bin/spark-submit --master spark://207.184.161.138:7077 \
--py-files dependencies \
path-to-my-script/myscript.py args
$ ./bin/spark-submit --master local[4] \
--py-files dependencies \
path-to-my-script/myscript.py args
$ PYSPARK_DRIVER_PYTHON=ipython \ ./bin/pyspark
$ PYSPARK_DRIVER_PYTHON=ipython \ PYSPARK_DRIVER_PYTHON_OPTS="notebook" \ ./bin/pyspark
Code Dojo
http://tomcooper.org.uk/2015/11/11/Python-North-East.html