1 of 21

2 of 21

Lightning-fast cluster computing

Thomas Cooper

PhD Student

Newcastle University

Cloud Computing for Big Data

@tomncooper | tom.n.cooper@gmail.com | www.tomcooper.org.uk

3 of 21

Background to large scale processing
What is Apache Spark?
Developing for Spark
Code Dojo

4 of 21

Background

Map

w → (w, 1)

Map

w → (w, 1)

Map

w → (w, 1)

Map

w → (w, 1)

(“is”, 1)

(“python”, 1)

(“awesome”, 1)

(“is”, 1)

(“java”, 1)

(“verbose”, 1)

(“is”, 1)

(“scala”, 1)

(“unreadable”, 1)

(“yeah”, 1)

(“python”, 1)

(“awesome”, 1)

Reduce

a, b → a + b

Reduce

a, b → a + b

(“yeah”, 1)

(“python”, 2)

(“is”, 3)

(“awesome”, 2)

(“java”, 1)

(“scala”, 1)

(“verbose”, 1)

(“unreadable”, 1)

Map Reduce

5 of 21

Background

Good for acyclic data flows (one pass)

For example:

Word counts

Log processing

Bad for iterative data flows (multi pass) due to repeated loading from disk

For example:

Page Rank

Parameter optimisation in machine learning algorithms

Interactive analytics

6 of 21

What is Spark?

Speed
Fault tolerance
Interactive

7 of 21

Resilient Distributed Datasets (RDDs)

RDD

Cluster

8 of 21

Resilient Distributed Datasets (RDDs)

Transformation:

Creation

map

filter

join

reduceByKey

Actions:

count

collect

save

or

9 of 21

Background

Map

w → (w, 1)

Map

w → (w, 1)

Map

w → (w, 1)

Map

w → (w, 1)

(“is”, 1)

(“python”, 1)

(“awesome”, 1)

(“is”, 1)

(“java”, 1)

(“verbose”, 1)

(“is”, 1)

(“scala”, 1)

(“unreadable”, 1)

(“yeah”, 1)

(“python”, 1)

(“awesome”, 1)

Reduce

a, b → a + b

Reduce

a, b → a + b

(“yeah”, 1)

(“python”, 2)

(“is”, 3)

(“awesome”, 2)

(“java”, 1)

(“scala”, 1)

(“verbose”, 1)

(“unreadable”, 1)

Map Reduce Word Count

10 of 21

Resilient Distributed Datasets (RDDs)

text_file = sc.textFile("hdfs://...")

words = text_file.flatMap(lambda line: line.split(" "))

word_tuples = words.map(lambda word: (word, 1))

counts = word_tuples.reduceByKey(lambda a, b: a + b)

counts.saveAsTextFile("hdfs://...")

text_file

words

word_tuples

flatMap

map

reduceByKey

counts

11 of 21

Dependencies

Narrow Dependency

eg map or filter

Wide Dependency

eg join or groupByKey

12 of 21

Staging

text_file

words

word_tuples

flatMap

map

reduceBy

Key

Stage 1

Stage 2

Stage 3

Stage 4

text_file.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1))

word_tuples.reduceByKey(lambda a, b: a + b)

counts

13 of 21

Fault Tolerance

text_file

words

word_tuples

counts

flatMap

map

reduceBy

Key

14 of 21

Spark Components

15 of 21

Provides libraries for working with RDDs or dataframes
Basic Statistics
Classification and regression
Clustering
Dimension reduction
Feature extraction

Spark Components

Spark SQL

Library for structured data
Create numpy dataframes from json etc.
Use standard SQL queries

Spark MLlib

16 of 21

Machine Learning

from pyspark.mllib.classification import LogisticRegressionWithLBFGS, LogisticRegressionModel�from pyspark.mllib.regression import LabeledPoint��# Load and parse the data�def parsePoint(line):� values = [float(x) for x in line.split(' ')]� return LabeledPoint(values[0], values[1:])��data = sc.textFile("data/mllib/sample_svm_data.txt")�parsedData = data.map(parsePoint)��# Build the model�model = LogisticRegressionWithLBFGS.train(parsedData)��# Evaluating the model on training data�labelsAndPreds = parsedData.map(lambda p: (p.label, model.predict(p.features)))�trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() / float(parsedData.count())�print("Training Error = " + str(trainErr))��# Save and load model�model.save(sc, "myModelPath")�sameModel = LogisticRegressionModel.load(sc, "myModelPath")

17 of 21

Spark Streaming

18 of 21

Spark Streaming

from pyspark import SparkContext�from pyspark.streaming import StreamingContext

# Create a local StreamingContext with two working thread and batch interval of 1 second�sc = SparkContext("local[2]", "NetworkWordCount")�ssc = StreamingContext(sc, 1)

# Create a DStream that will connect to hostname:port, like localhost:9999�lines = ssc.socketTextStream("localhost", 9999)

# Split each line into words�words = lines.flatMap(lambda line: line.split(" "))

# Count each word in each batch�pairs = words.map(lambda word: (word, 1))�wordCounts = pairs.reduceByKey(lambda x, y: x + y)

ssc.start() # Start the computation�ssc.awaitTermination() # Wait for the computation to terminate

19 of 21

Developing for Spark

20 of 21

Developing for Spark

$ ./bin/spark-submit --master spark://207.184.161.138:7077 \

--py-files dependencies \

path-to-my-script/myscript.py args

$ ./bin/spark-submit --master local[4] \

--py-files dependencies \

path-to-my-script/myscript.py args

$ PYSPARK_DRIVER_PYTHON=ipython \ ./bin/pyspark

$ PYSPARK_DRIVER_PYTHON=ipython \ PYSPARK_DRIVER_PYTHON_OPTS="notebook" \ ./bin/pyspark

21 of 21

Code Dojo

http://tomcooper.org.uk/2015/11/11/Python-North-East.html