1 of 21

2 of 21

Lightning-fast cluster computing

Thomas Cooper

PhD Student

Newcastle University

Cloud Computing for Big Data

@tomncooper | tom.n.cooper@gmail.com | www.tomcooper.org.uk

3 of 21

  • Background to large scale processing
  • What is Apache Spark?
  • Developing for Spark
  • Code Dojo

4 of 21

Background

Map

w → (w, 1)

Map

w → (w, 1)

Map

w → (w, 1)

Map

w → (w, 1)

(“is”, 1)

(“python”, 1)

(“awesome”, 1)

(“is”, 1)

(“java”, 1)

(“verbose”, 1)

(“is”, 1)

(“scala”, 1)

(“unreadable”, 1)

(“yeah”, 1)

(“python”, 1)

(“awesome”, 1)

Reduce

a, b → a + b

Reduce

a, b → a + b

(“yeah”, 1)

(“python”, 2)

(“is”, 3)

(“awesome”, 2)

(“java”, 1)

(“scala”, 1)

(“verbose”, 1)

(“unreadable”, 1)

Map Reduce

5 of 21

Background

Good for acyclic data flows (one pass)

For example:

Word counts

Log processing

Bad for iterative data flows (multi pass) due to repeated loading from disk

For example:

Page Rank

Parameter optimisation in machine learning algorithms

Interactive analytics

6 of 21

What is Spark?

  • Speed
  • Fault tolerance
  • Interactive

7 of 21

Resilient Distributed Datasets (RDDs)

RDD

Cluster

8 of 21

Resilient Distributed Datasets (RDDs)

Transformation:

Creation

map

filter

join

reduceByKey

Actions:

count

collect

save

or

9 of 21

Background

Map

w → (w, 1)

Map

w → (w, 1)

Map

w → (w, 1)

Map

w → (w, 1)

(“is”, 1)

(“python”, 1)

(“awesome”, 1)

(“is”, 1)

(“java”, 1)

(“verbose”, 1)

(“is”, 1)

(“scala”, 1)

(“unreadable”, 1)

(“yeah”, 1)

(“python”, 1)

(“awesome”, 1)

Reduce

a, b → a + b

Reduce

a, b → a + b

(“yeah”, 1)

(“python”, 2)

(“is”, 3)

(“awesome”, 2)

(“java”, 1)

(“scala”, 1)

(“verbose”, 1)

(“unreadable”, 1)

Map Reduce Word Count

10 of 21

Resilient Distributed Datasets (RDDs)

text_file = sc.textFile("hdfs://...")

words = text_file.flatMap(lambda line: line.split(" "))

word_tuples = words.map(lambda word: (word, 1))

counts = word_tuples.reduceByKey(lambda a, b: a + b)

counts.saveAsTextFile("hdfs://...")

text_file

words

word_tuples

flatMap

map

reduceByKey

counts

11 of 21

Dependencies

Narrow Dependency

eg map or filter

Wide Dependency

eg join or groupByKey

12 of 21

Staging

text_file

words

word_tuples

flatMap

map

reduceBy

Key

Stage 1

Stage 2

Stage 3

Stage 4

text_file.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1))

word_tuples.reduceByKey(lambda a, b: a + b)

counts

13 of 21

Fault Tolerance

text_file

words

word_tuples

counts

flatMap

map

reduceBy

Key

14 of 21

Spark Components

15 of 21

  • Provides libraries for working with RDDs or dataframes
  • Basic Statistics
  • Classification and regression
  • Clustering
  • Dimension reduction
  • Feature extraction

Spark Components

Spark SQL

  • Library for structured data
  • Create numpy dataframes from json etc.
  • Use standard SQL queries

Spark MLlib

16 of 21

Machine Learning

from pyspark.mllib.classification import LogisticRegressionWithLBFGS, LogisticRegressionModel�from pyspark.mllib.regression import LabeledPoint��# Load and parse the datadef parsePoint(line):� values = [float(x) for x in line.split(' ')]� return LabeledPoint(values[0], values[1:])��data = sc.textFile("data/mllib/sample_svm_data.txt")�parsedData = data.map(parsePoint)��# Build the model�model = LogisticRegressionWithLBFGS.train(parsedData)��# Evaluating the model on training data�labelsAndPreds = parsedData.map(lambda p: (p.label, model.predict(p.features)))�trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() / float(parsedData.count())�print("Training Error = " + str(trainErr))��# Save and load model�model.save(sc, "myModelPath")�sameModel = LogisticRegressionModel.load(sc, "myModelPath")

17 of 21

Spark Streaming

18 of 21

Spark Streaming

from pyspark import SparkContext�from pyspark.streaming import StreamingContext

# Create a local StreamingContext with two working thread and batch interval of 1 second�sc = SparkContext("local[2]", "NetworkWordCount")�ssc = StreamingContext(sc, 1)

# Create a DStream that will connect to hostname:port, like localhost:9999�lines = ssc.socketTextStream("localhost", 9999)

# Split each line into words�words = lines.flatMap(lambda line: line.split(" "))

# Count each word in each batch�pairs = words.map(lambda word: (word, 1))�wordCounts = pairs.reduceByKey(lambda x, y: x + y)

ssc.start() # Start the computation�ssc.awaitTermination() # Wait for the computation to terminate

19 of 21

Developing for Spark

20 of 21

Developing for Spark

$ ./bin/spark-submit --master spark://207.184.161.138:7077 \

--py-files dependencies \

path-to-my-script/myscript.py args

$ ./bin/spark-submit --master local[4] \

--py-files dependencies \

path-to-my-script/myscript.py args

$ PYSPARK_DRIVER_PYTHON=ipython \ ./bin/pyspark

$ PYSPARK_DRIVER_PYTHON=ipython \ PYSPARK_DRIVER_PYTHON_OPTS="notebook" \ ./bin/pyspark

21 of 21

Code Dojo

http://tomcooper.org.uk/2015/11/11/Python-North-East.html