1 of 15

Scaling up

2 of 15

3 of 15

4 of 15

5 of 15

Spark Facts

Written in Scala
Originally implemented as an in-memory alternative to Hadoop MapReduce
Started by Matei Zaharia in 2009 during his PhD at UC Berkeley
Apache Foundation top project since 2014
It is open-source and free
Business support offered by Databricks, a company created by Zaharia

6 of 15

The previous world record was 72 minutes, set by a Hadoop MapReduce cluster of 2100 nodes. This means that Spark sorted the same data 3X faster using 10X fewer machines.

More info

Spark is fast

In 2014 Spark beat Hadoop MapReduce by sorting 100TB

of data in 23 minutes,

and since then it improved a lot

7 of 15

Spark supports multiple languages

8 of 15

Transparent scalability

Multiple way to deploy a job:

Single node�Use resources of one single machine. Distribute the tasks on multiple cores
Cluster mode�Use resources of multiple machine. The tasks are executed in a distributed environment

9 of 15

Spark offers an unified stack

10 of 15

Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as a distributed SQL query engine. It enables unmodified Hadoop Hive queries to run up to 100x faster on existing deployments and data.

# A JSON dataset is pointed to by path.

# The path can be either a single text file or a directory storing text files

path = "examples/src/main/resources/people.json"

peopleDF = spark.read.json(path)

peopleDF.createOrReplaceTempView("people")

# SQL statements can be run by using the sql methods provided by spark

teenagerNamesDF = spark.sql("SELECT name FROM people WHERE age BETWEEN 13 AND 19")

11 of 15

Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams.

Netflix: Near Real-Time Recommendations Using Apache Spark Streaming

https://databricks.com/session/near-real-time-netflix-recommendations-using-apache-spark-streaming

Pinterest: How Pinterest Measures Real-Time User Engagement

https://medium.com/pinterest-engineering/building-a-new-experiment-pipeline-with-spark-b04e1a7d639a

Uber: Detecting Abuse at Scale

https://eng.uber.com/lsh/

12 of 15

classification, regression, clustering, and collaborative filtering
feature extraction, transformation, dimensionality reduction, and selection
linear algebra, statistics, data handling, etc.

Its goal is to make practical machine learning scalable and easy.

from pyspark.ml.regression import LinearRegression

# Load training data

training = spark.read.format("libsvm")\

.load("data/mllib/sample_linear_regression_data.txt")

lr = LinearRegression(regParam=0.3, elasticNetParam=0.8)

# Fit the model

lrModel = lr.fit(training)

13 of 15

GraphX is a new component in Spark for graphs and graph-parallel computation. At a high level, GraphX extends the Spark RDD by introducing a new Graph abstraction: a directed multigraph with properties attached to each vertex and edge.

// Load the edges as a graph

val graph = GraphLoader.edgeListFile(sc, "data/graphx/followers.txt")

// Run PageRank

val ranks = graph.pageRank(0.0001).vertices

// Join the ranks with the usernames

val users = sc.textFile("data/graphx/users.txt").map { line =>

val fields = line.split(",")

(fields(0).toLong, fields(1))

}

val ranksByUsername = users.join(ranks).map {

case (id, (username, rank)) => (username, rank)

}

Page Rank

14 of 15

pip install pyspark

or

conda install pyspark

Colab version:

https://colab.research.google.com/drive/1CqWY4pk7_VFxi8K8v4asr18ed0Hs8FVA

15 of 15

Useful links

Live Tutorial

https://go.epfl.ch/ada19_spark

Configuration advice

https://dlab.epfl.ch/2017-09-30-what-i-learned-from-processing-big-data-with-spark/

Original documentation

https://spark.apache.org/docs/latest/