Scaling up
Spark Facts
The previous world record was 72 minutes, set by a Hadoop MapReduce cluster of 2100 nodes. This means that Spark sorted the same data 3X faster using 10X fewer machines.
Spark is fast
In 2014 Spark beat Hadoop MapReduce by sorting 100TB
of data in 23 minutes,
and since then it improved a lot
Spark supports multiple languages
Transparent scalability
Multiple way to deploy a job:
Spark offers an unified stack
Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as a distributed SQL query engine. It enables unmodified Hadoop Hive queries to run up to 100x faster on existing deployments and data.
# A JSON dataset is pointed to by path.
# The path can be either a single text file or a directory storing text files
path = "examples/src/main/resources/people.json"
peopleDF = spark.read.json(path)
peopleDF.createOrReplaceTempView("people")
# SQL statements can be run by using the sql methods provided by spark
teenagerNamesDF = spark.sql("SELECT name FROM people WHERE age BETWEEN 13 AND 19")
Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams.
Netflix: Near Real-Time Recommendations Using Apache Spark Streaming
https://databricks.com/session/near-real-time-netflix-recommendations-using-apache-spark-streaming
Pinterest: How Pinterest Measures Real-Time User Engagement
https://medium.com/pinterest-engineering/building-a-new-experiment-pipeline-with-spark-b04e1a7d639a
Uber: Detecting Abuse at Scale
Its goal is to make practical machine learning scalable and easy.
from pyspark.ml.regression import LinearRegression
# Load training data
training = spark.read.format("libsvm")\
.load("data/mllib/sample_linear_regression_data.txt")
lr = LinearRegression(regParam=0.3, elasticNetParam=0.8)
# Fit the model
lrModel = lr.fit(training)
GraphX is a new component in Spark for graphs and graph-parallel computation. At a high level, GraphX extends the Spark RDD by introducing a new Graph abstraction: a directed multigraph with properties attached to each vertex and edge.
// Load the edges as a graph
val graph = GraphLoader.edgeListFile(sc, "data/graphx/followers.txt")
// Run PageRank
val ranks = graph.pageRank(0.0001).vertices
// Join the ranks with the usernames
val users = sc.textFile("data/graphx/users.txt").map { line =>
val fields = line.split(",")
(fields(0).toLong, fields(1))
}
val ranksByUsername = users.join(ranks).map {
case (id, (username, rank)) => (username, rank)
}
Page Rank
pip install pyspark
or
conda install pyspark
Useful links
Live Tutorial
https://go.epfl.ch/ada19_spark
Configuration advice
https://dlab.epfl.ch/2017-09-30-what-i-learned-from-processing-big-data-with-spark/
Original documentation