Sparking Insights in Data
Micah Whitacre
@mkwhit
Slow startup and I/O times
Batch Oriented
Slow startup and I/O times
Batch Oriented
Iterative Algorithms Difficult
Slow startup and I/O times
Batch Oriented
Iterative Algorithms Difficult
Custom Execution Engines
Slow startup and I/O times
How Spark is Known..
In Memory
How Spark is Known..
In Memory
SQL, streaming, and complex analytics
How Spark is Known..
In Memory
100x Faster than MapReduce
SQL, streaming, and complex analytics
How Spark is Known..
A fast and general engine for large-scale data processing.
Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing.
Spark has an advanced Directed Acyclic Graph execution engine that supports cyclic data flow and in-memory computing.
Spark has an advanced Directed Acyclic Graph execution engine that supports cyclic data flow and in-memory computing.
RDD
RDD
Resilient Distributed Dataset
Locality Aware Scheduling
Locality Aware Scheduling
Scalability
Fault Tolerant
Locality Aware Scheduling
Scalability
Fault Tolerant
Locality Aware Scheduling
Scalability
Applications with working sets
(Parallel ops on intermediate results)
Fault Tolerant
Locality Aware Scheduling
Scalability
Applications with working sets
(Parallel ops on intermediate results)
Log Updates
Options?
Distributed Shared Memory + Checkpointing
Node
Memory
Storage
Node
Memory
Storage
Checkpoint
Node
Memory
Storage
Node
Storage
Checkpoint
Replicate
Or
Log Updates
Options?
Distributed Shared Memory + Checkpointing
Log
(coarse-grained)
Updates
Immutable/Read Only
Partitioned
Bad for async updates to shared state
Stable RDD
Parallelized Collection
Input (HDFS, Files, JDBC)
val ctx = new
SparkContext(master,
"Spark App", conf)
SparkContext
Application Name
(creates Spark Application)
Cluster URL
Broadcast Values
Accumulators
Creates RDDs
val list = List(1, 2, 3, 4, 5)
val rdd =
ctx.parallelize(list);
val rawRDD =
ctx.textFile(path);
Transformations
Actions
Transformations
Actions
map, filter, flatmap, union, groupByKey, sample
reduce, collect, count, take
Transformations
Actions
lazily executed
return values to driver
Stable RDD
Stable RDD
map
modelRDD
val modelRDD =
rawRDD
.map(s => parse(s)))
Stable RDD
map
filter
modelRDD
filteredRDD
val filteredRDD =
modelRDD
.filter(m =>
m.cost > 0)
Stable RDD
map
filter
count
modelRDD
filteredRDD
val count =
filteredRDD.count
RDDs are created
filteredRDD.count
filteredRDD.count
filteredRDD.count
filteredRDD.count
filteredRDD.count
filteredRDD.count
RDDs can be persisted
Partitions recomputed on failure
filteredRDD.persist()
Default storage level
filteredRDD.cache()
MEMORY_ONLY (default)
�MEMORY_ONLY_SER
Storage Levels
MEMORY_ONLY (default)
MEMORY_AND_DISK�MEMORY_ONLY_SER
MEMORY_AND_DISK_SER
Storage Levels
MEMORY_ONLY (default)
MEMORY_AND_DISK�MEMORY_ONLY_SER
MEMORY_AND_DISK_SER
DISK_ONLY
Storage Levels
MEMORY_ONLY (default)
MEMORY_AND_DISK�MEMORY_ONLY_SER
MEMORY_AND_DISK_SER
DISK_ONLY
MEMORY_ONLY_2
MEMORY_AND_DISK_2
Storage Levels
RDDs lifecycle in memory tied to Spark Context
Text & Sequence Files
JavaPairRDD => HDFS
RDD.forEach(f => …)
Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing.
Multitenancy
Simple Cluster FIFO
In App FIFO/Fair
Static vs Dynamic Resources
Memory & CPU
Ecosystem
Apache Crunch
Pipeline pipeline = new MRPipeline(Driver.class,conf);
Pipeline p = new SparkPipeline(master,”App”);
Spark Streaming
Near Real Time Processing
Discretized Stream
(DStream)
RDD @ T0
RDD @ T1
RDD @ T2
RDD @ T3
All RDD operations available
Window
RDD @ T0
RDD @ T1
RDD @ T2
RDD @ T3
Window @ T1
Window @ T3
val ctx = new StreamingContext( sparkConf, Seconds(1))
val strm = ctx.socketStream(..)�val out = strm.map(f => …)
...
ctx.start()
ctx.awaitTermination()
Stream Sources
Absolute recovery based on source
Kafka, Flume, Twitter, MQTT, ZeroMQ
Exactly once processing
At least once persistence
Source
Slave
Slave
Source
Slave
Slave
Source
Slave
Slave
Source
Slave
Slave
Source
Slave
Slave
Source
Slave
Slave
Source
Slave
Slave
Source
Slave
Slave
Create RDDs from SQL
val youngUsers =
sc.sql2rdd("SELECT * FROM
users WHERE age < 20")
Spark SQL (1.0)
SchemaRDD
Only supports RDDs, Parquet, Hive
Includes future platform for Shark
val people =sc.textFile("people.txt")�people.registerAsTable("people")
val teenagers = sql("SELECT name FROM
people WHERE age >= 13 AND age <=
19")
Spark SQL (1.0)
val teenagers = people
.where('age >= 10)
.where('age <= 19)
.select('name)
GraphX & Bagel
GraphX & Bagel
MLib
Linear Regression
Binary Classification
Clustering
Collaborative Filtering
Save
Compile
Package
Ship
Execute
Repeat
Who likes a REPL?
Spark Shell
Scala & Python
Experiment First
Lower barrier to entry
Lessons
Lessons
Know your data
Lessons
Know your data
Know your data
Lessons
Know your data
Know your data
Know your data
Lessons
Know your data
Know your data
Know your data
Know your …..
Lessons
Know your data
Know your data
Know your data
Know your computations
Example Project
Links
-1 0 +1
Master
Slave
Slave
Slave
Slave
Standalone
Master
Slave
Slave
Slave
Slave
Standalone
Master
Zookeeper
Resource Manager
Node
Application Master
Node
Slave
Node
Slave
Node
Slave
YARN
Driver
Resource Manager
Node
Application Master + Driver
Node
Slave
Node
Slave
Node
Slave
YARN
Mesos Master
Slave
Spark
Slave
Slave
Slave
Fine Grained
Spark
Spark
Spark
Spark
Spark
Spark
Spark
Mesos Master
Slave
Spark
Slave
Slave
Slave
Coarse Grained
Spark
Spark
Spark