1 of 123

Sparking Insights in Data

Micah Whitacre

@mkwhit

2 of 123

3 of 123

4 of 123

5 of 123

6 of 123

7 of 123

8 of 123

9 of 123

Slow startup and I/O times

10 of 123

Batch Oriented

Slow startup and I/O times

11 of 123

Batch Oriented

Iterative Algorithms Difficult

Slow startup and I/O times

12 of 123

Batch Oriented

Iterative Algorithms Difficult

Custom Execution Engines

Slow startup and I/O times

13 of 123

14 of 123

How Spark is Known..

15 of 123

In Memory

How Spark is Known..

16 of 123

In Memory

SQL, streaming, and complex analytics

How Spark is Known..

17 of 123

In Memory

100x Faster than MapReduce

SQL, streaming, and complex analytics

How Spark is Known..

18 of 123

A fast and general engine for large-scale data processing.

19 of 123

Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing.

20 of 123

Spark has an advanced Directed Acyclic Graph execution engine that supports cyclic data flow and in-memory computing.

21 of 123

Spark has an advanced Directed Acyclic Graph execution engine that supports cyclic data flow and in-memory computing.

22 of 123

RDD

23 of 123

RDD

Resilient Distributed Dataset

24 of 123

Locality Aware Scheduling

25 of 123

Locality Aware Scheduling

Scalability

26 of 123

Fault Tolerant

Locality Aware Scheduling

Scalability

27 of 123

Fault Tolerant

Locality Aware Scheduling

Scalability

Applications with working sets

(Parallel ops on intermediate results)

28 of 123

Fault Tolerant

Locality Aware Scheduling

Scalability

Applications with working sets

(Parallel ops on intermediate results)

29 of 123

Log Updates

Options?

Distributed Shared Memory + Checkpointing

30 of 123

31 of 123

32 of 123

Node

Memory

Storage

33 of 123

Node

Memory

Storage

Checkpoint

34 of 123

Node

Memory

Storage

Node

Storage

Checkpoint

Replicate

35 of 123

Or

36 of 123

37 of 123

Log Updates

Options?

Distributed Shared Memory + Checkpointing

38 of 123

Log

(coarse-grained)

Updates

39 of 123

Immutable/Read Only

Partitioned

Bad for async updates to shared state

40 of 123

Stable RDD

Parallelized Collection

Input (HDFS, Files, JDBC)

41 of 123

val ctx = new

SparkContext(master,

"Spark App", conf)

42 of 123

SparkContext

Application Name

(creates Spark Application)

Cluster URL

Broadcast Values

Accumulators

Creates RDDs

43 of 123

val list = List(1, 2, 3, 4, 5)

val rdd =

ctx.parallelize(list);

44 of 123

val rawRDD =

ctx.textFile(path);

45 of 123

Transformations

Actions

46 of 123

Transformations

Actions

map, filter, flatmap, union, groupByKey, sample

reduce, collect, count, take

47 of 123

Transformations

Actions

lazily executed

return values to driver

48 of 123

Stable RDD

49 of 123

Stable RDD

map

modelRDD

50 of 123

val modelRDD =

rawRDD

.map(s => parse(s)))

51 of 123

Stable RDD

map

filter

modelRDD

filteredRDD

52 of 123

val filteredRDD =

modelRDD

.filter(m =>

m.cost > 0)

53 of 123

Stable RDD

map

filter

count

modelRDD

filteredRDD

54 of 123

val count =

filteredRDD.count

55 of 123

RDDs are created

56 of 123

filteredRDD.count

57 of 123

filteredRDD.count

58 of 123

RDDs can be persisted

Partitions recomputed on failure

filteredRDD.persist()

59 of 123

Default storage level

filteredRDD.cache()

60 of 123

MEMORY_ONLY (default)

�MEMORY_ONLY_SER

Storage Levels

61 of 123

MEMORY_ONLY (default)

MEMORY_AND_DISK�MEMORY_ONLY_SER

MEMORY_AND_DISK_SER

Storage Levels

62 of 123

MEMORY_ONLY (default)

MEMORY_AND_DISK�MEMORY_ONLY_SER

MEMORY_AND_DISK_SER

DISK_ONLY

Storage Levels

63 of 123

MEMORY_ONLY (default)

MEMORY_AND_DISK�MEMORY_ONLY_SER

MEMORY_AND_DISK_SER

DISK_ONLY

MEMORY_ONLY_2

MEMORY_AND_DISK_2

Storage Levels

64 of 123

RDDs lifecycle in memory tied to Spark Context

65 of 123

Text & Sequence Files

JavaPairRDD => HDFS

RDD.forEach(f => …)

66 of 123

Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing.

67 of 123

68 of 123

69 of 123

70 of 123

Multitenancy

Simple Cluster FIFO

In App FIFO/Fair

Static vs Dynamic Resources

Memory & CPU

71 of 123

Ecosystem

72 of 123

Apache Crunch

Pipeline pipeline = new MRPipeline(Driver.class,conf);

Pipeline p = new SparkPipeline(master,”App”);

73 of 123

Spark Streaming

Near Real Time Processing

74 of 123

75 of 123

76 of 123

77 of 123

Discretized Stream

(DStream)

RDD @ T0

RDD @ T1

RDD @ T2

RDD @ T3

All RDD operations available

78 of 123

Window

RDD @ T0

RDD @ T1

RDD @ T2

RDD @ T3

Window @ T1

Window @ T3

79 of 123

val ctx = new StreamingContext( sparkConf, Seconds(1))

80 of 123

val strm = ctx.socketStream(..)�val out = strm.map(f => …)

...

ctx.start()

ctx.awaitTermination()

81 of 123

Stream Sources

Absolute recovery based on source

Kafka, Flume, Twitter, MQTT, ZeroMQ

82 of 123

Exactly once processing

At least once persistence

83 of 123

Source

Slave

84 of 123

Source

Slave

85 of 123

Source

Slave

86 of 123

Source

Slave

87 of 123

88 of 123

Source

Slave

89 of 123

Source

Slave

90 of 123

Source

Slave

91 of 123

Source

Slave

92 of 123

93 of 123

94 of 123

95 of 123

96 of 123

Create RDDs from SQL

val youngUsers =

sc.sql2rdd("SELECT * FROM

users WHERE age < 20")

97 of 123

Spark SQL (1.0)

SchemaRDD

Only supports RDDs, Parquet, Hive

Includes future platform for Shark

98 of 123

val people =sc.textFile("people.txt")�people.registerAsTable("people")

val teenagers = sql("SELECT name FROM

people WHERE age >= 13 AND age <=

19")

Spark SQL (1.0)

99 of 123

val teenagers = people

.where('age >= 10)

.where('age <= 19)

.select('name)

100 of 123

GraphX & Bagel

101 of 123

GraphX & Bagel

102 of 123

MLib

Linear Regression

Binary Classification

Clustering

Collaborative Filtering

103 of 123

Save

Compile

Package

Ship

Execute

Repeat

104 of 123

Who likes a REPL?

105 of 123

Spark Shell

Scala & Python

Experiment First

Lower barrier to entry

106 of 123

Lessons

107 of 123

Lessons

Know your data

108 of 123

Lessons

Know your data

109 of 123

Lessons

Know your data

110 of 123

Lessons

Know your data

Know your …..

111 of 123

Lessons

Know your data

Know your computations

112 of 123

113 of 123

Example Project

https://github.com/mkwhitacre/spark-examples

114 of 123

Links

Spark Homepage: http://spark.apache.org/
Spark YouTube Channel: https://www.youtube.com/user/TheApacheSpark
Spark Sponsor Blog: http://databricks.com/blog
Spark is Hard: http://blog.explainmydata.com/2014/05/spark-should-be-better-than-mapreduce.html

115 of 123

-1 0 +1

116 of 123

117 of 123

118 of 123

Master

Slave

Standalone

119 of 123

Master

Slave

Standalone

Master

Zookeeper

120 of 123

Resource Manager

Node

Application Master

Node

Slave

Node

Slave

Node

Slave

YARN

Driver

121 of 123

Resource Manager

Node

Application Master + Driver

Node

Slave

Node

Slave

Node

Slave

YARN

122 of 123

Mesos Master

Slave

Spark

Slave

Fine Grained

Spark

123 of 123

Mesos Master

Slave

Spark

Slave

Coarse Grained

Spark