1
Musketeer
all for one, one for all in data processing systems
Ionel Gog Malte Schwarzkopf Natacha Crooks �Matthew P. Grosvenor Allen Clement Steven Hand
Meet Kermit from Sesame, Inc.
2
CamSaS
Behold the chaos!
3
HDFS/GFS
Execution engine
Giraph
PowerGraph
GraphChi
Pregel
Graph processing
Storage system
Hive
Pig
Tenzing
Scope
SparkSQL
Sawzall
FlumeJava
DryadLINQ
GraphLINQ
Lindi
Shark
GraphX
Hadoop/MR
Dryad
Spark
Dremel
Naiad
Metis
CIEL
Language/library
CamSaS
4
Which system� is the right �one to use?
CamSaS
Join between two data sets
5
PULL DATA FROM HDFS
LOAD DATA
RUNTIME
PUSH DATA TO HDFS
Makespan
Less is better
Local cluster�7 nodes
(EC2 results similar)
CamSaS
Join between two data sets
6
PULL DATA FROM HDFS
LOAD DATA
RUNTIME
PUSH DATA TO HDFS
Makespan
BEST
Local cluster�7 nodes
(EC2 results similar)
Asymmetric:�5 million ⋈ 69 million rows
Less is better
CamSaS
Join between two data sets
7
PULL DATA FROM HDFS
LOAD DATA
RUNTIME
PUSH DATA TO HDFS
Makespan
BEST
BEST
Local cluster�7 nodes
(EC2 results similar)
Asymmetric:�5 million ⋈ 69 million rows
Symmetric: �39 million ⋈ 39 million rows
Less is better
CamSaS
PageRank over Twitter graph
8
(42M vertices, 1.4B edges)
100 nodes
CamSaS
PageRank over Twitter graph
9
(42M vertices, 1.4B edges)
100 nodes
CamSaS
PageRank over Twitter graph
10
(42M vertices, 1.4B edges)
100 nodes
CamSaS
PageRank over Twitter graph
11
(42M vertices, 1.4B edges)
100 nodes
16 nodes
Faster than Hadoop on 100 nodes
CamSaS
PageRank over Twitter graph
12
(42M vertices, 1.4B edges)
100 nodes
16 nodes
CamSaS
PageRank over Twitter graph
13
(42M vertices, 1.4B edges)
100 nodes
16 nodes
1 node
FASTEST
MOST�EFFICIENT
For two simple computations, �five different systems can be best!
CamSaS
14
How can we help Kermit choose?
CamSaS
Behold the chaos!
15
HDFS/GFS
Execution engine
Language/library
Giraph
PowerGraph
GraphChi
Pregel
Graph processing
Storage system
Hive
Pig
Tenzing
Scope
SparkSQL
Sawzall
FlumeJava
DryadLINQ
GraphLINQ
Lindi
Shark
GraphX
Hadoop/MR
Dryad
Spark
Dremel
Naiad
Metis
CIEL
CamSaS
Front-end languages
16
HDFS/GFS
Execution engine
Language/library
Giraph
PowerGraph
GraphChi
Pregel
Graph processing
Storage system
Hive
Pig
Tenzing
Scope
SparkSQL
Sawzall
FlumeJava
DryadLINQ
GraphLINQ
Lindi
Shark
GraphX
Hadoop/MR
Dryad
Spark
Dremel
Naiad
Metis
CIEL
CamSaS
Back-end execution engines
17
HDFS/GFS
Execution engine
Language/library
Giraph
PowerGraph
GraphChi
Pregel
Graph processing
Storage system
Hive
Pig
Tenzing
Scope
SparkSQL
Sawzall
FlumeJava
DryadLINQ
GraphLINQ
Lindi
Shark
GraphX
Hadoop/MR
Dryad
Spark
Dremel
Naiad
Metis
CIEL
CamSaS
18
Hadoop/MR
Dryad
Spark
Hive
Pig
Scope
GraphX
Sawzall
FlumeJava
DryadLINQ
Dremel
SparkSQL
Giraph
PowerGraph
GraphChi
Pregel
Shark
Naiad
GraphLINQ
Lindi
CIEL
X-Stream
Metis
HDFS/GFS
Front-end
Back-end
Fixed binding
CamSaS
Musketeer
19
Hadoop/MR
Dryad
Spark
Hive
Pig
Scope
GraphX
Sawzall
FlumeJava
DryadLINQ
Dremel
SparkSQL
Giraph
PowerGraph
GraphChi
Pregel
Shark
Naiad
GraphLINQ
Lindi
CIEL
X-Stream
Metis
HDFS/GFS
Front-end
Back-end
Intermediate representation
Intermediate representation
Lindi
Hive
SQL DSL
GAS DSL
Hadoop/MR
Spark
PowerGraph
GraphChi
Metis
Naiad
CamSaS
Musketeer @ 10,000ft
20
SQL�query
GAS�kernel
translate to IR
SQL�query
GAS�kernel
translate to IR
1
CamSaS
Musketeer @ 10,000ft
21
SQL�query
GAS�kernel
translate to IR
optimize IR
optimize IR
2
CamSaS
Musketeer @ 10,000ft
22
SQL�query
GAS�kernel
optimize IR
translate to IR
optimize IR
recognize idioms
recognize idioms
3
CamSaS
Musketeer @ 10,000ft
23
SQL�query
GAS�kernel
optimize IR
translate to IR
recognize idioms
merge operators
merge operators
4
CamSaS
Musketeer @ 10,000ft
24
SQL�query
GAS�kernel
optimize IR
translate to IR
recognize idioms
merge operators
map to systems
map to systems
5
CamSaS
Musketeer @ 10,000ft
25
SQL�query
GAS�kernel
optimize IR
recognize idioms
merge operators
map to systems
translate to IR
expand�templates
dispatch�jobs
expand�templates
dispatch�jobs
6
CamSaS
Musketeer @ 10,000ft
26
map to systems
SQL�query
GAS�kernel
translate to IR
optimize IR
recognize idioms
merge operators
expand�templates
dispatch�jobs
CamSaS
Intermediate representation (IR)
27
SQL�query
GAS�kernel
translate to IR
1
CamSaS
Intermediate representation (IR)
28
CamSaS
29
map to systems
5
CamSaS
Mapping to systems
30
GraphChi job
Hadoop job
Equivalent to k-way graph partitioning, which is NP-hard :-(
edges
SUM
MULTIPLY
AGGREGATE
PROJECT
DIVIDE
JOIN
WHILE
COUNT
JOIN
vertices
CamSaS
System bindings
31
CamSaS
Job boundaries
32
CamSaS
Job boundaries
33
CamSaS
Dynamic heuristic
34
COUNT
Dynamic �heuristic
Hadoop job
GraphChi job
2
JOIN
WHILE
JOIN
DIVIDE
PROJECT
AGGREGATE
MULTIPLY
SUM
SUM
MULTIPLY
AGGREGATE
PROJECT
DIVIDE
JOIN
WHILE
JOIN
COUNT
Topological sort
1
edges
SUM
MULTIPLY
AGGREGATE
PROJECT
DIVIDE
JOIN
WHILE
COUNT
JOIN
vertices
CamSaS
Job generation
35
expand�templates
dispatch�jobs
6
CamSaS
Job generation
36
public IEnumerable<{{OUTPUT_TYPE}}> {{FUN_NAME}}(...) {
{{VAL_TYPE}} outValue = {{INIT_VAL}};
foreach (var value in values)
{{UPDATE_VAL}};
yield return new
+
public IEnumerable<LongTriple> SumRatings(LongPair movIds, IEnumerable<LongTriple> ratings) {
long allRatings = 0;
foreach (var rating in ratings)
allRatings += rating.second;
yield return new LongTriple(movIds.s, movIds.t, allRatings);
}
Auto-generated job implementation
Dispatch
CamSaS
Evaluation
37
CamSaS
TPC-H (Q17): Business decisions
38
Less is better
100 nodes
CamSaS
TPC-H (Q17): Business decisions
39
Less is better
100 nodes
CamSaS
TPC-H (Q17): Business decisions
40
100 nodes
uses Naiad
Less is better
CamSaS
Netflix movie recommendation
41
100 nodes
Less is better
CamSaS
Netflix movie recommendation
42
100 nodes
Less is better
CamSaS
Netflix movie recommendation
43
100 nodes
Less is better
CamSaS
Netflix movie recommendation
44
uses Naiad
100 nodes
Less is better
CamSaS
PageRank on Twitter graph
45
100 nodes
Less is better
CamSaS
PageRank on Twitter graph
46
100 nodes
Less is better
CamSaS
PageRank on Twitter graph
47
100 nodes
16 nodes
Less is better
CamSaS
PageRank on Twitter graph
48
100 nodes
16 nodes
Less is better
CamSaS
PageRank on Twitter graph
49
100 nodes
16 nodes
1 node
Less is better
CamSaS
PageRank on Twitter graph
50
100 nodes
16 nodes
1 node
Less is better
CamSaS
Generated code overhead
51
5-20% overhead
can be improved
PageRank on Twitter graph
Less is better
CamSaS
Automatic mapping
52
33 workflow setups
Run graph computations on specialized systems
Small computations on single machine systems
Large computations on scalable systems
CamSaS
Automatic mapping
53
Musketeer
33 workflow setups
CamSaS
Automatic mapping
54
Musketeer
33 workflow setups
CamSaS
Automatic mapping
55
Musketeer
33 workflow setups
�Always within 10% of best choice
CamSaS
Summary and conclusions
56
@CamSysAtScale
@ICGog
CamSaS
57
��Backup slides
CamSaS
Scheduler runtime
58
Less is better
Note: log scale
CamSaS
Scheduler limitations
59
3. JOIN
2. UNION
3. JOIN
2. UNION
Topological sort
Heuristic
Optimal
JOIN
JOIN
UNION
PROJECT
1. JOIN
4. PROJECT
1. JOIN
4. PROJECT
CamSaS
Do we want to combine frameworks?
60
CamSaS
Operator merge
61
PageRank on
LiveJournal (4.8M vertices, 69M edges)
Top-shopper
CamSaS
62
CamSaS