ddR and Spark
Clark Fitzgerald
2016 R Consortium Summer Intern
PhD Student - Statistics Dept - UC Davis
Github, Twitter @clarkfitzg
58 0a 00 00 00 02 00 03 03 01 00 02 03 00 00 00 00 0d 00 00 00 0a 00 00 00
58 0a 00 00 00 02 00 03 03 01 00 02 03 00 00 00 00 0d 00 00 00 0a 00 00 00
58 0a 00 00 00 02 00 03 03 01 00 02 03 00 00 00 00 0d 00 00 00 0a 00 00 00
1
2
n
...
My summer goal was to write a Spark backend to ddR
useBackend(“spark”, ...)
d_object <- dlist(1:10, rnorm(5))
2) Apply arbitrary R code to these distributed objects
dlapply(d_object, function(x){ … })
Choosing an R interface to Spark
sparkapi
SparkR
Patch upstream in SparkR: https://github.com/apache/spark/pull/14783
We can use a Spark RDD to implement an R list
RDD is a “resilient distributed dataset”
Standalone rddlist package here: https://github.com/clarkfitzg/rddlist
sc <- sparkapi::start_shell(master = "local")
x <- list(1:10, letters, rnorm(10))
�library(rddlist)
xrdd <- rddlist(sc, x)
# lapply method on RDD
first3 <- lapply(xrdd, function(x) x[1:3])
rddlist stores binary R objects in a Spark pairRDD
index
byte array
58 0a 00 00 00 02 00 03 03 01 00 02 03 00 00 00 00 0d 00 00 00 0a 00 00 00
58 0a 00 00 00 02 00 03 03 01 00 02 03 00 00 00 00 0d 00 00 00 0a 00 00 00
58 0a 00 00 00 02 00 03 03 01 00 02 03 00 00 00 00 0d 00 00 00 0a 00 00 00
1
2
n
...
Spark
pairRDD
1
2
n
...
Local
R list
lm(y ~ x)
rnorm(100)
zapsmall
(function)
unserialize
serialize
sparklite copies R’s parallel package
Standalone sparklite package here: https://github.com/clarkfitzg/sparklite
Local R
Session
clusterApply(data, f)
f(data[1])
f(data[2])
f(data[n])
Parallel Spark Workers
Demo
Feedback?
Any interest in collaboration on applications of these packages?
So can we all go use ddR with Spark?
Not yet!
I spent remaining time working on ddR internals.
Changes here: https://github.com/vertica/ddR/pull/15
Suggestions for internal ddR improvements
Details can be found here:
Thanks! Questions?
58 0a 00 00 00 02 00 03 03 01 00 02 03 00 00 00 00 0d 00 00 00 0a 00 00 00
58 0a 00 00 00 02 00 03 03 01 00 02 03 00 00 00 00 0d 00 00 00 0a 00 00 00
58 0a 00 00 00 02 00 03 03 01 00 02 03 00 00 00 00 0d 00 00 00 0a 00 00 00
1
2
n
...