1 of 12

ddR and Spark

Clark Fitzgerald

2016 R Consortium Summer Intern

PhD Student - Statistics Dept - UC Davis

Github, Twitter @clarkfitzg

58 0a 00 00 00 02 00 03 03 01 00 02 03 00 00 00 00 0d 00 00 00 0a 00 00 00

58 0a 00 00 00 02 00 03 03 01 00 02 03 00 00 00 00 0d 00 00 00 0a 00 00 00

58 0a 00 00 00 02 00 03 03 01 00 02 03 00 00 00 00 0d 00 00 00 0a 00 00 00

1

2

n

...

2 of 12

3 of 12

My summer goal was to write a Spark backend to ddR

  1. Store distributed lists, arrays, and data frames in Spark

useBackend(“spark”, ...)

d_object <- dlist(1:10, rnorm(5))

2) Apply arbitrary R code to these distributed objects

dlapply(d_object, function(x){ … })

4 of 12

Choosing an R interface to Spark

sparkapi

  • Direct low level Java access
  • Lighter dependency

SparkR

  • Useful for data frames
  • Evolving API
  • Speed and support for User Defined Functions

5 of 12

We can use a Spark RDD to implement an R list

RDD is a “resilient distributed dataset”

Standalone rddlist package here: https://github.com/clarkfitzg/rddlist

sc <- sparkapi::start_shell(master = "local")

x <- list(1:10, letters, rnorm(10))

�library(rddlist)

xrdd <- rddlist(sc, x)

# lapply method on RDD

first3 <- lapply(xrdd, function(x) x[1:3])

6 of 12

rddlist stores binary R objects in a Spark pairRDD

index

byte array

58 0a 00 00 00 02 00 03 03 01 00 02 03 00 00 00 00 0d 00 00 00 0a 00 00 00

58 0a 00 00 00 02 00 03 03 01 00 02 03 00 00 00 00 0d 00 00 00 0a 00 00 00

58 0a 00 00 00 02 00 03 03 01 00 02 03 00 00 00 00 0d 00 00 00 0a 00 00 00

1

2

n

...

Spark

pairRDD

1

2

n

...

Local

R list

lm(y ~ x)

rnorm(100)

zapsmall

(function)

unserialize

serialize

7 of 12

sparklite copies R’s parallel package

Standalone sparklite package here: https://github.com/clarkfitzg/sparklite

Local R

Session

clusterApply(data, f)

f(data[1])

f(data[2])

f(data[n])

Parallel Spark Workers

8 of 12

Demo

9 of 12

Feedback?

Any interest in collaboration on applications of these packages?

10 of 12

So can we all go use ddR with Spark?

Not yet!

I spent remaining time working on ddR internals.

11 of 12

Suggestions for internal ddR improvements

  • Use the “chunks”
  • Simplify the Map step
  • Put dlists, darrays, dframes in OO model

Details can be found here:

12 of 12

Thanks! Questions?

58 0a 00 00 00 02 00 03 03 01 00 02 03 00 00 00 00 0d 00 00 00 0a 00 00 00

58 0a 00 00 00 02 00 03 03 01 00 02 03 00 00 00 00 0d 00 00 00 0a 00 00 00

58 0a 00 00 00 02 00 03 03 01 00 02 03 00 00 00 00 0d 00 00 00 0a 00 00 00

1

2

n

...

Contact: Clark Fitzgerald

clarkfitzg@gmail.com