1 of 21

Reproducible Data Science in the Cloud

Daniel Whitenack, @dwhitena

Data Scientist and Advocate, @pachydermIO

2 of 21

Outline

  1. Why do we care about Reproducibility?
  2. How can we achieve Reproducibility?
  3. Demo with R and Pachyderm
  4. Resources

@dwhitena, @pachydermIO, @RPILally

3 of 21

Why do we care about Reproducibility?

@dwhitena, @pachydermIO, @RPILally

4 of 21

How can we achieve Reproducibility?

(at scale, in the cloud)

@dwhitena, @pachydermIO, @RPILally

5 of 21

How can we achieve Reproducibility?

(at scale, in the cloud)

@dwhitena, @pachydermIO, @RPILally

6 of 21

Demo

@dwhitena, @pachydermIO, @RPILally

7 of 21

iris.csv

1.3,1.4,...

@dwhitena, @pachydermIO, @RPILally

8 of 21

iris.csv

1.3,1.4,...

@dwhitena, @pachydermIO, @RPILally

9 of 21

iris.csv

1.3,1.4,...

train.R

model.rda

model.txt

@dwhitena, @pachydermIO, @RPILally

10 of 21

iris.csv

1.3,1.4,...

train.R

infer.R

model.rda

model.txt

@dwhitena, @pachydermIO, @RPILally

11 of 21

iris.csv

1.3,1.4,...

train.R

infer.R

1.csv

1.3,1.4,...

1

setosa

model.rda

model.txt

@dwhitena, @pachydermIO, @RPILally

12 of 21

iris.csv

1.3,1.4,...

train.R

infer.R

1.csv

1.3,1.4,...

1

setosa

model.rda

model.txt

@dwhitena, @pachydermIO, @RPILally

13 of 21

… enter Pachyderm

An open source, distributed processing and data versioning framework built on containers.

@dwhitena, @pachydermIO, @RPILally

14 of 21

Pachyderm

training

model

model

attributes

1.csv

inference

1

Running train.R

iris.csv

inference

Running infer.R

model.rda

model.txt

@dwhitena, @pachydermIO, @RPILally

15 of 21

Pachyderm

training

model

model

attributes

1.csv

inference

1

Running train.R

iris.csv

Inference 1

model.rda

model.txt

Inference 2

Inference N

@dwhitena, @pachydermIO, @RPILally

16 of 21

Pachyderm

training

model

model

attributes

inference

inference

@dwhitena, @pachydermIO, @RPILally

17 of 21

Pachyderm

training

model

model

attributes

inference

inference

plots

plots

@dwhitena, @pachydermIO, @RPILally

18 of 21

Pachyderm

training

model

model

attributes

inference

inference

plots

plots

raw_data

training

@dwhitena, @pachydermIO, @RPILally

19 of 21

Pachyderm

training

model

model

attributes

inference

inference

plots

plots

raw_data

training

raw_attr

attributes

#!/bin/bash

@dwhitena, @pachydermIO, @RPILally

20 of 21

Pachyderm

training

model

model

attributes

inference

inference

plots

plots

raw_data

training

raw_attr

attributes

attributes

attributes

inference

training

@dwhitena, @pachydermIO, @RPILally

21 of 21

Conclusion/Resources

@dwhitena, @pachydermIO, @RPILally