1 of 33

1

Fully-Reproducible ML Deployments with Spark, Pachyderm and MLeap

Hollin Wilkins and Daniel Whitenack

2 of 33

Introductions

2

Hollin Wilkins

Co-Founder, Combust

@combustml

Dan Whitenack

Data Scientist, Pachyderm

@pachydermio

3 of 33

Our Talk in 3 Parts

  • Reproducibility in the Context of ML

  • A Specific ML Use Case

  • Demonstration of Reproducible ML Deployment

3

4 of 33

4

Reproducibility in the Context of ML

5 of 33

What is ML Reproducibility?

5

6 of 33

What is ML Reproducibility?

  • Consistent Results

6

7 of 33

What is ML Reproducibility?

  • Consistent Results

  • Data Provenance

7

8 of 33

What is ML Reproducibility?

  • Consistent Results

  • Data Provenance

  • Versioned History

8

9 of 33

Why should we care?

9

10 of 33

Why should we care?

  • Collaboration/Creativity

10

11 of 33

Why should we care?

  • Collaboration/Creativity

  • Compliance

11

12 of 33

Why should we care?

  • Collaboration/Creativity

  • Compliance

  • Unique Insights

12

13 of 33

We Propose that...

13

14 of 33

We Propose that...

14

Reproducibility is essential for ML pipelines, such that they can be replayed, modified, tuned and tracked over time.

15 of 33

We Propose that...

15

Reproducibility is essential for ML pipelines, such that they can be replayed, modified, tuned and tracked over time.

Currently it is difficult to do this with standard tooling.

16 of 33

16

A Specific ML Use Case

17 of 33

17

train.csv

"1000.00","2007-05-26",...

"1000.00","2007-05-26",...

...

18 of 33

18

train.scala

model

train.csv

"1000.00","2007-05-26",...

"1000.00","2007-05-26",...

...

hdfs://PRODUCTION/lending_club/models/2017/05/02/22/13/model/*

19 of 33

19

train.scala

model

train.csv

"1000.00","2007-05-26",...

"1000.00","2007-05-26",...

...

score.scala

20 of 33

20

train.scala

model

train.csv

"1000.00","2007-05-26",...

"1000.00","2007-05-26",...

...

score.scala

test.csv

"3900.00","2007-05-27",...

"11000.00","2007-05-27",...

...

test_score

0.77

/lending_club/data/validation/production/2016/09/07/test.csv

/lending_club/data/validation/production/2016/09/08/test.csv

/lending_club/data/validation/production/2016/09/08/results.csv

21 of 33

21

train.scala

model

train.csv

"1000.00","2007-05-26",...

"1000.00","2007-05-26",...

...

score.scala

test.csv

"3900.00","2007-05-27",...

"11000.00","2007-05-27",...

...

test_score

0.77

22 of 33

22

train.scala

model

train.csv

"1000.00","2007-05-26",...

"1000.00","2007-05-26",...

...

score.scala

test.csv

"3900.00","2007-05-27",...

"11000.00","2007-05-27",...

...

test_score

0.77

1.csv

"3900.00","2007-05-27",...

"11000.00","2007-05-27",...

...

1

rejected

23 of 33

23

train.scala

model

train.csv

"1000.00","2007-05-26",...

"1000.00","2007-05-26",...

...

score.scala

test.csv

"3900.00","2007-05-27",...

"11000.00","2007-05-27",...

...

test_score

0.77

1.csv

"3900.00","2007-05-27",...

"11000.00","2007-05-27",...

...

1

rejected

24 of 33

24

… enter MLeap + Pachyderm

Open source frameworks for reproducible ML deployments, data pipelines, and data versioning

25 of 33

MLeap is...

25

A Serialization Framework For Machine Learning Pipelines

An Execution Engine for Machine Learning Pipelines

26 of 33

Pachyderm is...

26

Containerized Data Pipelines

Data Versioning

27 of 33

27

Pachyderm

training

model

model

test

score

train.scala

train.csv

score

score.scala

MLeap bundle

test.csv

score

28 of 33

28

Pachyderm

training

model

model

test

score

score

{"3900.00","2007-05-27",...}

{rejected...}

MLeap Serving

29 of 33

Existing Solutions, Comparison

29

Plain Spark

Prediction.io

Data Robot

Model DB

Pachyderm + MLeap

Data Versioning

Model Versioning

Open Sourced

Works with ML Pipelines

Commercial Support

30 of 33

30

Demonstration of Reproducible ML Deployment

31 of 33

31

Pachyderm

training

model

model

test

score

score

{"3900.00","2007-05-27",...}

{rejected...}

MLeap Serving

32 of 33

Git Repositories

32

33 of 33

Thank You.

Hollin Wilkins

Combust, @combustml, combust.ml

Daniel Whitenack

Pachyderm, @pachydermIO, pachyderm.io

33