1
Fully-Reproducible ML Deployments with Spark, Pachyderm and MLeap
Hollin Wilkins and Daniel Whitenack
Introductions
2
Hollin Wilkins
Co-Founder, Combust
@combustml
Dan Whitenack
Data Scientist, Pachyderm
@pachydermio
Our Talk in 3 Parts
3
4
Reproducibility in the Context of ML
What is ML Reproducibility?
5
What is ML Reproducibility?
6
What is ML Reproducibility?
7
What is ML Reproducibility?
8
Why should we care?
9
Why should we care?
10
Why should we care?
11
Why should we care?
12
We Propose that...
13
We Propose that...
14
Reproducibility is essential for ML pipelines, such that they can be replayed, modified, tuned and tracked over time.
We Propose that...
15
Reproducibility is essential for ML pipelines, such that they can be replayed, modified, tuned and tracked over time.
Currently it is difficult to do this with standard tooling.
16
A Specific ML Use Case
17
train.csv
"1000.00","2007-05-26",...
"1000.00","2007-05-26",...
...
18
train.scala
model
train.csv
"1000.00","2007-05-26",...
"1000.00","2007-05-26",...
...
hdfs://PRODUCTION/lending_club/models/2017/05/02/22/13/model/*
19
train.scala
model
train.csv
"1000.00","2007-05-26",...
"1000.00","2007-05-26",...
...
score.scala
20
train.scala
model
train.csv
"1000.00","2007-05-26",...
"1000.00","2007-05-26",...
...
score.scala
test.csv
"3900.00","2007-05-27",...
"11000.00","2007-05-27",...
...
test_score
0.77
/lending_club/data/validation/production/2016/09/07/test.csv
/lending_club/data/validation/production/2016/09/08/test.csv
/lending_club/data/validation/production/2016/09/08/results.csv
21
train.scala
model
train.csv
"1000.00","2007-05-26",...
"1000.00","2007-05-26",...
...
score.scala
test.csv
"3900.00","2007-05-27",...
"11000.00","2007-05-27",...
...
test_score
0.77
22
train.scala
model
train.csv
"1000.00","2007-05-26",...
"1000.00","2007-05-26",...
...
score.scala
test.csv
"3900.00","2007-05-27",...
"11000.00","2007-05-27",...
...
test_score
0.77
1.csv
"3900.00","2007-05-27",...
"11000.00","2007-05-27",...
...
1
rejected
23
train.scala
model
train.csv
"1000.00","2007-05-26",...
"1000.00","2007-05-26",...
...
score.scala
test.csv
"3900.00","2007-05-27",...
"11000.00","2007-05-27",...
...
test_score
0.77
1.csv
"3900.00","2007-05-27",...
"11000.00","2007-05-27",...
...
1
rejected
24
… enter MLeap + Pachyderm
Open source frameworks for reproducible ML deployments, data pipelines, and data versioning
MLeap is...
25
A Serialization Framework For Machine Learning Pipelines
An Execution Engine for Machine Learning Pipelines
Pachyderm is...
26
Containerized Data Pipelines
Data Versioning
27
Pachyderm
training
model
model
test
score
train.scala
train.csv
score
score.scala
MLeap bundle
test.csv
score
28
Pachyderm
training
model
model
test
score
score
{"3900.00","2007-05-27",...}
{rejected...}
MLeap Serving
Existing Solutions, Comparison
29
| Plain Spark | Prediction.io | Data Robot | Model DB | Pachyderm + MLeap |
Data Versioning | | | | | |
Model Versioning | | | | | |
Open Sourced | | | | | |
Works with ML Pipelines | | | | | |
Commercial Support | | | | | |
30
Demonstration of Reproducible ML Deployment
31
Pachyderm
training
model
model
test
score
score
{"3900.00","2007-05-27",...}
{rejected...}
MLeap Serving
Git Repositories
Pachyderm: https://github.com/pachyderm/pachyderm
MLeap: https://github.com/combust/mleap
Demo: https://github.com/combust/pachyderm-mleap-demo
32
Thank You.
Hollin Wilkins
Combust, @combustml, combust.ml
Daniel Whitenack
Pachyderm, @pachydermIO, pachyderm.io
33