Production-Ready BIG ML Workflows
From Zero to Hero
Daniel Marcous�Google, Waze, Data Wizard
dmarcous@gmail/google.com
2017
What’s a Data Wizard you ask?
Gain Actionable Insights!
What’s here?
What’s here?
Example Use Case�Optimizing Waze ETA Prediction
Methodology�Deploying to production - step by step
Pitfalls�What to look out for in both methodology and code
Use Cases�Showing off what we actually do in Waze Analytics
Based on tough lessons learned & Google experts recommendations and inputs.
A car goes from Chinatown to Times Square. How long will it take to arrive?
Why Big ML?
Bigger Is Better!
Challenges
Bigger is harder
Solution = Workflow
Measure first, optimize second.
Before you start
Remember : Desired short term behaviour does not imply long term behaviour
Measure
Preprocess
(parse, clean, join, etc.)
Naive
Matrix
1
1
2
3
3
Preprocess
2
2
1
1
1
3
3
Monitor.
Visualise - easiest way to measure quickly
Dashboard monitoring
Dashboard should support - picking different models, comparing metrics.
Pick models
to compare
Statistical tests on distributions
t.test / AUC
Dashboard monitoring
Dashboard should support - Timeseries anomaly detection, and impact analysis (deploying new model)
Start small and grow.
Getting a feel
Advanced variable selection with regularisation techniques in R.
Intercepts - by significance
No intercept = not entered to model
Getting a feel
Trying modeling techniques in R.
Root mean square error
Lower = better (~ kinda)
Fit a gradient boosted trees model
Getting a feel
Modeling bigger data with R, using parallelism.
Fit and combine 6 random forest
models (10k trees each) in parallel
Start with a flow.
Basic moving parts
Data source 1
Data source N
Preprocess
Training
Feature matrix
Scoring
Models 1..N
Predictions
1..N
Dashboard
Serving DB
Feedback loop
Conf.
User/Model assignments
Application
Good ML code trumps performance.
Why so many parts you ask?
Test your infrastructure.
Set up a baseline.
Start with a neutral launch
You are here:
Remember : You are running with a naive model. Everything better than the old model / random is OK.
Go to work.
Coffee recommended at this point.
Optimize
What? How?
Spark ML
Building a training pipeline with spark.ml.
Create dummy variables
Required response label format
The ML model itself
Labels back to readable format
Assembled training pipeline
Spark ML
Cross-validate, grid search params and evaluate metrics.
Grid search with reference to ML model stage (RF)
Metrics to evaluate
Yes, you can definitely extend
and add your own metrics.
A/B
Test your changes
Compare to baseline
A/B Infrastructures
Setting up a very basic A/B testing infrastructure built upon our earlier presented modeling wrapper.
Conf hold Mapping of:
model -> user_id/subject list
Score in parallel (inside a map)
Distributed=awesome.
Fancy scala union for all score files
Ad-Hoc statistics
Enter Apache Zeppelin
Playing with it
Read a parquet file , show statistics, register as table and run SparkSQL on it.
Parquet - already has a schema inside
For usage in SparkSQL
Putting it all together
Work Process
Step by step for deploying your big ML workflows to production, ready for operations and optimisations.
Possible Pitfalls
Use Cases
@Waze
Irregular Traffic Events
Major events, causing out of the ordinary traffic
Dangerous Places
Find most dangerous places, using custom developed clustering algorithms
Parking Places Detection
Parking entrance
Parking lot
Street parking
Speed Limits Inference
Waze Segment Data
Machine Learning
Speed Limit Prediction
Waze Segment Data
Community Verification
Show in App
Text Mining - Store Sentiments
Text Mining - Sentiment by Time & Place
Code & Slides�https://github.com/dmarcous/BigMLFlow/
Daniel Marcous�dmarcous@google.com
dmarcous@gmail.com