1 of 50

Production-Ready BIG ML Workflows

From Zero to Hero

Daniel Marcous�Google, Waze, Data Wizard

dmarcous@gmail/google.com

2017

2 of 50

What’s a Data Wizard you ask?

Gain Actionable Insights!

3 of 50

What’s here?

4 of 50

What’s here?

Example Use Case�Optimizing Waze ETA Prediction

Methodology�Deploying to production - step by step

Pitfalls�What to look out for in both methodology and code

Use Cases�Showing off what we actually do in Waze Analytics

Based on tough lessons learned & Google experts recommendations and inputs.

5 of 50

A car goes from Chinatown to Times Square. How long will it take to arrive?

Ride features
Historical data
Real time data
User features
Map features

7 of 50

Bigger Is Better!

More processing power

Grid search all the parameters you ever wanted.
Cross validate in parallel with no extra effort.

Keep training until you hit 0

Some models can not overfit when optimising until training error is 0.

Random Forests - more trees
Neural Nets - more iterations

Handle BIG data

Tons of training data

We store EVERYTHING
no need for sampling on wrong populations!

Millions of features

text processing with TF-IDF

Some models (Artificial Neural Networks) can’t do good without training on a lot of data.

9 of 50

Bigger is harder

Skill gap - Big data engineer (Scala/Java) VS Researcher/PHD (Python/R)
Curse of dimensionality

Some algorithms require exponential time/memory as dimensions grow
Harder and more important to tell what’s gold and what’s noise
Unbalance data goes a long way with more records

Big model != Small model

Different parameter settings
Different metric readings

Different implementations (distributed VS central memory)
Different programming language (heuristics)

Different populations trained on (sampling)

10 of 50

Solution = Workflow

11 of 50

Measure first, optimize second.

12 of 50

Before you start

Create example input

Raw input

Set up your metrics

Derived from business needs
Confusion matrix
Precision / recall

Per class metrics

AUC
Coverage

Amount of subjects affected
Sometimes measures as average precision per K random subjects.

Remember : Desired short term behaviour does not imply long term behaviour

Measure

Preprocess

(parse, clean, join, etc.)

Create example output

Featured input
Prediction rows

Naive

Matrix

13 of 50

Preprocess

Example Data :

Input - {user_id: 1 , from: 32.113, 34.818, to: 32.113, 34.802, time_start: 17:15}
Features - {from_neighbourhood: Neot-Afeka, to_neighbourhood: Ramat-Aviv, hour_start: 17, minute_start: 15, air_distance: 1.5km, road_distance: 2.9km, heavy_traffic: NO}
Output - {ETA: 17:25.43}
Actual - {ETA: 17:26.34}

Metric - absolute error

Error - {ETA: 17:26.34} - {ETA: 17:25.43} = {00:00.51}
Average error - over 10K examples : ~65 secs

15 of 50

Visualise - easiest way to measure quickly

Set up your dashboard

Amounts of input data

Before /after joining

Amounts of output data
Metrics (See “Measure first, optimize second”)

Different model comparison - what’s best, when and where
Timeseries Analysis

Anomaly detection - Does a metric suddenly drastically change?
Impact analysis - Did deploying a model had a significant effect on metric change?

Shiny @Waze

16 of 50

Dashboard monitoring

Dashboard should support - picking different models, comparing metrics.

Pick models

to compare

Statistical tests on distributions

t.test / AUC

17 of 50

Dashboard monitoring

Dashboard should support - Timeseries anomaly detection, and impact analysis (deploying new model)

18 of 50

Start small and grow.

19 of 50

Getting a feel

Advanced variable selection with regularisation techniques in R.

Intercepts - by significance

No intercept = not entered to model

20 of 50

Getting a feel

Trying modeling techniques in R.

Root mean square error

Lower = better (~ kinda)

Fit a gradient boosted trees model

21 of 50

Getting a feel

Modeling bigger data with R, using parallelism.

Fit and combine 6 random forest

models (10k trees each) in parallel

22 of 50

Start with a flow.

23 of 50

Basic moving parts

Data source 1

Data source N

Preprocess

Training

Feature matrix

Scoring

Models 1..N

Predictions

1..N

Dashboard

Serving DB

Feedback loop

Conf.

User/Model assignments

Application

24 of 50

Good ML code trumps performance.

25 of 50

Why so many parts you ask?

Scaling
Fault tolerance

Failed preprocessing /training doesn’t affect serving model
Rerunning only failed parts

Different logical parts - Different processes (@”Clean code” by Uncle Bob)
Easier to tweak and deploy changes

26 of 50

Test your infrastructure.

27 of 50

Set up a baseline.

Start with a neutral launch

28 of 50

You are here:

Take a snapshot of your metric reads:

The ones you chose earlier in the process as important to you

Confusion matrix
Weighted average % classified correctly
% subject coverage

Latency

Building feature matrix on last day data takes X minutes
Training takes X hours
Serving predictions on Y records takes X seconds

Remember : You are running with a naive model. Everything better than the old model / random is OK.

29 of 50

Go to work.

Coffee recommended at this point.

30 of 50

Optimize

What? How?

Grid search over parameters
Evaluate metrics

Using a Spark predefined Evaluator
Using user defined metrics

Cross validate Everything

Tweak preprocessing (mainly features)

Feature engineering
Feature transformers

Discretize / Normalise

Feature selectors
In Apache Spark Latest

Tweak training

Different models
Different model parameters

31 of 50

Spark ML

Building a training pipeline with spark.ml.

Create dummy variables

Required response label format

The ML model itself

Labels back to readable format

Assembled training pipeline

32 of 50

Spark ML

Cross-validate, grid search params and evaluate metrics.

Grid search with reference to ML model stage (RF)

Metrics to evaluate

Yes, you can definitely extend

and add your own metrics.

33 of 50

A/B

Test your changes

34 of 50

Compare to baseline

Best testing - production A/B test

Use current production model and new model in parallel
Local ETA Model (averaging road ETA) VS Global ETA Model (error on full ride) VS Hybrid

Metrics improvements (Remember your dashboard?)

Local ETA Model ~65s error VS Global ETA Model ~60s error VS Hybrid ~58.6s error

Deploy / Revert = Update user assignments

Based on new metrics / feedback loop if possible

35 of 50

A/B Infrastructures

Setting up a very basic A/B testing infrastructure built upon our earlier presented modeling wrapper.

Conf hold Mapping of:

model -> user_id/subject list

Score in parallel (inside a map)

Distributed=awesome.

Fancy scala union for all score files

36 of 50

Ad-Hoc statistics

37 of 50

Enter Apache Zeppelin

If you wrote your code right, you can easily reuse it in a notebook !
Answer ad-hoc questions

How many predictions did you output last month?
How many new users had a prediction with probability > 0.7
How accurate were we on last month predictions? (join with real data)

No need to rebuild anything!

38 of 50

Playing with it

Read a parquet file , show statistics, register as table and run SparkSQL on it.

Parquet - already has a schema inside

For usage in SparkSQL

39 of 50

Putting it all together

40 of 50

Work Process

Step by step for deploying your big ML workflows to production, ready for operations and optimisations.

Measure first, optimize second.

Define metrics.
Preprocess data (using examples)
Monitor. (dashboard setup)

Start small and grow.
Start with a flow.

Good ML code trumps performance.
Test your infrastructure.

Set up a baseline.
Go to work.

Optimize.
A/B.

Test new flow in parallel to existing flow.
Update user assignments.

Watch. Iterate. (see 5.)

41 of 50

Possible Pitfalls

Code produced with

Apache Spark 1.6 / Scala 2.11.4

RDD VS Dataframe

Enter “Dataset API” (V2.0+)

mllib VS spark.ml

Always use spark.ml if functionality exists

Algorithmic Richness
Using Parquet

Intermediate outputs

Unbalanced partitions

Stuck on reduce
Stragglers

Parameter tuning

Spark.sql.partitions
Executors
Driver VS executor memory

42 of 50

Use Cases

@Waze

43 of 50

Irregular Traffic Events

Major events, causing out of the ordinary traffic

Exploration tool over time & space

Seasonal traffic anomaly detection

44 of 50

Dangerous Places

Find most dangerous places, using custom developed clustering algorithms

Alert authorities / users
Compare & share with 3rd parties (NYPD)

45 of 50

Parking Places Detection

Parking entrance

Parking lot

Street parking

46 of 50

Speed Limits Inference

Waze Segment Data

Machine Learning

Speed Limit Prediction

Waze Segment Data

Community Verification

Show in App

47 of 50

Text Mining - Store Sentiments

48 of 50

Text Mining - Sentiment by Time & Place

50 of 50

Code & Slides�https://github.com/dmarcous/BigMLFlow/

Daniel Marcous�dmarcous@google.com

dmarcous@gmail.com

1 of 50

2 of 50

3 of 50

4 of 50

5 of 50

6 of 50

7 of 50

8 of 50

9 of 50

10 of 50

11 of 50

12 of 50

13 of 50

14 of 50

15 of 50

16 of 50

17 of 50

18 of 50

19 of 50

20 of 50

21 of 50

22 of 50

23 of 50

24 of 50

25 of 50

26 of 50

27 of 50

28 of 50

29 of 50

30 of 50

31 of 50

32 of 50

33 of 50

34 of 50

35 of 50

36 of 50

37 of 50

38 of 50

39 of 50

40 of 50

41 of 50

42 of 50

43 of 50

44 of 50

45 of 50

46 of 50

47 of 50

48 of 50

49 of 50

50 of 50