1 of 50

Production-Ready BIG ML Workflows

From Zero to Hero

Daniel Marcous�Google, Waze, Data Wizard

dmarcous@gmail/google.com

2017

2 of 50

What’s a Data Wizard you ask?

Gain Actionable Insights!

3 of 50

What’s here?

4 of 50

What’s here?

Example Use CaseOptimizing Waze ETA Prediction

MethodologyDeploying to production - step by step

PitfallsWhat to look out for in both methodology and code

Use CasesShowing off what we actually do in Waze Analytics

Based on tough lessons learned & Google experts recommendations and inputs.

5 of 50

A car goes from Chinatown to Times Square. How long will it take to arrive?

  • Ride features
  • Historical data
  • Real time data
  • User features
  • Map features

6 of 50

Why Big ML?

7 of 50

Bigger Is Better!

  • More processing power
    • Grid search all the parameters you ever wanted.
    • Cross validate in parallel with no extra effort.
  • Keep training until you hit 0
    • Some models can not overfit when optimising until training error is 0.
      • Random Forests - more trees
      • Neural Nets - more iterations
  • Handle BIG data
    • Tons of training data
      • We store EVERYTHING
      • no need for sampling on wrong populations!
    • Millions of features
      • text processing with TF-IDF
    • Some models (Artificial Neural Networks) can’t do good without training on a lot of data.

8 of 50

Challenges

9 of 50

Bigger is harder

  • Skill gap - Big data engineer (Scala/Java) VS Researcher/PHD (Python/R)
  • Curse of dimensionality
    • Some algorithms require exponential time/memory as dimensions grow
    • Harder and more important to tell what’s gold and what’s noise
    • Unbalance data goes a long way with more records
  • Big model != Small model
    • Different parameter settings
    • Different metric readings
      • Different implementations (distributed VS central memory)
      • Different programming language (heuristics)
    • Different populations trained on (sampling)

10 of 50

Solution = Workflow

11 of 50

Measure first, optimize second.

12 of 50

Before you start

  • Create example input
    • Raw input

  • Set up your metrics
    • Derived from business needs
    • Confusion matrix
    • Precision / recall
      • Per class metrics
    • AUC
    • Coverage
      • Amount of subjects affected
      • Sometimes measures as average precision per K random subjects.

Remember : Desired short term behaviour does not imply long term behaviour

Measure

Preprocess

(parse, clean, join, etc.)

  • Create example output
    • Featured input
    • Prediction rows

Naive

Matrix

1

1

2

3

3

13 of 50

Preprocess

  • Example Data :
    • Input - {user_id: 1 , from: 32.113, 34.818, to: 32.113, 34.802, time_start: 17:15}
    • Features - {from_neighbourhood: Neot-Afeka, to_neighbourhood: Ramat-Aviv, hour_start: 17, minute_start: 15, air_distance: 1.5km, road_distance: 2.9km, heavy_traffic: NO}
    • Output - {ETA: 17:25.43}
    • Actual - {ETA: 17:26.34}
  • Metric - absolute error
    • Error - {ETA: 17:26.34} - {ETA: 17:25.43} = {00:00.51}
    • Average error - over 10K examples : ~65 secs

2

2

1

1

1

3

3

14 of 50

Monitor.

15 of 50

Visualise - easiest way to measure quickly

  • Set up your dashboard
    • Amounts of input data
      • Before /after joining
    • Amounts of output data
    • Metrics (See “Measure first, optimize second”)
  • Different model comparison - what’s best, when and where
  • Timeseries Analysis
    • Anomaly detection - Does a metric suddenly drastically change?
    • Impact analysis - Did deploying a model had a significant effect on metric change?
  • Shiny @Waze

16 of 50

Dashboard monitoring

Dashboard should support - picking different models, comparing metrics.

Pick models

to compare

Statistical tests on distributions

t.test / AUC

17 of 50

Dashboard monitoring

Dashboard should support - Timeseries anomaly detection, and impact analysis (deploying new model)

18 of 50

Start small and grow.

19 of 50

Getting a feel

Advanced variable selection with regularisation techniques in R.

Intercepts - by significance

No intercept = not entered to model

20 of 50

Getting a feel

Trying modeling techniques in R.

Root mean square error

Lower = better (~ kinda)

Fit a gradient boosted trees model

21 of 50

Getting a feel

Modeling bigger data with R, using parallelism.

Fit and combine 6 random forest

models (10k trees each) in parallel

22 of 50

Start with a flow.

23 of 50

Basic moving parts

Data source 1

Data source N

Preprocess

Training

Feature matrix

Scoring

Models 1..N

Predictions

1..N

Dashboard

Serving DB

Feedback loop

Conf.

User/Model assignments

Application

24 of 50

Good ML code trumps performance.

25 of 50

Why so many parts you ask?

  • Scaling
  • Fault tolerance
    • Failed preprocessing /training doesn’t affect serving model
    • Rerunning only failed parts
  • Different logical parts - Different processes (@”Clean code” by Uncle Bob)
  • Easier to tweak and deploy changes

26 of 50

Test your infrastructure.

27 of 50

Set up a baseline.

Start with a neutral launch

28 of 50

You are here:

  • Take a snapshot of your metric reads:
    • The ones you chose earlier in the process as important to you
      • Confusion matrix
      • Weighted average % classified correctly
      • % subject coverage
  • Latency
    • Building feature matrix on last day data takes X minutes
    • Training takes X hours
    • Serving predictions on Y records takes X seconds

Remember : You are running with a naive model. Everything better than the old model / random is OK.

29 of 50

Go to work.

Coffee recommended at this point.

30 of 50

Optimize

What? How?

  • Grid search over parameters
  • Evaluate metrics
    • Using a Spark predefined Evaluator
    • Using user defined metrics
  • Cross validate Everything
  • Tweak preprocessing (mainly features)
    • Feature engineering
    • Feature transformers
      • Discretize / Normalise
    • Feature selectors
    • In Apache Spark Latest
  • Tweak training
    • Different models
    • Different model parameters

31 of 50

Spark ML

Building a training pipeline with spark.ml.

Create dummy variables

Required response label format

The ML model itself

Labels back to readable format

Assembled training pipeline

32 of 50

Spark ML

Cross-validate, grid search params and evaluate metrics.

Grid search with reference to ML model stage (RF)

Metrics to evaluate

Yes, you can definitely extend

and add your own metrics.

33 of 50

A/B

Test your changes

34 of 50

Compare to baseline

  • Best testing - production A/B test
    • Use current production model and new model in parallel
    • Local ETA Model (averaging road ETA) VS Global ETA Model (error on full ride) VS Hybrid
  • Metrics improvements (Remember your dashboard?)
    • Local ETA Model ~65s error VS Global ETA Model ~60s error VS Hybrid ~58.6s error
  • Deploy / Revert = Update user assignments
    • Based on new metrics / feedback loop if possible

35 of 50

A/B Infrastructures

Setting up a very basic A/B testing infrastructure built upon our earlier presented modeling wrapper.

Conf hold Mapping of:

model -> user_id/subject list

Score in parallel (inside a map)

Distributed=awesome.

Fancy scala union for all score files

36 of 50

Ad-Hoc statistics

37 of 50

Enter Apache Zeppelin

  • If you wrote your code right, you can easily reuse it in a notebook !
  • Answer ad-hoc questions
    • How many predictions did you output last month?
    • How many new users had a prediction with probability > 0.7
    • How accurate were we on last month predictions? (join with real data)
  • No need to rebuild anything!

38 of 50

Playing with it

Read a parquet file , show statistics, register as table and run SparkSQL on it.

Parquet - already has a schema inside

For usage in SparkSQL

39 of 50

Putting it all together

40 of 50

Work Process

Step by step for deploying your big ML workflows to production, ready for operations and optimisations.

  1. Measure first, optimize second.
    1. Define metrics.
    2. Preprocess data (using examples)
    3. Monitor. (dashboard setup)
  2. Start small and grow.
  3. Start with a flow.
    • Good ML code trumps performance.
    • Test your infrastructure.
  4. Set up a baseline.
  5. Go to work.
    • Optimize.
    • A/B.
      1. Test new flow in parallel to existing flow.
      2. Update user assignments.
  6. Watch. Iterate. (see 5.)

41 of 50

Possible Pitfalls

  • Code produced with
    • Apache Spark 1.6 / Scala 2.11.4
  • RDD VS Dataframe
    • Enter “Dataset API” (V2.0+)
  • mllib VS spark.ml
    • Always use spark.ml if functionality exists
  • Algorithmic Richness
  • Using Parquet
    • Intermediate outputs
  • Unbalanced partitions
    • Stuck on reduce
    • Stragglers
  • Parameter tuning
    • Spark.sql.partitions
    • Executors
    • Driver VS executor memory

42 of 50

Use Cases

@Waze

43 of 50

Irregular Traffic Events

Major events, causing out of the ordinary traffic

  • Exploration tool over time & space
  • Seasonal traffic anomaly detection

44 of 50

Dangerous Places

Find most dangerous places, using custom developed clustering algorithms

  • Alert authorities / users
  • Compare & share with 3rd parties (NYPD)

45 of 50

Parking Places Detection

Parking entrance

Parking lot

Street parking

46 of 50

Speed Limits Inference

Waze Segment Data

Machine Learning

Speed Limit Prediction

Waze Segment Data

Community Verification

Show in App

47 of 50

Text Mining - Store Sentiments

48 of 50

Text Mining - Sentiment by Time & Place

49 of 50

50 of 50

Code & Slides�https://github.com/dmarcous/BigMLFlow/

Daniel Marcous�dmarcous@google.com

dmarcous@gmail.com