Tracking the quality of models in production

Mo models, mo problems: tracking the quality of models in production

Pain Points

Ideas

Definition of done updated to include a plan for tracking model performance (since model builder is most familiar with data and munging process for model)
Checklist (ala checklist manifesto)
Assign expiration dates to model outputs when they’re created
Definition of done updated to include a feature dictionary
Serve alternative baseline (like popular pages/items) to a small but statistically significant set of users and compare lift (in CTR or pageviews-per-session) to baseline -- model is retrained if lift declines over significant time period
Look into QSAR (pharma concept for time-series)
Run two different but useful models side by side
Champion challenger systems as part of the platform
Apache Airflow DAGs / pipelines to test integrity of DB queries and data munging tasks before they hit models
Quick check-in meetings at key development milestones (right people in the room)
“You just need to make the process formal enough that people think about it, talk about it, and prepare for it.”
If there is an acceptable error margin / rate, use this when setting up alerts

We use PSI as a metric to measure feature and score drift (also tells us when something has gone wrong). Also validate against bus outcome.

What to monitor

Average of last x days error
Accuracy loss
Check distribution against baseline (KS or GOF tests or integral-based tests like “area metric cdf”)
Anomaly detection
% of missing and imputed
Production errors in line with test set errors
Precision / recall / etc.

Check distribution against baseline (KS or GOF tests or integral-based tests like “area metric cdf”))
Anomaly detection
% of missing and imputed
Concept drift of y-values
Data within limits (if applicable)
Check over time
Check compared to model test / training set
Multi-dimensional outlier detection (like HDOutliers)

Creating monitoring

Out of the box / automation
Cronjob that checks lifts over time and emails if lift is “out of control” based on Shewhart control chart rules

Delivery / Tools

Monitor in the office (“over our heads”)
Automated posting of results into Slack channel (“helps with more of the team having context”)
Model quality dashboards (powered by Looker) and a weekly review
Kibana
Owl Analytics (both models and datasets)
Anaconda, Civis, Domino (from Andrew Therriault’s deck)
Quilt
ModelDB

Who

One person leads on building model, also builds methods to check accuracy and update when needed -- moving away from this so everyone builds and can update
“Models are owned by the team, not an individual”
“Builders and maintainers are the same people and act on those changes”

Resources