Mo models, mo problems: tracking the quality of models in production

The tweet

Pain Points

  • Delays in tracking or “getting to” models = decay
  • Lifecycle of a model gets harder to manage once deployed (in production)
  • Model quality and model maintenance are “criminally underrepresented”

Ideas

  • Definition of done updated to include a plan for tracking model performance (since model builder is most familiar with data and munging process for model)
  • Checklist (ala checklist manifesto)
  • Assign expiration dates to model outputs when they’re created
  • Definition of done updated to include a feature dictionary
  • Serve alternative baseline (like popular pages/items) to a small but statistically significant set of users and compare lift (in CTR or pageviews-per-session) to baseline -- model is retrained if lift declines over significant time period
  • Look into QSAR (pharma concept for time-series)
  • Run two different but useful models side by side
  • Champion challenger systems as part of the platform
  • Apache Airflow DAGs / pipelines to test integrity of DB queries and data munging tasks before they hit models
  • Quick check-in meetings at key development milestones (right people in the room)
  • You just need to make the process formal enough that people think about it, talk about it, and prepare for it.”
  • If there is an acceptable error margin / rate, use this when setting up alerts

We use PSI as a metric to measure feature and score drift (also tells us when something has gone wrong). Also validate against bus outcome.

What to monitor

  • Model outputs:
  • Average of last x days error
  • Accuracy loss
  • Check distribution against baseline (KS or GOF tests or integral-based tests like “area metric cdf”)
  • Anomaly detection
  • % of missing and imputed
  • Production errors in line with test set errors
  • Precision / recall / etc.
  • Model features / weights:
  • Distribution of model features
  • Anomaly detection
  • How sensitive is model’s accuracy with different test datasets?
  • Model input data:
  • Check distribution against baseline (KS or GOF tests or integral-based tests like “area metric cdf”))
  • Anomaly detection
  • % of missing and imputed
  • Concept drift of y-values
  • Data within limits (if applicable)
  • Check over time
  • Check compared to model test / training set
  • Multi-dimensional outlier detection (like HDOutliers)
  • Operational: usage changes
  • Technical: load balancing, response times, latency, errors, etc.

Creating monitoring

  • Out of the box / automation
  • Cronjob that checks lifts over time and emails if lift is “out of control” based on Shewhart control chart rules

Delivery / Tools

  • Monitor in the office (“over our heads”)
  • Automated posting of results into Slack channel (“helps with more of the team having context”)
  • Model quality dashboards (powered by Looker) and a weekly review
  • Kibana
  • Owl Analytics (both models and datasets)
  • Anaconda, Civis, Domino (from Andrew Therriault’s deck)
  • Quilt
  • ModelDB

Who

  • One person leads on building model, also builds methods to check accuracy and update when needed -- moving away from this so everyone builds and can update
  • “Models are owned by the team, not an individual”
  • “Builders and maintainers are the same people and act on those changes”
  • Data ops
  • Model manager

Resources