Mo models, mo problems: tracking the quality of models in production
The tweet
Pain Points
- Delays in tracking or “getting to” models = decay
- Lifecycle of a model gets harder to manage once deployed (in production)
- Model quality and model maintenance are “criminally underrepresented”
Ideas
- Definition of done updated to include a plan for tracking model performance (since model builder is most familiar with data and munging process for model)
- Checklist (ala checklist manifesto)
- Assign expiration dates to model outputs when they’re created
- Definition of done updated to include a feature dictionary
- Serve alternative baseline (like popular pages/items) to a small but statistically significant set of users and compare lift (in CTR or pageviews-per-session) to baseline -- model is retrained if lift declines over significant time period
- Look into QSAR (pharma concept for time-series)
- Run two different but useful models side by side
- Champion challenger systems as part of the platform
- Apache Airflow DAGs / pipelines to test integrity of DB queries and data munging tasks before they hit models
- Quick check-in meetings at key development milestones (right people in the room)
- “You just need to make the process formal enough that people think about it, talk about it, and prepare for it.”
- If there is an acceptable error margin / rate, use this when setting up alerts
We use PSI as a metric to measure feature and score drift (also tells us when something has gone wrong). Also validate against bus outcome.
What to monitor
- Average of last x days error
- Accuracy loss
- Check distribution against baseline (KS or GOF tests or integral-based tests like “area metric cdf”)
- Anomaly detection
- % of missing and imputed
- Production errors in line with test set errors
- Precision / recall / etc.
- Model features / weights:
- Distribution of model features
- Anomaly detection
- How sensitive is model’s accuracy with different test datasets?
- Check distribution against baseline (KS or GOF tests or integral-based tests like “area metric cdf”))
- Anomaly detection
- % of missing and imputed
- Concept drift of y-values
- Data within limits (if applicable)
- Check over time
- Check compared to model test / training set
- Multi-dimensional outlier detection (like HDOutliers)
- Operational: usage changes
- Technical: load balancing, response times, latency, errors, etc.
Creating monitoring
- Out of the box / automation
- Cronjob that checks lifts over time and emails if lift is “out of control” based on Shewhart control chart rules
Delivery / Tools
- Monitor in the office (“over our heads”)
- Automated posting of results into Slack channel (“helps with more of the team having context”)
- Model quality dashboards (powered by Looker) and a weekly review
- Kibana
- Owl Analytics (both models and datasets)
- Anaconda, Civis, Domino (from Andrew Therriault’s deck)
- Quilt
- ModelDB
Who
- One person leads on building model, also builds methods to check accuracy and update when needed -- moving away from this so everyone builds and can update
- “Models are owned by the team, not an individual”
- “Builders and maintainers are the same people and act on those changes”
Resources

