Data Quality
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Administrivia
Complete the Team Citizenship Evaluation if you haven’t yet
2
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
A/B Experiments: What if...?
3
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Confidence in A/B Experiments
Group A
classic personalized content recommendation model
2158 Users
average 3:13 min time on site
Group B
updated personalized content recommendation model
10 Users
average 3:24 min time on site
4
What's the problem of comparing the average?
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Analyzing Results: Stats 101
5
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Analyzing Results: Stats 101
Significance testing also helps with comparison. When most of the confidence intervals overlap, we’d know there’s no actual differences between groups.
This is quantified by p-value:
6
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Stats 101: How to compute p-value?
Parametric tests: Assume comparing normally distributed groups that have the same variances.
Non-parametric tests: Does not assume normal distribution.
7
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
t-test
We will ask for statistical test in M3 – Many libraries implement it!
8
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Decision tree of tests
Many many other factors, e.g., dependent vs. independent measures
9
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
How many samples needed?
Too few?
Noise and random results!
Too many?
Risk of spreading bad designs!
10
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Concurrent A/B testing
Multiple experiments at the same time
11
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Other Experiments in Production
Chaos experiments
Shadow releases / traffic teeing
Canary releases
12
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Chaos Experiments
Deliberate introduction of faults in production to test robustness.
13
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Chaos Experiments for ML Components?
14
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Shadow releases / traffic teeing
Run both models in parallel
Use predictions of old model in production
Compare differences between model predictions
If possible, compare against ground truth labels/telemetry
Examples?
15
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Canary Releases
Release new version to small percentage of population (like A/B testing)
Automatically roll back if quality measures degrade
Automatically and incrementally increase deployment to 100% otherwise
16
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Advice for Experimenting in Production
Minimize blast radius (canary, A/B, chaos expr)
Automate experiments and deployments
Allow for quick rollback of poor models (continuous delivery, containers, load balancers, versioning)
Make decisions with confidence, compare distributions
Monitor, monitor, monitor
17
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
More Quality Assurance...Data Quality
18
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Readings
Required reading:
Recommended reading:
19
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Learning Goals
20
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Poor Data Quality has Consequences
(often delayed, hard-to-fix consequences)
21
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Garbage in → Garbage Out
Example: systematic bias in training.
Poor data quality leads to poor models
Often not detectable in offline evaluation
Causes problems in production - now difficult to correct
22
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Data Quality is a System-Wide Concern
23
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Data Cascades
"Compounding events causing negative, downstream effects from data issues, that result in technical debt over time."
24
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Common Data Cascades
Physical world brittleness
Inadequate domain expertise
Conflicting reward systems
Poor (cross-org.) documentation
25
Sambasivan, N., et al. (2021). “Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI. In Proc. Conference on Human Factors in Computing Systems.
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Interacting with physical world brittleness
Brittle deployments interacting with not-digitised physical worlds
e.g. an AI model for the COVID-19 pandemic on day 1 versus day 100 required a total change in various assumptions since the pandemic and human responses were volatile and dynamic
26
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Inadequate application-domain expertise
AI practitioners are responsible for data sense-making in contexts in which they do not have domain expertise.
27
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Conflicting Reward Systems
Misaligned incentives and priorities between practitioners, domain experts, and field partners.
e.g., when a clinician spends a lot of time punching in data, not paying attention to the patient, that has a human cost
28
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Poor Cross-organizational Documentation
Lack of documentation across various cross organisational relations, causing lack of understanding on metadata
e.g., a lack of metadata and collaborators changing schema without understanding context led to a loss of four months of precious medical robotics data collection.”
29
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Case Study: Inventory Management
Goal: Train an ML model to predict future sales; make decisions about what to (re)stock/when/how many...
30
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Discussion: Possible Data Cascades?
31
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Data Documentation
Let's use data documentation as an entry point to discuss what aspects of data we care about.
32
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Data Quality is a System-Wide Concern
Data flows across components, e.g., from user interface into database to crowd-sourced labeling team into ML pipeline
Humans interacting with the system
Organizational practices
Documentation at the interfaces is important
33
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Data Quality Documentation
Teams rarely document expectations of data quantity or quality.
Data quality tests are rare, but some teams adopt defensive monitoring.
Several ideas for documenting distributions, including Datasheets and Dataset Nutrition Label
34
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Data Card
Data Cards are for fostering transparent, purposeful and human-centered documentation of datasets within the practical contexts of industry and research. They are structured summaries of essential facts about various aspects of ML datasets…provide explanations of processes and rationales that shape the data and consequently the models.
35
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Entries in data card
Very good reference for data quality, but way too difficult to fill out for every dataset so usually ignored...
36
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Data Cards give you an idea of what might impact data quality
We will touch on some:
37
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Understand and improve data quality
Assuming you didn't have the best documented and cleaned data in the world (typical!), how do you evaluate your data and how do you clean it?
38
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Data cleaning and repairing account for about 60% of the work of data scientists.
"Everyone wants to do the model work, not the data work"
Own experience?
39
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Accuracy vs. Precision
Accuracy: Reported values (on average) represent real value
Precision: Repeated measurements yield the same result
Accurate, but imprecise: Q. How to deal with this issue?
Inaccurate, but precise: ?
40
(CC-BY-4.0 by Arbeck)
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Data Accuracy and Precision: Impact on ML
More data → better models (up to a point, diminishing effects)
Noisy data (imprecise) → less confident models, more data needed
Inaccurate data → misleading models, biased models
Invest in data quality, not just quantity
41
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Dealing with noisy data
Where does noise come from and how do we fix it?
42
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
What do we mean by clean data?
Accuracy: The data was recorded correctly.
Completeness: All relevant data was recorded.
Uniqueness: The entries are recorded once.
Consistency: The data agrees with itself.
Timeliness: The data is kept up to date.
43
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Challenge from collection: Data comes from many sources
e.g. For the inventory system:
44
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Challenge from collection: Data comes from many sources
These sources have different levels of reliability and quality
45
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
What happens: Data is noisy
Wrong results and computations, crashes
Duplicate data, near-duplicate data
Out of order data
Data format invalid
46
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Two levels of data precision
47
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Data Integrity / Schema
Ensuring basic consistency about shape and types
48
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Schema in Relational Databases
49
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Data Schema
Define the expected format of data
Data can be automatically checked against schema
Protects against change; explicit interface between components
50
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Data Schema Constraints for Inventory System?
51
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Schema Problems: Uniqueness, data format, integrity, …
52
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Dirty Data: Example
Problems with this data? Which Problems are Schema Problems?
53
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
What Happens When New Data Violates Schema?
54
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Modern Databases: Schema-Less
Also vector databases, schema-aware databases that basically store long text in each cell, etc.
55
Image source: https://www.kdnuggets.com/2021/05/nosql-know-it-all-compendium.html
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Schema-Less Data Exchange
56
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Schema-Less Data Exchange
Q. Benefits? Drawbacks?
57
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Schema Library: Apache Avro
58
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Schema Library: Apache Avro
Schema specification in JSON format
Serialization and deserialization with automated checking
Native support in Kafka
Benefits
Drawbacks
59
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Many Schema Libraries/Formats
Examples
60
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Schema not just for database, but for data transmission
61
https://openai.com/index/introducing-structured-outputs-in-the-api/
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Summary: Schema
Basic structure and type definition of data
Well supported in databases and many tools
Very low bar for data quality
62
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Wrong and Inconsistent Data
Application- and domain-specific data issues
63
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Dirty Data: Example
Problems with the data beyond schema problems?
64
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Wrong and Inconsistent Data
Q. How can we detect and fix these problems?
65
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Discussion: Wrong and Inconsistent Data?
66
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Data Cleaning Overview
Data analysis / Error detection
Error repair
67
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Error Detection Examples
Illegal values: min, max, variance, deviations, cardinality
Misspelling: sorting + manual inspection, dictionary lookup
Missing values: null values, default values
Duplication: sorting, edit distance, normalization
68
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Example Tool: Great Expectations
Supports schema validation and custom instance-level checks.
69
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Example Tool: Great Expectations
70
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Rule-based detection: Data Quality Rules
Rules can be used to reject data or repair it
Invariants on data that must hold
Typically about relationships of multiple attributes or data sources, eg.
Classic integrity constraints in databases or conditional constraints
71
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
ML-based for Detecting Inconsistencies
72
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Example: HoloClean
73
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Discovery of Data Quality Rules
Rules directly taken from external databases
Given clean data,
Given mostly clean data (probabilistic view),
Given labeled dirty data or user feedback,
74
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Discussion: Data Quality Rules?
75
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Dealing with Drift
Why does my model begin to perform poorly over time?
A very particular form of data accuracy problem (data becomes wrong), caused not by human creators but by the world. Very prevalent & affects product!
76
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Data changes
System objective changes over time
Software components are upgraded or replaced
Prediction models change
Quality of supplied data changes
User behavior changes
Assumptions about the environment no longer hold
Users can react to model output; or try to game/deceive the model
Examples in inventory system?
77
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Types of Drift
78
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Drift & Model Decay
Concept drift (or concept shift)
Data drift (or covariate shift, virtual drift, distribution shift, or population drift)
Upstream data changes
How do we fix these drifts?
79
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
On Terminology
Concept and data drift are separate concepts
In practice and literature, not always clearly distinguished
Colloquially encompasses all forms of model degradations and environment changes
Define term for target audience
80
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Breakout: Drift in the Inventory System
What kind of drift might be expected?
As a group, tagging members, write plausible examples in #lecture:
81
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Watch for Degradation in Prediction Accuracy
82
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Indicators of Concept Drift
How to detect concept drift in production?
83
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Indicators of Concept Drift
Model degradations observed with telemetry
Telemetry indicates different outputs over time for similar inputs
Differences in influential features and feature importance over time
Relabeling training data changes labels
Interpretable ML models indicate rules that no longer fit
(many papers on this topic, typically on statistical detection)
84
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Indicators of Data Drift
How to detect data drift in production?
85
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Indicators of Data Drift
Model degradations observed with telemetry
Distance between input distribution and training distribution increases
Average confidence of model predictions declines
Relabeling of training data retains stable labels
86
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Detecting Data Drift
87
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Data Distribution Analysis
Plot distributions of features (histograms, density plots, kernel density estimation)
Define distance function between inputs and identify distance to closest training data (e.g., energy distance, see also kNN)
Anomaly detection and "out of distribution" detection
Compare distribution of output labels
88
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Microsoft Azure Data Drift Dashboard
89
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Dealing with Drift
Regularly retrain model on recent data
Involve humans when increasing inconsistencies detected
Monitoring, monitoring, monitoring!
90
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Preview: Inaccurate Data can also be caused by factors other than drift
How do you detect and fix more systemic data quality issues?
91
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Challenge from collection and processing: Bias
The concept of "raw data" might be misleading; Data is always some proxy we actively collect to represent the world. Someone decides...
They change what you can do with the data
92
Recommended Reading: Gitelman, Lisa, Virginia Jackson, Daniel Rosenberg, Travis D. Williams, Kevin R. Brine, Mary Poovey, Matthew Stanley et al. "Data bite man: The work of sustaining a long-term study." In "Raw Data" Is an Oxymoron, (2013), MIT Press: 147-166.
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
What happens: Data is inaccurate �(for what you want to do)
Missing data
Biased data
Systematic errors in data distribution
93
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Summary
Data quality is a system-level concern
Data from many sources, often inaccurate, imprecise, inconsistent, incomplete, … – many different forms of data quality problems
Many mechanisms for enforcing consistency and cleaning
Concept and data drift are key challenges → monitor
94
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Further Readings
95
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025