1 of 95

Data Quality

  • Machine Learning in Production

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

2 of 95

Administrivia

Complete the Team Citizenship Evaluation if you haven’t yet

2

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

3 of 95

A/B Experiments: What if...?

  • ... we hand plenty of subjects for experiments
  • ... we could randomly assign to treatment/ control group without them knowing
  • ... we could analyze small individual changes and keep everything else constant

3

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

4 of 95

Confidence in A/B Experiments

Group A

classic personalized content recommendation model

2158 Users

average 3:13 min time on site

Group B

updated personalized content recommendation model

10 Users

average 3:24 min time on site

4

What's the problem of comparing the average?

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

5 of 95

Analyzing Results: Stats 101

  • All the data have some distribution.
  • Normal distribution is very common (e.g. students tend to get average score, not full score).
  • When we draw samples from the same distribution, we can get different samples just by chance.
  • Statistics tells us whether “interesting” patterns we observe actually exist in the samples, or whether it’s just sampling noise — use some numbers to express uncertainty.
  • Confidence interval: if we can compute the average of some samples (average on-site time of 10 users), what’s the average of the population (whole user group)?

5

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

6 of 95

Analyzing Results: Stats 101

Significance testing also helps with comparison. When most of the confidence intervals overlap, we’d know there’s no actual differences between groups.

This is quantified by p-value:

  • Misconception: “what is the probability a difference between two samples is by chance?” (lower = higher likelihood of a sig. difference)
  • Reality: the p-value starts with the assumption that there's no real difference (null hypothesis), then asks “how likely would our observed data be in that scenario?”

6

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

7 of 95

Stats 101: How to compute p-value?

Parametric tests: Assume comparing normally distributed groups that have the same variances.

  • Tests: t-test, ANOVA, & linear regression.
  • For: model accuracy, human click streams
  • More sensitive and powerful when the requirement is met.

Non-parametric tests: Does not assume normal distribution.

  • For: categorical data like Likert Scale rating
  • Tests: Wilcoxon signed-rank test compare users’ nominal Likert Scale ratings

7

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

8 of 95

t-test

We will ask for statistical test in M3 – Many libraries implement it!

8

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

9 of 95

Decision tree of tests

Many many other factors, e.g., dependent vs. independent measures

9

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

10 of 95

How many samples needed?

Too few?

Noise and random results!

Too many?

Risk of spreading bad designs!

10

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

11 of 95

Concurrent A/B testing

Multiple experiments at the same time

  • Independent experiments on different populations – interactions not explored
  • Multi-factorial designs, well understood but typically too complex, e.g., not all combinations valid or interesting
  • Grouping in sets of experiments (layers)

11

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

12 of 95

Other Experiments in Production

Chaos experiments

Shadow releases / traffic teeing

Canary releases

12

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

13 of 95

Chaos Experiments

Deliberate introduction of faults in production to test robustness.

13

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

14 of 95

Chaos Experiments for ML Components?

14

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

15 of 95

Shadow releases / traffic teeing

Run both models in parallel

Use predictions of old model in production

Compare differences between model predictions

If possible, compare against ground truth labels/telemetry

Examples?

15

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

16 of 95

Canary Releases

Release new version to small percentage of population (like A/B testing)

Automatically roll back if quality measures degrade

Automatically and incrementally increase deployment to 100% otherwise

16

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

17 of 95

Advice for Experimenting in Production

Minimize blast radius (canary, A/B, chaos expr)

Automate experiments and deployments

Allow for quick rollback of poor models (continuous delivery, containers, load balancers, versioning)

Make decisions with confidence, compare distributions

Monitor, monitor, monitor

17

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

18 of 95

More Quality Assurance...Data Quality

18

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

19 of 95

Readings

Required reading:

Recommended reading:

  • Schelter, S., et al. Automating large-scale data quality verification. Proceedings of the VLDB Endowment, 11(12), pp.1781-1794.

19

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

20 of 95

Learning Goals

  • Consider data quality as part of a system; design an organization that values data quality
  • Distinguish precision and accuracy; understanding the better models vs more data tradeoffs
  • Use schema languages to enforce data schemas
  • Design and implement automated quality assurance steps that check data schema conformance and distributions
  • Devise infrastructure for detecting data drift and schema violations

20

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

21 of 95

Poor Data Quality has Consequences

(often delayed, hard-to-fix consequences)

21

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

22 of 95

Garbage in → Garbage Out

Example: systematic bias in training.

Poor data quality leads to poor models

Often not detectable in offline evaluation

Causes problems in production - now difficult to correct

22

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

23 of 95

Data Quality is a System-Wide Concern

23

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

24 of 95

Data Cascades

"Compounding events causing negative, downstream effects from data issues, that result in technical debt over time."

24

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

25 of 95

Common Data Cascades

Physical world brittleness

  • Idealized data, ignoring realities and change of real-world data
  • Static data, one time learning mindset, no planning for evolution

Inadequate domain expertise

  • Not understand. data and its context
  • Involving experts only late for troubleshooting

Conflicting reward systems

  • Missing incentives for data quality
  • Not recognizing data quality importance, discard as technicality
  • Missing data literacy with partners

Poor (cross-org.) documentation

  • Conflicts at team/organization boundary
  • Undetected drift

25

Sambasivan, N., et al. (2021). “Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI. In Proc. Conference on Human Factors in Computing Systems.

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

26 of 95

Interacting with physical world brittleness

Brittle deployments interacting with not-digitised physical worlds

  • Time to manifest: 2-3 years to emerge, almost always in the production stage
  • Impact: Complete model failure, abandonment of projects, harms to beneficiaries from mispredictions
  • Triggers: drifts (formally defined more later!)
    • Hardware drifts (Rain, wind, fingerprints, shadows)
    • Environmental drifts (lighting, temperature, humidity)
    • Social drifts (new regulations, new user behaviors)
  • Address: Monitor data source, retrain models, introduce noise in data

e.g. an AI model for the COVID-19 pandemic on day 1 versus day 100 required a total change in various assumptions since the pandemic and human responses were volatile and dynamic

26

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

27 of 95

Inadequate application-domain expertise

AI practitioners are responsible for data sense-making in contexts in which they do not have domain expertise.

  • Time to manifest: After building models through client feedback & system performance
  • Impact: Costly modification (improve labels, collect more data), Unanticipated downstream impacts
  • Triggers:
    • Subjectivity in ground truths (e.g. Decision history on claims of insurance companies)
    • Poor application-domain expertise in finding representative data (e.g. Cannot take 90% of the data from one hospital and generalise for the entire world!)
  • Address: Faithfully document data sources, involve domain experts in data collection

27

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

28 of 95

Conflicting Reward Systems

Misaligned incentives and priorities between practitioners, domain experts, and field partners.

  • Time to manifest: Model deployment
  • Impact: Costly iterations, moving to an alternate data source, quitting the project
  • Triggers: Need annotation but...
    • Inserted as extraneous work
    • Not compensated well
    • Competing priority with partners’ primary responsibility
  • Address: Provide incentives & training

e.g., when a clinician spends a lot of time punching in data, not paying attention to the patient, that has a human cost

28

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

29 of 95

Poor Cross-organizational Documentation

Lack of documentation across various cross organisational relations, causing lack of understanding on metadata

  • Time to manifest: manual reviews, by "chance"
  • Impact: Wasted time and effort from using incorrect data, being blocked on building models, and discarding subsets or entire datasets
  • Triggers:
    • inherited datasets lacked critical details
    • feld partners not being aware of constraints in achieving good quality AI
  • Address: Create a data curation plan in advance and take ample field notes in order to create reproducible assets for data

e.g., a lack of metadata and collaborators changing schema without understanding context led to a loss of four months of precious medical robotics data collection.”

29

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

30 of 95

Case Study: Inventory Management

Goal: Train an ML model to predict future sales; make decisions about what to (re)stock/when/how many...

30

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

31 of 95

Discussion: Possible Data Cascades?

  • Interacting with physical world brittleness
  • Inadequate domain expertise
  • Conflicting reward systems
  • Poor (cross-organizational) documentation

31

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

32 of 95

Data Documentation

Let's use data documentation as an entry point to discuss what aspects of data we care about.

32

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

33 of 95

Data Quality is a System-Wide Concern

Data flows across components, e.g., from user interface into database to crowd-sourced labeling team into ML pipeline

Humans interacting with the system

  • Entering data, labeling data
  • Observed with sensors/telemetry
  • Incentives, power structures, recognition

Organizational practices

  • Value, attention, and resources given to data quality

Documentation at the interfaces is important

33

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

34 of 95

Data Quality Documentation

Teams rarely document expectations of data quantity or quality.

Data quality tests are rare, but some teams adopt defensive monitoring.

  • Local tests about assumed structure and distribution of data
  • Identify drift early and reach out to producing teams

Several ideas for documenting distributions, including Datasheets and Dataset Nutrition Label

  • Mostly focused on static datasets, describing origin, consideration, labeling procedure, and distributions; Example

34

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

35 of 95

Data Card

Data Cards are for fostering transparent, purposeful and human-centered documentation of datasets within the practical contexts of industry and research. They are structured summaries of essential facts about various aspects of ML datasets…provide explanations of processes and rationales that shape the data and consequently the models.

35

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

36 of 95

Entries in data card

Very good reference for data quality, but way too difficult to fill out for every dataset so usually ignored...

36

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

37 of 95

Data Cards give you an idea of what might impact data quality

We will touch on some:

  • Data quality, in terms of noise (e.g. from upstream data source, from human labelers)
  • Data drifts (static dataset and its source, vs. the world now)
  • Data quality, in terms of distributions (disagreements between annotators, biases towards certain distribution, annotation incentives – tend to select easy labels, etc.)
  • Data curation (shape the data towards what you want)

37

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

38 of 95

Understand and improve data quality

Assuming you didn't have the best documented and cleaned data in the world (typical!), how do you evaluate your data and how do you clean it?

38

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

39 of 95

Data cleaning and repairing account for about 60% of the work of data scientists.

"Everyone wants to do the model work, not the data work"

Own experience?

39

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

40 of 95

Accuracy vs. Precision

Accuracy: Reported values (on average) represent real value

Precision: Repeated measurements yield the same result

Accurate, but imprecise: Q. How to deal with this issue?

Inaccurate, but precise: ?

40

(CC-BY-4.0 by Arbeck)

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

41 of 95

Data Accuracy and Precision: Impact on ML

More data → better models (up to a point, diminishing effects)

Noisy data (imprecise) → less confident models, more data needed

  • some ML techniques are more or less robust to noise (more on robustness in a later lecture)

Inaccurate data → misleading models, biased models

  • Need the “right” data

Invest in data quality, not just quantity

41

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

42 of 95

Dealing with noisy data

Where does noise come from and how do we fix it?

42

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

43 of 95

What do we mean by clean data?

Accuracy: The data was recorded correctly.

Completeness: All relevant data was recorded.

Uniqueness: The entries are recorded once.

Consistency: The data agrees with itself.

Timeliness: The data is kept up to date.

43

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

44 of 95

Challenge from collection: Data comes from many sources

e.g. For the inventory system:

44

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

45 of 95

Challenge from collection: Data comes from many sources

  • Manually entered
  • Generated through actions in IT systems
  • Logging information, traces of user interactions
  • Sensor data
  • Crowdsourced

These sources have different levels of reliability and quality

45

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

46 of 95

What happens: Data is noisy

Wrong results and computations, crashes

Duplicate data, near-duplicate data

Out of order data

Data format invalid

46

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

47 of 95

Two levels of data precision

  1. Data Integrity / Schema: Ensuring basic consistency about shape and types
  2. Wrong and inconsistent data: Application- and domain-specific data issues

47

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

48 of 95

Data Integrity / Schema

Ensuring basic consistency about shape and types

48

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

49 of 95

Schema in Relational Databases

49

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

50 of 95

Data Schema

Define the expected format of data

  • expected fields and their types
  • expected ranges for values
  • constraints among values (within and across sources)

Data can be automatically checked against schema

Protects against change; explicit interface between components

50

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

51 of 95

Data Schema Constraints for Inventory System?

51

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

52 of 95

Schema Problems: Uniqueness, data format, integrity, …

  • Illegal attribute values: bdate=30.13.70
  • Violated attribute dependencies: age=22, bdate=12.02.70
  • Uniqueness violation: (name=“John Smith”, SSN=“123456”), (name=“Peter Miller”, SSN=“123456”)
  • Referential integrity violation: emp=(name=“John Smith”, deptno=127) if department 127 not defined

52

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

53 of 95

Dirty Data: Example

Problems with this data? Which Problems are Schema Problems?

53

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

54 of 95

What Happens When New Data Violates Schema?

54

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

55 of 95

Modern Databases: Schema-Less

Also vector databases, schema-aware databases that basically store long text in each cell, etc.

55

Image source: https://www.kdnuggets.com/2021/05/nosql-know-it-all-compendium.html

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

56 of 95

Schema-Less Data Exchange

  • CSV files
  • Key-value stores (JSon, XML, Nosql databases)
  • Message brokers
  • REST API calls
  • R/Pandas Dataframes

56

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

57 of 95

Schema-Less Data Exchange

Q. Benefits? Drawbacks?

57

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

58 of 95

Schema Library: Apache Avro

58

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

59 of 95

Schema Library: Apache Avro

Schema specification in JSON format

Serialization and deserialization with automated checking

Native support in Kafka

Benefits

  • Serialization in space efficient format
  • APIs for most languages (ORM-like)
  • Versioning constraints on schemas

Drawbacks

  • Reading/writing overhead
  • Binary data format, extra tools needed for reading
  • Requires external schema and maintenance
  • Learning overhead

59

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

60 of 95

Many Schema Libraries/Formats

Examples

  • Avro
  • XML Schema
  • Protobuf
  • Thrift
  • Parquet
  • ORC

60

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

61 of 95

Schema not just for database, but for data transmission

61

https://openai.com/index/introducing-structured-outputs-in-the-api/

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

62 of 95

Summary: Schema

Basic structure and type definition of data

Well supported in databases and many tools

Very low bar for data quality

62

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

63 of 95

Wrong and Inconsistent Data

Application- and domain-specific data issues

63

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

64 of 95

Dirty Data: Example

Problems with the data beyond schema problems?

64

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

65 of 95

Wrong and Inconsistent Data

  • Missing values: phone=999-9999999
  • Misspellings: city=Pittsburg
  • Misfielded values: city=USA
  • Duplicate records: name=John Smith, name=J. Smith
  • Wrong reference: emp=(name=“John Smith”, deptno=127) if department 127 defined but wrong

Q. How can we detect and fix these problems?

65

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

66 of 95

Discussion: Wrong and Inconsistent Data?

66

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

67 of 95

Data Cleaning Overview

Data analysis / Error detection

  • Usually focused on specific kind of problems, e.g., duplication, typos, missing values, distribution shift
  • Detection in input data vs detection in later stages (more context)

Error repair

  • Repair data vs repair rules, one at a time or holistic
  • Data transformation or mapping
  • Automated vs human guided

67

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

68 of 95

Error Detection Examples

Illegal values: min, max, variance, deviations, cardinality

Misspelling: sorting + manual inspection, dictionary lookup

Missing values: null values, default values

Duplication: sorting, edit distance, normalization

68

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

69 of 95

Example Tool: Great Expectations

Supports schema validation and custom instance-level checks.

69

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

70 of 95

Example Tool: Great Expectations

70

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

71 of 95

Rule-based detection: Data Quality Rules

Rules can be used to reject data or repair it

Invariants on data that must hold

Typically about relationships of multiple attributes or data sources, eg.

  • ZIP code and city name should correspond
  • User ID should refer to existing user
  • SSN should be unique
  • For two people in the same state, the person with the lower income should not have the higher tax rate

Classic integrity constraints in databases or conditional constraints

71

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

72 of 95

ML-based for Detecting Inconsistencies

72

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

73 of 95

Example: HoloClean

  • User provides rules as integrity constraints (e.g., "two entries with the same name can't have different city")
  • Detect violations of the rules in the data; also detect statistical outliers
  • Automatically generate repair candidates (with probabilities)

73

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

74 of 95

Discovery of Data Quality Rules

Rules directly taken from external databases

  • e.g. zip code directory

Given clean data,

  • several algorithms that find functional relationships (𝑋⇒𝑌) among columns
  • algorithms that find conditional relationships (if 𝑍 then 𝑋⇒𝑌)
  • algorithms that find denial constraints (𝑋 and 𝑌 cannot co-occur in a row)

Given mostly clean data (probabilistic view),

  • algorithms to find likely rules (e.g., association rule mining)
  • outlier and anomaly detection

Given labeled dirty data or user feedback,

  • supervised and active learning to learn and revise rules
  • supervised learning to learn repairs (e.g., spell checking)

74

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

75 of 95

Discussion: Data Quality Rules?

75

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

76 of 95

Dealing with Drift

Why does my model begin to perform poorly over time?

A very particular form of data accuracy problem (data becomes wrong), caused not by human creators but by the world. Very prevalent & affects product!

76

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

77 of 95

Data changes

System objective changes over time

Software components are upgraded or replaced

Prediction models change

Quality of supplied data changes

User behavior changes

Assumptions about the environment no longer hold

Users can react to model output; or try to game/deceive the model

Examples in inventory system?

77

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

78 of 95

Types of Drift

78

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

79 of 95

Drift & Model Decay

Concept drift (or concept shift)

  • properties to predict change over time (e.g., what is credit card fraud)
  • model has not learned the relevant concepts
  • over time: different expected outputs for same inputs

Data drift (or covariate shift, virtual drift, distribution shift, or population drift)

  • characteristics of input data changes (e.g., customers with face masks)
  • input data differs from training data
  • over time: predictions less confident, further from training data

Upstream data changes

  • external changes in data pipeline (e.g., format changes in weather service)
  • model interprets input data incorrectly
  • over time: abrupt changes due to faulty inputs

How do we fix these drifts?

79

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

80 of 95

On Terminology

Concept and data drift are separate concepts

In practice and literature, not always clearly distinguished

Colloquially encompasses all forms of model degradations and environment changes

Define term for target audience

80

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

81 of 95

Breakout: Drift in the Inventory System

What kind of drift might be expected?

As a group, tagging members, write plausible examples in #lecture:

  • Concept Drift:
  • Data Drift:
  • Upstream data changes:

81

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

82 of 95

Watch for Degradation in Prediction Accuracy

82

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

83 of 95

Indicators of Concept Drift

How to detect concept drift in production?

83

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

84 of 95

Indicators of Concept Drift

Model degradations observed with telemetry

Telemetry indicates different outputs over time for similar inputs

Differences in influential features and feature importance over time

Relabeling training data changes labels

Interpretable ML models indicate rules that no longer fit

(many papers on this topic, typically on statistical detection)

84

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

85 of 95

Indicators of Data Drift

How to detect data drift in production?

85

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

86 of 95

Indicators of Data Drift

Model degradations observed with telemetry

Distance between input distribution and training distribution increases

Average confidence of model predictions declines

Relabeling of training data retains stable labels

86

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

87 of 95

Detecting Data Drift

87

  • Compare distributions over time (e.g., t-test)
  • Detect both sudden jumps and gradual changes
  • Distributions can be manually specified or learned (see invariant detection)

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

88 of 95

Data Distribution Analysis

Plot distributions of features (histograms, density plots, kernel density estimation)

  • Identify which features drift

Define distance function between inputs and identify distance to closest training data (e.g., energy distance, see also kNN)

Anomaly detection and "out of distribution" detection

Compare distribution of output labels

88

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

89 of 95

Microsoft Azure Data Drift Dashboard

89

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

90 of 95

Dealing with Drift

Regularly retrain model on recent data

  • Use evaluation in production to detect decaying model performance

Involve humans when increasing inconsistencies detected

  • Monitoring thresholds, automation

Monitoring, monitoring, monitoring!

90

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

91 of 95

Preview: Inaccurate Data can also be caused by factors other than drift

How do you detect and fix more systemic data quality issues?

91

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

92 of 95

Challenge from collection and processing: Bias

The concept of "raw data" might be misleading; Data is always some proxy we actively collect to represent the world. Someone decides...

  • What data to collect
  • How to collect them
  • What scale to use

They change what you can do with the data

92

Recommended Reading: Gitelman, Lisa, Virginia Jackson, Daniel Rosenberg, Travis D. Williams, Kevin R. Brine, Mary Poovey, Matthew Stanley et al. "Data bite man: The work of sustaining a long-term study." In "Raw Data" Is an Oxymoron, (2013), MIT Press: 147-166.

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

93 of 95

What happens: Data is inaccurate �(for what you want to do)

Missing data

Biased data

Systematic errors in data distribution

93

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

94 of 95

Summary

Data quality is a system-level concern

  • Data quality at the interface between components
  • Documentation and monitoring often poor
  • Involves organizational structures, incentives, ethics, …

Data from many sources, often inaccurate, imprecise, inconsistent, incomplete, … – many different forms of data quality problems

Many mechanisms for enforcing consistency and cleaning

  • Data schema ensures format consistency
  • Data quality rules ensure invariants across data points

Concept and data drift are key challenges → monitor

94

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

95 of 95

Further Readings

  • Schelter, S., Lange, D., Schmidt, P., Celikel, M., Biessmann, F. and Grafberger, A., 2018. Automating large-scale data quality verification. Proceedings of the VLDB Endowment, 11(12), pp.1781-1794.
  • Polyzotis, Neoklis, Martin Zinkevich, Sudip Roy, Eric Breck, and Steven Whang. "Data validation for machine learning." Proceedings of Machine Learning and Systems 1 (2019): 334-347.
  • Polyzotis, Neoklis, Sudip Roy, Steven Euijong Whang, and Martin Zinkevich. 2017. “Data Management Challenges in Production Machine Learning.” In Proceedings of the 2017 ACM International Conference on Management of Data, 1723–26. ACM.
  • Theo Rekatsinas, Ihab Ilyas, and Chris Ré, “HoloClean - Weakly Supervised Data Repairing.” Blog, 2017.
  • Ilyas, Ihab F., and Xu Chu. Data cleaning. Morgan & Claypool, 2019.
  • Moreno-Torres, Jose G., Troy Raeder, Rocío Alaiz-Rodríguez, Nitesh V. Chawla, and Francisco Herrera. "A unifying view on dataset shift in classification." Pattern recognition 45, no. 1 (2012): 521-530.
  • Vogelsang, Andreas, and Markus Borg. "Requirements Engineering for Machine Learning: Perspectives from Data Scientists." In Proc. of the 6th International Workshop on Artificial Intelligence for Requirements Engineering (AIRE), 2019.
  • Humbatova, Nargiz, Gunel Jahangirova, Gabriele Bavota, Vincenzo Riccio, Andrea Stocco, and Paolo Tonella. "Taxonomy of real faults in deep learning systems." In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering, pp. 1110-1121. 2020.

95

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025