2 of 95

Administrivia

Complete the Team Citizenship Evaluation if you haven’t yet

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

3 of 95

A/B Experiments: What if...?

... we hand plenty of subjects for experiments
... we could randomly assign to treatment/ control group without them knowing
... we could analyze small individual changes and keep everything else constant

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

4 of 95

Confidence in A/B Experiments

Group A

classic personalized content recommendation model

2158 Users

average 3:13 min time on site

Group B

updated personalized content recommendation model

10 Users

average 3:24 min time on site

What's the problem of comparing the average?

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

5 of 95

Analyzing Results: Stats 101

All the data have some distribution.
Normal distribution is very common (e.g. students tend to get average score, not full score).
When we draw samples from the same distribution, we can get different samples just by chance.
Statistics tells us whether “interesting” patterns we observe actually exist in the samples, or whether it’s just sampling noise — use some numbers to express uncertainty.
Confidence interval: if we can compute the average of some samples (average on-site time of 10 users), what’s the average of the population (whole user group)?

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

6 of 95

Analyzing Results: Stats 101

Significance testing also helps with comparison. When most of the confidence intervals overlap, we’d know there’s no actual differences between groups.

This is quantified by p-value:

Misconception: “what is the probability a difference between two samples is by chance?” (lower = higher likelihood of a sig. difference)
Reality: the p-value starts with the assumption that there's no real difference (null hypothesis), then asks “how likely would our observed data be in that scenario?”

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

7 of 95

Stats 101: How to compute p-value?

Parametric tests: Assume comparing normally distributed groups that have the same variances.

Tests: t-test, ANOVA, & linear regression.
For: model accuracy, human click streams
More sensitive and powerful when the requirement is met.

Non-parametric tests: Does not assume normal distribution.

For: categorical data like Likert Scale rating
Tests: Wilcoxon signed-rank test compare users’ nominal Likert Scale ratings

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

8 of 95

t-test

We will ask for statistical test in M3 – Many libraries implement it!

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

9 of 95

Decision tree of tests

Many many other factors, e.g., dependent vs. independent measures

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

10 of 95

How many samples needed?

Too few?

Noise and random results!

Too many?

Risk of spreading bad designs!

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

11 of 95

Concurrent A/B testing

Multiple experiments at the same time

Independent experiments on different populations – interactions not explored
Multi-factorial designs, well understood but typically too complex, e.g., not all combinations valid or interesting
Grouping in sets of experiments (layers)

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

12 of 95

Other Experiments in Production

Chaos experiments

Shadow releases / traffic teeing

Canary releases

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

13 of 95

Chaos Experiments

Deliberate introduction of faults in production to test robustness.

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

14 of 95

Chaos Experiments for ML Components?

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

15 of 95

Shadow releases / traffic teeing

Run both models in parallel

Use predictions of old model in production

Compare differences between model predictions

If possible, compare against ground truth labels/telemetry

Examples?

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

16 of 95

Canary Releases

Release new version to small percentage of population (like A/B testing)

Automatically roll back if quality measures degrade

Automatically and incrementally increase deployment to 100% otherwise

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

17 of 95

Advice for Experimenting in Production

Minimize blast radius (canary, A/B, chaos expr)

Automate experiments and deployments

Allow for quick rollback of poor models (continuous delivery, containers, load balancers, versioning)

Make decisions with confidence, compare distributions

Monitor, monitor, monitor

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

18 of 95

More Quality Assurance...Data Quality

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

19 of 95

Readings

Required reading:

Sambasivan, N., Kapania, S., Highfill, H., Akrong, D., Paritosh, P., & Aroyo, L. M. (2021, May). “Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI. In Proc. Conference on Human Factors in Computing Systems (pp. 1-15).

20 of 95

Learning Goals

Consider data quality as part of a system; design an organization that values data quality
Distinguish precision and accuracy; understanding the better models vs more data tradeoffs
Use schema languages to enforce data schemas
Design and implement automated quality assurance steps that check data schema conformance and distributions
Devise infrastructure for detecting data drift and schema violations

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

21 of 95

Poor Data Quality has Consequences

(often delayed, hard-to-fix consequences)

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

22 of 95

Garbage in → Garbage Out

Example: systematic bias in training.

Poor data quality leads to poor models

Often not detectable in offline evaluation

Causes problems in production - now difficult to correct

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

23 of 95

Data Quality is a System-Wide Concern

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

24 of 95

Data Cascades

"Compounding events causing negative, downstream effects from data issues, that result in technical debt over time."

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

25 of 95

Common Data Cascades

Physical world brittleness

Idealized data, ignoring realities and change of real-world data
Static data, one time learning mindset, no planning for evolution

Inadequate domain expertise

Not understand. data and its context
Involving experts only late for troubleshooting

Conflicting reward systems

Missing incentives for data quality
Not recognizing data quality importance, discard as technicality
Missing data literacy with partners

Poor (cross-org.) documentation

Conflicts at team/organization boundary
Undetected drift

Sambasivan, N., et al. (2021). “Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI. In Proc. Conference on Human Factors in Computing Systems.

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

26 of 95

Interacting with physical world brittleness

Brittle deployments interacting with not-digitised physical worlds

Time to manifest: 2-3 years to emerge, almost always in the production stage
Impact: Complete model failure, abandonment of projects, harms to beneficiaries from mispredictions
Triggers: drifts (formally defined more later!)

Hardware drifts (Rain, wind, fingerprints, shadows)
Environmental drifts (lighting, temperature, humidity)
Social drifts (new regulations, new user behaviors)

Address: Monitor data source, retrain models, introduce noise in data

e.g. an AI model for the COVID-19 pandemic on day 1 versus day 100 required a total change in various assumptions since the pandemic and human responses were volatile and dynamic

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

27 of 95

Inadequate application-domain expertise

AI practitioners are responsible for data sense-making in contexts in which they do not have domain expertise.

Time to manifest: After building models through client feedback & system performance
Impact: Costly modification (improve labels, collect more data), Unanticipated downstream impacts
Triggers:

Subjectivity in ground truths (e.g. Decision history on claims of insurance companies)
Poor application-domain expertise in finding representative data (e.g. Cannot take 90% of the data from one hospital and generalise for the entire world!)

Address: Faithfully document data sources, involve domain experts in data collection

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

28 of 95

Conflicting Reward Systems

Misaligned incentives and priorities between practitioners, domain experts, and field partners.

Time to manifest: Model deployment
Impact: Costly iterations, moving to an alternate data source, quitting the project
Triggers: Need annotation but...

Inserted as extraneous work
Not compensated well
Competing priority with partners’ primary responsibility

Address: Provide incentives & training

e.g., when a clinician spends a lot of time punching in data, not paying attention to the patient, that has a human cost

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

29 of 95

Poor Cross-organizational Documentation

Lack of documentation across various cross organisational relations, causing lack of understanding on metadata

Time to manifest: manual reviews, by "chance"
Impact: Wasted time and effort from using incorrect data, being blocked on building models, and discarding subsets or entire datasets
Triggers:

inherited datasets lacked critical details
feld partners not being aware of constraints in achieving good quality AI

Address: Create a data curation plan in advance and take ample field notes in order to create reproducible assets for data

e.g., a lack of metadata and collaborators changing schema without understanding context led to a loss of four months of precious medical robotics data collection.”

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

30 of 95

Case Study: Inventory Management

Goal: Train an ML model to predict future sales; make decisions about what to (re)stock/when/how many...

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

31 of 95

Discussion: Possible Data Cascades?

Interacting with physical world brittleness
Inadequate domain expertise
Conflicting reward systems
Poor (cross-organizational) documentation

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

32 of 95

Data Documentation

Let's use data documentation as an entry point to discuss what aspects of data we care about.

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

33 of 95

Data Quality is a System-Wide Concern

Data flows across components, e.g., from user interface into database to crowd-sourced labeling team into ML pipeline

Humans interacting with the system

Entering data, labeling data
Observed with sensors/telemetry
Incentives, power structures, recognition

Organizational practices

Value, attention, and resources given to data quality

Documentation at the interfaces is important

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

34 of 95

Data Quality Documentation

Teams rarely document expectations of data quantity or quality.

Data quality tests are rare, but some teams adopt defensive monitoring.

Local tests about assumed structure and distribution of data
Identify drift early and reach out to producing teams

Several ideas for documenting distributions, including Datasheets and Dataset Nutrition Label

Mostly focused on static datasets, describing origin, consideration, labeling procedure, and distributions; Example

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

35 of 95

Data Card

Data Cards are for fostering transparent, purposeful and human-centered documentation of datasets within the practical contexts of industry and research. They are structured summaries of essential facts about various aspects of ML datasets…provide explanations of processes and rationales that shape the data and consequently the models.

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

36 of 95

Entries in data card

Very good reference for data quality, but way too difficult to fill out for every dataset so usually ignored...

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

37 of 95

Data Cards give you an idea of what might impact data quality

We will touch on some:

Data quality, in terms of noise (e.g. from upstream data source, from human labelers)
Data drifts (static dataset and its source, vs. the world now)
Data quality, in terms of distributions (disagreements between annotators, biases towards certain distribution, annotation incentives – tend to select easy labels, etc.)
Data curation (shape the data towards what you want)

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

38 of 95

Understand and improve data quality

Assuming you didn't have the best documented and cleaned data in the world (typical!), how do you evaluate your data and how do you clean it?

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

39 of 95

Data cleaning and repairing account for about 60% of the work of data scientists.

"Everyone wants to do the model work, not the data work"

Own experience?

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

40 of 95

Accuracy vs. Precision

Accuracy: Reported values (on average) represent real value

Precision: Repeated measurements yield the same result

Accurate, but imprecise: Q. How to deal with this issue?

Inaccurate, but precise: ?

(CC-BY-4.0 by Arbeck)

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

41 of 95

Data Accuracy and Precision: Impact on ML

More data → better models (up to a point, diminishing effects)

Noisy data (imprecise) → less confident models, more data needed

some ML techniques are more or less robust to noise (more on robustness in a later lecture)

Inaccurate data → misleading models, biased models

Need the “right” data

Invest in data quality, not just quantity

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

42 of 95

Dealing with noisy data

Where does noise come from and how do we fix it?

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

43 of 95

What do we mean by clean data?

Accuracy: The data was recorded correctly.

Completeness: All relevant data was recorded.

Uniqueness: The entries are recorded once.

Consistency: The data agrees with itself.

Timeliness: The data is kept up to date.

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

44 of 95

Challenge from collection: Data comes from many sources

e.g. For the inventory system:

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

45 of 95

Challenge from collection: Data comes from many sources

Manually entered
Generated through actions in IT systems
Logging information, traces of user interactions
Sensor data
Crowdsourced

These sources have different levels of reliability and quality

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

46 of 95

What happens: Data is noisy

Wrong results and computations, crashes

Duplicate data, near-duplicate data

Out of order data

Data format invalid

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

47 of 95

Two levels of data precision

Data Integrity / Schema: Ensuring basic consistency about shape and types
Wrong and inconsistent data: Application- and domain-specific data issues

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

48 of 95

Data Integrity / Schema

Ensuring basic consistency about shape and types

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

49 of 95

Schema in Relational Databases

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

50 of 95

Data Schema

Define the expected format of data

expected fields and their types
expected ranges for values
constraints among values (within and across sources)

Data can be automatically checked against schema

Protects against change; explicit interface between components

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

51 of 95

Data Schema Constraints for Inventory System?

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

52 of 95

Schema Problems: Uniqueness, data format, integrity, …

Illegal attribute values: bdate=30.13.70
Violated attribute dependencies: age=22, bdate=12.02.70
Uniqueness violation: (name=“John Smith”, SSN=“123456”), (name=“Peter Miller”, SSN=“123456”)
Referential integrity violation: emp=(name=“John Smith”, deptno=127) if department 127 not defined

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

53 of 95

Dirty Data: Example

Problems with this data? Which Problems are Schema Problems?

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

54 of 95

What Happens When New Data Violates Schema?

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

55 of 95

Modern Databases: Schema-Less

Also vector databases, schema-aware databases that basically store long text in each cell, etc.

Image source: https://www.kdnuggets.com/2021/05/nosql-know-it-all-compendium.html

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

56 of 95

Schema-Less Data Exchange

CSV files
Key-value stores (JSon, XML, Nosql databases)
Message brokers
REST API calls
R/Pandas Dataframes

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

57 of 95

Schema-Less Data Exchange

Q. Benefits? Drawbacks?

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

58 of 95

Schema Library: Apache Avro

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

59 of 95

Schema Library: Apache Avro

Schema specification in JSON format

Serialization and deserialization with automated checking

Native support in Kafka

Benefits

Serialization in space efficient format
APIs for most languages (ORM-like)
Versioning constraints on schemas

Drawbacks

Reading/writing overhead
Binary data format, extra tools needed for reading
Requires external schema and maintenance
Learning overhead

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

60 of 95

Many Schema Libraries/Formats

Examples

Avro
XML Schema
Protobuf
Thrift
Parquet
ORC

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

61 of 95

Schema not just for database, but for data transmission

https://openai.com/index/introducing-structured-outputs-in-the-api/

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

62 of 95

Summary: Schema

Basic structure and type definition of data

Well supported in databases and many tools

Very low bar for data quality

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

63 of 95

Wrong and Inconsistent Data

Application- and domain-specific data issues

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

64 of 95

Dirty Data: Example

Problems with the data beyond schema problems?

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

65 of 95

Wrong and Inconsistent Data

Missing values: phone=999-9999999
Misspellings: city=Pittsburg
Misfielded values: city=USA
Duplicate records: name=John Smith, name=J. Smith
Wrong reference: emp=(name=“John Smith”, deptno=127) if department 127 defined but wrong

Q. How can we detect and fix these problems?

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

66 of 95

Discussion: Wrong and Inconsistent Data?

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

67 of 95

Data Cleaning Overview

Data analysis / Error detection

Usually focused on specific kind of problems, e.g., duplication, typos, missing values, distribution shift
Detection in input data vs detection in later stages (more context)

Error repair

Repair data vs repair rules, one at a time or holistic
Data transformation or mapping
Automated vs human guided

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

68 of 95

Error Detection Examples

Illegal values: min, max, variance, deviations, cardinality

Misspelling: sorting + manual inspection, dictionary lookup

Missing values: null values, default values

Duplication: sorting, edit distance, normalization

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

69 of 95

Example Tool: Great Expectations

Supports schema validation and custom instance-level checks.

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

70 of 95

Example Tool: Great Expectations

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

71 of 95

Rule-based detection: Data Quality Rules

Rules can be used to reject data or repair it

Invariants on data that must hold

Typically about relationships of multiple attributes or data sources, eg.

ZIP code and city name should correspond
User ID should refer to existing user
SSN should be unique
For two people in the same state, the person with the lower income should not have the higher tax rate

Classic integrity constraints in databases or conditional constraints

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

72 of 95

ML-based for Detecting Inconsistencies

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

73 of 95

Example: HoloClean

User provides rules as integrity constraints (e.g., "two entries with the same name can't have different city")
Detect violations of the rules in the data; also detect statistical outliers
Automatically generate repair candidates (with probabilities)

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

74 of 95

Discovery of Data Quality Rules

Rules directly taken from external databases

e.g. zip code directory

Given clean data,

several algorithms that find functional relationships (𝑋⇒𝑌) among columns
algorithms that find conditional relationships (if 𝑍 then 𝑋⇒𝑌)
algorithms that find denial constraints (𝑋 and 𝑌 cannot co-occur in a row)

Given mostly clean data (probabilistic view),

algorithms to find likely rules (e.g., association rule mining)
outlier and anomaly detection

Given labeled dirty data or user feedback,

supervised and active learning to learn and revise rules
supervised learning to learn repairs (e.g., spell checking)

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

75 of 95

Discussion: Data Quality Rules?

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

76 of 95

Dealing with Drift

Why does my model begin to perform poorly over time?

A very particular form of data accuracy problem (data becomes wrong), caused not by human creators but by the world. Very prevalent & affects product!

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

77 of 95

Data changes

System objective changes over time

Software components are upgraded or replaced

Prediction models change

Quality of supplied data changes

User behavior changes

Assumptions about the environment no longer hold

Users can react to model output; or try to game/deceive the model

Examples in inventory system?

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

78 of 95

Types of Drift

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

79 of 95

Drift & Model Decay

Concept drift (or concept shift)

properties to predict change over time (e.g., what is credit card fraud)
model has not learned the relevant concepts
over time: different expected outputs for same inputs

Data drift (or covariate shift, virtual drift, distribution shift, or population drift)

characteristics of input data changes (e.g., customers with face masks)
input data differs from training data
over time: predictions less confident, further from training data

Upstream data changes

external changes in data pipeline (e.g., format changes in weather service)
model interprets input data incorrectly
over time: abrupt changes due to faulty inputs

How do we fix these drifts?

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

80 of 95

On Terminology

Concept and data drift are separate concepts

In practice and literature, not always clearly distinguished

Colloquially encompasses all forms of model degradations and environment changes

Define term for target audience

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

81 of 95

Breakout: Drift in the Inventory System

What kind of drift might be expected?

As a group, tagging members, write plausible examples in #lecture:

Concept Drift:
Data Drift:
Upstream data changes:

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

82 of 95

Watch for Degradation in Prediction Accuracy

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

83 of 95

Indicators of Concept Drift

How to detect concept drift in production?

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

84 of 95

Indicators of Concept Drift

Model degradations observed with telemetry

Telemetry indicates different outputs over time for similar inputs

Differences in influential features and feature importance over time

Relabeling training data changes labels

Interpretable ML models indicate rules that no longer fit

(many papers on this topic, typically on statistical detection)

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

85 of 95

Indicators of Data Drift

How to detect data drift in production?

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

86 of 95

Indicators of Data Drift

Model degradations observed with telemetry

Distance between input distribution and training distribution increases

Average confidence of model predictions declines

Relabeling of training data retains stable labels

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

87 of 95

Detecting Data Drift

Compare distributions over time (e.g., t-test)
Detect both sudden jumps and gradual changes
Distributions can be manually specified or learned (see invariant detection)

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

88 of 95

Data Distribution Analysis

Plot distributions of features (histograms, density plots, kernel density estimation)

Identify which features drift

Define distance function between inputs and identify distance to closest training data (e.g., energy distance, see also kNN)

Anomaly detection and "out of distribution" detection

Compare distribution of output labels

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

89 of 95

Microsoft Azure Data Drift Dashboard

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

90 of 95

Dealing with Drift

Regularly retrain model on recent data

Use evaluation in production to detect decaying model performance

Involve humans when increasing inconsistencies detected

Monitoring thresholds, automation

Monitoring, monitoring, monitoring!

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

91 of 95

Preview: Inaccurate Data can also be caused by factors other than drift

How do you detect and fix more systemic data quality issues?

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

92 of 95

Challenge from collection and processing: Bias

The concept of "raw data" might be misleading; Data is always some proxy we actively collect to represent the world. Someone decides...

What data to collect
How to collect them
What scale to use

They change what you can do with the data

Recommended Reading: Gitelman, Lisa, Virginia Jackson, Daniel Rosenberg, Travis D. Williams, Kevin R. Brine, Mary Poovey, Matthew Stanley et al. "Data bite man: The work of sustaining a long-term study." In "Raw Data" Is an Oxymoron, (2013), MIT Press: 147-166.

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

93 of 95

What happens: Data is inaccurate �(for what you want to do)

Missing data

Biased data

Systematic errors in data distribution

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025

94 of 95

Summary

Data quality is a system-level concern

Data quality at the interface between components
Documentation and monitoring often poor
Involves organizational structures, incentives, ethics, …

Data from many sources, often inaccurate, imprecise, inconsistent, incomplete, … – many different forms of data quality problems

Many mechanisms for enforcing consistency and cleaning

Data schema ensures format consistency
Data quality rules ensure invariants across data points

Concept and data drift are key challenges → monitor

Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025