1 of 104

Machine Learning Systems Design

Lecture 10: Data Distribution Shifts & Monitoring

CS 329S (Chip Huyen, 2022) | cs329s.stanford.edu

2 of 104

Zoom etiquettes

We appreciate it�if you keep videos on!

More visual feedback for us�to adjust materials
Better learning environment
Better sense of who you’re with�in class!

2

3 of 104

Agenda

Natural labels & feedback loops
Causes of ML failures
Breakout exercise
Data distribution shifts
Monitoring & observability

3

Lecture note is on course website / syllabus

4 of 104

Natural labels & feedback loops

4

5 of 104

Natural labels

The model’s predictions can be automatically evaluated or partially evaluated by the system.
Examples:

ETA
Ride demand prediction
Stock price prediction
Ads CTR
Recommender system

5

6 of 104

Natural labels

You can engineer a task to have natural labels

6

7 of 104

Natural labels: surprisingly common

⚠ Biases ⚠

Small sample size
Companies might only use ML for tasks with natural labels

7

Claypot AI’s real-time ML survey (2022)

8 of 104

Delayed labels

Time

Prediction is served

Feedback is provided

Feedback loop length

8

9 of 104

Delayed labels

Short feedback loop: minutes -> hours

Reddit / Twitter / TikTok’s recommender systems

Long feedback loop: weeks -> months

Stitch Fix’s recommender systems
Fraud detection

Time

Prediction is served

Feedback is provided

Feedback loop length

9

10 of 104

10

Claypot AI’s real-time ML survey (2022)

11 of 104

⚠ Labels are often assumed ⚠

Recommendation:

Click -> good rec
After X minutes, no click -> bad rec

Speed vs. accuracy tradeoff

Recommendation:

Click -> good rec
After X minutes, no click -> bad rec

Too short

Too long

False negatives

Slow feedback

11

12 of 104

⚠ Labels are often assumed ⚠

Recommendation:

Click -> good rec
After X minutes, no click -> bad rec

Too short

Too long

False negatives

Slow feedback

12

Addressing Delayed Feedback for Continuous Training with Neural Networks in CTR prediction (Ktena et al., 2019)

13 of 104

Causes of ML failures

13

14 of 104

14

Amazon scraps secret AI recruiting tool that showed bias against women (Reuters, 2018)

15 of 104

“Guests complained their robot room assistants thought snoring sounds were commands and would wake them up repeatedly during the night.”

15

https://www.hotelmanagement.net/tech/japan-s-henn-na-hotel-fires-half-its-robot-workforce

16 of 104

What is an ML failure?

A failure happens when one or more expectations of the system is violated.

Two types of expectations:

Operational metrics: e.g. average latency, throughput, uptime
ML metrics: e.g. accuracy, F1, BLEU score

16

17 of 104

What is an ML failure?

A failure happens when one or more expectations of the system is violated

Traditional software: mostly operational metrics
ML systems: operational + ML metrics

Ops: returns an English translation within 100ms latency on average
ML: BLEU score of 55 (out of 100)

17

18 of 104

ML system failures

If you enter a sentence and get no translation back -> ops failure
If one translation is incorrect -> ML failure?

18

19 of 104

ML system failures

If you enter a sentence and get no translation back -> ops failure
If one translation is incorrect -> ML failure?

Not necessarily: expected BLEU score < 100
ML failure if translations are consistently incorrect

19

20 of 104

20

Ops failures	ML failures
Visible 404, timeout, segfault, OOM, etc.	Often invisible

21 of 104

Causes of ops failures (software system failures)

Dependency failures
Deployment failures
Hardware failures
Network failure: downtime / crash

21

22 of 104

Causes of ops failures (software system failures)

Dependency failures
Deployment failures
Hardware failures
Network failure: downtime / crash

As tooling & best practices around ML production mature, there will be less surface for software failures

22

60 / 96 ML systems failures are non-ML failures�(Papasian & Underwood, 2020)

23 of 104

ML-specific failures (during/post deployment)

Production data differing from training data
Edge cases
Degenerate feedback loops

We’ve already covered problems pre-deployment in previous lectures!

23

24 of 104

Production data differing from training data

Train-serving skew:

Model performing well during development but poorly after production

Data distribution shifts

Model performing well when first deployed, but poorly over time
⚠ What looks like data shifts might be caused by human errors ⚠

24

25 of 104

Production data differing from training data

Train-serving skew:

Model performing well during development but poorly after production

Data distribution shifts

Model performing well when first deployed, but poorly over time
⚠ What looks like data shifts might be caused by human errors ⚠

Common & crucial. Will go into detail!

25

26 of 104

Edge cases

Self-driving car (yearly)

Safely: 99.99%
Fatal accidents: 0.01%

Zoom poll: Would you use this car?

26

27 of 104

Edge case vs. outlier

Outliers

Refer to inputs
Options to ignore/remove

Edge cases

Refer to outputs
Can’t ignore/remove

27

28 of 104

Degenerate feedback loops

When predictions influence the feedback, which is then used to extract labels to train the next iteration of the model
Common in tasks with natural labels

Predictions

Users’ feedback

Training data

28

29 of 104

Degenerate feedback loops: recsys

Originally, A is ranked marginally higher than B -> model recommends A
After a while, A is ranked much higher than B

Model recommends item A

User clicks on A

Model confirms A is good

29

30 of 104

Degenerate feedback loops: recsys

Originally, A is ranked marginally higher than B -> model recommends A
After a while, A is ranked much higher than B

Model recommends item A

User clicks on A

Model confirms A is good

Over time, recommendations become more homogenous

30

31 of 104

Degenerate feedback loops: resume screening

Originally, model thinks X is a good feature
Model only picks resumes with X
Hiring managers only see resumes with X, so only people with X are hired
Model confirms that X is good

Replace X with:

Has a name that is typically used for males
Went to Stanford
Worked at Google

31

32 of 104

Degenerate feedback loops: resume screening

Originally, model thinks X is a good feature
Model only picks resumes with X
Hiring managers only see resumes with X, so only people with X are hired
Model confirms that X is good

Tracking feature importance might help!

32

33 of 104

Detecting degenerate feedback loops

Only arise once models are in production -> hard to detect during training

Predictions

Users’ feedback

Training data

33

34 of 104

Degenerate feedback loops: detect

Average Rec Popularity (ARP)

Average popularity of the recommended items

Average Percentage of Long Tail Items (APLT)

average % of long tail items being recommended

Hit rate against popularity

Accuracy based on recommended items’ popularity buckets

34

Beyond NDCG: behavioral testing of recommender systems with RecList (Chia et al., 2021)

35 of 104

Degenerate feedback loops: mitigate

Randomization
Positional features

35

36 of 104

Randomization

Degenerate feedback loops increase output homogeneity
Combat homogeneity by introducing randomness in predictions

36

37 of 104

Randomization

Degenerate feedback loops increase output homogeneity
Combat homogeneity by introducing randomness in predictions
Recsys: show users random items & use feedback to determine items’ quality

37

38 of 104

Positional features

If a prediction’s position affects its feedback in any way, encode it.

Numerical: e.g. position 1, 2, 3, …
Boolean: e.g. shows first position or not

38

39 of 104

Positional features: naive

39

ID	Song	Genre	Year	Artist	User	1st Position	Click
1	Shallow	Pop	2020	Lady Gaga	listenr32	False	No
2	Good Vibe	Funk	2019	Funk Overlord	listenr32	False	No
3	Beat It	Rock	1989	Michael Jackson	fancypants	False	No
4	In Bloom	Rock	1991	Nirvana	fancypants	True	Yes
5	Shallow	Pop	2020	Lady Gaga	listenr32	True	Yes

40 of 104

Positional features: naive

Doesn’t have this feature during inference?

40

ID	Song	Genre	Year	Artist	User	1st Position	Click
1	Shallow	Pop	2020	Lady Gaga	listenr32	False	No
2	Good Vibe	Funk	2019	Funk Overlord	listenr32	False	No
3	Beat It	Rock	1989	Michael Jackson	fancypants	False	No
4	In Bloom	Rock	1991	Nirvana	fancypants	True	Yes
5	Shallow	Pop	2020	Lady Gaga	listenr32	True	Yes

41 of 104

Positional features: naive

Set to False during inference

41

ID	Song	Genre	Year	Artist	User	1st Position	Click
1	Shallow	Pop	2020	Lady Gaga	listenr32	False	No
2	Good Vibe	Funk	2019	Funk Overlord	listenr32	False	No
3	Beat It	Rock	1989	Michael Jackson	fancypants	False	No
4	In Bloom	Rock	1991	Nirvana	fancypants	True	Yes
5	Shallow	Pop	2020	Lady Gaga	listenr32	True	Yes

42 of 104

Positional features: 2 models

Predicts the probability that the user will see and consider a recommendation given its position.
Predicts the probability that the user will click on the item given that they saw and considered it.

Model 2 doesn’t use positional features

42

43 of 104

Breakout exercise

43

44 of 104

How might degenerate feedback loops occur? (10 mins)

Build a system to predict stock prices and use the predictions to make buy/sell decisions.
Use text scraped from the Internet to train a language model, then use the same language model to generate posts.

Discuss how you might mitigate the consequences of these feedback loops.

44

45 of 104

Data distribution shifts

45

46 of 104

Source distribution: data the model is trained on
Target distribution: data the model runs inference on

46

47 of 104

Supervised learning: P(X, Y)

P(X, Y) = P(Y|X)P(X)
P(X, Y) = P(X|Y)P(Y)

47

48 of 104

Types of data distribution shifts

48

Type	Meaning	Decomposition
Covariate shift	P(X) changes P(Y\|X) remains the same	P(X, Y) = P(Y\|X)P(X)
Label shift	P(Y) changes P(X\|Y) remains the same	P(X, Y) = P(X\|Y)P(Y)
Concept drift	P(X) remains the same P(Y\|X) changes	P(X, Y) = P(Y\|X)P(X)

49 of 104

Covariate shift

Statistics: a covariate is an independent variable that can influence the outcome of a given statistical trial.
Supervised ML: input features are covariates

49

P(X) changes
P(Y|X) remains the same

50 of 104

Covariate shift

Statistics: a covariate is an independent variable that can influence the outcome of a given statistical trial.
Supervised ML: input features are covariates
Input distribution changes, but for a given input, output is the same

50

P(X) changes
P(Y|X) remains the same

51 of 104

Covariate shift: example

Predicts P(cancer | patient)
P(age > 40): training > production
P(cancer | age > 40): training = production

51

P(X) changes
P(Y|X) remains the same

52 of 104

Covariate shift: causes (training)

Data collection

E.g. women >40 are encouraged by doctors to get checkups
Closely related to sampling biases

Training techniques

E.g. oversampling of rare classes

Learning process

E.g. active learning

52

Predicts P(cancer | patient)
P(age > 40):

training > production

P(cancer | age > 40):

training = production

P(X) changes
P(Y|X) remains the same

53 of 104

Covariate shift: causes (prod)

Changes in environments

Ex 1: P(convert to paid user | free user)

New marketing campaign attracting users from with higher income

P(high income) increases
P(convert to paid user | high level) remains the same

53

P(X) changes
P(Y|X) remains the same

54 of 104

Covariate shift: causes (prod)

Changes in environments

Ex 2: P(Covid | coughing sound)

Training data from clinics, production data from phone recordings

P(coughing sound) changes
P(Covid | coughing sound) remains the same

54

P(X) changes
P(Y|X) remains the same

55 of 104

Covariate shift

Research: if knowing in advance how the production data will differ from training data, use importance weighting
Production: unlikely to know how a distribution will change in advance

55

56 of 104

Label shift

Output distribution changes but for a given output, input distribution stays the same.

56

P(Y) changes
P(X|Y) remains the same

57 of 104

Label shift & covariate shift

Predicts P(cancer | patient)
P(age > 40): training > production
P(cancer | age > 40): training = production
P(cancer): training > production
P(age > 40 | cancer): training = prediction

57

P(Y) changes
P(X|Y) remains the same

P(X) changes
P(Y|X) remains the same

P(X) change often leads to P(Y) change, so covariate shift often means label shift

58 of 104

Label shift & covariate shift

Predicts P(cancer | patient)
New preventive drug: reducing P(cancer | patient) for all patients
P(age > 40): training > production
P(cancer | age > 40): training > production
P(cancer): training > production
P(age > 40 | cancer): training = prediction

58

P(X) changes
P(Y|X) remains the same

Not all label shifts are covariate shifts!

P(Y) changes
P(X|Y) remains the same

59 of 104

Concept Drift

Same input, expecting different output
P(houses in SF) remains the same
Covid causes people to leave SF, housing prices drop

P($5M | houses in SF)

Pre-covid: high
During-covid: low

59

P(X) remains the same
P(Y|X) changes

60 of 104

Concept Drift

Concept drifts can be cyclic & seasonal

Ride sharing demands high during rush hours, low otherwise
Flight ticket prices high during holidays, low otherwise

60

P(X) remains the same
P(Y|X) changes

61 of 104

General data changes

Feature change

A feature is added/removed/updated

61

62 of 104

General data changes

Feature change

A feature is added/removed/updated

Label schema change

Original: {“POSITIVE”: 0, “NEGATIVE”: 1}
New: {“POSITIVE”: 0, “NEGATIVE”: 1, “NEUTRAL”: 2}

62

63 of 104

Detecting data distribution shifts

How to determine that two distributions are different?

63

64 of 104

Detecting data distribution shifts

How to determine that two distributions are different?

Compare statistics: mean, median, variance, quantiles, skewness, kurtosis, …

Compute these stats during training and compare these stats in production

64

65 of 104

Detecting data distribution shifts

How to determine that two distributions are different?

Compare statistics: mean, median, variance, quantiles, skewness, kurtosis, …

Not universal: only useful for distributions where these statistics are meaningful

65

66 of 104

Detecting data distribution shifts

How to determine that two distributions are different?

Compare statistics: mean, median, variance, quantiles, skewness, kurtosis, …

Not universal: only useful for distributions where these statistics are meaningful
Inconclusive: if statistics differ, distributions differ. If statistics are the same, distributions can still differ.

66

67 of 104

Cumulative vs. sliding metrics

Sliding: reset at each new time window

67

This image is based on an example from MadeWithML (Goku Mohandas).

68 of 104

Detecting data distribution shifts

How to determine that two distributions are different?

Compare statistics: mean, median, variance, quantiles, skewness, kurtosis, …
Two-sample hypothesis test

Determine whether the difference between two populations is statistically significant
If yes, likely from two distinct distributions

68

E.g.

Data from yesterday
Data from today

69 of 104

Two-sample test: KS test (Kolmogorov–Smirnov)

Pros

Doesn’t require any parameters of the underlying distribution
Doesn’t make assumptions about distribution

Cons

Only works with one-dimensional data

69

Useful for prediction & label distributions
Not so useful for features

70 of 104

Two-sample test

70

alibi-detect (OS)

Most tests work better on low-dim data, so dim reduction is recommended beforehand!

71 of 104

Not all shifts are equal

Sudden shifts vs. gradual shifts

Sudden shifts are easier to detect than gradual shifts

71

72 of 104

Not all shifts are equal

Sudden shifts vs. gradual shifts
Spatial shifts vs. temporal shifts

72

New device (e.g. mobile vs. desktop)
New users (e.g. new country)

E.g. same users, same device, but behaviors change over time

73 of 104

Temporal shifts: time window scale matters

73

Target distribution

Source distribution: likely a shift

Source distribution: unlikely a shift

74 of 104

Temporal shifts: time window scale matters

74

Difficulty is compounded by seasonal variation

75 of 104

Temporal shifts: time window scale matters

Too short window: false alarms of shifts
Too long window: takes long to detect shifts

75

Granularity level: hourly, daily

76 of 104

Temporal shifts: time window scale matters

Too short window: false alarms of shifts
Too long window: takes long to detect shifts

76

Granularity level: hourly, daily
Merge shorter time scale windows -> larger time scale window
RCA: automatically analyze various window sizes

77 of 104

Addressing data distribution shifts

Train model using a massive dataset

77

78 of 104

Addressing data distribution shifts

Train model using a massive dataset
Retrain model with new data from new distribution

Mode

Train from scratch
Fine-tune

78

79 of 104

Addressing data distribution shifts

Train model using a massive dataset
Retrain model with new data from new distribution

Mode
Data

Use data from when data started to shift
Use data from the last X days/weeks/months
Use data form the last fine-tuning point

79

Need to figure out not just when to retrain models, but also how and what data

80 of 104

Monitoring & Observability

80

81 of 104

Monitoring vs. observability

Monitoring: tracking, measuring, and logging different metrics that can help us determine when something goes wrong
Observability: setting up our system in a way that gives us visibility into our system to investigate what went wrong

81

82 of 104

Monitoring vs. observability

Monitoring: tracking, measuring, and logging different metrics that can help us determine when something goes wrong
Observability: setting up our system in a way that gives us visibility into our system to investigate what went wrong

82

Instrumentation

adding timers to your functions
counting NaNs in your features
logging unusual events e.g. very long inputs
…

83 of 104

Monitoring vs. observability

Monitoring: tracking, measuring, and logging different metrics that can help us determine when something goes wrong
Observability: setting up our system in a way that gives us visibility into our system to investigate what went wrong

83

Observability is part of monitoring

84 of 104

Monitoring is all about metrics

Operational metrics
ML-specific metrics

84

85 of 104

Operational metrics

Latency
Throughput
Requests / minute/hour/day
% requests that return with a 2XX code
CPU/GPU utilization
Memory utilization
Availability
etc.

85

86 of 104

Operational metrics

Latency
Throughput
Requests / minute/hour/day
% requests that return with a 2XX code
CPU/GPU utilization
Memory utilization
Availability
etc.

86

SLA example

Up means:

median latency <200ms
99th percentile <2s

99.99% uptime (four-nines)

SLA for ML?

87 of 104

ML metrics: what to monitor

87

88 of 104

Monitoring #1: accuracy-related metrics

Most direct way to monitor a model’s performance

Can only do as fast as when feedback is available

88

89 of 104

Monitoring #1: accuracy-related metrics

Most direct way to monitor a model’s performance
Collect as much feedback as possible
Example: YouTube video recommendations

Click through rate
Duration watched
Completion rate
Take rate

89

90 of 104

Monitoring #2: predictions

Predictions are low-dim: easy to visualize, compute stats, and do two-sample tests
Changes in prediction dist. generally mean changes in input dist.

90

91 of 104

Monitoring #2: predictions

Predictions are low-dim: easy to visualize, compute stats, and do two-sample tests
Changes in prediction dist. generally mean changes in input dist.
Monitor odd things in predictions

E.g. if predictions are all False in the last 10 mins

91

92 of 104

Monitoring #3: features

Most monitoring tools focus on monitoring features
Feature schema expectations

Generated from the source distribution
If violated in production, possibly something is wrong

Example expectations

Common sense: e.g. “the” is most common word in English
min, max, or median values of a feature are in [a, b]
All values of a feature satisfy a regex
Categorical data belongs to a predefined set
FEATURE_1 > FEATURE_B

92

93 of 104

Generate expectations with profiling & visualization

Examining data & collecting:

statistics
informative summaries

pandas-profiling
facets

93

94 of 104

Monitoring #3: features

Feature schema expectations

94

GitHub - great-expectations/great_expectations

95 of 104

Monitoring #3: features schema with pydantic

95

https://pydantic-docs.helpmanual.io/usage/validators/

96 of 104

Monitoring #3: features schema with TFX

96

How To Evaluate MLOps Tools (Hamel Husain, CS 329S Lecture 9, 2022)

97 of 104

Feature monitoring problems

Compute & memory cost

100s models, each with 100s features
Computing stats for 10000s of features is costly

97

98 of 104

Feature monitoring problems

Compute & memory cost
Alert fatigue

Most expectation violations are benign

98

99 of 104

Feature monitoring problems

Compute & memory cost
Alert fatigue
Schema management

Feature schema changes over time
Need to find a way to map feature to schema version

99

100 of 104

Monitoring toolbox: logs

Log everything
A stream processing problem

100

“If it moves, we track it. Sometimes we’ll draw a graph of something that isn’t moving yet, just in case it decides to make a run for it.”

Ian Malpass (Etsy 2011)

Vladimir Kazanov (Badoo 2019)

101 of 104

Monitoring toolbox: dashboards

Make monitoring accessible to non-engineering stakeholders
Good for visualizations but insufficient for discovering distribution shifts

101

102 of 104

Monitoring toolbox: alerts

3 components

Alert policy: condition for alert
Notification channels
Description

Alert fatigue

How to send only meaningful alerts?

102

## Recommender model accuracy below 90%

${timestamp}: This alert originated from the service ${service-name}

103 of 104

Monitoring -> Continual Learning

Monitoring is passive

Wait for a shift to happen to detect it

Continual learning is active

Update your models to address shifts

103

104 of 104

Machine Learning Systems Design

Next class:

Continual Learning
Data Distribution Shifts on Streams�with Shreya Shankar

cs329s.stanford.edu | Chip Huyen