1 of 104

Machine Learning Systems Design

Lecture 10: Data Distribution Shifts & Monitoring

CS 329S (Chip Huyen, 2022) | cs329s.stanford.edu

2 of 104

Zoom etiquettes

We appreciate it�if you keep videos on!

  • More visual feedback for us�to adjust materials
  • Better learning environment
  • Better sense of who you’re with�in class!

2

3 of 104

Agenda

  1. Natural labels & feedback loops
  2. Causes of ML failures
  3. Breakout exercise
  4. Data distribution shifts
  5. Monitoring & observability

3

Lecture note is on course website / syllabus

4 of 104

Natural labels & feedback loops

4

5 of 104

Natural labels

  • The model’s predictions can be automatically evaluated or partially evaluated by the system.
  • Examples:
    • ETA
    • Ride demand prediction
    • Stock price prediction
    • Ads CTR
    • Recommender system

5

6 of 104

Natural labels

  • You can engineer a task to have natural labels

6

7 of 104

Natural labels: surprisingly common

⚠ Biases ⚠

  • Small sample size
  • Companies might only use ML for tasks with natural labels

7

8 of 104

Delayed labels

Time

Prediction is served

Feedback is provided

Feedback loop length

8

9 of 104

Delayed labels

  • Short feedback loop: minutes -> hours
    • Reddit / Twitter / TikTok’s recommender systems
  • Long feedback loop: weeks -> months
    • Stitch Fix’s recommender systems
    • Fraud detection

Time

Prediction is served

Feedback is provided

Feedback loop length

9

10 of 104

10

11 of 104

⚠ Labels are often assumed ⚠

  • Recommendation:
    • Click -> good rec
    • After X minutes, no click -> bad rec

Speed vs. accuracy tradeoff

  • Recommendation:
    • Click -> good rec
    • After X minutes, no click -> bad rec

Too short

Too long

False negatives

Slow feedback

11

12 of 104

⚠ Labels are often assumed ⚠

  • Recommendation:
    • Click -> good rec
    • After X minutes, no click -> bad rec

Too short

Too long

False negatives

Slow feedback

12

13 of 104

Causes of ML failures

13

14 of 104

14

15 of 104

“Guests complained their robot room assistants thought snoring sounds were commands and would wake them up repeatedly during the night.”

15

16 of 104

What is an ML failure?

A failure happens when one or more expectations of the system is violated.

Two types of expectations:

  • Operational metrics: e.g. average latency, throughput, uptime
  • ML metrics: e.g. accuracy, F1, BLEU score

16

17 of 104

What is an ML failure?

A failure happens when one or more expectations of the system is violated

  • Traditional software: mostly operational metrics
  • ML systems: operational + ML metrics
    • Ops: returns an English translation within 100ms latency on average
    • ML: BLEU score of 55 (out of 100)

17

18 of 104

ML system failures

  • If you enter a sentence and get no translation back -> ops failure
  • If one translation is incorrect -> ML failure?

18

19 of 104

ML system failures

  • If you enter a sentence and get no translation back -> ops failure
  • If one translation is incorrect -> ML failure?
  • Not necessarily: expected BLEU score < 100
  • ML failure if translations are consistently incorrect

19

20 of 104

20

Ops failures

ML failures

Visible

  • 404, timeout, segfault, OOM, etc.

Often invisible

21 of 104

Causes of ops failures (software system failures)

  • Dependency failures
  • Deployment failures
  • Hardware failures
  • Network failure: downtime / crash

21

22 of 104

Causes of ops failures (software system failures)

  • Dependency failures
  • Deployment failures
  • Hardware failures
  • Network failure: downtime / crash

As tooling & best practices around ML production mature, there will be less surface for software failures

22

23 of 104

ML-specific failures (during/post deployment)

  1. Production data differing from training data
  2. Edge cases
  3. Degenerate feedback loops

We’ve already covered problems pre-deployment in previous lectures!

23

24 of 104

Production data differing from training data

  • Train-serving skew:
    • Model performing well during development but poorly after production
  • Data distribution shifts
    • Model performing well when first deployed, but poorly over time
    • ⚠ What looks like data shifts might be caused by human errors ⚠

24

25 of 104

Production data differing from training data

  • Train-serving skew:
    • Model performing well during development but poorly after production
  • Data distribution shifts
    • Model performing well when first deployed, but poorly over time
    • ⚠ What looks like data shifts might be caused by human errors ⚠

Common & crucial. Will go into detail!

25

26 of 104

Edge cases

  • Self-driving car (yearly)
    • Safely: 99.99%
    • Fatal accidents: 0.01%

Zoom poll: Would you use this car?

26

27 of 104

Edge case vs. outlier

  • Outliers
    • Refer to inputs
    • Options to ignore/remove
  • Edge cases
    • Refer to outputs
    • Can’t ignore/remove

27

28 of 104

Degenerate feedback loops

  • When predictions influence the feedback, which is then used to extract labels to train the next iteration of the model
  • Common in tasks with natural labels

Predictions

Users’ feedback

Training data

28

29 of 104

Degenerate feedback loops: recsys

  • Originally, A is ranked marginally higher than B -> model recommends A
  • After a while, A is ranked much higher than B

Model recommends item A

User clicks on A

Model confirms A is good

29

30 of 104

Degenerate feedback loops: recsys

  • Originally, A is ranked marginally higher than B -> model recommends A
  • After a while, A is ranked much higher than B

Model recommends item A

User clicks on A

Model confirms A is good

Over time, recommendations become more homogenous

30

31 of 104

Degenerate feedback loops: resume screening

  • Originally, model thinks X is a good feature
  • Model only picks resumes with X
  • Hiring managers only see resumes with X, so only people with X are hired
  • Model confirms that X is good

Replace X with:

  • Has a name that is typically used for males
  • Went to Stanford
  • Worked at Google

31

32 of 104

Degenerate feedback loops: resume screening

  • Originally, model thinks X is a good feature
  • Model only picks resumes with X
  • Hiring managers only see resumes with X, so only people with X are hired
  • Model confirms that X is good

Tracking feature importance might help!

32

33 of 104

Detecting degenerate feedback loops

Only arise once models are in production -> hard to detect during training

Predictions

Users’ feedback

Training data

33

34 of 104

Degenerate feedback loops: detect

  • Average Rec Popularity (ARP)
    • Average popularity of the recommended items
  • Average Percentage of Long Tail Items (APLT)
    • average % of long tail items being recommended
  • Hit rate against popularity
    • Accuracy based on recommended items’ popularity buckets

34

35 of 104

Degenerate feedback loops: mitigate

  1. Randomization
  2. Positional features

35

36 of 104

Randomization

  • Degenerate feedback loops increase output homogeneity
  • Combat homogeneity by introducing randomness in predictions

36

37 of 104

Randomization

  • Degenerate feedback loops increase output homogeneity
  • Combat homogeneity by introducing randomness in predictions
  • Recsys: show users random items & use feedback to determine items’ quality

37

38 of 104

Positional features

  • If a prediction’s position affects its feedback in any way, encode it.
    • Numerical: e.g. position 1, 2, 3, …
    • Boolean: e.g. shows first position or not

38

39 of 104

Positional features: naive

39

ID

Song

Genre

Year

Artist

User

1st Position

Click

1

Shallow

Pop

2020

Lady Gaga

listenr32

False

No

2

Good Vibe

Funk

2019

Funk Overlord

listenr32

False

No

3

Beat It

Rock

1989

Michael Jackson

fancypants

False

No

4

In Bloom

Rock

1991

Nirvana

fancypants

True

Yes

5

Shallow

Pop

2020

Lady Gaga

listenr32

True

Yes

40 of 104

Positional features: naive

Doesn’t have this feature during inference?

40

ID

Song

Genre

Year

Artist

User

1st Position

Click

1

Shallow

Pop

2020

Lady Gaga

listenr32

False

No

2

Good Vibe

Funk

2019

Funk Overlord

listenr32

False

No

3

Beat It

Rock

1989

Michael Jackson

fancypants

False

No

4

In Bloom

Rock

1991

Nirvana

fancypants

True

Yes

5

Shallow

Pop

2020

Lady Gaga

listenr32

True

Yes

41 of 104

Positional features: naive

Set to False during inference

41

ID

Song

Genre

Year

Artist

User

1st Position

Click

1

Shallow

Pop

2020

Lady Gaga

listenr32

False

No

2

Good Vibe

Funk

2019

Funk Overlord

listenr32

False

No

3

Beat It

Rock

1989

Michael Jackson

fancypants

False

No

4

In Bloom

Rock

1991

Nirvana

fancypants

True

Yes

5

Shallow

Pop

2020

Lady Gaga

listenr32

True

Yes

42 of 104

Positional features: 2 models

  1. Predicts the probability that the user will see and consider a recommendation given its position.
  2. Predicts the probability that the user will click on the item given that they saw and considered it.

Model 2 doesn’t use positional features

42

43 of 104

Breakout exercise

43

44 of 104

How might degenerate feedback loops occur? (10 mins)

  1. Build a system to predict stock prices and use the predictions to make buy/sell decisions.
  2. Use text scraped from the Internet to train a language model, then use the same language model to generate posts.

Discuss how you might mitigate the consequences of these feedback loops.

44

45 of 104

Data distribution shifts

45

46 of 104

  • Source distribution: data the model is trained on
  • Target distribution: data the model runs inference on

46

47 of 104

Supervised learning: P(X, Y)

  1. P(X, Y) = P(Y|X)P(X)
  2. P(X, Y) = P(X|Y)P(Y)

47

48 of 104

Types of data distribution shifts

48

Type

Meaning

Decomposition

Covariate shift

  • P(X) changes
  • P(Y|X) remains the same

P(X, Y) = P(Y|X)P(X)

Label shift

  • P(Y) changes
  • P(X|Y) remains the same

P(X, Y) = P(X|Y)P(Y)

Concept drift

  • P(X) remains the same
  • P(Y|X) changes

P(X, Y) = P(Y|X)P(X)

49 of 104

Covariate shift

  • Statistics: a covariate is an independent variable that can influence the outcome of a given statistical trial.
  • Supervised ML: input features are covariates

49

  • P(X) changes
  • P(Y|X) remains the same

50 of 104

Covariate shift

  • Statistics: a covariate is an independent variable that can influence the outcome of a given statistical trial.
  • Supervised ML: input features are covariates
  • Input distribution changes, but for a given input, output is the same

50

  • P(X) changes
  • P(Y|X) remains the same

51 of 104

Covariate shift: example

  • Predicts P(cancer | patient)
  • P(age > 40): training > production
  • P(cancer | age > 40): training = production

51

  • P(X) changes
  • P(Y|X) remains the same

52 of 104

Covariate shift: causes (training)

  • Data collection
    • E.g. women >40 are encouraged by doctors to get checkups
    • Closely related to sampling biases
  • Training techniques
    • E.g. oversampling of rare classes
  • Learning process
    • E.g. active learning

52

  • Predicts P(cancer | patient)
  • P(age > 40):
    • training > production
  • P(cancer | age > 40):
    • training = production
  • P(X) changes
  • P(Y|X) remains the same

53 of 104

Covariate shift: causes (prod)

Changes in environments

  • Ex 1: P(convert to paid user | free user)
    • New marketing campaign attracting users from with higher income
      • P(high income) increases
      • P(convert to paid user | high level) remains the same

53

  • P(X) changes
  • P(Y|X) remains the same

54 of 104

Covariate shift: causes (prod)

Changes in environments

  • Ex 2: P(Covid | coughing sound)
    • Training data from clinics, production data from phone recordings
      • P(coughing sound) changes
      • P(Covid | coughing sound) remains the same

54

  • P(X) changes
  • P(Y|X) remains the same

55 of 104

Covariate shift

  • Research: if knowing in advance how the production data will differ from training data, use importance weighting
  • Production: unlikely to know how a distribution will change in advance

55

56 of 104

Label shift

  • Output distribution changes but for a given output, input distribution stays the same.

56

  • P(Y) changes
  • P(X|Y) remains the same

57 of 104

Label shift & covariate shift

  • Predicts P(cancer | patient)
  • P(age > 40): training > production
  • P(cancer | age > 40): training = production
  • P(cancer): training > production
  • P(age > 40 | cancer): training = prediction

57

  • P(Y) changes
  • P(X|Y) remains the same
  • P(X) changes
  • P(Y|X) remains the same

P(X) change often leads to P(Y) change, so covariate shift often means label shift

58 of 104

Label shift & covariate shift

  • Predicts P(cancer | patient)
  • New preventive drug: reducing P(cancer | patient) for all patients
  • P(age > 40): training > production
  • P(cancer | age > 40): training > production
  • P(cancer): training > production
  • P(age > 40 | cancer): training = prediction

58

  • P(X) changes
  • P(Y|X) remains the same

Not all label shifts are covariate shifts!

  • P(Y) changes
  • P(X|Y) remains the same

59 of 104

Concept Drift

  • Same input, expecting different output
  • P(houses in SF) remains the same
  • Covid causes people to leave SF, housing prices drop
    • P($5M | houses in SF)
      • Pre-covid: high
      • During-covid: low

59

  • P(X) remains the same
  • P(Y|X) changes

60 of 104

Concept Drift

  • Concept drifts can be cyclic & seasonal
    • Ride sharing demands high during rush hours, low otherwise
    • Flight ticket prices high during holidays, low otherwise

60

  • P(X) remains the same
  • P(Y|X) changes

61 of 104

General data changes

  • Feature change
    • A feature is added/removed/updated

61

62 of 104

General data changes

  • Feature change
    • A feature is added/removed/updated
  • Label schema change
    • Original: {“POSITIVE”: 0, “NEGATIVE”: 1}
    • New: {“POSITIVE”: 0, “NEGATIVE”: 1, “NEUTRAL”: 2}

62

63 of 104

Detecting data distribution shifts

How to determine that two distributions are different?

63

64 of 104

Detecting data distribution shifts

How to determine that two distributions are different?

  1. Compare statistics: mean, median, variance, quantiles, skewness, kurtosis, …
    • Compute these stats during training and compare these stats in production

64

65 of 104

Detecting data distribution shifts

How to determine that two distributions are different?

  • Compare statistics: mean, median, variance, quantiles, skewness, kurtosis, …
    • Not universal: only useful for distributions where these statistics are meaningful

65

66 of 104

Detecting data distribution shifts

How to determine that two distributions are different?

  • Compare statistics: mean, median, variance, quantiles, skewness, kurtosis, …
    • Not universal: only useful for distributions where these statistics are meaningful
    • Inconclusive: if statistics differ, distributions differ. If statistics are the same, distributions can still differ.

66

67 of 104

Cumulative vs. sliding metrics

  • Sliding: reset at each new time window

67

This image is based on an example from MadeWithML (Goku Mohandas).

68 of 104

Detecting data distribution shifts

How to determine that two distributions are different?

  • Compare statistics: mean, median, variance, quantiles, skewness, kurtosis, …
  • Two-sample hypothesis test
    • Determine whether the difference between two populations is statistically significant
    • If yes, likely from two distinct distributions

68

E.g.

  1. Data from yesterday
  2. Data from today

69 of 104

Two-sample test: KS test (Kolmogorov–Smirnov)

  • Pros
    • Doesn’t require any parameters of the underlying distribution
    • Doesn’t make assumptions about distribution
  • Cons
    • Only works with one-dimensional data

69

  • Useful for prediction & label distributions
  • Not so useful for features

70 of 104

Two-sample test

70

alibi-detect (OS)

Most tests work better on low-dim data, so dim reduction is recommended beforehand!

71 of 104

Not all shifts are equal

  • Sudden shifts vs. gradual shifts
    • Sudden shifts are easier to detect than gradual shifts

71

72 of 104

Not all shifts are equal

  • Sudden shifts vs. gradual shifts
  • Spatial shifts vs. temporal shifts

72

  • New device (e.g. mobile vs. desktop)
  • New users (e.g. new country)

E.g. same users, same device, but behaviors change over time

73 of 104

Temporal shifts: time window scale matters

73

Target distribution

Source distribution: likely a shift

Source distribution: unlikely a shift

74 of 104

Temporal shifts: time window scale matters

74

Difficulty is compounded by seasonal variation

75 of 104

Temporal shifts: time window scale matters

  • Too short window: false alarms of shifts
  • Too long window: takes long to detect shifts

75

  • Granularity level: hourly, daily

76 of 104

Temporal shifts: time window scale matters

  • Too short window: false alarms of shifts
  • Too long window: takes long to detect shifts

76

  • Granularity level: hourly, daily
  • Merge shorter time scale windows -> larger time scale window
  • RCA: automatically analyze various window sizes

77 of 104

Addressing data distribution shifts

  1. Train model using a massive dataset

77

78 of 104

Addressing data distribution shifts

  • Train model using a massive dataset
  • Retrain model with new data from new distribution
    • Mode
      • Train from scratch
      • Fine-tune

78

79 of 104

Addressing data distribution shifts

  • Train model using a massive dataset
  • Retrain model with new data from new distribution
    • Mode
    • Data
      • Use data from when data started to shift
      • Use data from the last X days/weeks/months
      • Use data form the last fine-tuning point

79

Need to figure out not just when to retrain models, but also how and what data

80 of 104

Monitoring & Observability

80

81 of 104

Monitoring vs. observability

  • Monitoring: tracking, measuring, and logging different metrics that can help us determine when something goes wrong
  • Observability: setting up our system in a way that gives us visibility into our system to investigate what went wrong

81

82 of 104

Monitoring vs. observability

  • Monitoring: tracking, measuring, and logging different metrics that can help us determine when something goes wrong
  • Observability: setting up our system in a way that gives us visibility into our system to investigate what went wrong

82

Instrumentation

  • adding timers to your functions
  • counting NaNs in your features
  • logging unusual events e.g. very long inputs

83 of 104

Monitoring vs. observability

  • Monitoring: tracking, measuring, and logging different metrics that can help us determine when something goes wrong
  • Observability: setting up our system in a way that gives us visibility into our system to investigate what went wrong

83

Observability is part of monitoring

84 of 104

Monitoring is all about metrics

  • Operational metrics
  • ML-specific metrics

84

85 of 104

Operational metrics

  • Latency
  • Throughput
  • Requests / minute/hour/day
  • % requests that return with a 2XX code
  • CPU/GPU utilization
  • Memory utilization
  • Availability
  • etc.

85

86 of 104

Operational metrics

  • Latency
  • Throughput
  • Requests / minute/hour/day
  • % requests that return with a 2XX code
  • CPU/GPU utilization
  • Memory utilization
  • Availability
  • etc.

86

SLA example

  • Up means:
    • median latency <200ms
    • 99th percentile <2s
  • 99.99% uptime (four-nines)

SLA for ML?

87 of 104

ML metrics: what to monitor

87

88 of 104

Monitoring #1: accuracy-related metrics

  • Most direct way to monitor a model’s performance
    • Can only do as fast as when feedback is available

88

89 of 104

Monitoring #1: accuracy-related metrics

  • Most direct way to monitor a model’s performance
  • Collect as much feedback as possible
  • Example: YouTube video recommendations
    • Click through rate
    • Duration watched
    • Completion rate
    • Take rate

89

90 of 104

Monitoring #2: predictions

  • Predictions are low-dim: easy to visualize, compute stats, and do two-sample tests
  • Changes in prediction dist. generally mean changes in input dist.

90

91 of 104

Monitoring #2: predictions

  • Predictions are low-dim: easy to visualize, compute stats, and do two-sample tests
  • Changes in prediction dist. generally mean changes in input dist.
  • Monitor odd things in predictions
    • E.g. if predictions are all False in the last 10 mins

91

92 of 104

Monitoring #3: features

  • Most monitoring tools focus on monitoring features
  • Feature schema expectations
    • Generated from the source distribution
    • If violated in production, possibly something is wrong
  • Example expectations
    • Common sense: e.g. “the” is most common word in English
    • min, max, or median values of a feature are in [a, b]
    • All values of a feature satisfy a regex
    • Categorical data belongs to a predefined set
    • FEATURE_1 > FEATURE_B

92

93 of 104

Generate expectations with profiling & visualization

  • Examining data & collecting:
    • statistics
    • informative summaries
  • pandas-profiling
  • facets

93

94 of 104

Monitoring #3: features

  • Feature schema expectations

94

95 of 104

Monitoring #3: features schema with pydantic

95

https://pydantic-docs.helpmanual.io/usage/validators/

96 of 104

Monitoring #3: features schema with TFX

96

How To Evaluate MLOps Tools (Hamel Husain, CS 329S Lecture 9, 2022)

97 of 104

Feature monitoring problems

  1. Compute & memory cost
    1. 100s models, each with 100s features
    2. Computing stats for 10000s of features is costly

97

98 of 104

Feature monitoring problems

  • Compute & memory cost
  • Alert fatigue
    • Most expectation violations are benign

98

99 of 104

Feature monitoring problems

  • Compute & memory cost
  • Alert fatigue
  • Schema management
    • Feature schema changes over time
    • Need to find a way to map feature to schema version

99

100 of 104

Monitoring toolbox: logs

  • Log everything
  • A stream processing problem

100

“If it moves, we track it. Sometimes we’ll draw a graph of something that isn’t moving yet, just in case it decides to make a run for it.”

Ian Malpass (Etsy 2011)

Vladimir Kazanov (Badoo 2019)

101 of 104

Monitoring toolbox: dashboards

  • Make monitoring accessible to non-engineering stakeholders
  • Good for visualizations but insufficient for discovering distribution shifts

101

102 of 104

Monitoring toolbox: alerts

  • 3 components
    • Alert policy: condition for alert
    • Notification channels
    • Description
  • Alert fatigue
    • How to send only meaningful alerts?

102

## Recommender model accuracy below 90%

${timestamp}: This alert originated from the service ${service-name}

103 of 104

Monitoring -> Continual Learning

  • Monitoring is passive
    • Wait for a shift to happen to detect it
  • Continual learning is active
    • Update your models to address shifts

103

104 of 104

Machine Learning Systems Design

Next class:

  • Continual Learning
  • Data Distribution Shifts on Streams�with Shreya Shankar

cs329s.stanford.edu | Chip Huyen