Machine Learning Systems Design
Lecture 10: Data Distribution Shifts & Monitoring
CS 329S (Chip Huyen, 2022) | cs329s.stanford.edu
Zoom etiquettes
We appreciate it�if you keep videos on!
2
Agenda
3
Lecture note is on course website / syllabus
Natural labels & feedback loops
4
Natural labels
5
Natural labels
6
Natural labels: surprisingly common
⚠ Biases ⚠
7
Delayed labels
Time
Prediction is served
Feedback is provided
Feedback loop length
8
Delayed labels
Time
Prediction is served
Feedback is provided
Feedback loop length
9
10
⚠ Labels are often assumed ⚠
Speed vs. accuracy tradeoff
Too short
Too long
False negatives
Slow feedback
11
⚠ Labels are often assumed ⚠
Too short
Too long
False negatives
Slow feedback
12
Causes of ML failures
13
14
“Guests complained their robot room assistants thought snoring sounds were commands and would wake them up repeatedly during the night.”
15
What is an ML failure?
A failure happens when one or more expectations of the system is violated.
Two types of expectations:
16
What is an ML failure?
A failure happens when one or more expectations of the system is violated
17
ML system failures
18
ML system failures
19
20
Ops failures | ML failures |
Visible
| Often invisible |
| |
Causes of ops failures (software system failures)
21
Causes of ops failures (software system failures)
As tooling & best practices around ML production mature, there will be less surface for software failures
22
60 / 96 ML systems failures are non-ML failures�(Papasian & Underwood, 2020)
ML-specific failures (during/post deployment)
We’ve already covered problems pre-deployment in previous lectures!
23
Production data differing from training data
24
Production data differing from training data
Common & crucial. Will go into detail!
25
Edge cases
Zoom poll: Would you use this car?
26
Edge case vs. outlier
27
Degenerate feedback loops
Predictions
Users’ feedback
Training data
28
Degenerate feedback loops: recsys
Model recommends item A
User clicks on A
Model confirms A is good
29
Degenerate feedback loops: recsys
Model recommends item A
User clicks on A
Model confirms A is good
Over time, recommendations become more homogenous
30
Degenerate feedback loops: resume screening
Replace X with:
31
Degenerate feedback loops: resume screening
Tracking feature importance might help!
32
Detecting degenerate feedback loops
Only arise once models are in production -> hard to detect during training
Predictions
Users’ feedback
Training data
33
Degenerate feedback loops: detect
34
Beyond NDCG: behavioral testing of recommender systems with RecList (Chia et al., 2021)
Degenerate feedback loops: mitigate
35
Randomization
36
Randomization
37
Positional features
38
Positional features: naive
39
ID | Song | Genre | Year | Artist | User | 1st Position | Click |
1 | Shallow | Pop | 2020 | Lady Gaga | listenr32 | False | No |
2 | Good Vibe | Funk | 2019 | Funk Overlord | listenr32 | False | No |
3 | Beat It | Rock | 1989 | Michael Jackson | fancypants | False | No |
4 | In Bloom | Rock | 1991 | Nirvana | fancypants | True | Yes |
5 | Shallow | Pop | 2020 | Lady Gaga | listenr32 | True | Yes |
Positional features: naive
Doesn’t have this feature during inference?
40
ID | Song | Genre | Year | Artist | User | 1st Position | Click |
1 | Shallow | Pop | 2020 | Lady Gaga | listenr32 | False | No |
2 | Good Vibe | Funk | 2019 | Funk Overlord | listenr32 | False | No |
3 | Beat It | Rock | 1989 | Michael Jackson | fancypants | False | No |
4 | In Bloom | Rock | 1991 | Nirvana | fancypants | True | Yes |
5 | Shallow | Pop | 2020 | Lady Gaga | listenr32 | True | Yes |
Positional features: naive
Set to False during inference
41
ID | Song | Genre | Year | Artist | User | 1st Position | Click |
1 | Shallow | Pop | 2020 | Lady Gaga | listenr32 | False | No |
2 | Good Vibe | Funk | 2019 | Funk Overlord | listenr32 | False | No |
3 | Beat It | Rock | 1989 | Michael Jackson | fancypants | False | No |
4 | In Bloom | Rock | 1991 | Nirvana | fancypants | True | Yes |
5 | Shallow | Pop | 2020 | Lady Gaga | listenr32 | True | Yes |
Positional features: 2 models
Model 2 doesn’t use positional features
42
Breakout exercise
43
How might degenerate feedback loops occur? (10 mins)
Discuss how you might mitigate the consequences of these feedback loops.
44
Data distribution shifts
45
46
Supervised learning: P(X, Y)
47
Types of data distribution shifts
48
Type | Meaning | Decomposition |
Covariate shift |
| P(X, Y) = P(Y|X)P(X) |
Label shift |
| P(X, Y) = P(X|Y)P(Y) |
Concept drift |
| P(X, Y) = P(Y|X)P(X) |
Covariate shift
49
Covariate shift
50
Covariate shift: example
51
Covariate shift: causes (training)
52
Covariate shift: causes (prod)
Changes in environments
53
Covariate shift: causes (prod)
Changes in environments
54
Covariate shift
55
Label shift
56
Label shift & covariate shift
57
P(X) change often leads to P(Y) change, so covariate shift often means label shift
Label shift & covariate shift
58
Not all label shifts are covariate shifts!
Concept Drift
59
Concept Drift
60
General data changes
61
General data changes
62
Detecting data distribution shifts
How to determine that two distributions are different?
63
Detecting data distribution shifts
How to determine that two distributions are different?
64
Detecting data distribution shifts
How to determine that two distributions are different?
65
Detecting data distribution shifts
How to determine that two distributions are different?
66
Cumulative vs. sliding metrics
67
This image is based on an example from MadeWithML (Goku Mohandas).
Detecting data distribution shifts
How to determine that two distributions are different?
68
E.g.
Two-sample test: KS test (Kolmogorov–Smirnov)
69
Two-sample test
70
alibi-detect (OS)
Most tests work better on low-dim data, so dim reduction is recommended beforehand!
Not all shifts are equal
71
Not all shifts are equal
72
E.g. same users, same device, but behaviors change over time
Temporal shifts: time window scale matters
73
Target distribution
Source distribution: likely a shift
Source distribution: unlikely a shift
Temporal shifts: time window scale matters
74
Difficulty is compounded by seasonal variation
Temporal shifts: time window scale matters
75
Temporal shifts: time window scale matters
76
Addressing data distribution shifts
77
Addressing data distribution shifts
78
Addressing data distribution shifts
79
Need to figure out not just when to retrain models, but also how and what data
Monitoring & Observability
80
Monitoring vs. observability
81
Monitoring vs. observability
82
Instrumentation
Monitoring vs. observability
83
Observability is part of monitoring
Monitoring is all about metrics
84
Operational metrics
85
Operational metrics
86
SLA example
SLA for ML?
ML metrics: what to monitor
87
Monitoring #1: accuracy-related metrics
88
Monitoring #1: accuracy-related metrics
89
Monitoring #2: predictions
90
Monitoring #2: predictions
91
Monitoring #3: features
92
Generate expectations with profiling & visualization
93
Monitoring #3: features
94
Monitoring #3: features schema with pydantic
95
https://pydantic-docs.helpmanual.io/usage/validators/
Monitoring #3: features schema with TFX
96
How To Evaluate MLOps Tools (Hamel Husain, CS 329S Lecture 9, 2022)
Feature monitoring problems
97
Feature monitoring problems
98
Feature monitoring problems
99
Monitoring toolbox: logs
100
“If it moves, we track it. Sometimes we’ll draw a graph of something that isn’t moving yet, just in case it decides to make a run for it.”
Ian Malpass (Etsy 2011)
Vladimir Kazanov (Badoo 2019)
Monitoring toolbox: dashboards
101
Monitoring toolbox: alerts
102
## Recommender model accuracy below 90%
${timestamp}: This alert originated from the service ${service-name}
Monitoring -> Continual Learning
103
Machine Learning Systems Design
Next class:
cs329s.stanford.edu | Chip Huyen