Predicting When �Machine Learning Models �Fail
© 2019 NAVER LABS. All rights reserved.
PhD.
2019
Research Scientist
Now
About ME
http://hadyelsahar.io
Intern.
2018
Intern.
Intern.
My interests:
Controlled Natural Language Generation�
Domain Adaptation
Scribe: a Wikipedia editor for small languages with AI features.
2013
2014
Masters
2015
Scribe
3
Accuracy of AI models can degrade within days when production data differs from training data.
Motivation
Partially funded through the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 786741 (SMOOTH Platform).
Joint work with
The Hathaway Effect: How Anne Gives Warren Buffett a Rise https://www.huffpost.com/entry/the-hathaway-effect-how-a_b_830041
Ps: this story is likely to be bogus as per other investigations.
Algorithmic trading failure: �Mirvish deduced that the algorithm used by the automated trading software was unable to differentiate certain content that was generated for Anne Hathaway and was affecting the data results for Berkshire Hathaway stock values too.
Anne Hathaway
American actress
Google apologises for Photos app's racist blunder
https://www.bbc.com/news/technology-33347866
‘Rosetta Stone’ of AI slammed for ‘racist’ classification of people’s faces
600,000 Images Removed from AI Database After Art Project Exposes Racist Bias
Mis-classification of under-represented groups in data
BIKE SHARING NEWS
New feature from Lime lets Dallas riders rate bad scooter parking jobs
Maintaining of ML models in production
OpenScale Monitoring AI deployed in business applications.
Maintaining of ML models in production
Maintaining of ML models in production
11
Model in production
Manually annotate evaluation datasets
Evaluate Model in production
Sample from data in production
Performance?
Maintaining of ML models in production
12
Model in production
Manually annotate evaluation datasets
Evaluate Model in production
Sample from data in production
Performance?
This maintenance loop is:
The Goal of this Talk
13
Estimate the performance drop of a trained Model on the target domain, without labeling any new examples.
How ?
Domain-shift ➡️ Domain-shift detection metric ➡️ Drop in performance
14
“Domain-shift” “Data fracture” “Dataset shifts” “Changing environments”
Why does training data differ from data seen at production time?
Expectations
15
ML
Production
Data Monster/God
Figure inspired from Zack Lipton’s slides
Same Data Monster/God
Reality
16
ML
Production
Data Monster/God
Figure inspired from Zack Lipton’s slides
?
Domain-shift
ps(x, y)�
pt(x, y)�
Different Data Monster/God
Background: Domain-shift
17
Domain-shift
Multidisciplinary field, with no standard terminology:
Domain-shift = Data fracture = Dataset shifts = Changing environments��Types of Domain-shift:
18
Domain-shift
Source data != Target Data
Source Data = Ps(x,y)
Target Data = Pt(x,y)
Recall bayes rule:
P(x,y) = P(y|x) P(x) = P(x|y) P(y)
19
#1 Causes of domain-shift: Covariate Shift
Motivation: HR Job candidate screening
20
Most popular domain shift - but not the only one
i.e. The model encounters unseen input examples in the target domain.
Ps(x) ≠ Pt(x)
�but,��Ps(y|x) = Pt(y|x)
Ps(x)
Pt(x)
#2 Causes of domain-shift: Label Shift
Motivation: �Pneumonia prediction
Ps(y) = ~.05% positive
Epidemic
Pt(y) = ~20% positive
21
Ps(x|y) = Pt(x|y)
but,
Ps(y) ≠ Pt(y)
#3 Causes of domain-shift: Concept Shift
Motivation: sentiment analysis
22
Ps(y|x) ≠ Pt(y|x)
but,
Ps(x) = Pt(x)
Source “Mobile phones”
Ps( Positive |”long duration”) ⬆️�
Target “Customer service”
Pt( Positive |”long duration”) ⬇️
Or Even,
A Mix between different types domain-shifts
23
ML
Production
?
[Moreno-Torres et al., 2012; Quionero-Candela et al., 2009]
Covariate-Shift
Label shift
Concept shift
deal with it!
deal with it!
deal with it!
24
Domain-shift detection metrics
Domain-shift detection Metrics
25
Measures of amplitude of domain shift, without target domain labels
26
H-Divergence “A-distance”�(Kifer et al. 2004, Ben David et al. 2006 - 2010)�
27
Proxy A-distance (PAD)
28
Sentiment Classifier
(Ctask)
performance drop = Accuracysource - AccuracyTarget
labeled (x,y)
unlabeled (x)
?
(x,y) ~ Ps(x,y)
x ~ Pt(x)
Proxy A-distance (PAD)
29
Sentiment Classifier
(Ctask)
performance drop
Domain Binary Classifier (Cdomain)
PAD = 1 - 2 ε(Cdomain)
?
(x,y) ~ Ps(x,y)
x ~ Pt(x)
Proxy A-distance (PAD)
30
Sentiment Classifier
performance drop
Domain Binary Classifier
PAD = 1 - 2 ε(Cdomain)
xs
xt
NEG
POS
?
x ~ Ps(x)
x ~ Pt(x)
Sentiment Classifier
(Ctask)
Domain Binary Classifier (Cdomain)
(x,y) ~ Ps(x,y)
Proxy A-distance (PAD)
31
Sentiment Classifier
performance drop
Domain Binary Classifier
PAD = 1 - 2 ε(Cdomain)
xs
xt
NEG
POS
?
x ~ Ps(x)
x ~ Pt(x)
Sentiment Classifier
(Ctask)
Domain Binary Classifier (Cdomain)
(x,y) ~ Ps(x,y)
(Blitzer et al. 2007)
Proxy A-distance (PAD) - Goes Wrong
32
Sentiment Classifier
performance drop
Domain Binary Classifier
PAD = 1
ADD :
Task irrelevant, but domain discriminating feature.
PAD = max value always!
e.g.
?
x ~ Ps(x)
x ~ Pt(x)
(x,y) ~ Ps(x,y)
Proxy A-distance Rectified (PAD*)
33
In-Domain
Out-Domain
performance drop
PAD*
Sentiment Classifier
~
~
Domain Classifier
Proxy A-distance Rectified (PAD*)
34
In-Domain
Out-Domain
performance drop
PAD*
~
task relevant encoder
~
Task encoder with fixed weights
linear layer for domain discrimination
Sentiment Classifier
Domain Classifier
Proxy A-distance Rectified (PAD*)
35
In-Domain
Out-Domain
performance drop
PAD*
~
task relevant encoder
~
Task encoder with fixed weights
linear layer for domain discrimination
Sentiment Classifier
Domain Classifier
Hint:
Optimizing both objectives simultaneously yields DANN
2. Confidence based metrics
36
Difference in confidence scores (CONF)
37
In-Domain
Out-Domain
~
Sentiment Classifier
Avg Confidence In-domain = 0.9
Avg Confidence In-domain = 0.7
CONF = (0.9 - 0.7) = 0.2
Modern NN prediction weights are not calibrated.
38
Solution: Calibration (Temperature scaling)
39
~
Sentiment Classifier
X Came for lunch with my sister. We loved our Thai-style mains which were amazing with lots of flavour, very impressive for a vegetarian restaurant.But the service was below average and the chips were too terrible to finish.When we arrived at 1.40, we had to wait 20 minutes while they got our table ready. OK, so we didn't have a reservation, but the restaurant was only half full. There was no reason to make us wait at all.
logits
softmax
P(positive | x ) = 0.99
Sentiment Classifier trained model.
uncalibrated confidence scores
Solution: Calibration (Temperature scaling)
40
~
Sentiment Classifier
X Came for lunch with my sister. We loved our Thai-style mains which were amazing with lots of flavour, very impressive for a vegetarian restaurant.But the service was below average and the chips were too terrible to finish.When we arrived at 1.40, we had to wait 20 minutes while they got our table ready. OK, so we didn't have a reservation, but the restaurant was only half full. There was no reason to make us wait at all.
P(positive | x ) = 0.85
Calibration
calibrated
(smoothed) confidence scores
Freeze Weights
➗ t
scaled logits
Difference in Calibrated confidence scores (CONF_CALIB)
41
In-Domain
Out-Domain
Sentiment Classifier
Avg calib. Confidence In-domain = 0.85
Avg calib. Confidence In-domain = 0.75
CONF_CALIB = (0.85 - 0.75) = 0.1
~
➗ t
3. Reverse Classification Accuracy
42
Reverse Classification Accuracy (RCA)
“reverse testing” (Fan and Davidson, 2006; Zhong et al., 2010)
43
In-Domain
Out-Domain
Sentiment Classifier
Train
Inference
Out-Domain*
Pseudo labels for out-domain dataset.
Sentiment Classifier*
Train
Evaluate
Reverse Classification Accuracy (RCA)
Reverse Classification Accuracy - Rectified (RCA*)
44
Domain-shift detection Metrics
45
PAD | Proxy A-distance |
CONF | Average Difference in confidence scores |
RCA | Reverse classification accuracy |
Can measure amplitude of domain shift ��Without labels from the target domain
We test 3 domain shift measures from 3 different families
46
Can we use �Domain shift detection metrics �To predict performance drop?
Domain shift ∝Drop in performance ?
47
Values of each domain shift detection metric
Accuracy Drop
Training domain: DVD
Eval. domain: restaurants
48
Challenge 1 �Some domain-shift detection metrics are task irrelevant ��Large detected domain-shift �does not necessarily mean �Large Performance drop
We exploit this using adversarial domain-shifts
49
Proxy A-Distance (PAD)
Accuracy Drop
False alarms PAD = 1 (max value) even for low accuracy drops
Proxy A-Distance (PAD)
Accuracy Drop
Adversarial domain shift
Robust to task irrelevant domain-shifts
50
Proxy A-Distance (PAD)
Accuracy Drop
Before: False alarms PAD = 1 (max value) even for low accuracy drops
Accuracy Drop
Modified Proxy A-Distance (PAD*)
After: more robust to Adversarial covariate-shifts
Task relevant features for domain shift detection
51
PAD | Proxy A-distance |
PAD* | Proxy A-distance using task learned features |
CONF | Average Difference in confidence scores |
CONF_CALIB | Average Difference in calibrated confidence scores |
RCA | Reverse classification accuracy |
RCA* | Reverse classification accuracy - Rectified |
Solution: task calibration of domain shift detection metrics (3 new proposed metrics)
Mod
Mod
Mod
All can be computed without labels from the target domain
52
Challenge 2�Domain-shift detection metrics ➡️ Actual drop in performance ��I.e. if PAD = 0.2; then Drop in performance = ?
53
Values of domain shift metrics Model Dependent
Values of a domain shift detection metric
Accuracy Drop
*Training domain: DVD
*Architecture: BILSTM, L=5
*Embeddings: ELMO
*Seed: 666
*Training domain: Books
*Architecture: LSTM, L=2
*Embeddings: RAND
*Seed: 123
54
x, y ~ Ps(x,y)
Evaluation datasets of available source domains
Model in production
Domain-Shift detection metric
Domain-shift detection metric value
Accuracy Drop
x, y ~ Ps(x,y)
x, y ~ Ps2(x,y)
x ~ Pt(x)
Source Domain Ds
Target domain Dt (without labels)
x, y ~ Ps3(x,y)
Training
Evaluation
Evaluation
Evaluation
Solution: Estimating Drop through regression
Calculate domain-shift detection metric
Estimated Accuracy Drop of the model on Dt
Regression
55
x, y ~ Ps(x,y)
Evaluation datasets of available source domains
Model in production
Domain-Shift detection metric
Domain-shift detection metric value
Accuracy Drop
x, y ~ Ps(x,y)
x, y ~ Ps2(x,y)
x ~ Pt(x)
Source Domain Ds
Target domain Dt (without labels)
x, y ~ Ps3(x,y)
Training
Evaluation
Evaluation
Evaluation
Calculate domain-shift detection metric
Estimated Accuracy Drop of the model on Dt
Regression
Solution: Estimating Drop through regression
56
Experiments
Domain shift = Drop in performance?�
57
We ran some experiments to simulate domain shift
Sentiment Classification
POS tagging
* 10x more examples and 40x more domain-shift scenarios than (Blitzer et al.,2007)
- English language #BenderRule
58
Error of performance drop estimation
Mean absolute error (MAE) and max error (Max) of the performance drop prediction.
Lower is better
59
22 datasets
Error = 2.1%
4 datasets
Error = 3.1%
Mean absolute error
Sentiment Analysis
How many evaluation datasets are required?
7 datasets
Error = 0.89%
4 datasets
Error = 1.03%
Number of source domain evaluation datasets used for regression
POS Tagging
Wrap up
60
| Analysis
Domain-shift detection metrics are model and task dependent
| We propose
61
THANK YOU
To Annotate or Not? Predicting Performance Drop under Domain Shift
References
62
H-divergence:
Reverse testing:
Confidence Calibration and out of distribution detection: