1 of 62

Hady Elsahar

hady.elsahar@naverlabs.com

@hadyelsahar

Predicting When �Machine Learning Models �Fail

2 of 62

PhD.

2019

Research Scientist

Now

About ME

http://hadyelsahar.io

Intern.

2018

Intern.

My interests:

Controlled Natural Language Generation�

Controlled Decoding
Self-supervised Summarization
Distributional Reinforcement Learning
Zero-shot generation
Sequence level training

Domain Adaptation

Domain shift detection
Model Calibration

Scribe: a Wikipedia editor for small languages with AI features.

2013

2014

Masters

2015

Scribe

3 of 62

Accuracy of AI models can degrade within days when production data differs from training data.

Motivation

Matthias Gallé

matthias.galle@naverlabs.com

@mgalle

Partially funded through the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 786741 (SMOOTH Platform).

Joint work with

5 of 62

The Hathaway Effect: How Anne Gives Warren Buffett a Rise https://www.huffpost.com/entry/the-hathaway-effect-how-a_b_830041

Ps: this story is likely to be bogus as per other investigations.

Algorithmic trading failure: �Mirvish deduced that the algorithm used by the automated trading software was unable to differentiate certain content that was generated for Anne Hathaway and was affecting the data results for Berkshire Hathaway stock values too.

Anne Hathaway

American actress

6 of 62

Google apologises for Photos app's racist blunder

https://www.bbc.com/news/technology-33347866

‘Rosetta Stone’ of AI slammed for ‘racist’ classification of people’s faces

https://www.telegraph.co.uk/technology/2019/09/17/rosetta-stone-ai-slammed-racist-classification-peoples-faces/

600,000 Images Removed from AI Database After Art Project Exposes Racist Bias

https://hyperallergic.com/518822/600000-images-removed-from-ai-database-after-art-project-exposes-racist-bias/

Mis-classification of under-represented groups in data

7 of 62

BIKE SHARING NEWS

New feature from Lime lets Dallas riders rate bad scooter parking jobs

8 of 62

https://twitter.com/kathrinpassig/status/1226637513244192768

9 of 62

https://techcrunch.com/2019/12/11/arthur-announces-3-3m-seed-to-monitor-machine-learning-model-performance/

Maintaining of ML models in production

10 of 62

https://www.ibm.com/cloud/blog/tutorial-monitor-your-deployed-wml-model-with-watson-openscale

OpenScale Monitoring AI deployed in business applications.

Maintaining of ML models in production

11 of 62

Maintaining of ML models in production

Model in production

Manually annotate evaluation datasets

Evaluate Model in production

Sample from data in production

Performance?

12 of 62

Maintaining of ML models in production

Model in production

Manually annotate evaluation datasets

Evaluate Model in production

Sample from data in production

Performance?

This maintenance loop is:

Costly �
Prohibitively slow �(Need time for annotation)

Continues Forever !

13 of 62

The Goal of this Talk

Estimate the performance drop of a trained Model on the target domain, without labeling any new examples.

How ?

Domain-shift ➡️ Domain-shift detection metric ➡️ Drop in performance

14 of 62

“Domain-shift” “Data fracture” “Dataset shifts” “Changing environments”

Why does training data differ from data seen at production time?

15 of 62

Expectations

Production

Data Monster/God

Figure inspired from Zack Lipton’s slides

Same Data Monster/God

16 of 62

Reality

Production

Data Monster/God

Figure inspired from Zack Lipton’s slides

Domain-shift

p_s(x, y)�

p_t(x, y)�

Different Data Monster/God

17 of 62

Background: Domain-shift

18 of 62

Domain-shift

Multidisciplinary field, with no standard terminology:

Domain-shift = Data fracture = Dataset shifts = Changing environments��Types of Domain-shift:

Covariate shift
Label shift
Concept shift
Mix of all above

19 of 62

Domain-shift

Source data != Target Data

Source Data = P_s(x,y)

Target Data = P_t(x,y)

Recall bayes rule:

P(x,y) = P(y|x) P(x) = P(x|y) P(y)

20 of 62

#1 Causes of domain-shift: Covariate Shift

Motivation: HR Job candidate screening

Most popular domain shift - but not the only one

i.e. The model encounters unseen input examples in the target domain.

P_s(x) ≠ P_t(x)

�but,��P_s(y|x) = P_t(y|x)

figure from: https://www.kaggle.com/pavansanagapati/covariate-shift-what-is-it

P_s(x)

P_t(x)

21 of 62

#2 Causes of domain-shift: Label Shift

Motivation: �Pneumonia prediction

P_s(y) = ~.05% positive

Epidemic

P_t(y) = ~20% positive

P_s(x|y) = P_t(x|y)

but,

P_s(y) ≠ P_t(y)

http://zacklipton.com/media/papers/learning-under-dist-shift-simons.pdf

22 of 62

#3 Causes of domain-shift: Concept Shift

Motivation: sentiment analysis

P_s(y|x) ≠ P_t(y|x)

but,

P_s(x) = P_t(x)

Source “Mobile phones”

P_s( Positive |”long duration”) ⬆️�

Target “Customer service”

P_t( Positive |”long duration”) ⬇️

23 of 62

Or Even,

A Mix between different types domain-shifts

Production

[Moreno-Torres et al., 2012; Quionero-Candela et al., 2009]

Covariate-Shift

Label shift

Concept shift

deal with it!

24 of 62

Domain-shift detection metrics

25 of 62

Domain-shift detection Metrics

Measures of amplitude of domain shift, without target domain labels

H-divergence | A-distance�[Kifer et al. 2004; Ben David et al. 2006 - 2010]

Confidence estimation for out of distribution detection�[Subramanya et al. 2017; Liang et al. 2017; Chen et al. 2018; Lee et. al. 2018]

Reverse Classification Accuracy�[Fan and Davidson 2006; Zhong et al., 2010]

26 of 62

H-Divergence

27 of 62

H-Divergence “A-distance”�(Kifer et al. 2004, Ben David et al. 2006 - 2010)�

A measure of domain discrepancy �
“Upper bounds” the performance drop due to domain shift (not the exact performance drop)�
Has theoretical guarantees but hard to compute exactly �
Approximated to “Proxy-A-distance” → works in practice but no theoretical guarantees

28 of 62

Proxy A-distance (PAD)

Sentiment Classifier

(C_task)

performance drop = Accuracy_source- Accuracy_Target

labeled (x,y)

unlabeled (x)

(x,y) ~ P_s(x,y)

x ~ P_t(x)

29 of 62

Proxy A-distance (PAD)

Sentiment Classifier

(C_task)

performance drop

Domain Binary Classifier (C_domain)

PAD = 1 - 2 ε(C_domain)

(x,y) ~ P_s(x,y)

x ~ P_t(x)

30 of 62

Proxy A-distance (PAD)

Sentiment Classifier

performance drop

Domain Binary Classifier

PAD = 1 - 2 ε(C_domain)

x_s

x_t

NEG

POS

x ~ P_s(x)

x ~ P_t(x)

Sentiment Classifier

(C_task)

Domain Binary Classifier (C_domain)

(x,y) ~ P_s(x,y)

31 of 62

Proxy A-distance (PAD)

Sentiment Classifier

performance drop

Domain Binary Classifier

PAD = 1 - 2 ε(C_domain)

x_s

x_t

NEG

POS

x ~ P_s(x)

x ~ P_t(x)

Sentiment Classifier

(C_task)

Domain Binary Classifier (C_domain)

(x,y) ~ P_s(x,y)

(Blitzer et al. 2007)

32 of 62

Proxy A-distance (PAD) - Goes Wrong

Sentiment Classifier

performance drop

Domain Binary Classifier

PAD = 1

ADD :

Task irrelevant, but domain discriminating feature.

PAD = max value always!

e.g.

Date
Title
Emojis

x ~ P_s(x)

x ~ P_t(x)

(x,y) ~ P_s(x,y)

33 of 62

Proxy A-distance Rectified (PAD*)

In-Domain

Out-Domain

performance drop

PAD*

Sentiment Classifier

Domain Classifier

34 of 62

Proxy A-distance Rectified (PAD*)

In-Domain

Out-Domain

performance drop

PAD*

task relevant encoder

Task encoder with fixed weights

linear layer for domain discrimination

Sentiment Classifier

Domain Classifier

35 of 62

Proxy A-distance Rectified (PAD*)

In-Domain

Out-Domain

performance drop

PAD*

task relevant encoder

Task encoder with fixed weights

linear layer for domain discrimination

Sentiment Classifier

Domain Classifier

Hint:

Optimizing both objectives simultaneously yields DANN

36 of 62

2. Confidence based metrics

37 of 62

Difference in confidence scores (CONF)

In-Domain

Out-Domain

Sentiment Classifier

Avg Confidence In-domain = 0.9

Avg Confidence In-domain = 0.7

CONF = (0.9 - 0.7) = 0.2

38 of 62

Modern NN prediction weights are not calibrated.

39 of 62

Solution: Calibration (Temperature scaling)

Sentiment Classifier

X Came for lunch with my sister. We loved our Thai-style mains which were amazing with lots of flavour, very impressive for a vegetarian restaurant.But the service was below average and the chips were too terrible to finish.When we arrived at 1.40, we had to wait 20 minutes while they got our table ready. OK, so we didn't have a reservation, but the restaurant was only half full. There was no reason to make us wait at all.

logits

softmax

P(positive | x ) = 0.99

Sentiment Classifier trained model.

uncalibrated confidence scores

40 of 62

Solution: Calibration (Temperature scaling)

Sentiment Classifier

P(positive | x ) = 0.85

Calibration

calibrated

(smoothed) confidence scores

Freeze Weights

➗ t

scaled logits

t > 0 → temperature
t learned parameter over valid. set
can be replaced by linear layer “platt scaling” (Platt et al., 1999)

41 of 62

Difference in Calibrated confidence scores (CONF_CALIB)

In-Domain

Out-Domain

Sentiment Classifier

Avg calib. Confidence In-domain = 0.85

Avg calib. Confidence In-domain = 0.75

CONF_CALIB = (0.85 - 0.75) = 0.1

➗ t

42 of 62

3. Reverse Classification Accuracy

43 of 62

Reverse Classification Accuracy (RCA)

“reverse testing” (Fan and Davidson, 2006; Zhong et al., 2010)

In-Domain

Out-Domain

Sentiment Classifier

Train

Inference

Out-Domain*

Pseudo labels for out-domain dataset.

Sentiment Classifier*

Train

Evaluate

Reverse Classification Accuracy (RCA)

44 of 62

Reverse Classification Accuracy - Rectified (RCA*)

Out adapted version of RCA
Skipped (to discussions) to avoid confusion.

45 of 62

Domain-shift detection Metrics

PAD	Proxy A-distance
CONF	Average Difference in confidence scores
RCA	Reverse classification accuracy

Can measure amplitude of domain shift ��Without labels from the target domain

We test 3 domain shift measures from 3 different families

46 of 62

Can we use �Domain shift detection metrics �To predict performance drop?

47 of 62

Domain shift ∝Drop in performance ?

Values of each domain shift detection metric

Accuracy Drop

Each point corresponds to a single domain shift scenario. �
Each color is a single trained model on in-domain dataset �and tested on several out of domain datasets

Training domain: DVD

Eval. domain: restaurants

48 of 62

Challenge 1 �Some domain-shift detection metrics are task irrelevant ��Large detected domain-shift �does not necessarily mean �Large Performance drop

49 of 62

We exploit this using adversarial domain-shifts

Proxy A-Distance (PAD)

Accuracy Drop

False alarms PAD = 1 (max value) even for low accuracy drops

Proxy A-Distance (PAD)

Accuracy Drop

Adversarial domain shift

50 of 62

Robust to task irrelevant domain-shifts

Proxy A-Distance (PAD)

Accuracy Drop

Before: False alarms PAD = 1 (max value) even for low accuracy drops

Accuracy Drop

Modified Proxy A-Distance (PAD*)

After: more robust to Adversarial covariate-shifts

Task relevant features for domain shift detection

51 of 62

PAD	Proxy A-distance
PAD*	Proxy A-distance using task learned features
CONF	Average Difference in confidence scores
CONF_CALIB	Average Difference in calibrated confidence scores
RCA	Reverse classification accuracy
RCA*	Reverse classification accuracy - Rectified

Solution: task calibration of domain shift detection metrics (3 new proposed metrics)

Mod

All can be computed without labels from the target domain

52 of 62

Challenge 2�Domain-shift detection metrics ➡️ Actual drop in performance ��I.e. if PAD = 0.2; then Drop in performance = ?

53 of 62

Values of domain shift metrics Model Dependent

Values of a domain shift detection metric

Accuracy Drop

Each color is a single model trained on in-domain dataset and tested on several out of domain datasets �
General correlation exist but it is more prevalent within single models

*Training domain: DVD

*Architecture: BILSTM, L=5

*Embeddings: ELMO

*Seed: 666

*Training domain: Books

*Architecture: LSTM, L=2

*Embeddings: RAND

*Seed: 123

54 of 62

x, y ~ P_s(x,y)

Evaluation datasets of available source domains

Model in production

Domain-Shift detection metric

Domain-shift detection metric value

Accuracy Drop

x, y ~ P_s(x,y)

x, y ~ P_s2(x,y)

x ~ P_t(x)

Source Domain D_s

Target domain D_t(without labels)

x, y ~ P_s3(x,y)

Training

Evaluation

Solution: Estimating Drop through regression

Calculate domain-shift detection metric

Estimated Accuracy Drop of the model on D_t

Regression

55 of 62

Non negligible cost

Only done once
Can be sampled with bias from D_s
Only few required

x, y ~ P_s(x,y)

Evaluation datasets of available source domains

Model in production

Domain-Shift detection metric

Domain-shift detection metric value

Accuracy Drop

x, y ~ P_s(x,y)

x, y ~ P_s2(x,y)

x ~ P_t(x)

Source Domain D_s

Target domain D_t(without labels)

x, y ~ P_s3(x,y)

Training

Evaluation

Calculate domain-shift detection metric

Estimated Accuracy Drop of the model on D_t

Regression

Solution: Estimating Drop through regression

56 of 62

Experiments

Can we accurately estimate the performance drop?
How many evaluation datasets are required?

57 of 62

Domain shift = Drop in performance?�

We ran some experiments to simulate domain shift

Sentiment Classification

Amazon + Yelp + IMDB �
5.8 Million sentences�
24 domains �(e.g. Books, DVDS, Restaurants, Movies)�
552 domain-shift Scenarios*�

POS tagging

Universal Dependencies dataset (English)�
474K tokens�
8 domains �(e.g.: News, Academic, Blog, Email, Fiction )�
56 domain-shift scenarios�

* 10x more examples and 40x more domain-shift scenarios than (Blitzer et al.,2007)

- English language #BenderRule

58 of 62

Error of performance drop estimation

Mean absolute error (MAE) and max error (Max) of the performance drop prediction.

Lower is better

59 of 62

22 datasets

Error = 2.1%

4 datasets

Error = 3.1%

Mean absolute error

Sentiment Analysis

How many evaluation datasets are required?

7 datasets

Error = 0.89%

4 datasets

Error = 1.03%

Number of source domain evaluation datasets used for regression

POS Tagging

60 of 62

Wrap up

| Analysis

Domain-shift detection metrics are model and task dependent

| We propose

Task dependent modifications to domain-shift detection metrics �
A method to predict performance drop of ML models�

Cheap & Fast without target domain annotations
Accurate ± 2.15% Sentiment analysis & ± 0.89 % POS tagging

61 of 62

Hady Elsahar

hady.elsahar@naverlabs.com

@hadyelsahar

THANK YOU

To Annotate or Not? Predicting Performance Drop under Domain Shift

Paper : bit.ly/hadyelsahar-emnlp2019-paper1
Blog : bit.ly/hadyelsahar-emnlp2019-blog1

62 of 62

References

H-divergence:

Kifer et al. 2004Detecting change in data streams. In Very Large Data Bases
Ben-David et al. 2006 Analysis of representations for domain adaptation.
Ben-David et al. 2008 A theory of learning from different domains. Machine Learning
Blitzer et al. 2007 Biographies, bollywood, boomboxes and blenders: Domain adaptation for sentiment classification.

Reverse testing:

Bhaskaruni et al. 2018 Prediction Qualities without Ground Truth: A Revisit of the Reverse Testing Framework.

Confidence Calibration and out of distribution detection:

Chen et al. 2018 - Confidence Scoring using white box Meta-models with Linear classifier Probes
Lee et. al. 2018 - Training Confidence-calibrated classifiers for detecting out-of-distribution samples
DeVries et al. 2018 - Learning Confidence for Out-of-Distribution Detection in Neural Networks
Mandelbaum et al. 2017 - Distance-based confidence score for neural network classifiers
Subramanya et al. 2017 - Confidence estimation in Deep Neural Networks via density modeling
Guo et al. 2017 - On Calibration of Modern Neural Networks
Liang et al. 2017 - Principled detection of Out-of-Distribution Examples in Neural Networks
Liang et al. 2017 -Enhancing the reliability of Out-of-distribution image detection in neural networks
Lakshminarayanan et al. 2016 - Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles
Hendrycks et al.2016 - A Baseline for detecting misclassified and out of distribution examples in neural networks