1 of 62

Hady Elsahar

hady.elsahar@naverlabs.com

@hadyelsahar

Predicting When �Machine Learning Models �Fail

© 2019 NAVER LABS. All rights reserved.

2 of 62

PhD.

2019

Research Scientist

Now

About ME

http://hadyelsahar.io

Intern.

2018

Intern.

Intern.

My interests:

Controlled Natural Language Generation�

  • Controlled Decoding
  • Self-supervised Summarization
  • Distributional Reinforcement Learning
  • Zero-shot generation
  • Sequence level training

Domain Adaptation

  • Domain shift detection
  • Model Calibration

Scribe: a Wikipedia editor for small languages with AI features.

2013

2014

Masters

2015

Scribe

3 of 62

3

Accuracy of AI models can degrade within days when production data differs from training data.

Motivation

Matthias Gallé

matthias.galle@naverlabs.com

@mgalle

Partially funded through the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 786741 (SMOOTH Platform).

Joint work with

4 of 62

5 of 62

The Hathaway Effect: How Anne Gives Warren Buffett a Rise https://www.huffpost.com/entry/the-hathaway-effect-how-a_b_830041

Ps: this story is likely to be bogus as per other investigations.

Algorithmic trading failure: �Mirvish deduced that the algorithm used by the automated trading software was unable to differentiate certain content that was generated for Anne Hathaway and was affecting the data results for Berkshire Hathaway stock values too.

Anne Hathaway

American actress

6 of 62

Google apologises for Photos app's racist blunder

https://www.bbc.com/news/technology-33347866

‘Rosetta Stone’ of AI slammed for ‘racist’ classification of people’s faces

https://www.telegraph.co.uk/technology/2019/09/17/rosetta-stone-ai-slammed-racist-classification-peoples-faces/

600,000 Images Removed from AI Database After Art Project Exposes Racist Bias

https://hyperallergic.com/518822/600000-images-removed-from-ai-database-after-art-project-exposes-racist-bias/

Mis-classification of under-represented groups in data

7 of 62

BIKE SHARING NEWS

New feature from Lime lets Dallas riders rate bad scooter parking jobs

8 of 62

9 of 62

Maintaining of ML models in production

10 of 62

OpenScale Monitoring AI deployed in business applications.

Maintaining of ML models in production

11 of 62

Maintaining of ML models in production

11

Model in production

Manually annotate evaluation datasets

Evaluate Model in production

Sample from data in production

Performance?

12 of 62

Maintaining of ML models in production

12

Model in production

Manually annotate evaluation datasets

Evaluate Model in production

Sample from data in production

Performance?

This maintenance loop is:

  • Costly �
  • Prohibitively slow �(Need time for annotation)

  • Continues Forever !

13 of 62

The Goal of this Talk

13

Estimate the performance drop of a trained Model on the target domain, without labeling any new examples.

How ?

Domain-shift ➡️ Domain-shift detection metric ➡️ Drop in performance

14 of 62

14

“Domain-shift” “Data fracture” “Dataset shifts” “Changing environments”

Why does training data differ from data seen at production time?

15 of 62

Expectations

15

ML

Production

Data Monster/God

Figure inspired from Zack Lipton’s slides

Same Data Monster/God

16 of 62

Reality

16

ML

Production

Data Monster/God

Figure inspired from Zack Lipton’s slides

?

Domain-shift

ps(x, y)�

pt(x, y)�

Different Data Monster/God

17 of 62

Background: Domain-shift

17

18 of 62

Domain-shift

Multidisciplinary field, with no standard terminology:

Domain-shift = Data fracture = Dataset shifts = Changing environments��Types of Domain-shift:

  • Covariate shift
  • Label shift
  • Concept shift
  • Mix of all above

18

19 of 62

Domain-shift

Source data != Target Data

Source Data = Ps(x,y)

Target Data = Pt(x,y)

Recall bayes rule:

P(x,y) = P(y|x) P(x) = P(x|y) P(y)

19

20 of 62

#1 Causes of domain-shift: Covariate Shift

Motivation: HR Job candidate screening

20

Most popular domain shift - but not the only one

i.e. The model encounters unseen input examples in the target domain.

Ps(x) ≠ Pt(x)

�but,��Ps(y|x) = Pt(y|x)

Ps(x)

Pt(x)

21 of 62

#2 Causes of domain-shift: Label Shift

Motivation: �Pneumonia prediction

Ps(y) = ~.05% positive

Epidemic

Pt(y) = ~20% positive

21

Ps(x|y) = Pt(x|y)

but,

Ps(y) ≠ Pt(y)

22 of 62

#3 Causes of domain-shift: Concept Shift

Motivation: sentiment analysis

22

Ps(y|x) ≠ Pt(y|x)

but,

Ps(x) = Pt(x)

Source “Mobile phones”

Ps( Positive |”long duration”) ⬆️

Target “Customer service”

Pt( Positive |”long duration”) ⬇️

23 of 62

Or Even,

A Mix between different types domain-shifts

23

ML

Production

?

[Moreno-Torres et al., 2012; Quionero-Candela et al., 2009]

Covariate-Shift

Label shift

Concept shift

deal with it!

deal with it!

deal with it!

24 of 62

24

Domain-shift detection metrics

25 of 62

Domain-shift detection Metrics

25

Measures of amplitude of domain shift, without target domain labels

  1. H-divergence | A-distance�[Kifer et al. 2004; Ben David et al. 2006 - 2010]

  • Confidence estimation for out of distribution detection[Subramanya et al. 2017; Liang et al. 2017; Chen et al. 2018; Lee et. al. 2018]

  • Reverse Classification Accuracy[Fan and Davidson 2006; Zhong et al., 2010]

26 of 62

  1. H-Divergence

26

27 of 62

H-Divergence “A-distance”�(Kifer et al. 2004, Ben David et al. 2006 - 2010)�

  • A measure of domain discrepancy �
  • “Upper bounds” the performance drop due to domain shift (not the exact performance drop)
  • Has theoretical guarantees but hard to compute exactly �
  • Approximated to “Proxy-A-distance” → works in practice but no theoretical guarantees

27

28 of 62

Proxy A-distance (PAD)

28

Sentiment Classifier

(Ctask)

performance drop = Accuracysource - AccuracyTarget

labeled (x,y)

unlabeled (x)

?

(x,y) ~ Ps(x,y)

x ~ Pt(x)

29 of 62

Proxy A-distance (PAD)

29

Sentiment Classifier

(Ctask)

performance drop

Domain Binary Classifier (Cdomain)

PAD = 1 - 2 ε(Cdomain)

?

(x,y) ~ Ps(x,y)

x ~ Pt(x)

30 of 62

Proxy A-distance (PAD)

30

Sentiment Classifier

performance drop

Domain Binary Classifier

PAD = 1 - 2 ε(Cdomain)

xs

xt

NEG

POS

?

x ~ Ps(x)

x ~ Pt(x)

Sentiment Classifier

(Ctask)

Domain Binary Classifier (Cdomain)

(x,y) ~ Ps(x,y)

31 of 62

Proxy A-distance (PAD)

31

Sentiment Classifier

performance drop

Domain Binary Classifier

PAD = 1 - 2 ε(Cdomain)

xs

xt

NEG

POS

?

x ~ Ps(x)

x ~ Pt(x)

Sentiment Classifier

(Ctask)

Domain Binary Classifier (Cdomain)

(x,y) ~ Ps(x,y)

(Blitzer et al. 2007)

32 of 62

Proxy A-distance (PAD) - Goes Wrong

32

Sentiment Classifier

performance drop

Domain Binary Classifier

PAD = 1

ADD :

Task irrelevant, but domain discriminating feature.

PAD = max value always!

e.g.

  • Date
  • Title
  • Emojis

?

x ~ Ps(x)

x ~ Pt(x)

(x,y) ~ Ps(x,y)

33 of 62

Proxy A-distance Rectified (PAD*)

33

In-Domain

Out-Domain

performance drop

PAD*

Sentiment Classifier

~

~

Domain Classifier

34 of 62

Proxy A-distance Rectified (PAD*)

34

In-Domain

Out-Domain

performance drop

PAD*

~

task relevant encoder

~

Task encoder with fixed weights

linear layer for domain discrimination

Sentiment Classifier

Domain Classifier

35 of 62

Proxy A-distance Rectified (PAD*)

35

In-Domain

Out-Domain

performance drop

PAD*

~

task relevant encoder

~

Task encoder with fixed weights

linear layer for domain discrimination

Sentiment Classifier

Domain Classifier

Hint:

Optimizing both objectives simultaneously yields DANN

36 of 62

2. Confidence based metrics

36

37 of 62

Difference in confidence scores (CONF)

37

In-Domain

Out-Domain

~

Sentiment Classifier

Avg Confidence In-domain = 0.9

Avg Confidence In-domain = 0.7

CONF = (0.9 - 0.7) = 0.2

38 of 62

Modern NN prediction weights are not calibrated.

38

39 of 62

Solution: Calibration (Temperature scaling)

39

~

Sentiment Classifier

X Came for lunch with my sister. We loved our Thai-style mains which were amazing with lots of flavour, very impressive for a vegetarian restaurant.But the service was below average and the chips were too terrible to finish.When we arrived at 1.40, we had to wait 20 minutes while they got our table ready. OK, so we didn't have a reservation, but the restaurant was only half full. There was no reason to make us wait at all.

logits

softmax

P(positive | x ) = 0.99

Sentiment Classifier trained model.

uncalibrated confidence scores

40 of 62

Solution: Calibration (Temperature scaling)

40

~

Sentiment Classifier

X Came for lunch with my sister. We loved our Thai-style mains which were amazing with lots of flavour, very impressive for a vegetarian restaurant.But the service was below average and the chips were too terrible to finish.When we arrived at 1.40, we had to wait 20 minutes while they got our table ready. OK, so we didn't have a reservation, but the restaurant was only half full. There was no reason to make us wait at all.

P(positive | x ) = 0.85

Calibration

calibrated

(smoothed) confidence scores

Freeze Weights

➗ t

scaled logits

  • t > 0 → temperature
  • t learned parameter over valid. set
  • can be replaced by linear layer “platt scaling” (Platt et al., 1999)

41 of 62

Difference in Calibrated confidence scores (CONF_CALIB)

41

In-Domain

Out-Domain

Sentiment Classifier

Avg calib. Confidence In-domain = 0.85

Avg calib. Confidence In-domain = 0.75

CONF_CALIB = (0.85 - 0.75) = 0.1

~

➗ t

42 of 62

3. Reverse Classification Accuracy

42

43 of 62

Reverse Classification Accuracy (RCA)

“reverse testing” (Fan and Davidson, 2006; Zhong et al., 2010)

43

In-Domain

Out-Domain

Sentiment Classifier

Train

Inference

Out-Domain*

Pseudo labels for out-domain dataset.

Sentiment Classifier*

Train

Evaluate

Reverse Classification Accuracy (RCA)

44 of 62

Reverse Classification Accuracy - Rectified (RCA*)

  • Out adapted version of RCA
  • Skipped (to discussions) to avoid confusion.

44

45 of 62

Domain-shift detection Metrics

45

PAD

Proxy A-distance

CONF

Average Difference in confidence scores

RCA

Reverse classification accuracy

Can measure amplitude of domain shift ��Without labels from the target domain

We test 3 domain shift measures from 3 different families

46 of 62

46

Can we use �Domain shift detection metrics �To predict performance drop?

47 of 62

Domain shift Drop in performance ?

47

Values of each domain shift detection metric

Accuracy Drop

  • Each point corresponds to a single domain shift scenario. �
  • Each color is a single trained model on in-domain dataset �and tested on several out of domain datasets

Training domain: DVD

Eval. domain: restaurants

48 of 62

48

Challenge 1 �Some domain-shift detection metrics are task irrelevant ��Large detected domain-shift �does not necessarily mean �Large Performance drop

49 of 62

We exploit this using adversarial domain-shifts

49

Proxy A-Distance (PAD)

Accuracy Drop

False alarms PAD = 1 (max value) even for low accuracy drops

Proxy A-Distance (PAD)

Accuracy Drop

Adversarial domain shift

50 of 62

Robust to task irrelevant domain-shifts

50

Proxy A-Distance (PAD)

Accuracy Drop

Before: False alarms PAD = 1 (max value) even for low accuracy drops

Accuracy Drop

Modified Proxy A-Distance (PAD*)

After: more robust to Adversarial covariate-shifts

Task relevant features for domain shift detection

51 of 62

51

PAD

Proxy A-distance

PAD*

Proxy A-distance using task learned features

CONF

Average Difference in confidence scores

CONF_CALIB

Average Difference in calibrated confidence scores

RCA

Reverse classification accuracy

RCA*

Reverse classification accuracy - Rectified

Solution: task calibration of domain shift detection metrics (3 new proposed metrics)

Mod

Mod

Mod

All can be computed without labels from the target domain

52 of 62

52

Challenge 2�Domain-shift detection metrics ➡️ Actual drop in performance ��I.e. if PAD = 0.2; then Drop in performance = ?

53 of 62

53

Values of domain shift metrics Model Dependent

Values of a domain shift detection metric

Accuracy Drop

  • Each color is a single model trained on in-domain dataset and tested on several out of domain datasets �
  • General correlation exist but it is more prevalent within single models

*Training domain: DVD

*Architecture: BILSTM, L=5

*Embeddings: ELMO

*Seed: 666

*Training domain: Books

*Architecture: LSTM, L=2

*Embeddings: RAND

*Seed: 123

54 of 62

54

x, y ~ Ps(x,y)

Evaluation datasets of available source domains

Model in production

Domain-Shift detection metric

Domain-shift detection metric value

Accuracy Drop

x, y ~ Ps(x,y)

x, y ~ Ps2(x,y)

x ~ Pt(x)

Source Domain Ds

Target domain Dt (without labels)

x, y ~ Ps3(x,y)

Training

Evaluation

Evaluation

Evaluation

Solution: Estimating Drop through regression

Calculate domain-shift detection metric

Estimated Accuracy Drop of the model on Dt

Regression

55 of 62

55

  • Non negligible cost

  • Only done once
  • Can be sampled with bias from Ds
  • Only few required

x, y ~ Ps(x,y)

Evaluation datasets of available source domains

Model in production

Domain-Shift detection metric

Domain-shift detection metric value

Accuracy Drop

x, y ~ Ps(x,y)

x, y ~ Ps2(x,y)

x ~ Pt(x)

Source Domain Ds

Target domain Dt (without labels)

x, y ~ Ps3(x,y)

Training

Evaluation

Evaluation

Evaluation

Calculate domain-shift detection metric

Estimated Accuracy Drop of the model on Dt

Regression

Solution: Estimating Drop through regression

56 of 62

56

Experiments

  • Can we accurately estimate the performance drop?
  • How many evaluation datasets are required?

57 of 62

Domain shift = Drop in performance?�

57

We ran some experiments to simulate domain shift

Sentiment Classification

  • Amazon + Yelp + IMDB �
  • 5.8 Million sentences�
  • 24 domains �(e.g. Books, DVDS, Restaurants, Movies)�
  • 552 domain-shift Scenarios*

POS tagging

  • Universal Dependencies dataset (English)�
  • 474K tokens�
  • 8 domains �(e.g.: News, Academic, Blog, Email, Fiction )�
  • 56 domain-shift scenarios

* 10x more examples and 40x more domain-shift scenarios than (Blitzer et al.,2007)

- English language #BenderRule

58 of 62

58

Error of performance drop estimation

Mean absolute error (MAE) and max error (Max) of the performance drop prediction.

Lower is better

59 of 62

59

22 datasets

Error = 2.1%

4 datasets

Error = 3.1%

Mean absolute error

Sentiment Analysis

How many evaluation datasets are required?

7 datasets

Error = 0.89%

4 datasets

Error = 1.03%

Number of source domain evaluation datasets used for regression

POS Tagging

60 of 62

Wrap up

60

| Analysis

Domain-shift detection metrics are model and task dependent

| We propose

  • Task dependent modifications to domain-shift detection metrics �
  • A method to predict performance drop of ML models�
    • Cheap & Fast without target domain annotations
    • Accurate ± 2.15% Sentiment analysis & ± 0.89 % POS tagging

61 of 62

61

Hady Elsahar

hady.elsahar@naverlabs.com

@hadyelsahar

THANK YOU

To Annotate or Not? Predicting Performance Drop under Domain Shift

  • Paper : bit.ly/hadyelsahar-emnlp2019-paper1
  • Blog : bit.ly/hadyelsahar-emnlp2019-blog1

62 of 62

References

62

H-divergence:

  • Kifer et al. 2004Detecting change in data streams. In Very Large Data Bases
  • Ben-David et al. 2006 Analysis of representations for domain adaptation.
  • Ben-David et al. 2008 A theory of learning from different domains. Machine Learning
  • Blitzer et al. 2007 Biographies, bollywood, boomboxes and blenders: Domain adaptation for sentiment classification.

Reverse testing:

  • Bhaskaruni et al. 2018 Prediction Qualities without Ground Truth: A Revisit of the Reverse Testing Framework.

Confidence Calibration and out of distribution detection:

  • Chen et al. 2018 - Confidence Scoring using white box Meta-models with Linear classifier Probes
  • Lee et. al. 2018 - Training Confidence-calibrated classifiers for detecting out-of-distribution samples
  • DeVries et al. 2018 - Learning Confidence for Out-of-Distribution Detection in Neural Networks
  • Mandelbaum et al. 2017 - Distance-based confidence score for neural network classifiers
  • Subramanya et al. 2017 - Confidence estimation in Deep Neural Networks via density modeling
  • Guo et al. 2017 - On Calibration of Modern Neural Networks
  • Liang et al. 2017 - Principled detection of Out-of-Distribution Examples in Neural Networks
  • Liang et al. 2017 -Enhancing the reliability of Out-of-distribution image detection in neural networks
  • Lakshminarayanan et al. 2016 - Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles
  • Hendrycks et al.2016 - A Baseline for detecting misclassified and out of distribution examples in neural networks