1 of 271

Interpreting Machine Learning Models:�State-of-the-art, Challenges, Opportunities

Hima Lakkaraju

2 of 271

Schedule for Today

  • 9am to 1020am: Introduction; Overview of Inherently Interpretable Models

  • 1020am to 1040am: Break

  • 1040am to 12pm: Overview of Post hoc Explanation Methods

  • 12pm to 1pm: Lunch

  • 105pm to 125pm: Breakout Groups

  • 125pm to 245pm: Evaluating and Analyzing Model Interpretations and Explanations

  • 245pm to 3pm: Break

  • 3pm to 4pm: Analyzing Model Interpretations and Explanations, and Future Research Directions

3 of 271

Motivation

3

Machine Learning is EVERYWHERE!!

4 of 271

Is Model Understanding Needed Everywhere?

4

5 of 271

When and Why Model Understanding?

  • Not all applications require model understanding
    • E.g., ad/product/friend recommendations
    • No human intervention

  • Model understanding not needed because:
    • Little to no consequences for incorrect predictions
    • Problem is well studied and models are extensively validated in real-world applications 🡪 trust model predictions

5

[ Weller 2017, Lipton 2017, Doshi-Velez and Kim 2016 ]

6 of 271

When and Why Model Understanding?

ML is increasingly being employed in complex high-stakes settings

6

7 of 271

When and Why Model Understanding?

  • High-stakes decision-making settings
    • Impact on human lives/health/finances
    • Settings relatively less well studied, models not extensively validated

  • Accuracy alone is no longer enough
    • Train/test data may not be representative of data encountered in practice

  • Auxiliary criteria are also critical:
    • Nondiscrimination
    • Right to explanation
    • Safety

8 of 271

When and Why Model Understanding?

  • Auxiliary criteria are often hard to quantify (completely)
    • E.g.: Impossible to predict/enumerate all scenarios violating safety of an autonomous car

  • Incompleteness in problem formalization
    • Hinders optimization and evaluation
    • Incompleteness ≠ Uncertainty; Uncertainty can be quantified

9 of 271

When and Why Model Understanding?

9

Model understanding becomes critical when:

  1. models not extensively validated in applications; train/test data not representative of real time data

  • key criteria are hard to quantify, and we need to rely on a “you will know it when you see it” approach

10 of 271

Example: Why Model Understanding?

10

Predictive

Model

Input

Prediction = Siberian Husky

Model Understanding

This model is relying on incorrect features to make this prediction!! Let me fix the model

Model understanding facilitates debugging.

11 of 271

Example: Why Model Understanding?

11

Predictive

Model

Defendant Details

Prediction = Risky to Release

Model Understanding

Race

Crimes

Gender

This prediction is biased. Race and gender are being used to make the prediction!!

Model understanding facilitates bias detection.

[ Larson et. al. 2016 ]

12 of 271

Example: Why Model Understanding?

12

Predictive

Model

Loan Applicant Details

Prediction = Denied Loan

Model Understanding

Increase salary by 50K + pay credit card bills on time for next 3 months to get a loan

Loan Applicant

I have some means for recourse. Let me go and work on my promotion and pay my bills on time.

Model understanding helps provide recourse to individuals who are adversely affected by model predictions.

13 of 271

Example: Why Model Understanding?

13

Predictive

Model

Patient Data

Model Understanding

This model is using irrelevant features when predicting on female subpopulation. I should not trust its predictions for that group.

Predictions

25, Female, Cold

32, Male, No

31, Male, Cough

.

.

.

.

Healthy

Sick

Sick

.

.

Healthy

Healthy

Sick

If gender = female,

if ID_num > 200, then sick

If gender = male,

if cold = true and cough = true, then sick

Model understanding helps assess if and when to trust model predictions when making decisions.

14 of 271

Example: Why Model Understanding?

14

Predictive

Model

Patient Data

Model Understanding

This model is using irrelevant features when predicting on female subpopulation. This cannot be approved!

Predictions

25, Female, Cold

32, Male, No

31, Male, Cough

.

.

.

.

Healthy

Sick

Sick

.

.

Healthy

Healthy

Sick

If gender = female,

if ID_num > 200, then sick

If gender = male,

if cold = true and cough = true, then sick

Model understanding allows us to vet models to determine if they are suitable for deployment in real world.

15 of 271

Summary: Why Model Understanding?

15

Debugging

Bias Detection

Recourse

If and when to trust model predictions

Vet models to assess suitability for deployment

Utility

End users (e.g., loan applicants)

Decision makers (e.g., doctors, judges)

Regulatory agencies (e.g., FDA, European commission)

Researchers and engineers

Stakeholders

16 of 271

Achieving Model Understanding

Take 1: Build inherently interpretable predictive models

16

[ Letham and Rudin 2015; Lakkaraju et. al. 2016 ]

17 of 271

Achieving Model Understanding

Take 2: Explain pre-built models in a post-hoc manner

17

Explainer

18 of 271

Inherently Interpretable Models vs.

Post hoc Explanations

In certain settings, accuracy-interpretability trade offs may exist.

18

Example

[ Cireşan et. al. 2012, Caruana et. al. 2006, Frosst et. al. 2017, Stewart 2020 ]

19 of 271

Inherently Interpretable Models vs.

Post hoc Explanations

19

complex models might achieve higher accuracy

can build interpretable +

accurate models

20 of 271

Inherently Interpretable Models vs.

Post hoc Explanations

Sometimes, you don’t have enough data to build your model from scratch.

And, all you have is a (proprietary) black box!

20

21 of 271

Inherently Interpretable Models vs.

Post hoc Explanations

21

If you can build an interpretable model which is also adequately accurate for your setting, DO IT!

Otherwise, post hoc explanations come to the rescue!

22 of 271

Agenda

  • Inherently Interpretable Models
  • Post hoc Explanation Methods
  • Evaluating Model Interpretations/Explanations
  • Empirically & Theoretically Analyzing Interpretations/Explanations
  • Future of Model Understanding

22

23 of 271

Agenda

  • Inherently Interpretable Models
  • Post hoc Explanation Methods
  • Evaluating Model Interpretations/Explanations
  • Empirically & Theoretically Analyzing Interpretations/Explanations
  • Future of Model Understanding

23

24 of 271

Inherently Interpretable Models

  • Rule Based Models
  • Risk Scores
  • Generalized Additive Models
  • Prototype Based Models
  • Attention Based Models

25 of 271

Inherently Interpretable Models

  • Rule Based Models
  • Risk Scores
  • Generalized Additive Models
  • Prototype Based Models
  • Attention Based Models

26 of 271

Bayesian Rule Lists

  • A rule list classifier for stroke prediction

[Letham et. al. 2016]

27 of 271

Bayesian Rule Lists

  • A generative model designed to produce rule lists (if/else-if) that strike a balance between accuracy, interpretability, and computation

  • What about using other similar models?
    • Decision trees (CART, C5.0 etc.)
    • They employ greedy construction methods
    • Not computationally demanding but affects quality of solution – both accuracy and interpretability

27

28 of 271

Bayesian Rule Lists: Generative Model

28

is a set of pre-mined antecedents

Model parameters are inferred using the Metropolis-Hastings algorithm which is a

Markov Chain Monte Carlo (MCMC) Sampling method

29 of 271

Pre-mined Antecedents

  • A major source of practical feasibility: pre-mined antecedents
    • Reduces model space
    • Complexity of problem depends on number of pre-mined antecedents

  • As long as pre-mined set is expressive, accurate decision list can be found + smaller model space means better generalization (Vapnik, 1995)

29

30 of 271

Interpretable Decision Sets

  • A decision set classifier for disease diagnosis

[Lakkaraju et. al. 2016]

31 of 271

Interpretable Decision Sets: Desiderata

  • Optimize for the following criteria
    • Recall
    • Precision
    • Distinctness
    • Parsimony
    • Class Coverage

  • Recall and Precision 🡪 Accurate predications

  • Distinctness, Parsimony, and Class Coverage 🡪Interpretability

32 of 271

IDS: Objective Function

32

33 of 271

IDS: Objective Function

33

34 of 271

IDS: Objective Function

34

35 of 271

IDS: Objective Function

35

36 of 271

IDS: Objective Function

36

37 of 271

IDS: Objective Function

37

38 of 271

IDS: Optimization Procedure

  • The problem is a non-normal, non-monotone, submodular optimization problem

  • Maximizing a non-monotone submodular function is NP-hard

  • Local search method which iteratively adds and removes elements until convergence
    • Provides a 2/5 approximation

38

39 of 271

Inherently Interpretable Models

  • Rule Based Models
  • Risk Scores
  • Generalized Additive Models
  • Prototype Based Models
  • Attention Based Models

40 of 271

Risk Scores: Motivation

  • Risk scores are widely used in medicine and criminal justice
    • E.g., assess risk of mortality in ICU, assess the risk of recidivism

  • Adoption 🡪 decision makers find them easy to understand

  • Until very recently, risk scores were constructed manually by domain experts. Can we learn these in a data-driven fashion?

40

41 of 271

Risk Scores: Examples

  • Recidivism

  • Loan Default

41

[Ustun and Rudin, 2016]

42 of 271

Objective function to learn risk scores

42

Above turns out to be a mixed integer program, and is optimized using a cutting plane

method and a branch-and-bound technique.

43 of 271

Inherently Interpretable Models

  • Rule Based Models
  • Risk Scores
  • Generalized Additive Models
  • Prototype Based Models
  • Attention Based Models

44 of 271

Generalized Additive Models (GAMs)

44

[Lou et. al., 2012; Caruana et. al., 2015]

45 of 271

Formulation and Characteristics of GAMs

45

g is a link function; E.g., identity function in case of regression;

log (y/1 – y) in case of classification;

fi is a shape function

46 of 271

GAMs and GA2Ms

  • While GAMs model first order terms, GA2Ms model second order feature interactions as well.

47 of 271

GAMs and GA2Ms

  • Learning:
    • Represent each component as a spline
    • Least squares formulation; Optimization problem to balance smoothness and empirical error

  • GA2Ms: Build GAM first and and then detect and rank all possible pairs of interactions in the residual
    • Choose top k pairs
    • k determined by CV

47

48 of 271

Inherently Interpretable Models

  • Rule Based Models
  • Risk Scores
  • Generalized Additive Models
  • Prototype Based Models
  • Attention Based Models

49 of 271

Prototype Selection for Interpretable Classification

  • The goal here is to identify K prototypes (instances) from the data s.t. a new instance which will be assigned the same label as the closest prototype will be correctly classified (with a high probability)

  • Let each instance “cover” the ϵ - neighborhood around it.

  • Once we define the neighborhood covered by each instance, this problem becomes similar to the problem of finding rule sets, and can be solved analogously.

[Bien et. al., 2012]

50 of 271

Prototype Selection for Interpretable Classification

51 of 271

Prototype Layers in Deep Learning Models

[Li et. al. 2017, Chen et. al. 2019]

52 of 271

Prototype Layers in Deep Learning Models

53 of 271

Prototype Layers in Deep Learning Models

54 of 271

Prototype Layers in Deep Learning Models

55 of 271

Prototype Layers in Deep Learning Models

56 of 271

Inherently Interpretable Models

  • Rule Based Models
  • Risk Scores
  • Generalized Additive Models
  • Prototype Based Models
  • Attention Based Models

57 of 271

Attention Layers in Deep Learning Models

  • Let us consider the example of machine translation

[Bahdanau et. al. 2016; Xu et. al. 2015]

Input

Encoder

Context Vector

Decoder

Outputs

I

am

Bob

h1

h2

h3

C

s1

s2

s3

Je

suis

Bob

58 of 271

Attention Layers in Deep Learning Models

  • Let us consider the example of machine translation

[Bahdanau et. al. 2016; Xu et. al. 2015]

Input

Encoder

Context Vector

Decoder

Outputs

I

am

Bob

h1

h2

h3

s1

s2

s3

Je

suis

Bob

c1

c2

c3

59 of 271

Attention Layers in Deep Learning Models

  • Context vector corresponding to si can be written as follows:

  • captures the attention placed on input token j when determining the decoder hidden state si; it can be computed as a softmax of the “match” between si-1 and hj

60 of 271

Inherently Interpretable Models

  • Rule Based Models
  • Risk Scores
  • Generalized Additive Models
  • Prototype Based Models
  • Attention Based Models

61 of 271

Agenda

  • Inherently Interpretable Models
  • Post hoc Explanation Methods
  • Evaluating Model Interpretations/Explanations
  • Empirically & Theoretically Analyzing Interpretations/Explanations
  • Future of Model Understanding

61

62 of 271

What is an Explanation?

62

63 of 271

What is an Explanation?

Definition: Interpretable description of the model behavior

63

Classifier

User

Explanation

Faithful

Understandable

64 of 271

What is an Explanation?

Definition: Interpretable description of the model behavior

64

Summarize with a program/rule/tree

Classifier

User

Send all the model parameters θ?

Send many example predictions?

Select most important features/points

Describe how to flip the model prediction

...

65 of 271

Local Explanations vs. Global Explanations

65

Explain individual predictions

Explain complete behavior of the model

Help unearth biases in the local neighborhood of a given instance

Help shed light on big picture biases affecting larger subgroups

Help vet if individual predictions are being made for the right reasons

Help vet if the model, at a high level, is suitable for deployment

66 of 271

Approaches for Post hoc Explainability

Local Explanations

  • Feature Importances
  • Rule Based
  • Saliency Maps
  • Prototypes/Example Based
  • Counterfactuals

66

Global Explanations

  • Collection of Local Explanations
  • Representation Based
  • Model Distillation
  • Summaries of Counterfactuals

67 of 271

Approaches for Post hoc Explainability

Local Explanations

  • Feature Importances
  • Rule Based
  • Saliency Maps
  • Prototypes/Example Based
  • Counterfactuals

67

Global Explanations

  • Collection of Local Explanations
  • Representation Based
  • Model Distillation
  • Summaries of Counterfactuals

68 of 271

LIME: Local Interpretable Model-Agnostic Explanations

68

  1. Sample points around xi

69 of 271

LIME: Local Interpretable Model-Agnostic Explanations

69

  1. Sample points around xi
  2. Use model to predict labels for each sample

70 of 271

LIME: Local Interpretable Model-Agnostic Explanations

70

  1. Sample points around xi
  2. Use model to predict labels for each sample
  3. Weigh samples according to distance to xi

71 of 271

LIME: Local Interpretable Model-Agnostic Explanations

71

  1. Sample points around xi
  2. Use model to predict labels for each sample
  3. Weigh samples according to distance to xi
  4. Learn simple linear model on weighted samples

72 of 271

LIME: Local Interpretable Model-Agnostic Explanations

72

  1. Sample points around xi
  2. Use model to predict labels for each sample
  3. Weigh samples according to distance to xi
  4. Learn simple linear model on weighted samples
  5. Use simple linear model to explain

73 of 271

Predict Wolf vs Husky

73

Only 1 mistake!

74 of 271

Predict Wolf vs Husky

74

We’ve built a great snow detector…

75 of 271

SHAP: Shapley Values as Importance

Marginal contribution of each feature towards the prediction,

averaged over all possible permutations.

Attributes the prediction to each of the features.

75

xi

P(y) = 0.9

xi

P(y) = 0.8

M(xi, O) = 0.1

O

O/xi

76 of 271

Approaches for Post hoc Explainability

Local Explanations

  • Feature Importances
  • Rule Based
  • Saliency Maps
  • Prototypes/Example Based
  • Counterfactuals

76

Global Explanations

  • Collection of Local Explanations
  • Representation Based
  • Model Distillation
  • Summaries of Counterfactuals

77 of 271

Anchors

  • Perturb a given instance x to generate local neighborhood

  • Identify an “anchor” rule which has the maximum coverage of the local neighborhood and also achieves a high precision.

78 of 271

Salary Prediction

79 of 271

Approaches for Post hoc Explainability

Local Explanations

  • Feature Importances
  • Rule Based
  • Saliency Maps
  • Prototypes/Example Based
  • Counterfactuals

79

Global Explanations

  • Collection of Local Explanations
  • Representation Based
  • Model Distillation
  • Summaries of Counterfactuals

80 of 271

Saliency Map Overview

80

Input

Model

Predictions

Junco Bird

81 of 271

Saliency Map Overview

81

What parts of the input are most relevant for the model’s prediction: ‘Junco Bird’?

Input

Model

Predictions

Junco Bird

82 of 271

Saliency Map Overview

82

What parts of the input are most relevant for the model’s prediction: ‘Junco Bird’?

Input

Model

Predictions

Junco Bird

83 of 271

Modern DNN Setting

83

Model

class specific logit

Input

Model

Predictions

Junco Bird

84 of 271

Input-Gradient

84

Input-Gradient

Logit

Input

Same dimension as the input.

Input

Model

Predictions

Junco Bird

85 of 271

Input-Gradient

85

Input

Model

Predictions

Junco Bird

Input-Gradient

Logit

Visualize as a heatmap

Input

86 of 271

Input-Gradient

86

Input

Model

Predictions

Junco Bird

Input-Gradient

Logit

Input

Challenges

  • Visually noisy & difficult to interpret.
  • ‘Gradient saturation.’

87 of 271

SmoothGrad

87

Input

Model

Predictions

Junco Bird

SmoothGrad

Gaussian noise

Average Input-gradient of ‘noisy’ inputs.

88 of 271

SmoothGrad

88

Input

Model

Predictions

Junco Bird

SmoothGrad

Gaussian noise

Average Input-gradient of ‘noisy’ inputs.

89 of 271

Integrated Gradients

89

Input

Model

Predictions

Junco Bird

Baseline input

Path integral: ‘sum’ of interpolated gradients

90 of 271

Integrated Gradients

90

Input

Model

Predictions

Junco Bird

Path integral: ‘sum’ of interpolated gradients

Baseline input

91 of 271

Gradient-Input

91

Input

Model

Predictions

Junco Bird

Gradient-Input

Input gradient

Input

Element-wise product of input-gradient and input.

92 of 271

Gradient-Input

92

Input

Model

Predictions

Junco Bird

Gradient-Input

logit gradient

Input

Element-wise product of input-gradient and input.

93 of 271

Approaches for Post hoc Explainability

Local Explanations

  • Feature Importances
  • Rule Based
  • Saliency Maps
  • Prototypes/Example Based
  • Counterfactuals

93

Global Explanations

  • Collection of Local Explanations
  • Representation Based
  • Model Distillation
  • Summaries of Counterfactuals

94 of 271

Prototypes/Example Based Post hoc Explanations

94

Use examples (synthetic or natural) to explain individual predictions

    • Influence Functions (Koh & Liang 2017)
      • Identify instances in the training set that are responsible for the prediction of a given test instance

    • Activation Maximization (Erhan et al. 2009)
      • Identify examples (synthetic or natural) that strongly activate a function (neuron) of interest�

95 of 271

Training Point Ranking via Influence Functions

Which training data points have the most ‘influence’ on the test loss?

95

Input

Model

Predictions

Junco Bird

96 of 271

Training Point Ranking via Influence Functions

Which training data points have the most ‘influence’ on the test loss?

96

Input

Model

Predictions

Junco Bird

97 of 271

Training Point Ranking via Influence Functions

Influence Function: classic tool used in robust statistics for assessing the effect of a sample on regression parameters (Cook & Weisberg, 1980).

Instead of refitting model for every data point, Cook’s distance provides analytical alternative.

97

98 of 271

Training Point Ranking via Influence Functions

Koh & Liang (2017) extend the ‘Cook’s distance’ insight to modern machine learning setting.

98

Training sample point

99 of 271

Training Point Ranking via Influence Functions

Koh & Liang (2017) extend the ‘Cook’s distance’ insight to modern machine learning setting.

99

Training sample point

ERM Solution

UpWeighted ERM Solution

100 of 271

Training Point Ranking via Influence Functions

Koh & Liang (2017) extend the ‘Cook’s distance’ insight to modern machine learning setting.

100

Training sample point

ERM Solution

UpWeighted ERM Solution

Influence of Training Point on Parameters

Influence of Training Point on Test-Input’s loss

101 of 271

Training Point Ranking via Influence Functions

Applications:

  • compute self-influence to identify mislabelled examples;

  • diagnose possible domain mismatch;

  • craft training-time poisoning examples.

101

102 of 271

Challenges and Other Approaches

Influence function Challenges:

  1. scalability: computing hessian-vector products can be tedious in practice.

102

103 of 271

Challenges and Other Approaches

Influence function Challenges:

  1. scalability: computing hessian-vector products can be tedious in practice.

  • non-convexity: possibly loose approximation for ‘deeper’ networks (Basu et. al. 2020).

Alternatives:

  • Representer Points (Yeh et. al. 2018).

  • TracIn (Pruthi et. al. appearing at NeuRIPs 2020).

103

104 of 271

Activation Maximization

These approaches identify examples, synthetic or natural, that strongly activate a function (neuron) of interest.

104

105 of 271

Activation Maximization

These approaches identify examples, synthetic or natural, that strongly activate a function (neuron) of interest.

Implementation Flavors:

  • Search for natural examples within a specified set (training or validation corpus) that strongly activate a neuron of interest;

  • Synthesize examples, typically via gradient descent, that strongly activate a neuron of interest.

105

106 of 271

Feature Visualization

106

107 of 271

Approaches for Post hoc Explainability

Local Explanations

  • Feature Importances
  • Rule Based
  • Saliency Maps
  • Prototypes/Example Based
  • Counterfactuals

107

Global Explanations

  • Collection of Local Explanations
  • Representation Based
  • Model Distillation
  • Summaries of Counterfactuals

108 of 271

Counterfactual Explanations

108

What features need to be changed and by how much to flip a model’s prediction?

[Goyal et. al., 2019]

109 of 271

Counterfactual Explanations

As ML models increasingly deployed to make high-stakes decisions (e.g., loan applications), it becomes important to provide recourse to affected individuals.

109

Counterfactual Explanations

What features need to be changed and by

how much to flip a model’s prediction ?

(i.e., to reverse an unfavorable outcome).

110 of 271

Counterfactual Explanations

110

Predictive

Model

Deny Loan

Loan Application

Recourse: Increase your salary by 5K & pay your credit card bills on time for next 3 months

f(x)

Applicant

Counterfactual Generation Algorithm

Recourse

111 of 271

Generating Counterfactual Explanations: Intuition

111

Proposed solutions differ on:

  1. How to choose among

candidate counterfactuals?

  1. How much access is needed to

the underlying predictive model?

112 of 271

Take 1: Minimum Distance Counterfactuals

112

Distance Metric

Predictive Model

Desired Outcome

Original Instance

Counterfactual

Choice of distance metric dictates what kinds of counterfactuals are chosen.

Wachter et. al. use normalized Manhattan distance.

113 of 271

Take 1: Minimum Distance Counterfactuals

113

Wachter et. al. solve a differentiable, unconstrained version of the objective using ADAM optimization algorithm with random restarts.

This method requires access to gradients of the underlying predictive model.

114 of 271

Take 1: Minimum Distance Counterfactuals

114

Not feasible to act upon these features!

115 of 271

Take 2: Feasible and Least Cost Counterfactuals

  • is the set of feasible counterfactuals (input by end user)
    • E.g., changes to race, gender are not feasible

  • Cost is modeled as total log-percentile shift
    • Changes become harder when starting off from a higher percentile value

115

116 of 271

Take 2: Feasible and Least Cost Counterfactuals

  • Ustun et. al. only consider the case where the model is a linear classifier
    • Objective formulated as an IP and optimized using CPLEX

  • Requires complete access to the linear classifier i.e., weight vector

116

117 of 271

Take 2: Feasible and Least Cost Counterfactuals

117

Question: What if we have a black box or a non-linear classifier?

Answer: generate a local linear model approximation (e.g., using LIME) and then apply Ustun et. al.’s framework

118 of 271

Take 2: Feasible and Least Cost Counterfactuals

118

Changing one feature without affecting another might not be possible!

119 of 271

Take 3: Causally Feasible Counterfactuals

119

After 1 year

Recourse:

Reduce current debt

from 3250$ to 1000$

My current debt has reduced to 1000$. Please give me loan.

Loan Applicant

f(x)

Your age increased by 1 year and the recourse is no longer valid! Sorry!

Important to account for feature interactions when generating counterfactuals!

But how?!

Loan Applicant

Predictive Model

120 of 271

Take 3: Causally Feasible Counterfactuals

120

is the set of causally feasible counterfactuals permitted according to a given Structural Causal Model (SCM).

Question: What if we don’t have access to the structural causal model?

121 of 271

Counterfactuals on Data Manifold

  • Generated counterfactuals should lie on the data manifold
  • Construct Variational Autoencoders (VAEs) to map input instances to latent space
  • Search for counterfactuals in the latent space
  • Once a counterfactual is found, map it back to the input space using the decoder

121

[ Verma et. al., 2020, Pawelczyk et. al., 2020]

122 of 271

Approaches for Post hoc Explainability

Local Explanations

  • Feature Importances
  • Rule Based
  • Saliency Maps
  • Prototypes/Example Based
  • Counterfactuals

122

Global Explanations

  • Collection of Local Explanations
  • Representation Based
  • Model Distillation
  • Summaries of Counterfactuals

123 of 271

Global Explanations

  • Explain the complete behavior of a given (black box) model
    • Provide a bird’s eye view of model behavior

  • Help detect big picture model biases persistent across larger subgroups of the population
    • Impractical to manually inspect local explanations of several instances to ascertain big picture biases!

  • Global explanations are complementary to local explanations

123

124 of 271

Local vs. Global Explanations

124

Explain individual predictions

Help unearth biases in the local neighborhood of a given instance

Help vet if individual predictions are being made for the right reasons

Explain complete behavior of the model

Help shed light on big picture biases affecting larger subgroups

Help vet if the model, at a high level, is suitable for deployment

125 of 271

Approaches for Post hoc Explainability

Local Explanations

  • Feature Importances
  • Rule Based
  • Saliency Maps
  • Prototypes/Example Based
  • Counterfactuals

125

Global Explanations

  • Collection of Local Explanations
  • Representation Based
  • Model Distillation
  • Summaries of Counterfactuals

126 of 271

Global Explanation as a Collection of Local Explanations

How to generate a global explanation of a (black box) model?

  • Generate a local explanation for every instance in the data using one of the approaches discussed earlier

  • Pick a subset of k local explanations to constitute the global explanation

126

What local explanation technique to use?

How to choose the subset of k local explanations?

127 of 271

Global Explanations from Local Feature Importances: SP-LIME

83

LIME explains a single prediction

local behavior for a single instance

Can’t examine all explanations

Instead pick k explanations to show to the user

Diverse

Should not be redundant in their descriptions

Representative

Should summarize the model’s global behavior

Single explanation

SP-LIME uses submodular optimization and greedily picks k explanations

Model Agnostic

128 of 271

Global Explanations from Local Rule Sets: SP-Anchor

  • Use Anchors algorithm discussed earlier to obtain local rule sets for every instance in the data

  • Use the same procedure to greedily select a subset of k local rule sets to correspond to the global explanation

84

129 of 271

Approaches for Post hoc Explainability

Local Explanations

  • Feature Importances
  • Rule Based
  • Saliency Maps
  • Prototypes/Example Based
  • Counterfactuals

129

Global Explanations

  • Collection of Local Explanations
  • Representation Based
  • Model Distillation
  • Summaries of Counterfactuals

130 of 271

Representation Based Approaches

  • Derive model understanding by analyzing intermediate representations of a DNN.

  • Determine model’s reliance on ‘concepts’ that are semantically meaningful to humans.

130

131 of 271

Representation Based Explanations

131

[Kim et. al., 2018]

Zebra

(0.97)

How important is the notion of “stripes” for this prediction?

132 of 271

Representation Based Explanations: TCAV

132

Examples of the concept “stripes”

Random examples

Train a linear classifier to separate

activations

The vector orthogonal to the decision boundary pointing towards the “stripes” class quantifies the concept “stripes”

Compute derivatives by leveraging this vector to determine the importance of the notion of stripes for any given prediction

133 of 271

Quantitative Testing with Concept Activation Vectors (TCAV)

TCAV measures the sensitivity of a model’s prediction to user provided concept using the model internal representations.

133

134 of 271

Approaches for Post hoc Explainability

Local Explanations

  • Feature Importances
  • Rule Based
  • Saliency Maps
  • Prototypes/Example Based
  • Counterfactuals

134

Global Explanations

  • Collection of Local Explanations
  • Representation Based
  • Model Distillation
  • Summaries of Counterfactuals

135 of 271

Model Distillation for Generating Global Explanations

135

Model

Predictions

Predictive

Model

Label 1

Label 1

.

.

.

Label 2

.

v1, v2

.

.

v11, v12

.

Data

Explainer

Simpler, interpretable model which is optimized to mimic the model predictions

f(x)

136 of 271

Generalized Additive Models as Global Explanations

136

Model

Predictions

Black Box

Model

Label 1

Label 1

.

.

.

Label 2

.

v1, v2

.

.

v11, v12

.

Data

Explainer

[Tan et. al., 2019]

Model Agnostic

137 of 271

Generalized Additive Models as Global Explanations: Shape Functions for Predicting Bike Demand

137

[Tan et. al., 2019]

138 of 271

Generalized Additive Models as Global Explanations: Shape Functions for Predicting Bike Demand

How does bike demand vary as a function of temperature?

138

[Tan et. al., 2019]

139 of 271

Generalized Additive Models as Global Explanations

Generalized Additive Model (GAM) :

139

[Tan et. al., 2019]

Shape functions of

individual features

Higher order feature interaction terms

Fit this model to the predictions of the black box to obtain the shape functions.

=

140 of 271

Decision Trees as Global Explanations

140

Model

Predictions

Black Box

Model

Label 1

Label 1

.

.

.

Label 2

.

v1, v2

.

.

v11, v12

.

Data

Explainer

[ Bastani et. al., 2019 ]

Model Agnostic

141 of 271

Customizable Decision Sets as Global Explanations

141

Model

Predictions

Black Box

Model

Label 1

Label 1

.

.

.

Label 2

.

v1, v2

.

.

v11, v12

.

Data

Explainer

Model Agnostic

142 of 271

Customizable Decision Sets as Global Explanations

142

Subgroup Descriptor

Decision Logic

143 of 271

Customizable Decision Sets as Global Explanations

143

Explain how the model behaves across patient subgroups with different values of smoking and exercise

144 of 271

Customizable Decision Sets as Global Explanations:

Desiderata & Optimization Problem

144

Fidelity

Describe model behavior accurately

Unambiguity

No contradicting explanations

Simplicity

Users should be able to look at the explanation and reason about model behavior

Customizability

Users should be able to understand model behavior across various subgroups of interest

Fidelity

Minimize number of instances for which explanation’s label ≠ model prediction

Unambiguity

Minimize the number of duplicate rules applicable to each instance

Simplicity

Minimize the number of conditions in rules;

Constraints on number of rules & subgroups;

Customizability

Outer rules should only comprise of features of user interest (candidate set restricted)

145 of 271

Customizable Decision Sets as Global Explanations

  • The complete optimization problem is non-negative, non-normal, non-monotone, and submodular with matroid constraints

  • Solved using the well-known smooth local search algorithm (Feige et. al., 2007) with best known optimality guarantees.

145

146 of 271

Approaches for Post hoc Explainability

Local Explanations

  • Feature Importances
  • Rule Based
  • Saliency Maps
  • Prototypes/Example Based
  • Counterfactuals

146

Global Explanations

  • Collection of Local Explanations
  • Representation Based
  • Model Distillation
  • Summaries of Counterfactuals

147 of 271

Counterfactual Explanations

147

Predictive

Model

f(x)

Counterfactual Generation Algorithm

DENIED

LOANS

RECOURSES

How do recourses permitted by the model vary across various racial & gender subgroups?

Are there any biases against certain demographics?

Decision Maker

(or) Regulatory Authority

148 of 271

Customizable Global Summaries of Counterfactuals

148

Predictive

Model

f(x)

Algorithm for generating global summaries of counterfactuals

DENIED

LOANS

How do recourses permitted by the model vary across various racial & gender subgroups?

Are there any biases against certain demographics?

149 of 271

Customizable Global Summaries of Counterfactuals

149

Omg! this model is biased. It requires certain demographics to “act upon” lot more features than others.

Subgroup Descriptor

Recourse Rules

150 of 271

Customizable Global Summaries of Counterfactuals:

Desiderata & Optimization Problem

150

Recourse Correctness

Prescribed recourses should obtain desirable outcomes

Recourse Correctness

Minimize number of applicants for whom prescribed recourse

does not lead to desired outcome

Recourse Coverage

Minimize number of applicants for whom recourse does not exist (i.e., satisfy no rule).

Minimal Recourse Costs

Minimize total feature costs as well as magnitude of changes

in feature values

Interpretability of Summaries

Constraints on # of rules, # of conditions in rules & # of subgroups

Recourse Coverage

(Almost all) applicants should be provided with recourses

Minimal Recourse Costs

Acting upon a prescribed recourse

should not be impractical or terribly expensive

Interpretability of Summaries

Summaries should be readily understandable to

stakeholders (e.g., decision makers/regulatory authorities).

Customizability

Stakeholders should be able to understand model behavior across various subgroups of interest

Customizability

Outer rules should only comprise of features of stakeholder interest (candidate set restricted)

151 of 271

Customizable Global Summaries of Counterfactuals

  • The complete optimization problem is non-negative, non-normal, non-monotone, and submodular with matroid constraints

  • Solved using the well-known smooth local search algorithm (Feige et. al., 2007) with best known optimality guarantees.

151

152 of 271

Breakout Groups

  • What concepts/ideas/approaches from our morning discussion stood out to you ?

  • We discussed different basic units of interpretation -- prototypes, rules, risk scores, shape functions (GAMs), feature importances
    • Are some of these more suited to certain data modalities (e.g., tabular, images, text) than others?

  • What could be some potential vulnerabilities/drawbacks of inherently interpretable models and post hoc explanation methods?

  • Given the diversity of the methods we discussed, how do we go about evaluating inherently interpretable models and post hoc explanation methods?

153 of 271

Agenda

  • Inherently Interpretable Models
  • Post hoc Explanation Methods
  • Evaluating Model Interpretations/Explanations
  • Empirically & Theoretically Analyzing Interpretations/Explanations
  • Future of Model Understanding

153

154 of 271

Evaluating Model Interpretations/Explanations

  • Evaluating the meaningfulness or correctness of explanations

    • Diverse ways of doing this depending on the type of model interpretation/explanation

  • Evaluating the interpretability of explanations

155 of 271

Evaluating Interpretability

155

156 of 271

Evaluating Interpretability

  • Functionally-grounded evaluation: Quantitative metrics – e.g., number of rules, prototypes --> lower is better!

  • Human-grounded evaluation: binary forced choice, forward simulation/prediction, counterfactual simulation

  • Application-grounded evaluation: Domain expert with exact application task or simpler/partial task

157 of 271

Evaluating Inherently Interpretable Models

  • Evaluating the accuracy of the resulting model

  • Evaluating the interpretability of the resulting model

  • Do we need to evaluate the “correctness” or “meaningfulness” of the resulting interpretations?

158 of 271

Evaluating Bayesian Rule Lists

  • A rule list classifier for stroke prediction

[Letham et. al. 2016]

159 of 271

Evaluating Interpretable Decision Sets

  • A decision set classifier for disease diagnosis

[Lakkaraju et. al. 2016]

160 of 271

Evaluating Interpretability of Bayesian Rule Lists and Interpretable Decision Sets

  • Number of rules, predicates etc. 🡪 lower is better!

  • User studies to compare interpretable decision sets to Bayesian Decision Lists (Letham et. al.)

  • Each user is randomly assigned one of the two models

  • 10 objective and 2 descriptive questions per user

160

161 of 271

Interface for Objective Questions

161

162 of 271

Interface for Descriptive Questions

162

163 of 271

User Study Results

163

Task

Metrics

Our Approach

Bayesian Decision Lists

Descriptive

Human Accuracy

0.81

0.17

Avg. Time Spent (secs.)

113.4

396.86

Avg. # of Words

31.11

120.57

Objective

Human Accuracy

0.97

0.82

Avg. Time Spent (secs.)

28.18

36.34

Objective Questions: 17% more accurate, 22% faster;

Descriptive Questions: 74% fewer words, 71% faster.

164 of 271

Evaluating Prototype and Attention Layers

  • Are prototypes and attention weights always meaningful?

  • Do attention weights correlate with other measures of feature importance? E.g., gradients

  • Would alternative attention weights yield different predictions?

[Jain and Wallace, 2019]

No!!

165 of 271

Evaluating Post hoc Explanations

  • Evaluating the faithfulness (or correctness) of post hoc explanations

  • Evaluating the stability of post hoc explanations

  • Evaluating the fairness of post hoc explanations

  • Evaluating the interpretability of post hoc explanations

[Agarwal et. al., 2022]

166 of 271

Evaluating Faithfulness of Post hoc Explanations – Ground Truth

166

167 of 271

Evaluating Faithfulness of Post hoc Explanations – Ground Truth

167

Spearman rank correlation coefficient computed over features of interest

168 of 271

Evaluating Faithfulness of Post hoc Explanations – Explanations as Models

  • If the explanation is itself a model (e.g., linear model fit by LIME), we can compute the fraction of instances for which the labels assigned by explanation model match those assigned by the underlying model

168

169 of 271

Evaluating Faithfulness of Post hoc Explanations

  • What if we do not have any ground truth?

  • What if explanations cannot be considered as models that output predictions?

170 of 271

How important are selected features?

  • Deletion: remove important features and see what happens..

170

% of Pixels deleted

Prediction Probability

171 of 271

How important are selected features?

  • Deletion: remove important features and see what happens..

171

% of Pixels deleted

Prediction Probability

172 of 271

How important are selected features?

  • Deletion: remove important features and see what happens..

172

% of Pixels deleted

Prediction Probability

173 of 271

How important are selected features?

  • Deletion: remove important features and see what happens..

173

% of Pixels deleted

Prediction Probability

174 of 271

How important are selected features?

  • Deletion: remove important features and see what happens..

174

% of Pixels deleted

Prediction Probability

175 of 271

How important are selected features?

  • Deletion: remove important features and see what happens..

175

% of Pixels deleted

Prediction Probability

176 of 271

How important are selected features?

  • Insertion: add important features and see what happens..

176

% of Pixels inserted

Prediction Probability

177 of 271

How important are selected features?

  • Insertion: add important features and see what happens..

177

Prediction Probability

% of Pixels inserted

178 of 271

How important are selected features?

  • Insertion: add important features and see what happens..

178

Prediction Probability

% of Pixels inserted

179 of 271

How important are selected features?

  • Insertion: add important features and see what happens..

179

Prediction Probability

% of Pixels inserted

180 of 271

How important are selected features?

  • Insertion: add important features and see what happens..

180

Prediction Probability

% of Pixels inserted

181 of 271

Evaluating Stability of Post hoc Explanations

  • Are post hoc explanations unstable w.r.t. small input perturbations?

[Alvarez-Melis, 2018; Agarwal et. al., 2022]

Local Lipschitz Constant

Input

Post hoc Explanation

max

182 of 271

Evaluating Stability of Post hoc Explanations

  • What if the underlying model itself is unstable?

  • Relative Output Stability: Denominator accounts for changes in the prediction probabilities

  • Relative Representation Stability: Denominator accounts for changes in the intermediate representations of the underlying model

[Agarwal et. al., 2022]

183 of 271

Evaluating Fairness of Post hoc Explanations

  • Compute mean faithfulness/stability metrics for instances from majority and minority groups (e.g., race A vs. race B, male vs. female)

  • If the difference between the two means is statistically significant, then there is unfairness in the post hoc explanations

  • Why/when can such unfairness occur?

[Dai et. al., 2022]

184 of 271

Evaluating Interpretability of Post hoc Explanations

184

185 of 271

Predicting Behavior (“Simulation”)

185

Classifier

Predictions & Explanations

Show to user

Data

Predictions

New Data

User guesses what

the classifier would do

on new data

Compare Accuracy

186 of 271

Predicting Behavior (“Simulation”)

186

187 of 271

Human-AI Collaboration

  • Are Explanations Useful for Making Decisions?
    • For tasks where the algorithms are not reliable by themselves

187

188 of 271

Human-AI Collaboration

  • Deception Detection: Identify fake reviews online
    • Are Humans better detectors with explanations?

188

https://machineintheloop.com/deception/

189 of 271

Can we improve the accuracy of decisions �using feature attribution-based explanations?

  • Prediction Problem: Is a given patient likely to be diagnosed with breast cancer within 2 years?

  • User studies carried out with about 78 doctors (Residents, Internal Medicine)

  • Each doctor looks at 10 patient records from historical data and makes predictions for each of them.

189

Lakkaraju et. al., 2022

190 of 271

Can we improve the accuracy of decisions �using feature attribution-based explanations?

190

Accuracy: 78.32%

Accuracy: 93.11%

Accuracy: 82.02%

+

At Risk (0.91)

+

At Risk (0.91)

Important Features

ESR

Family Risk

Chronic Health Conditions

Model Accuracy: 88.92%

191 of 271

Can we improve the accuracy of decisions �using feature attribution-based explanations?

191

Accuracy: 78.32%

Accuracy: 82.02%

+

At Risk (0.91)

+

At Risk (0.91)

Model Accuracy: 88.92%

Important Features

Appointment time

Appointment day

Zip code

Doctor ID > 150

Accuracy: 93.11%

192 of 271

Challenges of Evaluating Interpretable Models/Post hoc Explanation Methods

  • Evaluating interpretations/explanations still an ongoing endeavor

  • Parameter settings heavily influence the resulting interpretations/explanations

  • Diversity of explanation/interpretation methods 🡪 diverse metrics

  • User studies are not consistent
    • Affected by choice of: UI, phrasing, visualization, population, incentives, …

  • All the above leading to conflicting findings

192

193 of 271

Open Source Tools for Quantitative Evaluation

  • Post hoc explanation methods: OpenXAI: https://open-xai.github.io/ -- 22 metrics (faithfulness, stability, fairness); public dashboards comparing various metrics on different metrics; 11 lines of code to evaluate explanation quality

  • Other XAI libraries: Captum, quantus, shap bench, ERASER (NLP)

194 of 271

Agenda

  • Inherently Interpretable Models
  • Post hoc Explanation Methods
  • Evaluating Model Interpretations/Explanations
  • Empirically & Theoretically Analyzing Interpretations/Explanations
  • Future of Model Understanding

194

195 of 271

Empirically Analyzing Interpretations/Explanations

  • Lot of recent focus on analyzing the behavior of post hoc explanation methods.

  • Empirical studies analyzing the faithfulness, stability, fairness, adversarial vulnerabilities, and utility of post hoc explanation methods.

  • Several studies demonstrate limitations of existing post hoc methods.

196 of 271

Limitations: Faithfulness

196

Gradient ⊙ Input

Guided Backprop

Guided GradCAM

Model parameter randomization test

197 of 271

Limitations: Faithfulness

197

Gradient ⊙ Input

Guided Backprop

Guided GradCAM

Model parameter randomization test

198 of 271

Limitations: Faithfulness

198

Gradient ⊙ Input

Guided Backprop

Guided GradCAM

No!!

Model parameter randomization test

199 of 271

Limitations: Faithfulness

199

Randomizing class labels of instances also didn’t impact explanations!

200 of 271

Limitations: Stability

200

Are post-hoc explanations unstable wrt small non-adversarial input perturbation?

input

model

hyperparameters

Local Lipschitz Constant

Input

Explanation function: LIME, SHAP, Gradient...etc.

201 of 271

Limitations: Stability

201

  • Perturbation approaches like LIME can be unstable.

Estimate for 100 tests for an MNIST Model.

Are post-hoc explanations unstable wrt small non-adversarial input perturbation?

202 of 271

Limitations: Stability – Problem is Worse!

202

[Slack et. al., 2020]

Many = 250 perturbations; Few = 25 perturbations;

When you repeatedly run LIME on the same instance, you get different explanations (blue region)

Problem with having too few perturbations? If so, what is the optimal number of perturbations?

203 of 271

203

Post-hoc Explanations are Fragile

Post-hoc explanations can be easily manipulated.

204 of 271

204

Post-hoc Explanations are Fragile

Post-hoc explanations can be easily manipulated.

205 of 271

205

Post-hoc explanations can be easily manipulated.

Post-hoc Explanations are Fragile

206 of 271

206

Post-hoc explanations can be easily manipulated.

Post-hoc Explanations are Fragile

207 of 271

207

Adversarial Attacks on Explanations

Adversarial Attack of Ghorbani et. al. 2018

Minimally modify the input with a small perturbation without changing the model prediction.

208 of 271

208

Adversarial Attacks on Explanations

Adversarial Attack of Ghorbani et. al. 2018

Minimally modify the input with a small perturbation without changing the model prediction.

209 of 271

209

Adversarial Attacks on Explanations

Adversarial Attack of Ghorbani et. al. 2018

Minimally modify the input with a small perturbation without changing the model prediction.

210 of 271

210

Scaffolding attack used to hide classifier dependence on gender.

Adversarial Classifiers to fool LIME & SHAP

211 of 271

Vulnerabilities of LIME/SHAP: Intuition

211

Several perturbed data points are out of distribution (OOD)!

212 of 271

Vulnerabilities of LIME/SHAP: Intuition

212

Adversaries can exploit this and build a classifier that is biased on in-sample data points and unbiased on OOD samples!

213 of 271

Building Adversarial Classifiers

  • Setting:

    • Adversary wants to deploy a biased classifier f in real world.
      • E.g., uses only race to make decisions

    • Adversary must provide black box access to customers and regulators who may use post hoc techniques (GDPR).

    • Goal of adversary is to fool post hoc explanation techniques and hide underlying biases of f

213

214 of 271

Building Adversarial Classifiers

  • Input: Adversary provides us with the biased classifier f, an input dataset X sampled from real world input distribution Xdist

  • Output: Scaffolded classifier e which behaves exactly like f when making predictions on instances sampled from Xdist but will not reveal underlying biases of f when probed with perturbation-based post hoc explanation techniques.
    • e is the adversarial classifier

214

215 of 271

Building Adversarial Classifiers

  • Adversarial classifier e can be defined as:

  • f is the biased classifier input by adversary.

  • is the unbiased classifier (e.g., only uses features uncorrelated to sensitive attributes)

215

216 of 271

216

Limitations: Stability

Post-hoc explanations can be unstable to small, non-adversarial, perturbations to the input.

217 of 271

217

Limitations: Stability

Post-hoc explanations can be unstable to small, non-adversarial, perturbations to the input.

‘Local Lipschitz Constant’

Input

Explanation function: LIME, SHAP, Gradient...etc.

218 of 271

218

Limitations: Stability

  • Perturbation approaches like LIME can be unstable.

Estimate for 100 tests for an MNIST Model.

219 of 271

219

Sensitivity to Hyperparameters

Explanations can be highly sensitive to hyperparameters such as random seed, number of perturbations, patch size, etc.

220 of 271

220

Utility: High fidelity explanations can mislead

In a bail adjudication task, misleading high-fidelity explanations improve end-user (domain experts) trust.

True Classifier relies on race

221 of 271

221

True Classifier relies on race

High fidelity ‘misleading’ explanation

In a bail adjudication task, misleading high-fidelity explanations improve end-user (domain experts) trust.

Utility: High fidelity explanations can mislead

222 of 271

Utility: Post hoc Explanations Instill Over Trust

  • Domain experts and end users seem to be over trusting explanations & the underlying models based on explanations

    • Data scientists over trusted explanations without even comprehending them -- “Participants trusted the tools because of their visualizations and their public availability

222

[Kaur et. al., 2020; Bucinca et. al., 2020]

223 of 271

Responses from Data Scientists Using Explainability Tools

(GAM and SHAP)

223

[Kaur et. al., 2020]

224 of 271

224

Utility: Explanations for Debugging

In a housing price prediction task, Amazon mechanical turkers are unable to use linear model coefficients to diagnose model mistakes.

225 of 271

225

In a dog breeds classification task, users familiar with machine learning rely on labels, instead of saliency maps, for diagnosing model errors.

Utility: Explanations for Debugging

226 of 271

226

In a dog breeds classification task, users familiar with machine learning rely on labels, instead of saliency maps, for diagnosing model errors.

Utility: Explanations for Debugging

227 of 271

227

Conflicting Evidence on Utility of Explanations

  • Mixed evidence:
        • simulation and benchmark studies show that explanations are useful for debugging;
        • however, recent user studies show limited utility in practice.

228 of 271

228

  • Mixed evidence:
        • simulation and benchmark studies show that explanations are useful for debugging;
        • however, recent user studies show limited utility in practice.

  • Rigorous user studies and pilots with end-users can continue to help provide feedback to researchers on what to address (see: Alqaraawi et. al. 2020, Bhatt et. al. 2020 & Kaur et. al. 2020).

Conflicting Evidence on Utility of Explanations

229 of 271

Utility: Disagreement Problem in XAI

  • Study to understand:

    • if and how often feature attribution based explanation methods disagree with each other in practice

    • What constitutes disagreement between these explanations, and how to formalize the notion of explanation disagreement based on practitioner inputs?

    • How do practitioners resolve explanation disagreement?

229

Krishna and Han et. al., 2022

230 of 271

Practitioner Inputs on Explanation Disagreement

  • 30 minute semi-structured interviews with 25 data scientists

  • 84% of participants said they often encountered disagreement between explanation methods

  • Characterizing disagreement:
    • Top features are different
    • Ordering among top features is different
    • Direction of top feature contributions is different
    • Relative ordering of features of interest is different

230

231 of 271

How do Practitioners Resolve Disagreements?

  • Online user study where 25 users were shown explanations that disagree and asked to make a choice, and explain why

  • Practitioners are choosing methods due to:
    • Associated theory or publication time (33%)
    • Explanations matching human intuition better (32%)
    • Type of data (23%)
      • E.g., LIME or SHAP are better for tabular data

231

232 of 271

How do Practitioners Resolve Disagreements?

232

233 of 271

233

Empirical Analysis: Summary

  • Faithfulness/Fidelity
      • Some explanation methods do not ‘reflect’ the underlying model.

  • Fragility
      • Post-hoc explanations can be easily manipulated.

  • Stability
      • Slight changes to inputs can cause large changes in explanations.

  • Useful in practice?

234 of 271

Theoretically Analyzing Interpretable Models

  • Two main classes of theoretical results:
  • Interpretable models learned using certain algorithms are certifiably optimal
    • E.g., rule lists (Angelino et. al., 2018)
  • No accuracy-interpretability tradeoffs in certain settings
    • E.g., reinforcement learning for mazes (Mansour et. al., 2022)

235 of 271

Theoretical Analysis of Tabular LIME w.r.t. Linear Models

  • Theoretical analysis of LIME
    • “black box” is a linear model
    • data is tabular and discretized

  • Obtained closed-form solutions of the average coefficients of the “surrogate” model (explanation output by LIME)

  • The coefficients obtained are proportional to the gradient of the function to be explained

  • Local error of surrogate model is bounded away from zero with high probability

235

[Garreau et. al., 2020]

236 of 271

Unification and Robustness of LIME and SmoothGrad

  • C-LIME (a continuous variant of LIME) and SmoothGrad converge to the same explanation in expectation

  • At expectation, the resulting explanations are provably robust according to the notion of Lipschitz continuity

  • Finite sample complexity bounds for the number of perturbed samples required for SmoothGrad and C-LIME to converge to their expected output

236

[Agarwal et. al., 2020]

237 of 271

Function Approximation Perspective to Characterizing Post hoc Explanation Methods

  • Various feature attribution methods (e.g., LIME, C-LIME, KernelSHAP, Occlusion, Vanilla Gradients, Gradient times Input, SmoothGrad, Integrated Gradients) are essentially local linear function approximations.

  • But…

237

[Han et. al., 2022]

238 of 271

Function Approximation Perspective to Characterizing Post hoc Explanation Methods

  • But, they adopt different loss functions, and local neighborhoods

238

[Han et. al., 2022]

239 of 271

Function Approximation Perspective to Characterizing Post hoc Explanation Methods

  • No Free Lunch Theorem for Explanation Methods: No single method can perform optimally across all neighborhoods

239

240 of 271

Agenda

  • Inherently Interpretable Models
  • Post hoc Explanation Methods
  • Evaluating Model Interpretations/Explanations
  • Empirically & Theoretically Analyzing Interpretations/Explanations
  • Future of Model Understanding

240

241 of 271

Future of Model Understanding

241

Methods for More Reliable

Post hoc Explanations

Theoretical Analysis of the Behavior of

Interpretable Models & Explanation Methods

Model Understanding

Beyond Classification

Intersections with Model Privacy

Intersections with Model Fairness

Empirical Evaluation of the Correctness & Utility of Model Interpretations/Explanations

Intersections with Model Robustness

Characterizing Similarities and Differences Between Various Methods

New Interfaces, Tools, Benchmarks for Model Understanding

242 of 271

Methods for More Reliable Post hoc Explanations

  • We have seen several limitations in the behavior of post hoc explanation methods – e.g., unstable, inconsistent, fragile, not faithful

  • While there are already attempts to address some of these limitations, more work is needed

242

243 of 271

Challenges with LIME: Stability

  • Perturbation approaches like LIME/SHAP are unstable

243

Alvarez-Melis, 2018

244 of 271

Challenges with LIME: Consistency

244

Slack et. al., 2020

Many = 250 perturbations; Few = 25 perturbations;

When you repeatedly run LIME on the same instance,

you get different explanations (blue region)

245 of 271

Challenges with LIME: Consistency

245

Problem with having too few perturbations?

What is the optimal number of perturbations?

Can we just use a very large number of perturbations?

246 of 271

Challenges with LIME: Scalability

  • Querying complex models (e.g., Inception Network, ResNet, AlexNet) repeatedly for labels can be computationally prohibitive

  • Large number of perturbations 🡪 Large number of model queries

246

Generating reliable explanations using LIME can be computationally expensive!

247 of 271

Explanations with Guarantees: BayesLIME and BayesSHAP

  • Intuition: Instead of point estimates of feature importances, define these as distributions

247

248 of 271

BayesLIME and BayesSHAP

  • Construct a Bayesian locally weighted regression that can accommodate LIME/SHAP weighting functions

248

Feature Importances

Perturbations

Weighting Function

Black Box Predictions

Priors on feature importances and feature importance uncertainty

249 of 271

BayesLIME and BayesSHAP: Inference

  • Conjugacy results in following posteriors

  • We can compute all parameters in closed form

249

These are the same equations used in LIME & SHAP!

250 of 271

Estimating the Required Number of Perturbations

250

Estimate required number of perturbations for user specified uncertainty level.

251 of 271

Improving Efficiency: Focused Sampling

  • Instead of sampling perturbations randomly and querying the black box, choose points the learning algorithm is most uncertain about and only query their labels from the black box.

251

This approach allows us to construct explanations with

user defined levels of confidence in an efficient manner!

252 of 271

Other Questions

  • Can we construct post hoc explanations that are provably robust to various adversarial attacks discussed earlier?

  • Can we construct post hoc explanations that can guarantee faithfulness, stability, and fairness simultaneously?

252

253 of 271

Future of Model Understanding

253

Methods for More Reliable

Post hoc Explanations

Theoretical Analysis of the Behavior of

Interpretable Models & Explanation Methods

Model Understanding

Beyond Classification

Intersections with Model Privacy

Intersections with Model Fairness

Empirical Evaluation of the Correctness & Utility of Model Interpretations/Explanations

Intersections with Model Robustness

Characterizing Similarities and Differences Between Various Methods

New Interfaces, Tools, Benchmarks for Model Understanding

254 of 271

Theoretical Analysis of the Behavior of Explanations/Models

  • We discussed some of the recent theoretical results earlier. Despite these, several important questions remain unanswered

  • Can we characterize the conditions under which each post hoc explanation method (un)successfully captures the behavior of the underlying model?

  • Given the properties of the underlying model, data distribution, can we theoretically determine which explanation method should be employed?

  • Can we theoretically analyze the nature of the prototypes/attention weights learned by deep nets with added layers? When are these meaningful/when are they spurious?

254

255 of 271

Future of Model Understanding

255

Methods for More Reliable

Post hoc Explanations

Theoretical Analysis of the Behavior of

Interpretable Models & Explanation Methods

Model Understanding

Beyond Classification

Intersections with Model Privacy

Intersections with Model Fairness

Empirical Evaluation of the Correctness & Utility of Model Interpretations/Explanations

Intersections with Model Robustness

Characterizing Similarities and Differences Between Various Methods

New Interfaces, Tools, Benchmarks for Model Understanding

256 of 271

Empirical Analysis of Correctness/Utility

  • While there is already a lot of work on empirical analysis of correctness/utility for post hoc explanation methods, there is still no clear characterization of which methods (if any) are correct/useful under what conditions.

  • There is even less work on the empirical analysis of the correctness/utility of the interpretations generated by inherently interpretable models. For instance, are the prototypes generated by adding prototype layers correct/meaningful? Can they be leveraged in any real world applications? What about attention weights?

256

257 of 271

Future of Model Understanding

257

Methods for More Reliable

Post hoc Explanations

Theoretical Analysis of the Behavior of

Interpretable Models & Explanation Methods

Model Understanding

Beyond Classification

Intersections with Model Privacy

Intersections with Model Fairness

Empirical Evaluation of the Correctness & Utility of Model Interpretations/Explanations

Intersections with Model Robustness

Characterizing Similarities and Differences Between Various Methods

New Interfaces, Tools, Benchmarks for Model Understanding

258 of 271

Characterizing Similarities and Differences

  • Several post hoc explanation methods exist which employ diverse algorithms and definitions of what constitutes an explanation, under what conditions do these methods generate similar outputs (e.g., top K features) ?

  • Multiple interpretable models which output natural/synthetic prototypes (e.g., Li et. al, Chen et. al. etc.). When do they generate similar answers and why?

258

259 of 271

Future of Model Understanding

259

Methods for More Reliable

Post hoc Explanations

Theoretical Analysis of the Behavior of

Interpretable Models & Explanation Methods

Model Understanding

Beyond Classification

Intersections with Model Privacy

Intersections with Model Fairness

Empirical Evaluation of the Correctness & Utility of Model Interpretations/Explanations

Intersections with Model Robustness

Characterizing Similarities and Differences Between Various Methods

New Interfaces, Tools, Benchmarks for Model Understanding

260 of 271

Model Understanding Beyond Classification

  • How to think about interpretability in the context of large language models and foundation models? What is even feasible here?

  • Already active work on interpretability in RL and GNNs. However, very little research on analyzing the correctness/utility of these explanations.

  • Given that primitive interpretable models/post hoc explanations suffer from so many limitations, how to ensure explanations for more complex models are reliable?

260

[Coppens et. al., 2019, Amir et. al. 2018]

[Ying et. al., 2019]

261 of 271

Future of Model Understanding

261

Methods for More Reliable

Post hoc Explanations

Theoretical Analysis of the Behavior of

Interpretable Models & Explanation Methods

Model Understanding

Beyond Classification

Intersections with Model Privacy

Intersections with Model Fairness

Empirical Evaluation of the Correctness & Utility of Model Interpretations/Explanations

Intersections with Model Robustness

Characterizing Similarities and Differences Between Various Methods

New Interfaces, Tools, Benchmarks for Model Understanding

262 of 271

Intersections with Model Robustness

  • Are inherently interpretable models with prototype/attention layers more robust than those without these layers? If so, why?

  • Are there any inherent trade-offs between (certain kinds of) model interpretability and model robustness? Or do these aspects reinforce each other?

  • Prior works show that counterfactual explanation generation algorithms output adversarial examples. What is the impact of adversarially robust models on these explanations? [Pawelczyk et. al., 2022]

262

263 of 271

Future of Model Understanding

263

Methods for More Reliable

Post hoc Explanations

Theoretical Analysis of the Behavior of

Interpretable Models & Explanation Methods

Model Understanding

Beyond Classification

Intersections with Model Privacy

Intersections with Model Fairness

Empirical Evaluation of the Correctness & Utility of Model Interpretations/Explanations

Intersections with Model Robustness

Characterizing Similarities and Differences Between Various Methods

New Interfaces, Tools, Benchmarks for Model Understanding

264 of 271

Intersections with Model Fairness

  • It is often hypothesized that model interpretations and explanations help unearth unfairness biases of underlying models. However, there is little to no empirical research demonstrating this.

  • Conducting more empirical evaluations and user studies to determine how interpetations and explanations can complement statistical notions of fairness in identifying racial/gender biases

  • How does the fairness (statistical) of inherently interpretable models compare with that of vanilla models? Are there any inherent trade-offs between (certain kinds of) model interpretability and model fairness? Or do these aspects reinforce each other?

264

265 of 271

Future of Model Understanding

265

Methods for More Reliable

Post hoc Explanations

Theoretical Analysis of the Behavior of

Interpretable Models & Explanation Methods

Model Understanding

Beyond Classification

Intersections with Model Privacy

Intersections with Model Fairness

Empirical Evaluation of the Correctness & Utility of Model Interpretations/Explanations

Intersections with Model Robustness

Characterizing Similarities and Differences Between Various Methods

New Interfaces, Tools, Benchmarks for Model Understanding

266 of 271

Intersections with Differential Privacy

  • Model interpretations and explanations could potentially expose sensitive information from the datasets.

  • Little to no research on the privacy implications of interpretable models and/or explanations. What kinds of privacy attacks (e.g., membership inference, model inversion etc.) are enabled?

  • Do differentially private models help thwart these attacks?If so, under what conditions? Should we construct differentially private explanations?

266

[Harder et. al., 2020; Patel et. al. 2020]

267 of 271

Future of Model Understanding

267

Methods for More Reliable

Post hoc Explanations

Theoretical Analysis of the Behavior of

Interpretable Models & Explanation Methods

Model Understanding

Beyond Classification

Intersections with Model Privacy

Intersections with Model Fairness

Empirical Evaluation of the Correctness & Utility of Model Interpretations/Explanations

Intersections with Model Robustness

Characterizing Similarities and Differences Between Various Methods

New Interfaces, Tools, Benchmarks for Model Understanding

268 of 271

New Interfaces, Tools, Benchmarks for �Model Understanding

  • Can we construct more interactive interfaces for end users to engage with models? What would be the nature of such interactions? [demo]

  • As model interpretations and explanations are employed in different settings, we need to develop new benchmarks and tools for enabling comparison of faithfulness, stability, fairness, utility of various methods. How to enable that?

268

[Lakkaraju et. al., 2022, Slack et. al., 2022]

269 of 271

Some Parting Thoughts..

  • There has been renewed interest in model understanding over the past half decade, thanks to ML models being deployed in healthcare and other high-stakes settings

  • As ML models continue to get increasingly complex and they continue to find more applications, the need for model understanding is only going to raise

  • Lots of interesting and open problems waiting to be solved

  • You can approach the field of XAI from diverse perspectives: theory, algorithms, HCI, or interdisciplinary research – there is room for everyone! ☺

269

270 of 271

Thank You!

  • Acknowledgements: Special thanks to Julius Adebayo, Chirag Agarwal, Shalmali Joshi, and Sameer Singh for co-developing and co-presenting sub-parts of this tutorial at NeurIPS, AAAI, and FAccT conferences.

  • Course on interpretability and explainability: https://interpretable-ml-class.github.io/

  • More tutorials on interpretability and explainability: https://explainml-tutorial.github.io/

  • Trustworthy ML Initiative: https://www.trustworthyml.org/
    • Lots of resources and seminar series on topics related to explainability, fairness, adversarial robustness, differential privacy, causality etc.

270

271 of 271