2 of 271

Schedule for Today

9am to 1020am: Introduction; Overview of Inherently Interpretable Models

1020am to 1040am: Break

1040am to 12pm: Overview of Post hoc Explanation Methods

12pm to 1pm: Lunch

105pm to 125pm: Breakout Groups

125pm to 245pm: Evaluating and Analyzing Model Interpretations and Explanations

245pm to 3pm: Break

3pm to 4pm: Analyzing Model Interpretations and Explanations, and Future Research Directions

3 of 271

Motivation

Machine Learning is EVERYWHERE!!

[ Weller 2017 ]

4 of 271

Is Model Understanding Needed Everywhere?

[ Weller 2017 ]

5 of 271

When and Why Model Understanding?

Not all applications require model understanding

E.g., ad/product/friend recommendations
No human intervention

Model understanding not needed because:

Little to no consequences for incorrect predictions
Problem is well studied and models are extensively validated in real-world applications 🡪 trust model predictions

[ Weller 2017, Lipton 2017, Doshi-Velez and Kim 2016 ]

6 of 271

When and Why Model Understanding?

ML is increasingly being employed in complex high-stakes settings

7 of 271

When and Why Model Understanding?

High-stakes decision-making settings

Impact on human lives/health/finances
Settings relatively less well studied, models not extensively validated

Accuracy alone is no longer enough

Train/test data may not be representative of data encountered in practice

Auxiliary criteria are also critical:

Nondiscrimination
Right to explanation
Safety

8 of 271

When and Why Model Understanding?

Auxiliary criteria are often hard to quantify (completely)

E.g.: Impossible to predict/enumerate all scenarios violating safety of an autonomous car

Incompleteness in problem formalization

Hinders optimization and evaluation
Incompleteness ≠ Uncertainty; Uncertainty can be quantified

9 of 271

When and Why Model Understanding?

Model understanding becomes critical when:

models not extensively validated in applications; train/test data not representative of real time data

key criteria are hard to quantify, and we need to rely on a “you will know it when you see it” approach

10 of 271

Example: Why Model Understanding?

Predictive

Model

Input

Prediction = Siberian Husky

Model Understanding

This model is relying on incorrect features to make this prediction!! Let me fix the model

Model understanding facilitates debugging.

11 of 271

Example: Why Model Understanding?

Predictive

Model

Defendant Details

Prediction = Risky to Release

Model Understanding

Race

Crimes

Gender

This prediction is biased. Race and gender are being used to make the prediction!!

Model understanding facilitates bias detection.

[ Larson et. al. 2016 ]

12 of 271

Example: Why Model Understanding?

Predictive

Model

Loan Applicant Details

Prediction = Denied Loan

Model Understanding

Increase salary by 50K + pay credit card bills on time for next 3 months to get a loan

Loan Applicant

I have some means for recourse. Let me go and work on my promotion and pay my bills on time.

Model understanding helps provide recourse to individuals who are adversely affected by model predictions.

13 of 271

Example: Why Model Understanding?

Predictive

Model

Patient Data

Model Understanding

This model is using irrelevant features when predicting on female subpopulation. I should not trust its predictions for that group.

Predictions

25, Female, Cold

32, Male, No

31, Male, Cough

Healthy

Sick

Healthy

Sick

If gender = female,

if ID_num > 200, then sick

If gender = male,

if cold = true and cough = true, then sick

Model understanding helps assess if and when to trust model predictions when making decisions.

14 of 271

Example: Why Model Understanding?

Predictive

Model

Patient Data

Model Understanding

This model is using irrelevant features when predicting on female subpopulation. This cannot be approved!

Predictions

25, Female, Cold

32, Male, No

31, Male, Cough

Healthy

Sick

Healthy

Sick

If gender = female,

if ID_num > 200, then sick

If gender = male,

if cold = true and cough = true, then sick

Model understanding allows us to vet models to determine if they are suitable for deployment in real world.

15 of 271

Summary: Why Model Understanding?

Debugging

Bias Detection

Recourse

If and when to trust model predictions

Vet models to assess suitability for deployment

Utility

End users (e.g., loan applicants)

Decision makers (e.g., doctors, judges)

Regulatory agencies (e.g., FDA, European commission)

Researchers and engineers

Stakeholders

16 of 271

Achieving Model Understanding

Take 1: Build inherently interpretable predictive models

[ Letham and Rudin 2015; Lakkaraju et. al. 2016 ]

17 of 271

Achieving Model Understanding

Take 2: Explain pre-built models in a post-hoc manner

Explainer

[ Ribeiro et. al. 2016, Ribeiro et al. 2018; Lakkaraju et. al. 2019 ]

18 of 271

Inherently Interpretable Models vs.

Post hoc Explanations

In certain settings, accuracy-interpretability trade offs may exist.

Example

[ Cireşan et. al. 2012, Caruana et. al. 2006, Frosst et. al. 2017, Stewart 2020 ]

19 of 271

Inherently Interpretable Models vs.

Post hoc Explanations

complex models might achieve higher accuracy

can build interpretable +

accurate models

20 of 271

Inherently Interpretable Models vs.

Post hoc Explanations

Sometimes, you don’t have enough data to build your model from scratch.

And, all you have is a (proprietary) black box!

[ Ribeiro et. al. 2016 ]

21 of 271

Inherently Interpretable Models vs.

Post hoc Explanations

If you can build an interpretable model which is also adequately accurate for your setting, DO IT!

Otherwise, post hoc explanations come to the rescue!

22 of 271

Agenda

Inherently Interpretable Models
Post hoc Explanation Methods
Evaluating Model Interpretations/Explanations
Empirically & Theoretically Analyzing Interpretations/Explanations
Future of Model Understanding

23 of 271

Agenda

Inherently Interpretable Models
Post hoc Explanation Methods
Evaluating Model Interpretations/Explanations
Empirically & Theoretically Analyzing Interpretations/Explanations
Future of Model Understanding

24 of 271

Inherently Interpretable Models

Rule Based Models
Risk Scores
Generalized Additive Models
Prototype Based Models
Attention Based Models

25 of 271

Inherently Interpretable Models

Rule Based Models
Risk Scores
Generalized Additive Models
Prototype Based Models
Attention Based Models

26 of 271

Bayesian Rule Lists

A rule list classifier for stroke prediction

[Letham et. al. 2016]

27 of 271

Bayesian Rule Lists

A generative model designed to produce rule lists (if/else-if) that strike a balance between accuracy, interpretability, and computation

What about using other similar models?

Decision trees (CART, C5.0 etc.)
They employ greedy construction methods
Not computationally demanding but affects quality of solution – both accuracy and interpretability

28 of 271

Bayesian Rule Lists: Generative Model

is a set of pre-mined antecedents

Model parameters are inferred using the Metropolis-Hastings algorithm which is a

Markov Chain Monte Carlo (MCMC) Sampling method

29 of 271

Pre-mined Antecedents

A major source of practical feasibility: pre-mined antecedents

Reduces model space
Complexity of problem depends on number of pre-mined antecedents

As long as pre-mined set is expressive, accurate decision list can be found + smaller model space means better generalization (Vapnik, 1995)

30 of 271

Interpretable Decision Sets

A decision set classifier for disease diagnosis

[Lakkaraju et. al. 2016]

31 of 271

Interpretable Decision Sets: Desiderata

Optimize for the following criteria

Recall
Precision
Distinctness
Parsimony
Class Coverage

Recall and Precision 🡪 Accurate predications

Distinctness, Parsimony, and Class Coverage 🡪Interpretability

32 of 271

IDS: Objective Function

33 of 271

IDS: Objective Function

34 of 271

IDS: Objective Function

35 of 271

IDS: Objective Function

36 of 271

IDS: Objective Function

37 of 271

IDS: Objective Function

38 of 271

IDS: Optimization Procedure

The problem is a non-normal, non-monotone, submodular optimization problem

Maximizing a non-monotone submodular function is NP-hard

Local search method which iteratively adds and removes elements until convergence

Provides a 2/5 approximation

39 of 271

Inherently Interpretable Models

Rule Based Models
Risk Scores
Generalized Additive Models
Prototype Based Models
Attention Based Models

40 of 271

Risk Scores: Motivation

Risk scores are widely used in medicine and criminal justice

E.g., assess risk of mortality in ICU, assess the risk of recidivism

Adoption 🡪 decision makers find them easy to understand

Until very recently, risk scores were constructed manually by domain experts. Can we learn these in a data-driven fashion?

41 of 271

Risk Scores: Examples

Recidivism

Loan Default

[Ustun and Rudin, 2016]

42 of 271

Objective function to learn risk scores

Above turns out to be a mixed integer program, and is optimized using a cutting plane

method and a branch-and-bound technique.

43 of 271

Inherently Interpretable Models

Rule Based Models
Risk Scores
Generalized Additive Models
Prototype Based Models
Attention Based Models

44 of 271

Generalized Additive Models (GAMs)

[Lou et. al., 2012; Caruana et. al., 2015]

45 of 271

Formulation and Characteristics of GAMs

g is a link function; E.g., identity function in case of regression;

log (y/1 – y) in case of classification;

f_i is a shape function

46 of 271

GAMs and GA²Ms

While GAMs model first order terms, GA²Ms model second order feature interactions as well.

47 of 271

GAMs and GA²Ms

Learning:

Represent each component as a spline
Least squares formulation; Optimization problem to balance smoothness and empirical error

GA²Ms: Build GAM first and and then detect and rank all possible pairs of interactions in the residual

Choose top k pairs
k determined by CV

48 of 271

Inherently Interpretable Models

Rule Based Models
Risk Scores
Generalized Additive Models
Prototype Based Models
Attention Based Models

49 of 271

Prototype Selection for Interpretable Classification

The goal here is to identify K prototypes (instances) from the data s.t. a new instance which will be assigned the same label as the closest prototype will be correctly classified (with a high probability)

Let each instance “cover” the ϵ - neighborhood around it.

Once we define the neighborhood covered by each instance, this problem becomes similar to the problem of finding rule sets, and can be solved analogously.

[Bien et. al., 2012]

50 of 271

Prototype Selection for Interpretable Classification

51 of 271

Prototype Layers in Deep Learning Models

[Li et. al. 2017, Chen et. al. 2019]

52 of 271

Prototype Layers in Deep Learning Models

53 of 271

Prototype Layers in Deep Learning Models

54 of 271

Prototype Layers in Deep Learning Models

55 of 271

Prototype Layers in Deep Learning Models

56 of 271

Inherently Interpretable Models

Rule Based Models
Risk Scores
Generalized Additive Models
Prototype Based Models
Attention Based Models

57 of 271

Attention Layers in Deep Learning Models

Let us consider the example of machine translation

[Bahdanau et. al. 2016; Xu et. al. 2015]

Input

Encoder

Context Vector

Decoder

Outputs

Bob

h₁

h₂

h₃

s₁

s₂

s₃

suis

Bob

58 of 271

Attention Layers in Deep Learning Models

Let us consider the example of machine translation

[Bahdanau et. al. 2016; Xu et. al. 2015]

Input

Encoder

Context Vector

Decoder

Outputs

Bob

h₁

h₂

h₃

s₁

s₂

s₃

suis

Bob

c₁

c₂

c₃

59 of 271

Attention Layers in Deep Learning Models

Context vector corresponding to s_ican be written as follows:

captures the attention placed on input token j when determining the decoder hidden state s_i; it can be computed as a softmax of the “match” between s_i-1 and h_j

60 of 271

Inherently Interpretable Models

Rule Based Models
Risk Scores
Generalized Additive Models
Prototype Based Models
Attention Based Models

61 of 271

Agenda

Inherently Interpretable Models
Post hoc Explanation Methods
Evaluating Model Interpretations/Explanations
Empirically & Theoretically Analyzing Interpretations/Explanations
Future of Model Understanding

62 of 271

What is an Explanation?

63 of 271

What is an Explanation?

Definition: Interpretable description of the model behavior

Classifier

User

Explanation

Faithful

Understandable

64 of 271

What is an Explanation?

Definition: Interpretable description of the model behavior

Summarize with a program/rule/tree

Classifier

User

Send all the model parameters θ?

Send many example predictions?

Select most important features/points

Describe how to flip the model prediction

...

[ Lipton 2016 ]

65 of 271

Local Explanations vs. Global Explanations

Explain individual predictions

Explain complete behavior of the model

Help unearth biases in the local neighborhood of a given instance

Help shed light on big picture biases affecting larger subgroups

Help vet if individual predictions are being made for the right reasons

Help vet if the model, at a high level, is suitable for deployment

66 of 271

Approaches for Post hoc Explainability

Local Explanations

Feature Importances
Rule Based
Saliency Maps
Prototypes/Example Based
Counterfactuals

Global Explanations

Collection of Local Explanations
Representation Based
Model Distillation
Summaries of Counterfactuals

67 of 271

Approaches for Post hoc Explainability

Local Explanations

Feature Importances
Rule Based
Saliency Maps
Prototypes/Example Based
Counterfactuals

Global Explanations

Collection of Local Explanations
Representation Based
Model Distillation
Summaries of Counterfactuals

68 of 271

LIME: Local Interpretable Model-Agnostic Explanations

Sample points around x_i

[ Ribeiro et al. 2016 ]

69 of 271

LIME: Local Interpretable Model-Agnostic Explanations

Sample points around x_i
Use model to predict labels for each sample

70 of 271

LIME: Local Interpretable Model-Agnostic Explanations

Sample points around x_i
Use model to predict labels for each sample
Weigh samples according to distance to x_i

71 of 271

LIME: Local Interpretable Model-Agnostic Explanations

Sample points around x_i
Use model to predict labels for each sample
Weigh samples according to distance to x_i
Learn simple linear model on weighted samples

72 of 271

LIME: Local Interpretable Model-Agnostic Explanations

Sample points around x_i
Use model to predict labels for each sample
Weigh samples according to distance to x_i
Learn simple linear model on weighted samples
Use simple linear model to explain

73 of 271

Predict Wolf vs Husky

Only 1 mistake!

[ Ribeiro et al. 2016 ]

74 of 271

Predict Wolf vs Husky

We’ve built a great snow detector…

[ Ribeiro et al. 2016 ]

75 of 271

SHAP: Shapley Values as Importance

Marginal contribution of each feature towards the prediction,

averaged over all possible permutations.

Attributes the prediction to each of the features.

x_i

P(y) = 0.9

x_i

P(y) = 0.8

M(x_i, O) = 0.1

O/x_i

[ Lundberg & Lee 2017 ]

76 of 271

Approaches for Post hoc Explainability

Local Explanations

Feature Importances
Rule Based
Saliency Maps
Prototypes/Example Based
Counterfactuals

Global Explanations

Collection of Local Explanations
Representation Based
Model Distillation
Summaries of Counterfactuals

77 of 271

Anchors

[ Ribeiro et al. 2018 ]

Perturb a given instance x to generate local neighborhood

Identify an “anchor” rule which has the maximum coverage of the local neighborhood and also achieves a high precision.

78 of 271

Salary Prediction

[ Ribeiro et al. 2018 ]

79 of 271

Approaches for Post hoc Explainability

Local Explanations

Feature Importances
Rule Based
Saliency Maps
Prototypes/Example Based
Counterfactuals

Global Explanations

Collection of Local Explanations
Representation Based
Model Distillation
Summaries of Counterfactuals

80 of 271

Saliency Map Overview

Input

Model

Predictions

Junco Bird

81 of 271

Saliency Map Overview

What parts of the input are most relevant for the model’s prediction: ‘Junco Bird’?

Input

Model

Predictions

Junco Bird

82 of 271

Saliency Map Overview

What parts of the input are most relevant for the model’s prediction: ‘Junco Bird’?

Input

Model

Predictions

Junco Bird

83 of 271

Modern DNN Setting

Model

class specific logit

Input

Model

Predictions

Junco Bird

84 of 271

Input-Gradient

Logit

Input

Same dimension as the input.

Input

Model

Predictions

Junco Bird

Baehrens et. al. 2010; Simonyan et. al. 2014 .

85 of 271

Input-Gradient

Input

Model

Predictions

Junco Bird

Baehrens et. al. 2010; Simonyan et. al. 2014 .

Input-Gradient

Logit

Visualize as a heatmap

Input

86 of 271

Input-Gradient

Input

Model

Predictions

Junco Bird

Input-Gradient

Logit

Input

Challenges

Visually noisy & difficult to interpret.
‘Gradient saturation.’

Shrikumar et. al. 2017.

Baehrens et. al. 2010; Simonyan et. al. 2014 .

87 of 271

SmoothGrad

Input

Model

Predictions

Junco Bird

Smilkov et. al. 2017

SmoothGrad

Gaussian noise

Average Input-gradient of ‘noisy’ inputs.

88 of 271

SmoothGrad

Input

Model

Predictions

Junco Bird

Smilkov et. al. 2017

SmoothGrad

Gaussian noise

Average Input-gradient of ‘noisy’ inputs.

89 of 271

Integrated Gradients

Input

Model

Predictions

Junco Bird

Baseline input

Path integral: ‘sum’ of interpolated gradients

Sundararajan et. al. 2017

90 of 271

Integrated Gradients

Input

Model

Predictions

Junco Bird

Path integral: ‘sum’ of interpolated gradients

Sundararajan et. al. 2017

Baseline input

91 of 271

Gradient-Input

Input

Model

Predictions

Junco Bird

Gradient-Input

Input gradient

Input

Element-wise product of input-gradient and input.

Shrikumar et. al. 2017, Ancona et. al. 2018.

92 of 271

Gradient-Input

Input

Model

Predictions

Junco Bird

Gradient-Input

logit gradient

Input

Shrikumar et. al. 2017, Ancona et. al. 2018.

Element-wise product of input-gradient and input.

93 of 271

Approaches for Post hoc Explainability

Local Explanations

Feature Importances
Rule Based
Saliency Maps
Prototypes/Example Based
Counterfactuals

Global Explanations

Collection of Local Explanations
Representation Based
Model Distillation
Summaries of Counterfactuals

94 of 271

Prototypes/Example Based Post hoc Explanations

Use examples (synthetic or natural) to explain individual predictions

Influence Functions (Koh & Liang 2017)

Identify instances in the training set that are responsible for the prediction of a given test instance

Activation Maximization (Erhan et al. 2009)

Identify examples (synthetic or natural) that strongly activate a function (neuron) of interest�

95 of 271

Training Point Ranking via Influence Functions

Which training data points have the most ‘influence’ on the test loss?

Input

Model

Predictions

Junco Bird

96 of 271

Training Point Ranking via Influence Functions

Which training data points have the most ‘influence’ on the test loss?

Input

Model

Predictions

Junco Bird

97 of 271

Training Point Ranking via Influence Functions

Influence Function: classic tool used in robust statistics for assessing the effect of a sample on regression parameters (Cook & Weisberg, 1980).

Instead of refitting model for every data point, Cook’s distance provides analytical alternative.

98 of 271

Training Point Ranking via Influence Functions

Koh & Liang (2017) extend the ‘Cook’s distance’ insight to modern machine learning setting.

Koh & Liang 2017

Training sample point

99 of 271

Training Point Ranking via Influence Functions

Koh & Liang (2017) extend the ‘Cook’s distance’ insight to modern machine learning setting.

Koh & Liang 2017

Training sample point

ERM Solution

UpWeighted ERM Solution

100 of 271

Training Point Ranking via Influence Functions

Koh & Liang (2017) extend the ‘Cook’s distance’ insight to modern machine learning setting.

100

Koh & Liang 2017

Training sample point

ERM Solution

UpWeighted ERM Solution

Influence of Training Point on Parameters

Influence of Training Point on Test-Input’s loss

101 of 271

Training Point Ranking via Influence Functions

Applications:

compute self-influence to identify mislabelled examples;

diagnose possible domain mismatch;

craft training-time poisoning examples.

101

[ Koh & Liang 2017 ]

102 of 271

Challenges and Other Approaches

Influence function Challenges:

scalability: computing hessian-vector products can be tedious in practice.

non-convexity: possibly loose approximation for deeper networks (Basu et. al. 2020).

102

103 of 271

Challenges and Other Approaches

Influence function Challenges:

scalability: computing hessian-vector products can be tedious in practice.

non-convexity: possibly loose approximation for ‘deeper’ networks (Basu et. al. 2020).

Alternatives:

Representer Points (Yeh et. al. 2018).

TracIn (Pruthi et. al. appearing at NeuRIPs 2020).

103

104 of 271

Activation Maximization

These approaches identify examples, synthetic or natural, that strongly activate a function (neuron) of interest.

104

105 of 271

Activation Maximization

These approaches identify examples, synthetic or natural, that strongly activate a function (neuron) of interest.

Implementation Flavors:

Search for natural examples within a specified set (training or validation corpus) that strongly activate a neuron of interest;

Synthesize examples, typically via gradient descent, that strongly activate a neuron of interest.

105

106 of 271

Feature Visualization

106

Olah et. al. 2017

107 of 271

Approaches for Post hoc Explainability

Local Explanations

Feature Importances
Rule Based
Saliency Maps
Prototypes/Example Based
Counterfactuals

107

Global Explanations

Collection of Local Explanations
Representation Based
Model Distillation
Summaries of Counterfactuals

108 of 271

Counterfactual Explanations

108

What features need to be changed and by how much to flip a model’s prediction?

[Goyal et. al., 2019]

109 of 271

Counterfactual Explanations

As ML models increasingly deployed to make high-stakes decisions (e.g., loan applications), it becomes important to provide recourse to affected individuals.

109

Counterfactual Explanations

What features need to be changed and by

how much to flip a model’s prediction ?

(i.e., to reverse an unfavorable outcome).

110 of 271

Counterfactual Explanations

110

Predictive

Model

Deny Loan

Loan Application

Recourse: Increase your salary by 5K & pay your credit card bills on time for next 3 months

f(x)

Applicant

Counterfactual Generation Algorithm

Recourse

111 of 271

Generating Counterfactual Explanations: Intuition

111

Proposed solutions differ on:

How to choose among

candidate counterfactuals?

How much access is needed to

the underlying predictive model?

112 of 271

Take 1: Minimum Distance Counterfactuals

112

Distance Metric

Predictive Model

Desired Outcome

[ Wachter et. al., 2018 ]

Original Instance

Counterfactual

Choice of distance metric dictates what kinds of counterfactuals are chosen.

Wachter et. al. use normalized Manhattan distance.

113 of 271

Take 1: Minimum Distance Counterfactuals

113

Wachter et. al. solve a differentiable, unconstrained version of the objective using ADAM optimization algorithm with random restarts.

This method requires access to gradients of the underlying predictive model.

[ Wachter et. al., 2018 ]

114 of 271

Take 1: Minimum Distance Counterfactuals

114

Not feasible to act upon these features!

115 of 271

Take 2: Feasible and Least Cost Counterfactuals

is the set of feasible counterfactuals (input by end user)

E.g., changes to race, gender are not feasible

Cost is modeled as total log-percentile shift

Changes become harder when starting off from a higher percentile value

115

[ Ustun et. al., 2019 ]

116 of 271

Take 2: Feasible and Least Cost Counterfactuals

Ustun et. al. only consider the case where the model is a linear classifier

Objective formulated as an IP and optimized using CPLEX

Requires complete access to the linear classifier i.e., weight vector

116

[ Ustun et. al., 2019 ]

117 of 271

Take 2: Feasible and Least Cost Counterfactuals

117

Question: What if we have a black box or a non-linear classifier?

Answer: generate a local linear model approximation (e.g., using LIME) and then apply Ustun et. al.’s framework

[ Ustun et. al., 2019 ]

118 of 271

Take 2: Feasible and Least Cost Counterfactuals

118

Changing one feature without affecting another might not be possible!

[ Ustun et. al., 2019 ]

119 of 271

Take 3: Causally Feasible Counterfactuals

119

After 1 year

Recourse:

Reduce current debt

from 3250$ to 1000$

My current debt has reduced to 1000$. Please give me loan.

Loan Applicant

f(x)

Your age increased by 1 year and the recourse is no longer valid! Sorry!

Important to account for feature interactions when generating counterfactuals!

But how?!

Loan Applicant

Predictive Model

[ Mahajan et. al., 2019, Karimi et. al. 2020 ]

120 of 271

Take 3: Causally Feasible Counterfactuals

120

[ Ustun et. al., 2019 ]

is the set of causally feasible counterfactuals permitted according to a given Structural Causal Model (SCM).

Question: What if we don’t have access to the structural causal model?

121 of 271

Counterfactuals on Data Manifold

Generated counterfactuals should lie on the data manifold
Construct Variational Autoencoders (VAEs) to map input instances to latent space
Search for counterfactuals in the latent space
Once a counterfactual is found, map it back to the input space using the decoder

121

[ Verma et. al., 2020, Pawelczyk et. al., 2020]

122 of 271

Approaches for Post hoc Explainability

Local Explanations

Feature Importances
Rule Based
Saliency Maps
Prototypes/Example Based
Counterfactuals

122

Global Explanations

Collection of Local Explanations
Representation Based
Model Distillation
Summaries of Counterfactuals

123 of 271

Global Explanations

Explain the complete behavior of a given (black box) model

Provide a bird’s eye view of model behavior

Help detect big picture model biases persistent across larger subgroups of the population

Impractical to manually inspect local explanations of several instances to ascertain big picture biases!

Global explanations are complementary to local explanations

123

124 of 271

Local vs. Global Explanations

124

Explain individual predictions

Help unearth biases in the local neighborhood of a given instance

Help vet if individual predictions are being made for the right reasons

Explain complete behavior of the model

Help shed light on big picture biases affecting larger subgroups

Help vet if the model, at a high level, is suitable for deployment

125 of 271

Approaches for Post hoc Explainability

Local Explanations

Feature Importances
Rule Based
Saliency Maps
Prototypes/Example Based
Counterfactuals

125

Global Explanations

Collection of Local Explanations
Representation Based
Model Distillation
Summaries of Counterfactuals

126 of 271

Global Explanation as a Collection of Local Explanations

How to generate a global explanation of a (black box) model?

Generate a local explanation for every instance in the data using one of the approaches discussed earlier

Pick a subset of k local explanations to constitute the global explanation

126

What local explanation technique to use?

How to choose the subset of k local explanations?

127 of 271

Global Explanations from Local Feature Importances: SP-LIME

LIME explains a single prediction

local behavior for a single instance

Can’t examine all explanations

Instead pick k explanations to show to the user

Diverse

Should not be redundant in their descriptions

Representative

Should summarize the model’s global behavior

Single explanation

SP-LIME uses submodular optimization and greedily picks k explanations

Model Agnostic

[ Ribeiro et al. 2016 ]

128 of 271

Global Explanations from Local Rule Sets: SP-Anchor

Use Anchors algorithm discussed earlier to obtain local rule sets for every instance in the data

Use the same procedure to greedily select a subset of k local rule sets to correspond to the global explanation

[ Ribeiro et al. 2018 ]

129 of 271

Approaches for Post hoc Explainability

Local Explanations

Feature Importances
Rule Based
Saliency Maps
Prototypes/Example Based
Counterfactuals

129

Global Explanations

Collection of Local Explanations
Representation Based
Model Distillation
Summaries of Counterfactuals

130 of 271

Representation Based Approaches

Derive model understanding by analyzing intermediate representations of a DNN.

Determine model’s reliance on ‘concepts’ that are semantically meaningful to humans.

130

131 of 271

Representation Based Explanations

131

[Kim et. al., 2018]

Zebra

(0.97)

How important is the notion of “stripes” for this prediction?

132 of 271

Representation Based Explanations: TCAV

132

Examples of the concept “stripes”

Random examples

Train a linear classifier to separate

activations

The vector orthogonal to the decision boundary pointing towards the “stripes” class quantifies the concept “stripes”

Compute derivatives by leveraging this vector to determine the importance of the notion of stripes for any given prediction

133 of 271

Quantitative Testing with Concept Activation Vectors (TCAV)

TCAV measures the sensitivity of a model’s prediction to user provided concept using the model internal representations.

133

Kim et. al. 2018

134 of 271

Approaches for Post hoc Explainability

Local Explanations

Feature Importances
Rule Based
Saliency Maps
Prototypes/Example Based
Counterfactuals

134

Global Explanations

Collection of Local Explanations
Representation Based
Model Distillation
Summaries of Counterfactuals

135 of 271

Model Distillation for Generating Global Explanations

135

Model

Predictions

Predictive

Model

Label 1

Label 2

v1, v2

v11, v12

Data

Explainer

Simpler, interpretable model which is optimized to mimic the model predictions

f(x)

136 of 271

Generalized Additive Models as Global Explanations

136

Model

Predictions

Black Box

Model

Label 1

Label 2

v1, v2

v11, v12

Data

Explainer

[Tan et. al., 2019]

Model Agnostic

137 of 271

Generalized Additive Models as Global Explanations: Shape Functions for Predicting Bike Demand

137

[Tan et. al., 2019]

138 of 271

Generalized Additive Models as Global Explanations: Shape Functions for Predicting Bike Demand

How does bike demand vary as a function of temperature?

138

[Tan et. al., 2019]

139 of 271

Generalized Additive Models as Global Explanations

Generalized Additive Model (GAM) :

139

[Tan et. al., 2019]

Shape functions of

individual features

Higher order feature interaction terms

Fit this model to the predictions of the black box to obtain the shape functions.

ŷ =

140 of 271

Decision Trees as Global Explanations

140

Model

Predictions

Black Box

Model

Label 1

Label 2

v1, v2

v11, v12

Data

Explainer

[ Bastani et. al., 2019 ]

Model Agnostic

141 of 271

Customizable Decision Sets as Global Explanations

141

Model

Predictions

Black Box

Model

Label 1

Label 2

v1, v2

v11, v12

Data

Explainer

Model Agnostic

[ Lakkaraju et. al., 2019 ]

142 of 271

Customizable Decision Sets as Global Explanations

142

Subgroup Descriptor

Decision Logic

[ Lakkaraju et. al., 2019 ]

143 of 271

Customizable Decision Sets as Global Explanations

143

Explain how the model behaves across patient subgroups with different values of smoking and exercise

[ Lakkaraju et. al., 2019 ]

144 of 271

Customizable Decision Sets as Global Explanations:

Desiderata & Optimization Problem

144

Fidelity

Describe model behavior accurately

Unambiguity

No contradicting explanations

Simplicity

Users should be able to look at the explanation and reason about model behavior

Customizability

Users should be able to understand model behavior across various subgroups of interest

Fidelity

Minimize number of instances for which explanation’s label ≠ model prediction

Unambiguity

Minimize the number of duplicate rules applicable to each instance

Simplicity

Minimize the number of conditions in rules;

Constraints on number of rules & subgroups;

Customizability

Outer rules should only comprise of features of user interest (candidate set restricted)

[ Lakkaraju et. al., 2019 ]

145 of 271

Customizable Decision Sets as Global Explanations

The complete optimization problem is non-negative, non-normal, non-monotone, and submodular with matroid constraints

Solved using the well-known smooth local search algorithm (Feige et. al., 2007) with best known optimality guarantees.

145

[ Lakkaraju et. al., 2019 ]

146 of 271

Approaches for Post hoc Explainability

Local Explanations

Feature Importances
Rule Based
Saliency Maps
Prototypes/Example Based
Counterfactuals

146

Global Explanations

Collection of Local Explanations
Representation Based
Model Distillation
Summaries of Counterfactuals

147 of 271

Counterfactual Explanations

147

Predictive

Model

f(x)

Counterfactual Generation Algorithm

DENIED

LOANS

RECOURSES

How do recourses permitted by the model vary across various racial & gender subgroups?

Are there any biases against certain demographics?

[ Rawal and Lakkaraju, 2020 ]

Decision Maker

(or) Regulatory Authority

148 of 271

Customizable Global Summaries of Counterfactuals

148

Predictive

Model

f(x)

Algorithm for generating global summaries of counterfactuals

DENIED

LOANS

How do recourses permitted by the model vary across various racial & gender subgroups?

Are there any biases against certain demographics?

[ Rawal and Lakkaraju, 2020 ]

149 of 271

Customizable Global Summaries of Counterfactuals

149

Omg! this model is biased. It requires certain demographics to “act upon” lot more features than others.

Subgroup Descriptor

Recourse Rules

[ Rawal and Lakkaraju, 2020 ]

150 of 271

Customizable Global Summaries of Counterfactuals:

Desiderata & Optimization Problem

150

Recourse Correctness

Prescribed recourses should obtain desirable outcomes

Recourse Correctness

Minimize number of applicants for whom prescribed recourse

does not lead to desired outcome

Recourse Coverage

Minimize number of applicants for whom recourse does not exist (i.e., satisfy no rule).

Minimal Recourse Costs

Minimize total feature costs as well as magnitude of changes

in feature values

Interpretability of Summaries

Constraints on # of rules, # of conditions in rules & # of subgroups

Recourse Coverage

(Almost all) applicants should be provided with recourses

Minimal Recourse Costs

Acting upon a prescribed recourse

should not be impractical or terribly expensive

Interpretability of Summaries

Summaries should be readily understandable to

stakeholders (e.g., decision makers/regulatory authorities).

Customizability

Stakeholders should be able to understand model behavior across various subgroups of interest

Customizability

Outer rules should only comprise of features of stakeholder interest (candidate set restricted)

[ Rawal and Lakkaraju, 2020 ]

151 of 271

Customizable Global Summaries of Counterfactuals

The complete optimization problem is non-negative, non-normal, non-monotone, and submodular with matroid constraints

Solved using the well-known smooth local search algorithm (Feige et. al., 2007) with best known optimality guarantees.

151

[ Rawal and Lakkaraju, 2020 ]

152 of 271

Breakout Groups

What concepts/ideas/approaches from our morning discussion stood out to you ?

We discussed different basic units of interpretation -- prototypes, rules, risk scores, shape functions (GAMs), feature importances

Are some of these more suited to certain data modalities (e.g., tabular, images, text) than others?

What could be some potential vulnerabilities/drawbacks of inherently interpretable models and post hoc explanation methods?

Given the diversity of the methods we discussed, how do we go about evaluating inherently interpretable models and post hoc explanation methods?

153 of 271

Agenda

Inherently Interpretable Models
Post hoc Explanation Methods
Evaluating Model Interpretations/Explanations
Empirically & Theoretically Analyzing Interpretations/Explanations
Future of Model Understanding

153

154 of 271

Evaluating Model Interpretations/Explanations

Evaluating the meaningfulness or correctness of explanations

Diverse ways of doing this depending on the type of model interpretation/explanation

Evaluating the interpretability of explanations

155 of 271

Evaluating Interpretability

155

[ Doshi-Velez and Kim, 2017 ]

156 of 271

Evaluating Interpretability

Functionally-grounded evaluation: Quantitative metrics – e.g., number of rules, prototypes --> lower is better!

Human-grounded evaluation: binary forced choice, forward simulation/prediction, counterfactual simulation

Application-grounded evaluation: Domain expert with exact application task or simpler/partial task

157 of 271

Evaluating Inherently Interpretable Models

Evaluating the accuracy of the resulting model

Evaluating the interpretability of the resulting model

Do we need to evaluate the “correctness” or “meaningfulness” of the resulting interpretations?

158 of 271

Evaluating Bayesian Rule Lists

A rule list classifier for stroke prediction

[Letham et. al. 2016]

159 of 271

Evaluating Interpretable Decision Sets

A decision set classifier for disease diagnosis

[Lakkaraju et. al. 2016]

160 of 271

Evaluating Interpretability of Bayesian Rule Lists and Interpretable Decision Sets

Number of rules, predicates etc. 🡪 lower is better!

User studies to compare interpretable decision sets to Bayesian Decision Lists (Letham et. al.)

Each user is randomly assigned one of the two models

10 objective and 2 descriptive questions per user

160

161 of 271

Interface for Objective Questions

161

162 of 271

Interface for Descriptive Questions

162

163 of 271

User Study Results

163

Task	Metrics	Our Approach	Bayesian Decision Lists
Descriptive	Human Accuracy	0.81	0.17
	Avg. Time Spent (secs.)	113.4	396.86
	Avg. # of Words	31.11	120.57
Objective	Human Accuracy	0.97	0.82
Objective	Avg. Time Spent (secs.)	28.18	36.34

Objective Questions: 17% more accurate, 22% faster;

Descriptive Questions: 74% fewer words, 71% faster.

164 of 271

Evaluating Prototype and Attention Layers

Are prototypes and attention weights always meaningful?

Do attention weights correlate with other measures of feature importance? E.g., gradients

Would alternative attention weights yield different predictions?

[Jain and Wallace, 2019]

No!!

165 of 271

Evaluating Post hoc Explanations

Evaluating the faithfulness (or correctness) of post hoc explanations

Evaluating the stability of post hoc explanations

Evaluating the fairness of post hoc explanations

Evaluating the interpretability of post hoc explanations

[Agarwal et. al., 2022]

166 of 271

Evaluating Faithfulness of Post hoc Explanations – Ground Truth

166

167 of 271

Evaluating Faithfulness of Post hoc Explanations – Ground Truth

167

Spearman rank correlation coefficient computed over features of interest

168 of 271

Evaluating Faithfulness of Post hoc Explanations – Explanations as Models

If the explanation is itself a model (e.g., linear model fit by LIME), we can compute the fraction of instances for which the labels assigned by explanation model match those assigned by the underlying model

168

169 of 271

Evaluating Faithfulness of Post hoc Explanations

What if we do not have any ground truth?

What if explanations cannot be considered as models that output predictions?

170 of 271

How important are selected features?

Deletion: remove important features and see what happens..

170

% of Pixels deleted

Prediction Probability

[ Qi, Khorram, Fuxin, 2020 ]

171 of 271

How important are selected features?

Deletion: remove important features and see what happens..

171

% of Pixels deleted

Prediction Probability

[ Qi, Khorram, Fuxin, 2020 ]

172 of 271

How important are selected features?

Deletion: remove important features and see what happens..

172

% of Pixels deleted

Prediction Probability

[ Qi, Khorram, Fuxin, 2020 ]

173 of 271

How important are selected features?

Deletion: remove important features and see what happens..

173

% of Pixels deleted

Prediction Probability

[ Qi, Khorram, Fuxin, 2020 ]

174 of 271

How important are selected features?

Deletion: remove important features and see what happens..

174

% of Pixels deleted

Prediction Probability

[ Qi, Khorram, Fuxin, 2020 ]

175 of 271

How important are selected features?

Deletion: remove important features and see what happens..

175

% of Pixels deleted

Prediction Probability

[ Qi, Khorram, Fuxin, 2020 ]

176 of 271

How important are selected features?

Insertion: add important features and see what happens..

176

% of Pixels inserted

Prediction Probability

[ Qi, Khorram, Fuxin, 2020 ]

177 of 271

How important are selected features?

Insertion: add important features and see what happens..

177

Prediction Probability

% of Pixels inserted

[ Qi, Khorram, Fuxin, 2020 ]

178 of 271

How important are selected features?

Insertion: add important features and see what happens..

178

Prediction Probability

% of Pixels inserted

[ Qi, Khorram, Fuxin, 2020 ]

179 of 271

How important are selected features?

Insertion: add important features and see what happens..

179

Prediction Probability

% of Pixels inserted

[ Qi, Khorram, Fuxin, 2020 ]

180 of 271

How important are selected features?

Insertion: add important features and see what happens..

180

Prediction Probability

% of Pixels inserted

[ Qi, Khorram, Fuxin, 2020 ]

181 of 271

Evaluating Stability of Post hoc Explanations

Are post hoc explanations unstable w.r.t. small input perturbations?

[Alvarez-Melis, 2018; Agarwal et. al., 2022]

Local Lipschitz Constant

Input

Post hoc Explanation

max

182 of 271

Evaluating Stability of Post hoc Explanations

What if the underlying model itself is unstable?

Relative Output Stability: Denominator accounts for changes in the prediction probabilities

Relative Representation Stability: Denominator accounts for changes in the intermediate representations of the underlying model

[Agarwal et. al., 2022]

183 of 271

Evaluating Fairness of Post hoc Explanations

Compute mean faithfulness/stability metrics for instances from majority and minority groups (e.g., race A vs. race B, male vs. female)

If the difference between the two means is statistically significant, then there is unfairness in the post hoc explanations

Why/when can such unfairness occur?

[Dai et. al., 2022]

184 of 271

Evaluating Interpretability of Post hoc Explanations

184

[ Doshi-Velez and Kim, 2017 ]

185 of 271

Predicting Behavior (“Simulation”)

185

Classifier

Predictions & Explanations

Show to user

Data

Predictions

New Data

User guesses what

the classifier would do

on new data

[ Ribeiro et al. 2018, Hase and Bansal 2020 ]

Compare Accuracy

186 of 271

Predicting Behavior (“Simulation”)

186

[ Poursabzi-Sangdeh et al. 2018 ]

187 of 271

Human-AI Collaboration

Are Explanations Useful for Making Decisions?

For tasks where the algorithms are not reliable by themselves

187

[ Lai and Tan, 2019 ]

188 of 271

Human-AI Collaboration

Deception Detection: Identify fake reviews online

Are Humans better detectors with explanations?

188

[ Lai and Tan, 2019 ]

https://machineintheloop.com/deception/

189 of 271

Can we improve the accuracy of decisions �using feature attribution-based explanations?

Prediction Problem: Is a given patient likely to be diagnosed with breast cancer within 2 years?

User studies carried out with about 78 doctors (Residents, Internal Medicine)

Each doctor looks at 10 patient records from historical data and makes predictions for each of them.

189

Lakkaraju et. al., 2022

190 of 271

Can we improve the accuracy of decisions �using feature attribution-based explanations?

190

Accuracy: 78.32%

Accuracy: 93.11%

Accuracy: 82.02%

At Risk (0.91)

Important Features

ESR

Family Risk

Chronic Health Conditions

Model Accuracy: 88.92%

191 of 271

Can we improve the accuracy of decisions �using feature attribution-based explanations?

191

Accuracy: 78.32%

Accuracy: 82.02%

At Risk (0.91)

Model Accuracy: 88.92%

Important Features

Appointment time

Appointment day

Zip code

Doctor ID > 150

Accuracy: 93.11%

192 of 271

Challenges of Evaluating Interpretable Models/Post hoc Explanation Methods

Evaluating interpretations/explanations still an ongoing endeavor

Parameter settings heavily influence the resulting interpretations/explanations

Diversity of explanation/interpretation methods 🡪 diverse metrics

User studies are not consistent

Affected by choice of: UI, phrasing, visualization, population, incentives, …

All the above leading to conflicting findings

192

193 of 271

Open Source Tools for Quantitative Evaluation

Interpretable models: https://github.com/interpretml/interpret

Post hoc explanation methods: OpenXAI: https://open-xai.github.io/ -- 22 metrics (faithfulness, stability, fairness); public dashboards comparing various metrics on different metrics; 11 lines of code to evaluate explanation quality

Other XAI libraries: Captum, quantus, shap bench, ERASER (NLP)

194 of 271

Agenda

Inherently Interpretable Models
Post hoc Explanation Methods
Evaluating Model Interpretations/Explanations
Empirically & Theoretically Analyzing Interpretations/Explanations
Future of Model Understanding

194

195 of 271

Empirically Analyzing Interpretations/Explanations

Lot of recent focus on analyzing the behavior of post hoc explanation methods.

Empirical studies analyzing the faithfulness, stability, fairness, adversarial vulnerabilities, and utility of post hoc explanation methods.

Several studies demonstrate limitations of existing post hoc methods.

196 of 271

Limitations: Faithfulness

196

Gradient ⊙ Input

Guided Backprop

Guided GradCAM

Model parameter randomization test

Adebayo, Julius, et al. "Sanity checks for saliency maps." NeurIPS, 2018.

197 of 271

Limitations: Faithfulness

197

Gradient ⊙ Input

Guided Backprop

Guided GradCAM

Model parameter randomization test

Adebayo, Julius, et al. "Sanity checks for saliency maps." NeurIPS, 2018.

198 of 271

Limitations: Faithfulness

198

Adebayo, Julius, et al. "Sanity checks for saliency maps." NeurIPS, 2018.

Gradient ⊙ Input

Guided Backprop

Guided GradCAM

No!!

Model parameter randomization test

199 of 271

Limitations: Faithfulness

199

Randomizing class labels of instances also didn’t impact explanations!

Adebayo, Julius, et al. "Sanity checks for saliency maps." NeurIPS, 2018.

200 of 271

Limitations: Stability

200

Are post-hoc explanations unstable wrt small non-adversarial input perturbation?

input

model

hyperparameters

Local Lipschitz Constant

Input

Explanation function: LIME, SHAP, Gradient...etc.

Alvarez-Melis, David et al. "On the robustness of interpretability methods." WHI, ICML, 2018.

201 of 271

Limitations: Stability

201

Perturbation approaches like LIME can be unstable.

Estimate for 100 tests for an MNIST Model.

Are post-hoc explanations unstable wrt small non-adversarial input perturbation?

Alvarez-Melis, David et al. "On the robustness of interpretability methods." WHI, ICML, 2018.

202 of 271

Limitations: Stability – Problem is Worse!

202

[Slack et. al., 2020]

Many = 250 perturbations; Few = 25 perturbations;

When you repeatedly run LIME on the same instance, you get different explanations (blue region)

Problem with having too few perturbations? If so, what is the optimal number of perturbations?

203 of 271

203

Dombrowski et. al. 2019

Post-hoc Explanations are Fragile

Post-hoc explanations can be easily manipulated.

204 of 271

204

Dombrowski et. al. 2019

Post-hoc Explanations are Fragile

Post-hoc explanations can be easily manipulated.

205 of 271

205

Post-hoc explanations can be easily manipulated.

Dombrowski et. al. 2019

Post-hoc Explanations are Fragile

206 of 271

206

Post-hoc explanations can be easily manipulated.

Dombrowski et. al. 2019

Post-hoc Explanations are Fragile

207 of 271

207

Adversarial Attacks on Explanations

Adversarial Attack of Ghorbani et. al. 2018

Minimally modify the input with a small perturbation without changing the model prediction.

208 of 271

208

Adversarial Attacks on Explanations

Adversarial Attack of Ghorbani et. al. 2018

Minimally modify the input with a small perturbation without changing the model prediction.

209 of 271

209

Adversarial Attacks on Explanations

Adversarial Attack of Ghorbani et. al. 2018

Minimally modify the input with a small perturbation without changing the model prediction.

210 of 271

210

Scaffolding attack used to hide classifier dependence on gender.

Slack and Hilgard et. al. 2020

Adversarial Classifiers to fool LIME & SHAP

211 of 271

Vulnerabilities of LIME/SHAP: Intuition

211

Several perturbed data points are out of distribution (OOD)!

212 of 271

Vulnerabilities of LIME/SHAP: Intuition

212

Adversaries can exploit this and build a classifier that is biased on in-sample data points and unbiased on OOD samples!

213 of 271

Building Adversarial Classifiers

Setting:

Adversary wants to deploy a biased classifier f in real world.

E.g., uses only race to make decisions

Adversary must provide black box access to customers and regulators who may use post hoc techniques (GDPR).

Goal of adversary is to fool post hoc explanation techniques and hide underlying biases of f

213

214 of 271

Building Adversarial Classifiers

Input: Adversary provides us with the biased classifier f, an input dataset X sampled from real world input distribution X_dist

Output: Scaffolded classifier e which behaves exactly like f when making predictions on instances sampled from X_distbut will not reveal underlying biases of f when probed with perturbation-based post hoc explanation techniques.

e is the adversarial classifier

214

215 of 271

Building Adversarial Classifiers

Adversarial classifier e can be defined as:

f is the biased classifier input by adversary.

is the unbiased classifier (e.g., only uses features uncorrelated to sensitive attributes)

215

216 of 271

216

Limitations: Stability

Post-hoc explanations can be unstable to small, non-adversarial, perturbations to the input.

Alvarez et. al. 2018.

217 of 271

217

Limitations: Stability

Post-hoc explanations can be unstable to small, non-adversarial, perturbations to the input.

‘Local Lipschitz Constant’

Input

Explanation function: LIME, SHAP, Gradient...etc.

Alvarez et. al. 2018.

218 of 271

218

Limitations: Stability

Perturbation approaches like LIME can be unstable.

Alvarez et. al. 2018.

Estimate for 100 tests for an MNIST Model.

219 of 271

219

Sensitivity to Hyperparameters

Bansal, Agarwal, & Nguyen, 2020.

Explanations can be highly sensitive to hyperparameters such as random seed, number of perturbations, patch size, etc.

220 of 271

220

Utility: High fidelity explanations can mislead

Lakkaraju & Bastani 2019.

In a bail adjudication task, misleading high-fidelity explanations improve end-user (domain experts) trust.

True Classifier relies on race

221 of 271

221

Lakkaraju & Bastani 2019.

True Classifier relies on race

High fidelity ‘misleading’ explanation

In a bail adjudication task, misleading high-fidelity explanations improve end-user (domain experts) trust.

Utility: High fidelity explanations can mislead

222 of 271

Utility: Post hoc Explanations Instill Over Trust

Domain experts and end users seem to be over trusting explanations & the underlying models based on explanations

Data scientists over trusted explanations without even comprehending them -- “Participants trusted the tools because of their visualizations and their public availability”

222

[Kaur et. al., 2020; Bucinca et. al., 2020]

223 of 271

Responses from Data Scientists Using Explainability Tools

(GAM and SHAP)

223

[Kaur et. al., 2020]

224 of 271

224

Utility: Explanations for Debugging

Poursabzi-Sangdeh et. al. 2019

In a housing price prediction task, Amazon mechanical turkers are unable to use linear model coefficients to diagnose model mistakes.

225 of 271

225

Adebayo et. al., 2020.

In a dog breeds classification task, users familiar with machine learning rely on labels, instead of saliency maps, for diagnosing model errors.

Utility: Explanations for Debugging

226 of 271

226

Adebayo et. al., 2020.

In a dog breeds classification task, users familiar with machine learning rely on labels, instead of saliency maps, for diagnosing model errors.

Utility: Explanations for Debugging

227 of 271

227

Conflicting Evidence on Utility of Explanations

Mixed evidence:

simulation and benchmark studies show that explanations are useful for debugging;
however, recent user studies show limited utility in practice.

228 of 271

228

Mixed evidence:

simulation and benchmark studies show that explanations are useful for debugging;
however, recent user studies show limited utility in practice.

Rigorous user studies and pilots with end-users can continue to help provide feedback to researchers on what to address (see: Alqaraawi et. al. 2020, Bhatt et. al. 2020 & Kaur et. al. 2020).

Conflicting Evidence on Utility of Explanations

229 of 271

Utility: Disagreement Problem in XAI

Study to understand:

if and how often feature attribution based explanation methods disagree with each other in practice

What constitutes disagreement between these explanations, and how to formalize the notion of explanation disagreement based on practitioner inputs?

How do practitioners resolve explanation disagreement?

229

Krishna and Han et. al., 2022

230 of 271

Practitioner Inputs on Explanation Disagreement

30 minute semi-structured interviews with 25 data scientists

84% of participants said they often encountered disagreement between explanation methods

Characterizing disagreement:

Top features are different
Ordering among top features is different
Direction of top feature contributions is different
Relative ordering of features of interest is different

230

231 of 271

How do Practitioners Resolve Disagreements?

Online user study where 25 users were shown explanations that disagree and asked to make a choice, and explain why

Practitioners are choosing methods due to:

Associated theory or publication time (33%)
Explanations matching human intuition better (32%)
Type of data (23%)

E.g., LIME or SHAP are better for tabular data

231

232 of 271

How do Practitioners Resolve Disagreements?

232

233 of 271

233

Empirical Analysis: Summary

Faithfulness/Fidelity

Some explanation methods do not ‘reflect’ the underlying model.

Fragility

Post-hoc explanations can be easily manipulated.

Stability

Slight changes to inputs can cause large changes in explanations.

Useful in practice?

234 of 271

Theoretically Analyzing Interpretable Models

Two main classes of theoretical results:
Interpretable models learned using certain algorithms are certifiably optimal

E.g., rule lists (Angelino et. al., 2018)

No accuracy-interpretability tradeoffs in certain settings

E.g., reinforcement learning for mazes (Mansour et. al., 2022)

235 of 271

Theoretical Analysis of Tabular LIME w.r.t. Linear Models

Theoretical analysis of LIME

“black box” is a linear model
data is tabular and discretized

Obtained closed-form solutions of the average coefficients of the “surrogate” model (explanation output by LIME)

The coefficients obtained are proportional to the gradient of the function to be explained

Local error of surrogate model is bounded away from zero with high probability

235

[Garreau et. al., 2020]

236 of 271

Unification and Robustness of LIME and SmoothGrad

C-LIME (a continuous variant of LIME) and SmoothGrad converge to the same explanation in expectation

At expectation, the resulting explanations are provably robust according to the notion of Lipschitz continuity

Finite sample complexity bounds for the number of perturbed samples required for SmoothGrad and C-LIME to converge to their expected output

236

[Agarwal et. al., 2020]

237 of 271

Function Approximation Perspective to Characterizing Post hoc Explanation Methods

Various feature attribution methods (e.g., LIME, C-LIME, KernelSHAP, Occlusion, Vanilla Gradients, Gradient times Input, SmoothGrad, Integrated Gradients) are essentially local linear function approximations.

But…

237

[Han et. al., 2022]

238 of 271

Function Approximation Perspective to Characterizing Post hoc Explanation Methods

But, they adopt different loss functions, and local neighborhoods

238

[Han et. al., 2022]

239 of 271

Function Approximation Perspective to Characterizing Post hoc Explanation Methods

No Free Lunch Theorem for Explanation Methods: No single method can perform optimally across all neighborhoods

239

240 of 271

Agenda

Inherently Interpretable Models
Post hoc Explanation Methods
Evaluating Model Interpretations/Explanations
Empirically & Theoretically Analyzing Interpretations/Explanations
Future of Model Understanding

240

241 of 271

Future of Model Understanding

241

Methods for More Reliable

Post hoc Explanations

Theoretical Analysis of the Behavior of

Interpretable Models & Explanation Methods

Model Understanding

Beyond Classification

Intersections with Model Privacy

Intersections with Model Fairness

Empirical Evaluation of the Correctness & Utility of Model Interpretations/Explanations

Intersections with Model Robustness

Characterizing Similarities and Differences Between Various Methods

New Interfaces, Tools, Benchmarks for Model Understanding

242 of 271

Methods for More Reliable Post hoc Explanations

We have seen several limitations in the behavior of post hoc explanation methods – e.g., unstable, inconsistent, fragile, not faithful

While there are already attempts to address some of these limitations, more work is needed

242

243 of 271

Challenges with LIME: Stability

Perturbation approaches like LIME/SHAP are unstable

243

Alvarez-Melis, 2018

244 of 271

Challenges with LIME: Consistency

244

Slack et. al., 2020

Many = 250 perturbations; Few = 25 perturbations;

When you repeatedly run LIME on the same instance,

you get different explanations (blue region)

245 of 271

Challenges with LIME: Consistency

245

Problem with having too few perturbations?

What is the optimal number of perturbations?

Can we just use a very large number of perturbations?

246 of 271

Challenges with LIME: Scalability

Querying complex models (e.g., Inception Network, ResNet, AlexNet) repeatedly for labels can be computationally prohibitive

Large number of perturbations 🡪 Large number of model queries

246

Generating reliable explanations using LIME can be computationally expensive!

247 of 271

Explanations with Guarantees: BayesLIME and BayesSHAP

Intuition: Instead of point estimates of feature importances, define these as distributions

247

248 of 271

BayesLIME and BayesSHAP

Construct a Bayesian locally weighted regression that can accommodate LIME/SHAP weighting functions

248

Feature Importances

Perturbations

Weighting Function

Black Box Predictions

Priors on feature importances and feature importance uncertainty

249 of 271

BayesLIME and BayesSHAP: Inference

Conjugacy results in following posteriors

We can compute all parameters in closed form

249

These are the same equations used in LIME & SHAP!

250 of 271

Estimating the Required Number of Perturbations

250

Estimate required number of perturbations for user specified uncertainty level.

251 of 271

Improving Efficiency: Focused Sampling

Instead of sampling perturbations randomly and querying the black box, choose points the learning algorithm is most uncertain about and only query their labels from the black box.

251

This approach allows us to construct explanations with

user defined levels of confidence in an efficient manner!

252 of 271

253 of 271

Future of Model Understanding

253

Methods for More Reliable

Post hoc Explanations

Theoretical Analysis of the Behavior of

Interpretable Models & Explanation Methods

Model Understanding

Beyond Classification

Intersections with Model Privacy

Intersections with Model Fairness

Empirical Evaluation of the Correctness & Utility of Model Interpretations/Explanations

Intersections with Model Robustness

Characterizing Similarities and Differences Between Various Methods

New Interfaces, Tools, Benchmarks for Model Understanding

254 of 271

Theoretical Analysis of the Behavior of Explanations/Models

We discussed some of the recent theoretical results earlier. Despite these, several important questions remain unanswered

Can we characterize the conditions under which each post hoc explanation method (un)successfully captures the behavior of the underlying model?

Given the properties of the underlying model, data distribution, can we theoretically determine which explanation method should be employed?

Can we theoretically analyze the nature of the prototypes/attention weights learned by deep nets with added layers? When are these meaningful/when are they spurious?

254

255 of 271

Future of Model Understanding

255

Methods for More Reliable

Post hoc Explanations

Theoretical Analysis of the Behavior of

Interpretable Models & Explanation Methods

Model Understanding

Beyond Classification

Intersections with Model Privacy

Intersections with Model Fairness

Empirical Evaluation of the Correctness & Utility of Model Interpretations/Explanations

Intersections with Model Robustness

Characterizing Similarities and Differences Between Various Methods

New Interfaces, Tools, Benchmarks for Model Understanding

256 of 271

Empirical Analysis of Correctness/Utility

While there is already a lot of work on empirical analysis of correctness/utility for post hoc explanation methods, there is still no clear characterization of which methods (if any) are correct/useful under what conditions.

There is even less work on the empirical analysis of the correctness/utility of the interpretations generated by inherently interpretable models. For instance, are the prototypes generated by adding prototype layers correct/meaningful? Can they be leveraged in any real world applications? What about attention weights?

256

257 of 271

Future of Model Understanding

257

Methods for More Reliable

Post hoc Explanations

Theoretical Analysis of the Behavior of

Interpretable Models & Explanation Methods

Model Understanding

Beyond Classification

Intersections with Model Privacy

Intersections with Model Fairness

Empirical Evaluation of the Correctness & Utility of Model Interpretations/Explanations

Intersections with Model Robustness

Characterizing Similarities and Differences Between Various Methods

New Interfaces, Tools, Benchmarks for Model Understanding

258 of 271

Characterizing Similarities and Differences

Several post hoc explanation methods exist which employ diverse algorithms and definitions of what constitutes an explanation, under what conditions do these methods generate similar outputs (e.g., top K features) ?

Multiple interpretable models which output natural/synthetic prototypes (e.g., Li et. al, Chen et. al. etc.). When do they generate similar answers and why?

258

259 of 271

Future of Model Understanding

259

Methods for More Reliable

Post hoc Explanations

Theoretical Analysis of the Behavior of

Interpretable Models & Explanation Methods

Model Understanding

Beyond Classification

Intersections with Model Privacy

Intersections with Model Fairness

Empirical Evaluation of the Correctness & Utility of Model Interpretations/Explanations

Intersections with Model Robustness

Characterizing Similarities and Differences Between Various Methods

New Interfaces, Tools, Benchmarks for Model Understanding

260 of 271

Model Understanding Beyond Classification

How to think about interpretability in the context of large language models and foundation models? What is even feasible here?

Already active work on interpretability in RL and GNNs. However, very little research on analyzing the correctness/utility of these explanations.

Given that primitive interpretable models/post hoc explanations suffer from so many limitations, how to ensure explanations for more complex models are reliable?

260

[Coppens et. al., 2019, Amir et. al. 2018]

[Ying et. al., 2019]

261 of 271

Future of Model Understanding

261

Methods for More Reliable

Post hoc Explanations

Theoretical Analysis of the Behavior of

Interpretable Models & Explanation Methods

Model Understanding

Beyond Classification

Intersections with Model Privacy

Intersections with Model Fairness

Empirical Evaluation of the Correctness & Utility of Model Interpretations/Explanations

Intersections with Model Robustness

Characterizing Similarities and Differences Between Various Methods

New Interfaces, Tools, Benchmarks for Model Understanding

262 of 271

Intersections with Model Robustness

Are inherently interpretable models with prototype/attention layers more robust than those without these layers? If so, why?

Are there any inherent trade-offs between (certain kinds of) model interpretability and model robustness? Or do these aspects reinforce each other?

Prior works show that counterfactual explanation generation algorithms output adversarial examples. What is the impact of adversarially robust models on these explanations? [Pawelczyk et. al., 2022]

262

263 of 271

Future of Model Understanding

263

Methods for More Reliable

Post hoc Explanations

Theoretical Analysis of the Behavior of

Interpretable Models & Explanation Methods

Model Understanding

Beyond Classification

Intersections with Model Privacy

Intersections with Model Fairness

Empirical Evaluation of the Correctness & Utility of Model Interpretations/Explanations

Intersections with Model Robustness

Characterizing Similarities and Differences Between Various Methods

New Interfaces, Tools, Benchmarks for Model Understanding

264 of 271

Intersections with Model Fairness

It is often hypothesized that model interpretations and explanations help unearth unfairness biases of underlying models. However, there is little to no empirical research demonstrating this.

Conducting more empirical evaluations and user studies to determine how interpetations and explanations can complement statistical notions of fairness in identifying racial/gender biases

How does the fairness (statistical) of inherently interpretable models compare with that of vanilla models? Are there any inherent trade-offs between (certain kinds of) model interpretability and model fairness? Or do these aspects reinforce each other?

264

265 of 271

Future of Model Understanding

265

Methods for More Reliable

Post hoc Explanations

Theoretical Analysis of the Behavior of

Interpretable Models & Explanation Methods

Model Understanding

Beyond Classification

Intersections with Model Privacy

Intersections with Model Fairness

Empirical Evaluation of the Correctness & Utility of Model Interpretations/Explanations

Intersections with Model Robustness

Characterizing Similarities and Differences Between Various Methods

New Interfaces, Tools, Benchmarks for Model Understanding

266 of 271

Intersections with Differential Privacy

Model interpretations and explanations could potentially expose sensitive information from the datasets.

Little to no research on the privacy implications of interpretable models and/or explanations. What kinds of privacy attacks (e.g., membership inference, model inversion etc.) are enabled?

Do differentially private models help thwart these attacks?If so, under what conditions? Should we construct differentially private explanations?

266

[Harder et. al., 2020; Patel et. al. 2020]

267 of 271

Future of Model Understanding

267

Methods for More Reliable

Post hoc Explanations

Theoretical Analysis of the Behavior of

Interpretable Models & Explanation Methods

Model Understanding

Beyond Classification

Intersections with Model Privacy

Intersections with Model Fairness

Empirical Evaluation of the Correctness & Utility of Model Interpretations/Explanations

Intersections with Model Robustness

Characterizing Similarities and Differences Between Various Methods

New Interfaces, Tools, Benchmarks for Model Understanding

268 of 271

New Interfaces, Tools, Benchmarks for �Model Understanding

Can we construct more interactive interfaces for end users to engage with models? What would be the nature of such interactions? [demo]

As model interpretations and explanations are employed in different settings, we need to develop new benchmarks and tools for enabling comparison of faithfulness, stability, fairness, utility of various methods. How to enable that?

268

[Lakkaraju et. al., 2022, Slack et. al., 2022]

269 of 271

Some Parting Thoughts..

There has been renewed interest in model understanding over the past half decade, thanks to ML models being deployed in healthcare and other high-stakes settings

As ML models continue to get increasingly complex and they continue to find more applications, the need for model understanding is only going to raise

Lots of interesting and open problems waiting to be solved

You can approach the field of XAI from diverse perspectives: theory, algorithms, HCI, or interdisciplinary research – there is room for everyone! ☺

269

270 of 271

Thank You!

Acknowledgements: Special thanks to Julius Adebayo, Chirag Agarwal, Shalmali Joshi, and Sameer Singh for co-developing and co-presenting sub-parts of this tutorial at NeurIPS, AAAI, and FAccT conferences.

Email: hlakkaraju@hbs.edu; hlakkaraju@seas.harvard.edu;

Course on interpretability and explainability: https://interpretable-ml-class.github.io/

More tutorials on interpretability and explainability: https://explainml-tutorial.github.io/

Trustworthy ML Initiative: https://www.trustworthyml.org/

Lots of resources and seminar series on topics related to explainability, fairness, adversarial robustness, differential privacy, causality etc.

270