1 of 19

Rethinking the Role of Gradient-based �Attribution Methods for Model Interpretability

Suraj Srinivas1 & François Fleuret2Idiap Research Institute1 & EPFL1, University of Geneva2

2 of 19

Saliency Maps for Model Interpretability

2

Deep �Neural �Network

Saliency � Algorithm

Highlight important regions

3 of 19

Input-gradient Saliency

3

Simonyan et. al, Deep inside convolutional networks: Visualising image classification models and saliency maps, 2013

Input (x)

Saliency map (S)

Neural network

4 of 19

Why are gradients highly structured anyway?

4

5 of 19

Gradient Structure is Arbitrary

5

Pre-softmax (logit) gradients can be arbitrary, even if the model generalizes perfectly!��This also holds for post-softmax gradients (see paper for details).

Arbitrary!

6 of 19

Gradient Structure is Arbitrary

6

Input image

Logit-gradients of standard model

Logit-gradients of model with “fooled” gradients

Logit gradients don’t need to encode relevant information, but they still do. Why?

7 of 19

Generative Models hidden within Discriminative Models

7

8 of 19

Implicit Density Models within Discriminative Models

8

Grathwohl et. al, Your Classifier is Secretly an Energy-based Model and You Should Treat it Like One, ICLR 2020

9 of 19

Hypothesis

Hypothesis: The structure of logit-gradients is due to its alignment with the ground truth gradients of log density.

A concrete test: Increasing gradient alignment must improve gradient interpretability & decreasing this alignment must deteriorate interpretability.

9

10 of 19

Training Energy-based Models

10

11 of 19

Energy-based Generative Models

  • Sampling via MCMC: Use Langevin Dynamics (“noisy gradient ascent”)�����
  • Training using:
  • Approx. Max-likelihood - requires MCMC to estimate normalizing constant
  • Score-matching - does not require normalizing constant, but is unstable
  • Noise Contrastive Estimation, Minimizing Stein Discrepancy, etc

11

12 of 19

Score-Matching

12

Aapo Hyvarinen. “Estimation of non-normalized statistical models by score matching”.Journal of Machine Learning Research, 6(Apr):695–709, 2005

  • Hessian computation is intractable for deep models!
  • Trace of Hessian is unbounded below

Does not require ∇x log pdata(x)

Alignment of gradients is a generative modelling principle!

13 of 19

Regularized Score-Matching

Efficient estimation of Hessian trace

Regularization of Hessian trace

13

Hutchinson’s trick

Taylor series

14 of 19

Interpretability vs Generative Modelling

14

Interpretability

Generative Modelling

Logit-Gradients

Gradient of log p(x)

“Deep dream” Visualization by Activation Maximization

MCMC Sampling by Langevin Dynamics

Pixel perturbation test

Density ratio test

Simonyan et. al, Deep inside convolutional networks: Visualising image classification models and saliency maps, 2013

15 of 19

Experiments

We compare generative capabilities and gradient interpretability across different models

  • Baseline unregularized model
  • Score-matching regularized model
  • Anti-score-matching regularized model
  • Gradient norm regularized model

15

16 of 19

Effect on Generative Modelling

16

  • Models assign high likelihoods to noisy points!
  • This tendency reduces with score-matching models, and increases for anti-score-matching models
  • Sample quality is measured using GAN-test
  • Sample quality improves with score-matching and deteriorates with anti-score-matching

Schmelkov et. al, How good is my GAN?, ECCV 2018

17 of 19

Effect on Gradient Interpretability

17

This confirms our hypothesis that the implicit density modelling influences gradient interpretability.

  • A proxy for gradient interpretability is the pixel perturbation test, which masks unimportant pixels and checks accuracy (higher is better)�
  • Score-matching improves on this metric, while anti-score-matching deteriorates

18 of 19

Effect on Gradient Interpretability

18

19 of 19

Conclusion

  • We present evidence that logit-gradient interpretability is strongly related to the underlying class conditional density model p(x|y) and not p(y|x), which they are typically used to interpret. �
  • Broad message: Gradient structure depends on factors outside the discriminative properties of the model. �
  • Open Question: What causes approximate energy-based training in standard models?

19