1 of 71

Trustworthy ML

Winter Semester 2022-2023

University of Tübingen

Lecturer : Seong Joon Oh

2 of 71

Until Monday 21 November 2023:

Please send an email with the following information to stai.there@gmail.com.

  • Your university email address
  • Your favourite email address
  • Your matriculation number

Sorry for the short notice.

3 of 71

Tutorial today

Alex will stay to answer questions after the lecture.

4 of 71

Summary of last lecture

  • Adversarial generalisation measures the worst-case generalisation within a set of possible environments.
  • White-box attacks in pixel space:
    • FGSM
    • PGD
  • White-box attacks in other strategy spaces:
    • Physical attacks
    • Optical-flow attacks
  • Black-box attacks.
  • Defense method – Adversarial training.

5 of 71

Obfuscated gradients: Breaking defense again!

  • ICLR’18 was teeming with defense methods!
  • This ICML 2018 paper: “7 of 9 ICLR’18 defenses do not work”.

Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples. ICML 2018.

6 of 71

Obfuscated gradients: Breaking defense again!

Why is the defense not so effective?

  • Defenses are specifically targetted against gradient-based attacks.
  • They only make the gradient malfunction - to mislead gradient-based attacks.
  • The model itself is still vulnerable.
  • One can use slight modification of gradient-based attacks to attack it again.

Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples. ICML 2018.

7 of 71

Obfuscated gradients: Breaking defense again!

x

xADV

xBenign

Model is safe

No gradient-based algorithm can find an attack

x

xADV

Model is safe

=

There is no attack within the attack space

PGD algorithm

Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples. ICML 2018.

8 of 71

Example: Input transformations

Apply image transformations (and random combination of them)

  • Cropping and rescaling.
  • Bit-depth reduction.
  • JPEG encoding + decoding.
  • Remove random pixels and restore them via TV minimization.
  • Image quilting: reconstruct images with small patches from other images in a database.

Countering adversarial images using input transformations. ICLR 2018.

9 of 71

Example: Input transformations

  • You’re *removing* adversarial effects from the input image.

Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples. ICML 2018.

Input

Transformation

Model

Transformed (safe) input

Correct prediction

10 of 71

Example: Input transformations

  • Successful defense!
  • It seems.

Countering adversarial images using input transformations. ICLR 2018.

11 of 71

Example: Input transformations

  • Cropping and rescaling:
    • Differentiable transformation – you can attack the joint network.

Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples. ICML 2018.

Input

Transformation

Model

Transformed (safe) input

Correct prediction

Attack this part with PGD

Compute gradient

12 of 71

Example: Input transformations

  • Other discrete transformations:
    • Differentiate “through” quantisation layers. (Called straight-through estimator)

Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples. ICML 2018.

This is not possible for JPEG, Quantisation, …

Input

Transformation

Model

Transformed (safe) input

Correct prediction

Attack this part with PGD

Compute gradient

13 of 71

Example: Input transformations

  • Other discrete transformations:
    • Differentiate “through” quantisation layers. (Called straight-through estimator)

Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples. ICML 2018.

Input

Transformation

Model

Transformed (safe) input

Correct prediction

Attack this part with PGD

Compute gradient

ST estimator:

14 of 71

Example: Input transformations

Straight-through estimator.

  • Forward: JPEG encoding and decoding. Not differentiable. But close to an identity mapping.

  • Backward: Gradient as if forward was identity mapping.

15 of 71

Example: Input transformations

  • Random mixture of transformations:
    • Perform expectation over transformations (EOT):

Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples. ICML 2018.

Input

Random

Transformation

Model

Transformed (safe) input

Correct prediction

Attack this part with PGD

Compute gradient

EOT attack

16 of 71

Example: Input transformations

Randomness in defense.

  • Makes the model a gray box.
  • Attacker may still craft an attack that is effective against the set of defenses.
  • One can do this with expectation over transformations (EoT).
  • With sufficient capacity for attacker, the defense may still be ineffective.

17 of 71

Example: Input transformations

  • When ICML 2018 attack attacks, 0% adversarial accuracy!!

Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples. ICML 2018.

ICML 2018 attack

18 of 71

Adversarial training is still an effective denfense.

  • Adversarial training does not introduce obfuscated gradients.
  • It was hard for ICML 2018 to attack adversarially trained models with greater attack success rates!
  • → AT is an effective defense.
  • Caveat: Adversarial training is very difficult to perform at scale.

Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples. ICML 2018.

19 of 71

What if the set of transformations is gigantic?

Adversary

EoT effectively beats a defender with a good number of candidate transformations.

Defender

But what if the defender employs an exponential number of possible transformations?

Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples. ICML 2018.

20 of 71

Barrage of Random Transforms (BaRT)

  • Introduce 10 groups of possible image transformations.
    • Color Precision Reduction
    • JPEG Noise
    • Swirl
    • Noise Injection
    • FFT Perturbation
    • Zoom
    • Color Space
    • Contrast
    • Grayscale
    • Denoising
  • And apply them with random sequence.

Barrage of Random Transforms for Adversarially Robust Defense. CVPR 2019.

21 of 71

Barrage of Random Transforms (BaRT)

  • BaRT defends a model against PGD.
  • BaRT defends against ICML 2018 attacks (EOT) designed to break gradient obfuscations.

Barrage of Random Transforms for Adversarially Robust Defense. CVPR 2019.

22 of 71

History of adversarial robustness in ML

Attack

Defense

First attack: L-BFGS attack

2014

First practical attack: FGSM attack

2015

2016

First black box attack: Substitute model

Stronger iterative attacks: DeepFool

First defense: Distillation

23 of 71

History of adversarial robustness in ML

Attack

Defense

2017

Strong defense: Adversarial training

Strong attack: PGD

Defenses at ICLR’18 are mostly ineffective: Obfuscated gradients

2018

Defenses at ICLR’18: Input perturbation, adversarial input detection, adversarial training …

Adversarial input detection methods

2019

Barrage of Random Transforms (BaRT):

You just need to apply transforms sequentially.

24 of 71

History of adversarial robustness in ML

2020 - : Stop the cat and mouse game!

It’s a dead end.

Alternatives:

  1. Diversify and randomise (eg BaRT)
  2. Certified defenses
  3. Deal with realistic threats, rather than unrealistic worst-case threats.

25 of 71

Certified defenses

Two-layer neural network: f(x) = V σ(Wx).

  • V and W are matrices.
  • σ( ) is a non-linearity with bounded gradients. E.g. ReLU or sigmoid.

We write the following for the worst-case adversarial attack.

Certified defenses against adversarial examples. ICLR 2018.

26 of 71

Certified defenses

They have produced the following upper bounds on the severity of adversarial attack:

Certified defenses against adversarial examples. ICLR 2018.

27 of 71

Certified defenses

Certified defenses against adversarial examples. ICLR 2018.

Based on the final bound , one may formulate the loss function

Given V[t], W[t] and c[t] values at iteration t, one obtains the guarantee for any attack A:

28 of 71

Towards less pessimistic defense

Considered attacks are way too strong. Instead more work on:

  • Defense against black box attacks.
  • Defense against non-adversarial, non-worst case perturbations:
  • OOD generalisation
  • Domain generalisation
  • Cross-bias generalisation.

29 of 71

Summary

  • DNNs are vulnerable within small Lp ball.
  • Attacks and defenses tend to be a cat and mouse game.
  • People seek alternative directions, such as domain generalisation.

30 of 71

31 of 71

Explainability

32 of 71

A system

System

Input

Output

33 of 71

(Relatively) well-understood system

Central banks

Policy rates

Asset purchases

Price control

https://www.ecb.europa.eu/mopo/intro/transmission/html/index.en.html

34 of 71

Good understanding → Know how to control

Central banks

Policy rates

Asset purchases

Price control

https://www.ecb.europa.eu/mopo/intro/transmission/html/index.en.html

35 of 71

A black box system

System

Input

Output

36 of 71

We don’t understand → We can’t control

System

Input

Output

https://news.columbia.edu/news/researchers-unveil-tool-debug-black-box-deep-learning-algorithms

37 of 71

Two ways to control undefined behaviour in OOD

1. An infinite list of unit tests and data augmentation.

Goal: Let a model work well in any new environment.

Evaluation: Introduce ImageNet-A,B,C,D, ….

Model: Augment ImageNet-A,B,C,D, ….

https://github.com/hendrycks/robustness

38 of 71

Two ways to control undefined behaviour in OOD

2. Understand and fix.

Goal: Let a model work well in any new environment.

Evaluation: Examine cues utilised by the model (explainability).

Model: Regularise the model to choose generalisable cues (feature selection).

→ Perhaps more scalable?

https://github.com/hendrycks/robustness

39 of 71

Explainability as a base tool for many applications

Applications requiring the selection of right features:

  • Fairness and demographic biases.
  • Robustness to distribution shifts.

Applications requiring better understanding and controllability:

  • Quick adaptation of models to downstream tasks (e.g. GPT-3 and LLMs).
  • ML for science; discovering scientific facts from high-dimensional data.

Towards A Rigorous Science of Interpretable Machine Learning.

40 of 71

Explainability as a base tool for many applications

Applications requiring better understanding of training data:

  • Detection of private information.
  • Attribution of original authors in training data.

Applications requiring greater user trust:

  • ML-human expert symbiosis.
  • Finance, law, and medical applications.

Towards A Rigorous Science of Interpretable Machine Learning.

41 of 71

Explanation as a data subject’s right.

European Union regulations on algorithmic decision-making and a “right to explanation”. 2016

GDPR

42 of 71

Explanation as a data subject’s right.

  • Critical decisions are made on humans based on automatised systems using their personal data.
  • Three key barriers to transparency:
    • Intentional concealment; decision making procedures are kept from public scrutiny.
    • Gaps in technical literacy; for most people, reading the code is insufficient.
    • Mismatch between the actual inner workings of models and the demands of human-scale reasoning and styles of interpretation.

European Union regulations on algorithmic decision-making and a “right to explanation”. 2016

https://www.knime.com/blog/banks-use-xai-transparent-credit-scoring.

43 of 71

When is an explanation needed?

  • “Explanations may highlight an incompleteness”.
  • Explanations are required typically when something doesn’t work as expected.

Keil FC. Explanation and understanding. Annu Rev Psychol. 2006.

Towards A Rigorous Science of Interpretable Machine Learning.

44 of 71

When is an explanation not needed?

Examples: Ad servers, postal code sorting, air craft collision avoidance systems.

Why?

  1. There are no significant consequences for unacceptable results; OR
  2. The problem is sufficiently well-studied and validated in real applications that we trust the system’s decision, even if the system is not perfect.

Towards A Rigorous Science of Interpretable Machine Learning.

45 of 71

Defining explainability

Which algorithms are generally deemed explainable?

  • Sparse linear models with human-understandable features.

Towards A Rigorous Science of Interpretable Machine Learning.

https://cims.nyu.edu/~cfgranda/pages/OBDA_spring16/material/lecture_4.pdf

46 of 71

Defining explainability

Which algorithms are generally deemed explainable?

  • Decision trees with human-understandable criteria.

Towards A Rigorous Science of Interpretable Machine Learning, https://forum.huawei.com/enterprise/en/machine-learning-algorithms-decision-trees/thread/710283-895

47 of 71

Defining explainability

What then is explainability?

No easy answer.

Let’s start with how humans explain to each other.

Explanation in Artificial Intelligence: Insights from the Social Sciences

48 of 71

Human-to-human explanations

Tim Miller’s work on the status of XAI from the social-science perspective.

Highly recommended for those working in XAI.

Explanation = “an answer to a why–question”.

Explanation in Artificial Intelligence: Insights from the Social Sciences

49 of 71

Why do humans need explanations?

Malle [112, Chapter 3]: people ask for explanations for two main reasons.

1. To find meaning: to reconcile the contradictions or inconsistencies between elements of our knowledge structures.

2. To manage social interaction: to create a shared meaning of something, and to change others’ beliefs & impressions, their emotions, or to influence their actions.

Both are important for explainable AI systems.

Explanation in Artificial Intelligence: Insights from the Social Sciences

50 of 71

Characteristics of human-to-human explanations

  • Contrastive.
  • Selective.
  • Social, context-dependent, and interactive.

Explanation in Artificial Intelligence: Insights from the Social Sciences

51 of 71

Contrastive explanations

Explanations are sought in response to particular counterfactual cases.

People do not ask: Why did event P happen?

They ask: Why did event P happen instead of some event Q?

Even if the apparent format is “Why A?”, it usually implies “Why A rather than B?”

Explanation in Artificial Intelligence: Insights from the Social Sciences

52 of 71

Contrastive explanations

The foil Q, the alternative case, is often implied.

For the question “Why did Elizabeth open the door?”, there are many possible foils.

  • “Why did Elizabeth open the door, rather than leave it closed?”
  • “Why did Elizabeth open the door rather than the window?”
  • “Why did Elizabeth open the door rather than Michael opening it?”

Explanation in Artificial Intelligence: Insights from the Social Sciences

53 of 71

Contrastive explanations

For XAI, we ask questions like: Why was this image categorised as a A?

Perhaps less ambiguous to ask: Why is this image categorised as A, not B?

Or

Will this image still be categorised as A even if the image is modified?

Keep CALM and Improve Visual Feature Attribution. ICCV 2021.

54 of 71

Selective explanations

  • People rarely expect an explanation that consists of an actual and complete cause of an event.
  • Humans are adept at selecting one or two causes from a sometimes infinite number of causes to be the explanation.
  • Causal chains are often too large to comprehend.
  • Principle of simplicity – Do not overwhelm human users.

Explanation in Artificial Intelligence: Insights from the Social Sciences

55 of 71

Social explanations.

  • Philosophy, psychology, and cognitive studies:
  • People employ cognitive biases and social expectations.
  • Explanations are a transfer of knowledge, presented as part of a conversation or interaction.
  • Explanations are thus presented relative to the explainer’s beliefs about the explainee’s beliefs.

Explanation in Artificial Intelligence: Insights from the Social Sciences

56 of 71

Interactive explanations

  • Let’s say there are many causes.
  • Explainee cares only about a small subset relevant to the context.
  • Explainer selects a subset of this subset based on other criteria.
  • Explainer and explainee may interact and argue about this explanation.

Explanation in Artificial Intelligence: Insights from the Social Sciences

57 of 71

A good video summary on explanations

Richard Feynman: Explanations are

  • Context dependent;
  • Social;
  • Counterfactual.

There’s no single correct answer to “why?”.

58 of 71

What are good explanations?

  • Soundness (faithfulness, correctness): identifies the true cause for an event.
    • Primary focus of XAI evaluation.
    • Turns out to be high on the list of desiderata from domain experts (second ref below).
    • Not the only criterion for a good explanation (Hilton [73]).
  • Simplicity: cite fewer causes.
  • Generality: explain many events.
  • Relevance: aligned with the final goal.

Explanation in Artificial Intelligence: Insights from the Social Sciences Rethinking Explainability as a Dialogue: A Practitioner’s Perspective. 2022.

59 of 71

Revisiting intrinsically explainable models.

Towards A Rigorous Science of Interpretable Machine Learning.

https://cims.nyu.edu/~cfgranda/pages/OBDA_spring16/material/lecture_4.pdf

Sparse linear model

Decision trees

60 of 71

Do they explain well?

  • Soundness: By definition, every feature is a sound cause for the outcome.
  • Simplicity: One can control with the number of features.
  • Generality: By definition, whenever the cited causes happen, similar outcomes will follow.
  • Contrastive: One can simulate contrastive reasoning ground-up.
  • Social & interactive: Not by default; additional module required.
  • Relevance: Depends on the final goal (To debug? To understand? …).

Explanation in Artificial Intelligence: Insights from the Social Sciences

61 of 71

Types of model explainability

  • Intrinsic vs post-hoc.
  • Global vs local.
  • Attributing to training sample vs test sample.

62 of 71

Intrinsic vs post-hoc explainability

  • Intrinsic explainability means the model is interpretable (sound, simple, and general) by definition. Examples: sparse linear models, decision trees.
  • Post-hoc explainability means the model itself lacks interpretability and one is trying to explain its behaviour post-hoc.

63 of 71

Terminologies: Explainability vs Interpretability.

Biran and Cotton [9]. Interpretability: the degree to which an observer can understand the cause of a decision.

Lipton [103]. Explanation is post-hoc interpretability.

Justification: Explains why a decision is good, but does not necessarily aim to give an explanation of the actual decision-making process [9].

Explanation in Artificial Intelligence: Insights from the Social Sciences

64 of 71

Global vs local explainability.

Global

  • Example: Epidemics model (SIR model).
  • Overall understanding of the mechanism.
  • Often impossible for complex, black-box deep models.
  • Useful for scientific understanding and simulation of counterfactuals.

Towards A Rigorous Science of Interpretable Machine Learning. Modeling Transmission Dynamics and Control of Vector-Borne Neglected Tropical Diseases.

65 of 71

Global vs local explainability.

Local

  • Example: Why did my loan get rejected?
  • Understanding of the mechanism behind a particular case.
  • Feasible even for complex deep models.
  • We focus on local explainability in this lecture.

Towards A Rigorous Science of Interpretable Machine Learning.

66 of 71

Attributing to training sample vs test sample

A model is a function.

Model

Input

Output

67 of 71

Attributing to training sample vs test sample

The model is, again, an output of a training algorithm.

System

Input

Output

Training data

68 of 71

Attributing to training sample vs test sample

We write a model prediction as a function of two variables.

Y = Model ( X ; θ ) = Model ( X ; θ(Xtr) )

We can trace back the output Y for X to either

  • Particular feature xi in test sample X = [x1, x2, … , xD]; or
  • Particular training sample Xtri in training set Xtr = {Xtr1, Xtr2, … , XtrN}.

One may also attribute the prediction to a particular parameter θj but individual parameters are often not very interpretable to humans.

69 of 71

What’s the current status of XAI techniques?

Despite the recent growth spurt in the field of XAI, studies examining how people actually interact with AI explanations have found popular XAI techniques to be ineffective [6, 80, 111], potentially risky [50, 95], and underused in real-world contexts [58].

Expanding Explainability: Towards Social Transparency in AI systems. CHI 2021.

70 of 71

What’s the current status of XAI techniques?

The feld has been critiqued for its techno-centric view, where “inmates [are running] the asylum” [70], based on the impression that XAI researchers often develop explanations based on their own intuition rather than the situated needs of their intended audience.

Solutionism (always seeking technical solutions) and Formalism (seeking abstract, mathematical solutions) [32, 87], are likely to further widen these gaps.

Expanding Explainability: Towards Social Transparency in AI systems. CHI 2021.

71 of 71

Summary

  • We motivate the study of ML explainability.
  • ML explainability research is being criticised for being useless.
  • We have prepared ourselves to study ML explainability from more human- and application-oriented perspectives.
  • Keeping this in mind, let’s dive into ML explainability techniques next time.