1 of 90

CSCI-SHU 376: Natural Language Processing

Hua Shen

2026-04-28

Spring 2026

Lecture 18 & 19: Mechanistic Interpretability Approaches & Human-Centered and Interactive Explanation

Contents adapted from UC Berkeley Agentic AI Course

2 of 90

Today’s Plan

  1. Motivation: Addressing the "Black Box" Problem
  2. Taxonomy: An Overview of Interpretation Methodologies
  3. Deep Dive: Explainability for LLMs
  4. Assessment: Methods for Evaluating Explanations

3 of 90

Outline

  • Motivation: Addressing the "Black Box" Problem
  • Taxonomy: An Overview of Interpretation Methodologies (25 min)
  • Deep Dive: Explainability for LLMs (25 min)
  • Assessment: Methods for Evaluating Explanations (15 min)

4 of 90

Why do we need AI interpretability?

5 of 90

The usefulness of AI Interpretability for humans is crucial

6 of 90

Trajectory of Explainable AI (XAI) algorithms

7 of 90

Interpretability v.s. Explanations

(Interchangeable)

8 of 90

What are desired properties of AI explanations

Share Your Thoughts 🙌

9 of 90

Desiderata of AI Explanations

  • Faithful
    • Accurately represent the model mechanisms
  • Plausible
    • Coherent with human understanding
  • Useful
    • Helpful for human perception, or AI improvement, or downstream tasks

10 of 90

Faithful AI Explanations

Faithfulness: Accurately represent the model mechanisms

11 of 90

Plausible AI Explanations

Plausibility: Coherent with human understanding, also referred to as persuasiveness or understandability

12 of 90

Faithfulness v.s. Plausibility ?

Application categorization w.r.t. the desired levels of faithfulness (left) and plausibility (right) in explanations provided by LLMs.

  • High-stakes applications like healthcare, finance, and legal demand high faithfulness to ensure the accuracy of the LLM’s output due to the critical nature of decisions made in these fields.
  • Conversely, recreational and educational applications like storytelling, educational LLMs, and creativity prioritize plausibility to enhance user engagement.
  • Agarwal, Chirag, Sree Harsha Tanneru, and Himabindu Lakkaraju. "Faithfulness vs. plausibility: On the (un) reliability of explanations from large language models." arXiv preprint arXiv:2402.04614 (2024).

13 of 90

Faithfulness v.s. Plausibility ?

It is possible that AI explanations are:

  • faithful but NOT plausible,
  • or plausible but NOT faithful
  • Agarwal, Chirag, Sree Harsha Tanneru, and Himabindu Lakkaraju. "Faithfulness vs. plausibility: On the (un) reliability of explanations from large language models." arXiv preprint arXiv:2402.04614 (2024).

14 of 90

Useful AI Explanations

Usefulness:

  • Helpful for AI: improve AI performance
  • Helpful or human: improve human understanding
  • Helpful for the human-AI team: empower human-AI collaborations

15 of 90

Useful AI Explanations

16 of 90

Outline

  • Motivation: Addressing the "Black Box" Problem (10 min)
  • Taxonomy: An Overview of Interpretation Methodologies
  • Deep Dive: Explainability for LLMs (25 min)
  • Assessment: Methods for Evaluating Explanations (15 min)

17 of 90

What AI Explanations have you ever met or used

Share Your Thoughts 🙌

18 of 90

Different ways to categorize AI explanations

How would you explain this AI prediction to a user?

19 of 90

A survey of 200+ Explainable AI (XAI) papers

20 of 90

A collection of Explainable AI (XAI) types

21 of 90

Different ways to categorize AI explanations

  • Global v.s. Local Explanations
  • Post-hoc v.s. Self-Explanations
  • Various Types of Explanations

22 of 90

Global Explanations

How does the AI system generally make predictions?

(Yang, C., et al, 2018)

(Ribeiro, M. T., et al, 2018)

23 of 90

Local Explanations

How does the AI system make this specific predictions?

How can we attribute the AI’s prediction to the input features?

What are the most relevant training examples contributing to this AI prediction?

(Koh, P. W., & Liang, P, 2017)

(Wallace, Eric, et al, 2019)

24 of 90

Post-hoc Explanations

  • Shen, Hua, et al. "Are shortest rationales the best explanations for human understanding?." ACL 2022

(Shen, Hua, et al. 2022)

25 of 90

Self-Explaining Models

  • Shen, Hua, et al. "Are shortest rationales the best explanations for human understanding?." ACL 2022

26 of 90

Large Language Model Explanations are primarily focusing on post-hoc Explanations, including both global and local explanations

27 of 90

Outline

  • Motivation: Addressing the "Black Box" Problem (10 min)
  • Taxonomy: An Overview of Interpretation Methodologies (25 min)
  • Deep Dive: Explainability for LLMs
  • Assessment: Methods for Evaluating Explanations (15 min)

28 of 90

Explanations in the Era of LLMs

  • Rationale-based Explanations
    • Feature attributions / Extractive rationales
    • Free-text explanations
    • Structured explanations
  • Data attribution
    • Data Influence (Influence Functions)
  • Transformer understanding
    • Neuron-level interpretability: Sparse Autoencoders
    • Causal Mediation
    • Transformer-oriented Interpretation

29 of 90

Rationale-based Explanations

  • Feature attributions (Post-hoc)
  • DeYoung, Jay, et al. "ERASER: A benchmark to evaluate rationalized NLP models." ACL 2020
  • Extractive rationales (Self-Explanation)

30 of 90

Feature Attributions (Post-hoc Explanation)

  • Removal- / Perturbation- based Explanations
  • SHAP (SHapley Additive exPlanation)
  • Gradient-based Explanations

31 of 90

Feature Attributions (Post-hoc Explanation)

  • Covert, Ian, Scott Lundberg, and Su-In Lee. "Explaining by removing: A unified framework for model explanation." Journal of Machine Learning Research 22.209 (2021): 1-90.
  • Lundberg, Scott M., and Su-In Lee. "A unified approach to interpreting model predictions." Advances in neural information processing systems 30 (2017).
  • Sundararajan, Mukund, Ankur Taly, and Qiqi Yan. "Axiomatic attribution for deep networks." International conference on machine learning. PMLR, 2017.
  • Removal- / Perturbation- based Explanations

32 of 90

Feature Attributions (Post-hoc Explanation)

  • Covert, Ian, Scott Lundberg, and Su-In Lee. "Explaining by removing: A unified framework for model explanation." Journal of Machine Learning Research 22.209 (2021): 1-90.
  • Lundberg, Scott M., and Su-In Lee. "A unified approach to interpreting model predictions." Advances in neural information processing systems 30 (2017).
  • Sundararajan, Mukund, Ankur Taly, and Qiqi Yan. "Axiomatic attribution for deep networks." International conference on machine learning. PMLR, 2017.
  • SHAP (SHapley Additive exPlanation)

33 of 90

Feature Attributions (Post-hoc Explanation)

  • Covert, Ian, Scott Lundberg, and Su-In Lee. "Explaining by removing: A unified framework for model explanation." Journal of Machine Learning Research 22.209 (2021): 1-90.
  • Lundberg, Scott M., and Su-In Lee. "A unified approach to interpreting model predictions." Advances in neural information processing systems 30 (2017).
  • Sundararajan, Mukund, Ankur Taly, and Qiqi Yan. "Axiomatic attribution for deep networks." International conference on machine learning. PMLR, 2017.
  • Gradient-based Explanations

34 of 90

Feature Attributions (Post-hoc Explanation)

Challenges:

  • Computational cost
  • Low efficiency in long context
  • No model access (gradients, attention scores, etc.)

35 of 90

Extractive rationales (Self-Explanation)

Lei, Tao, Regina Barzilay, and Tommi Jaakkola. "Rationalizing neural predictions." EMNLP, 2016.

36 of 90

Extractive rationales (Self-Explanation)

  • Shen, Hua, et al. "Are shortest rationales the best explanations for human understanding?." ACL 2022

37 of 90

Extractive rationales (Self-Explanation)

  • Shen, Hua, et al. "Are shortest rationales the best explanations for human understanding?." ACL 2022

How to control explanation length?

38 of 90

Free-text Explanations

What does a free-text Explanation look like?

39 of 90

Free-text Explanations

How to Generate Free-text Explanations?

  • Kumar, Sawan, and Partha Talukdar. "NILE: Natural language inference with faithful natural language explanations." arXiv preprint arXiv:2005.12116 (2020).

40 of 90

Free-text Explanations

  • Brown, Tom, et al. "Language models are few-shot learners." Advances in neural information processing systems 33 (2020): 1877-1901.

In-context learning / Few-shot prompting

41 of 90

Prompting for Explanations

  • Wiegreffe, Sarah, et al. "Reframing human-AI collaboration for generating free-text explanations." NAACL 2022
  • Marasović, Ana, et al. "Few-shot self-rationalization with natural language prompts." Findings of the association for computational linguistics: Naacl 2022. 2022.

42 of 90

Chain of Thought (CoT) - based Explanations

  • Wei, Jason, et al. "Chain-of-thought prompting elicits reasoning in large language models." Advances in neural information processing systems 35 (2022): 24824-24837.
  • Kojima, Takeshi, et al. "Large language models are zero-shot reasoners." Advances in neural information processing systems 35 (2022): 22199-22213.

43 of 90

CoT + Question Decomposition

44 of 90

CoT + Vote and Rank

45 of 90

Structured Explanations

Tafjord, Oyvind, Bhavana Dalvi, and Peter Clark. "Proofwriter: Generating implications, proofs, and abductive statements over natural language." ACL-IJCNLP. 2021.

Dalvi, Bhavana, et al. "Explaining answers with entailment trees." arXiv preprint arXiv:2104.08661 (2021).

  • Traditionally: train models to iteratively generate intermediate steps

ProofWriter

EntailmentWriter

  • Need a lot of training data

46 of 90

Structured Explanations

47 of 90

Logically-Constrained Reasoning

  • Jung, Jaehun, et al. "Maieutic prompting: Logically consistent reasoning with recursive explanations." EMNLP 2022.
  • Ye, Xi, et al. "Satlm: Satisfiability-aided language models using declarative prompting." NeurIPS 2023
  • Maieutic Prompting
  • See also SatLM [Ye et al., 2023]

48 of 90

Symbolically-Aided Reasoning

  • Chen, Wenhu, et al. "Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks." arXiv preprint arXiv:2211.12588 (2022).
  • Gao, Luyu, et al. "Pal: Program-aided language models." International Conference on Machine Learning. PMLR, 2023.
  • Program of Thoughts (PoT) Prompting

See also: Program-Aided LM/PAL [Gao et al., 2023]

49 of 90

Data Influence

  • Seminal Work: Influence Functions

[Koh and Liang, 2017]

50 of 90

Data Influence

  • Koh, Pang Wei, and Percy Liang. "Understanding black-box predictions via influence functions." ICML. PMLR, 2017.

51 of 90

Data Influence: Explaining LLMs’ Completions

  • Rationale-based Explanations
    • Feature attributions / Extractive rationales
    • Free-text explanations
    • Structured explanations
  • Data attribution
    • Data Influence (Influence Functions)
  • Transformer understanding
    • Neuron-level interpretability: Sparse Autoencoders
    • Causal Mediation
    • Transformer-oriented Interpretation
  • Grosse, Roger, et al. "Studying large language model generalization with influence functions." arXiv preprint arXiv:2308.03296 (2023).

52 of 90

Transformer Understanding

  • Rationale-based Explanations
    • Feature attributions / Extractive rationales
    • Free-text explanations
    • Structured explanations
  • Data attribution
    • Data Influence (Influence Functions)
  • Transformer understanding
    • Neuron-level interpretability: Sparse Autoencoders
    • Causal Mediation
    • Transformer-oriented Interpretation

Neuron-level interpretability: Sparse Autoencoders

  • Transformer Circuits Thread: https://transformer-circuits.pub/

53 of 90

Sparse Auencoders

  • Rationale-based Explanations
    • Feature attributions / Extractive rationales
    • Free-text explanations
    • Structured explanations
  • Data attribution
    • Data Influence (Influence Functions)
  • Transformer understanding
    • Neuron-level interpretability: Sparse Autoencoders
    • Causal Mediation
    • Transformer-oriented Interpretation

54 of 90

Sparse Auencoders

  • Rationale-based Explanations
    • Feature attributions / Extractive rationales
    • Free-text explanations
    • Structured explanations
  • Data attribution
    • Data Influence (Influence Functions)
  • Transformer understanding
    • Neuron-level interpretability: Sparse Autoencoders
    • Causal Mediation
    • Transformer-oriented Interpretation

55 of 90

Sparse Auencoders

  • Rationale-based Explanations
    • Feature attributions / Extractive rationales
    • Free-text explanations
    • Structured explanations
  • Data attribution
    • Data Influence (Influence Functions)
  • Transformer understanding
    • Neuron-level interpretability: Sparse Autoencoders
    • Causal Mediation
    • Transformer-oriented Interpretation

56 of 90

Course Final Project & Scoring Updates

  1. Remove the lowest quiz score;
  2. Project Presentation Video Upload to Youtube

57 of 90

Transformer Understanding

  • Rationale-based Explanations
    • Feature attributions / Extractive rationales
    • Free-text explanations
    • Structured explanations
  • Data attribution
    • Data Influence (Influence Functions)
  • Transformer understanding
    • Neuron-level interpretability: Sparse Autoencoders
    • Causal Mediation
    • Transformer-oriented Interpretation

58 of 90

Transformer Understanding

The Three Layer Causal Hierarchy

https://web.cs.ucla.edu/~kaoru/3-layer-causal-hierarchy.pdf

59 of 90

Causal Mediation

60 of 90

Causal Mediation

  • Rationale-based Explanations
    • Feature attributions / Extractive rationales
    • Free-text explanations
    • Structured explanations
  • Data attribution
    • Data Influence (Influence Functions)
  • Transformer understanding
    • Neuron-level interpretability: Sparse Autoencoders
    • Causal Mediation
    • Transformer-oriented Interpretation

61 of 90

Causal Mediation

  • Rationale-based Explanations
    • Feature attributions / Extractive rationales
    • Free-text explanations
    • Structured explanations
  • Data attribution
    • Data Influence (Influence Functions)
  • Transformer understanding
    • Neuron-level interpretability: Sparse Autoencoders
    • Causal Mediation
    • Transformer-oriented Interpretation

62 of 90

Causal Mediation

  • Rationale-based Explanations
    • Feature attributions / Extractive rationales
    • Free-text explanations
    • Structured explanations
  • Data attribution
    • Data Influence (Influence Functions)
  • Transformer understanding
    • Neuron-level interpretability: Sparse Autoencoders
    • Causal Mediation
    • Transformer-oriented Interpretation

63 of 90

Causal Mediation

  • Rationale-based Explanations
    • Feature attributions / Extractive rationales
    • Free-text explanations
    • Structured explanations
  • Data attribution
    • Data Influence (Influence Functions)
  • Transformer understanding
    • Neuron-level interpretability: Sparse Autoencoders
    • Causal Mediation
    • Transformer-oriented Interpretation

64 of 90

Causal Mediation in Transformers

  • Rationale-based Explanations
    • Feature attributions / Extractive rationales
    • Free-text explanations
    • Structured explanations
  • Data attribution
    • Data Influence (Influence Functions)
  • Transformer understanding
    • Neuron-level interpretability: Sparse Autoencoders
    • Causal Mediation
    • Transformer-oriented Interpretation

65 of 90

Causal Mediation in Transformers

  • Rationale-based Explanations
    • Feature attributions / Extractive rationales
    • Free-text explanations
    • Structured explanations
  • Data attribution
    • Data Influence (Influence Functions)
  • Transformer understanding
    • Neuron-level interpretability: Sparse Autoencoders
    • Causal Mediation
    • Transformer-oriented Interpretation

66 of 90

Causal Mediation in Transformers

  • Rationale-based Explanations
    • Feature attributions / Extractive rationales
    • Free-text explanations
    • Structured explanations
  • Data attribution
    • Data Influence (Influence Functions)
  • Transformer understanding
    • Neuron-level interpretability: Sparse Autoencoders
    • Causal Mediation
    • Transformer-oriented Interpretation

67 of 90

Causal Mediation in Transformers

  • Rationale-based Explanations
    • Feature attributions / Extractive rationales
    • Free-text explanations
    • Structured explanations
  • Data attribution
    • Data Influence (Influence Functions)
  • Transformer understanding
    • Neuron-level interpretability: Sparse Autoencoders
    • Causal Mediation
    • Transformer-oriented Interpretation

68 of 90

Causal Mediation in Transformers

  • Rationale-based Explanations
    • Feature attributions / Extractive rationales
    • Free-text explanations
    • Structured explanations
  • Data attribution
    • Data Influence (Influence Functions)
  • Transformer understanding
    • Neuron-level interpretability: Sparse Autoencoders
    • Causal Mediation
    • Transformer-oriented Interpretation

69 of 90

Transformer Understanding

  • Rationale-based Explanations
    • Feature attributions / Extractive rationales
    • Free-text explanations
    • Structured explanations
  • Data attribution
    • Data Influence (Influence Functions)
  • Transformer understanding
    • Neuron-level interpretability: Sparse Autoencoders
    • Causal Mediation
    • Transformer-oriented Interpretation

Transformer Residual Stream and Linear Structure

  • Grosse, Roger, et al. "Studying large language model generalization with influence functions." arXiv preprint arXiv:2308.03296 (2023).

One-Layer Attention-Only Transformers

70 of 90

Outline

  • Motivation: Addressing the "Black Box" Problem (10 min)
  • Taxonomy: An Overview of Interpretation Methodologies (25 min)
  • Deep Dive: Explainability for LLMs (25 min)
  • Assessment: Methods for Evaluating Explanations

71 of 90

Evaluating AI Explanations

  • Faithful
  • Plausible
  • Informativeness
  • Useful

72 of 90

Evaluating Faithfulness of AI Explanations

DeYoung, Jay, et al. "ERASER: A benchmark to evaluate rationalized NLP models." Proceedings of the 58th annual meeting of the association for computational linguistics. 2020.

  • Sufficiency

  • Comprehensiveness

73 of 90

Evaluating Plausibility of AI Explanations

74 of 90

Evaluating Informativeness of AI Explanations

75 of 90

Evaluating Utility of AI Explanations

76 of 90

Outline

  • Motivation: Addressing the "Black Box" Problem (10 min)
  • Taxonomy: An Overview of Interpretation Methodologies (25 min)
  • Deep Dive: Explainability for LLMs (25 min)
  • Assessment: Methods for Evaluating Explanations (15 min)
  • Human-Centered Explanations:

Are AI Explanations useful for humans? Why and How?

77 of 90

78 of 90

79 of 90

80 of 90

81 of 90

82 of 90

83 of 90

84 of 90

85 of 90

86 of 90

87 of 90

88 of 90

89 of 90

Thank you for taking the course!

Best of luck with your exams!

90 of 90

Course Feedback

Please submit your valuable feedback to this course for our better improvement.

Thank you!