1 of 90

CSCI-SHU 376: Natural Language Processing

Hua Shen

Course Agenda: 2026 Spring-NLP-[CSCI-SHU-376]-Class Schedule

2026-04-28

Spring 2026

Lecture 18 & 19: Mechanistic Interpretability Approaches & Human-Centered and Interactive Explanation

Contents adapted from UC Berkeley Agentic AI Course

2 of 90

Today’s Plan

Motivation: Addressing the "Black Box" Problem
Taxonomy: An Overview of Interpretation Methodologies
Deep Dive: Explainability for LLMs
Assessment: Methods for Evaluating Explanations

3 of 90

Outline

Motivation: Addressing the "Black Box" Problem
Taxonomy: An Overview of Interpretation Methodologies (25 min)
Deep Dive: Explainability for LLMs (25 min)
Assessment: Methods for Evaluating Explanations (15 min)

4 of 90

Why do we need AI interpretability?

5 of 90

The usefulness of AI Interpretability for humans is crucial

6 of 90

Trajectory of Explainable AI (XAI) algorithms

7 of 90

Interpretability v.s. Explanations

(Interchangeable)

8 of 90

What are desired properties of AI explanations

Share Your Thoughts 🙌

9 of 90

Desiderata of AI Explanations

Faithful

Accurately represent the model mechanisms

Plausible

Coherent with human understanding

Useful

Helpful for human perception, or AI improvement, or downstream tasks

10 of 90

Faithful AI Explanations

Faithfulness: Accurately represent the model mechanisms

Towards Faithfully Interpretable NLP Systems: How should we define and evaluate faithfulness? ACL 2020
Towards Faithful Model Explanation in NLP: A Survey. Computational Linguistics, 2024

11 of 90

Plausible AI Explanations

Plausibility: Coherent with human understanding, also referred to as persuasiveness or understandability

Towards Faithfully Interpretable NLP Systems: How should we define and evaluate faithfulness? ACL 2020

12 of 90

Faithfulness v.s. Plausibility ?

Application categorization w.r.t. the desired levels of faithfulness (left) and plausibility (right) in explanations provided by LLMs.

High-stakes applications like healthcare, finance, and legal demand high faithfulness to ensure the accuracy of the LLM’s output due to the critical nature of decisions made in these fields.
Conversely, recreational and educational applications like storytelling, educational LLMs, and creativity prioritize plausibility to enhance user engagement.

Agarwal, Chirag, Sree Harsha Tanneru, and Himabindu Lakkaraju. "Faithfulness vs. plausibility: On the (un) reliability of explanations from large language models." arXiv preprint arXiv:2402.04614 (2024).

13 of 90

Faithfulness v.s. Plausibility ?

It is possible that AI explanations are:

faithful but NOT plausible,
or plausible but NOT faithful

Agarwal, Chirag, Sree Harsha Tanneru, and Himabindu Lakkaraju. "Faithfulness vs. plausibility: On the (un) reliability of explanations from large language models." arXiv preprint arXiv:2402.04614 (2024).

14 of 90

Useful AI Explanations

Usefulness:

Helpful for AI: improve AI performance
Helpful or human: improve human understanding
Helpful for the human-AI team: empower human-AI collaborations

Towards Faithfully Interpretable NLP Systems: How should we define and evaluate faithfulness? ACL 2020
Towards Faithful Model Explanation in NLP: A Survey. Computational Linguistics, 2024

15 of 90

Useful AI Explanations

16 of 90

Outline

Motivation: Addressing the "Black Box" Problem (10 min)
Taxonomy: An Overview of Interpretation Methodologies
Deep Dive: Explainability for LLMs (25 min)
Assessment: Methods for Evaluating Explanations (15 min)

17 of 90

What AI Explanations have you ever met or used

Share Your Thoughts 🙌

18 of 90

Different ways to categorize AI explanations

How would you explain this AI prediction to a user?

19 of 90

A survey of 200+ Explainable AI (XAI) papers

Website: https://human-centered-exnlp.github.io/
Shen, Hua, and Ting-Hao'Kenneth Huang. "Explaining the road not taken." CHI HCXAI 2021

20 of 90

A collection of Explainable AI (XAI) types

Website: https://human-centered-exnlp.github.io/
Shen, Hua, and Ting-Hao'Kenneth Huang. "Explaining the road not taken." CHI HCXAI 2021

21 of 90

Different ways to categorize AI explanations

Global v.s. Local Explanations
Post-hoc v.s. Self-Explanations
Various Types of Explanations

Website: https://human-centered-exnlp.github.io/
Shen, Hua, and Ting-Hao'Kenneth Huang. "Explaining the road not taken." CHI HCXAI 2021

22 of 90

Global Explanations

How does the AI system generally make predictions?

(Yang, C., et al, 2018)

(Ribeiro, M. T., et al, 2018)

23 of 90

Local Explanations

How does the AI system make this specific predictions?

How can we attribute the AI’s prediction to the input features?

What are the most relevant training examples contributing to this AI prediction?

(Koh, P. W., & Liang, P, 2017)

(Wallace, Eric, et al, 2019)

24 of 90

Post-hoc Explanations

Shen, Hua, et al. "Are shortest rationales the best explanations for human understanding?." ACL 2022

(Shen, Hua, et al. 2022)

25 of 90

Self-Explaining Models

Shen, Hua, et al. "Are shortest rationales the best explanations for human understanding?." ACL 2022

26 of 90

Large Language Model Explanations are primarily focusing on post-hoc Explanations, including both global and local explanations

27 of 90

Outline

Motivation: Addressing the "Black Box" Problem (10 min)
Taxonomy: An Overview of Interpretation Methodologies (25 min)
Deep Dive: Explainability for LLMs
Assessment: Methods for Evaluating Explanations (15 min)

28 of 90

Explanations in the Era of LLMs

Rationale-based Explanations

Feature attributions / Extractive rationales
Free-text explanations
Structured explanations

Data attribution

Data Influence (Influence Functions)

Transformer understanding

Neuron-level interpretability: Sparse Autoencoders
Causal Mediation
Transformer-oriented Interpretation

29 of 90

Rationale-based Explanations

Feature attributions (Post-hoc)

DeYoung, Jay, et al. "ERASER: A benchmark to evaluate rationalized NLP models." ACL 2020

Extractive rationales (Self-Explanation)

30 of 90

Feature Attributions (Post-hoc Explanation)

Removal- / Perturbation- based Explanations
SHAP (SHapley Additive exPlanation)
Gradient-based Explanations

31 of 90

Feature Attributions (Post-hoc Explanation)

Covert, Ian, Scott Lundberg, and Su-In Lee. "Explaining by removing: A unified framework for model explanation." Journal of Machine Learning Research 22.209 (2021): 1-90.
Lundberg, Scott M., and Su-In Lee. "A unified approach to interpreting model predictions." Advances in neural information processing systems 30 (2017).
Sundararajan, Mukund, Ankur Taly, and Qiqi Yan. "Axiomatic attribution for deep networks." International conference on machine learning. PMLR, 2017.

Removal- / Perturbation- based Explanations

32 of 90

Feature Attributions (Post-hoc Explanation)

Covert, Ian, Scott Lundberg, and Su-In Lee. "Explaining by removing: A unified framework for model explanation." Journal of Machine Learning Research 22.209 (2021): 1-90.
Lundberg, Scott M., and Su-In Lee. "A unified approach to interpreting model predictions." Advances in neural information processing systems 30 (2017).
Sundararajan, Mukund, Ankur Taly, and Qiqi Yan. "Axiomatic attribution for deep networks." International conference on machine learning. PMLR, 2017.

SHAP (SHapley Additive exPlanation)

33 of 90

Feature Attributions (Post-hoc Explanation)

Covert, Ian, Scott Lundberg, and Su-In Lee. "Explaining by removing: A unified framework for model explanation." Journal of Machine Learning Research 22.209 (2021): 1-90.
Lundberg, Scott M., and Su-In Lee. "A unified approach to interpreting model predictions." Advances in neural information processing systems 30 (2017).
Sundararajan, Mukund, Ankur Taly, and Qiqi Yan. "Axiomatic attribution for deep networks." International conference on machine learning. PMLR, 2017.

Gradient-based Explanations

34 of 90

Feature Attributions (Post-hoc Explanation)

Challenges:

Computational cost
Low efficiency in long context
No model access (gradients, attention scores, etc.)

35 of 90

Extractive rationales (Self-Explanation)

Lei, Tao, Regina Barzilay, and Tommi Jaakkola. "Rationalizing neural predictions." EMNLP, 2016.

36 of 90

Extractive rationales (Self-Explanation)

Shen, Hua, et al. "Are shortest rationales the best explanations for human understanding?." ACL 2022

37 of 90

Extractive rationales (Self-Explanation)

Shen, Hua, et al. "Are shortest rationales the best explanations for human understanding?." ACL 2022

How to control explanation length?

38 of 90

Free-text Explanations

What does a free-text Explanation look like?

39 of 90

Free-text Explanations

How to Generate Free-text Explanations?

Kumar, Sawan, and Partha Talukdar. "NILE: Natural language inference with faithful natural language explanations." arXiv preprint arXiv:2005.12116 (2020).

40 of 90

Free-text Explanations

Brown, Tom, et al. "Language models are few-shot learners." Advances in neural information processing systems 33 (2020): 1877-1901.

In-context learning / Few-shot prompting

41 of 90

Prompting for Explanations

Wiegreffe, Sarah, et al. "Reframing human-AI collaboration for generating free-text explanations." NAACL 2022
Marasović, Ana, et al. "Few-shot self-rationalization with natural language prompts." Findings of the association for computational linguistics: Naacl 2022. 2022.

42 of 90

Chain of Thought (CoT) - based Explanations

Wei, Jason, et al. "Chain-of-thought prompting elicits reasoning in large language models." Advances in neural information processing systems 35 (2022): 24824-24837.
Kojima, Takeshi, et al. "Large language models are zero-shot reasoners." Advances in neural information processing systems 35 (2022): 22199-22213.

43 of 90

CoT + Question Decomposition

44 of 90

CoT + Vote and Rank

45 of 90

Structured Explanations

Tafjord, Oyvind, Bhavana Dalvi, and Peter Clark. "Proofwriter: Generating implications, proofs, and abductive statements over natural language." ACL-IJCNLP. 2021.

Dalvi, Bhavana, et al. "Explaining answers with entailment trees." arXiv preprint arXiv:2104.08661 (2021).

Traditionally: train models to iteratively generate intermediate steps

ProofWriter

EntailmentWriter

Need a lot of training data

46 of 90

Structured Explanations

47 of 90

Logically-Constrained Reasoning

Jung, Jaehun, et al. "Maieutic prompting: Logically consistent reasoning with recursive explanations." EMNLP 2022.
Ye, Xi, et al. "Satlm: Satisfiability-aided language models using declarative prompting." NeurIPS 2023

Maieutic Prompting

See also SatLM [Ye et al., 2023]

48 of 90

Symbolically-Aided Reasoning

Chen, Wenhu, et al. "Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks." arXiv preprint arXiv:2211.12588 (2022).
Gao, Luyu, et al. "Pal: Program-aided language models." International Conference on Machine Learning. PMLR, 2023.

Program of Thoughts (PoT) Prompting

See also: Program-Aided LM/PAL [Gao et al., 2023]

49 of 90

Data Influence

[Interpreting Predictions of NLP Models EMNLP’20 Tutorial]
[NAACL 24 Tutorial: Explanations in the Era of Large Language Models]
Koh, Pang Wei, and Percy Liang. "Understanding black-box predictions via influence functions." ICML. PMLR, 2017.

Seminal Work: Influence Functions

[Koh and Liang, 2017]

50 of 90

Data Influence

Koh, Pang Wei, and Percy Liang. "Understanding black-box predictions via influence functions." ICML. PMLR, 2017.

51 of 90

Data Influence: Explaining LLMs’ Completions

Rationale-based Explanations

Feature attributions / Extractive rationales
Free-text explanations
Structured explanations

Data attribution

Data Influence (Influence Functions)

Transformer understanding

Neuron-level interpretability: Sparse Autoencoders
Causal Mediation
Transformer-oriented Interpretation

Grosse, Roger, et al. "Studying large language model generalization with influence functions." arXiv preprint arXiv:2308.03296 (2023).

52 of 90

Transformer Understanding

Rationale-based Explanations

Feature attributions / Extractive rationales
Free-text explanations
Structured explanations

Data attribution

Data Influence (Influence Functions)

Transformer understanding

Neuron-level interpretability: Sparse Autoencoders
Causal Mediation
Transformer-oriented Interpretation

Neuron-level interpretability: Sparse Autoencoders

Transformer Circuits Thread: https://transformer-circuits.pub/

53 of 90

Sparse Auencoders

Rationale-based Explanations

Feature attributions / Extractive rationales
Free-text explanations
Structured explanations

Data attribution

Data Influence (Influence Functions)

Transformer understanding

Neuron-level interpretability: Sparse Autoencoders
Causal Mediation
Transformer-oriented Interpretation

54 of 90

Sparse Auencoders

Rationale-based Explanations

Feature attributions / Extractive rationales
Free-text explanations
Structured explanations

Data attribution

Data Influence (Influence Functions)

Transformer understanding

Neuron-level interpretability: Sparse Autoencoders
Causal Mediation
Transformer-oriented Interpretation

55 of 90

Sparse Auencoders

Rationale-based Explanations

Feature attributions / Extractive rationales
Free-text explanations
Structured explanations

Data attribution

Data Influence (Influence Functions)

Transformer understanding

Neuron-level interpretability: Sparse Autoencoders
Causal Mediation
Transformer-oriented Interpretation

56 of 90

Course Final Project & Scoring Updates

Remove the lowest quiz score;
Project Presentation Video Upload to Youtube

57 of 90

Transformer Understanding

Rationale-based Explanations

Feature attributions / Extractive rationales
Free-text explanations
Structured explanations

Data attribution

Data Influence (Influence Functions)

Transformer understanding

Neuron-level interpretability: Sparse Autoencoders
Causal Mediation
Transformer-oriented Interpretation

58 of 90

Transformer Understanding

The Three Layer Causal Hierarchy

https://web.cs.ucla.edu/~kaoru/3-layer-causal-hierarchy.pdf

59 of 90

Causal Mediation

60 of 90

Causal Mediation

Rationale-based Explanations

Feature attributions / Extractive rationales
Free-text explanations
Structured explanations

Data attribution

Data Influence (Influence Functions)

Transformer understanding

Neuron-level interpretability: Sparse Autoencoders
Causal Mediation
Transformer-oriented Interpretation

61 of 90

Causal Mediation

Rationale-based Explanations

Feature attributions / Extractive rationales
Free-text explanations
Structured explanations

Data attribution

Data Influence (Influence Functions)

Transformer understanding

Neuron-level interpretability: Sparse Autoencoders
Causal Mediation
Transformer-oriented Interpretation

62 of 90

Causal Mediation

Rationale-based Explanations

Feature attributions / Extractive rationales
Free-text explanations
Structured explanations

Data attribution

Data Influence (Influence Functions)

Transformer understanding

Neuron-level interpretability: Sparse Autoencoders
Causal Mediation
Transformer-oriented Interpretation

63 of 90

Causal Mediation

Rationale-based Explanations

Feature attributions / Extractive rationales
Free-text explanations
Structured explanations

Data attribution

Data Influence (Influence Functions)

Transformer understanding

Neuron-level interpretability: Sparse Autoencoders
Causal Mediation
Transformer-oriented Interpretation

64 of 90

Causal Mediation in Transformers

Rationale-based Explanations

Feature attributions / Extractive rationales
Free-text explanations
Structured explanations

Data attribution

Data Influence (Influence Functions)

Transformer understanding

Neuron-level interpretability: Sparse Autoencoders
Causal Mediation
Transformer-oriented Interpretation

65 of 90

Causal Mediation in Transformers

Rationale-based Explanations

Feature attributions / Extractive rationales
Free-text explanations
Structured explanations

Data attribution

Data Influence (Influence Functions)

Transformer understanding

Neuron-level interpretability: Sparse Autoencoders
Causal Mediation
Transformer-oriented Interpretation

66 of 90

Causal Mediation in Transformers

Rationale-based Explanations

Feature attributions / Extractive rationales
Free-text explanations
Structured explanations

Data attribution

Data Influence (Influence Functions)

Transformer understanding

Neuron-level interpretability: Sparse Autoencoders
Causal Mediation
Transformer-oriented Interpretation

67 of 90

Causal Mediation in Transformers

Rationale-based Explanations

Feature attributions / Extractive rationales
Free-text explanations
Structured explanations

Data attribution

Data Influence (Influence Functions)

Transformer understanding

Neuron-level interpretability: Sparse Autoencoders
Causal Mediation
Transformer-oriented Interpretation

68 of 90

Causal Mediation in Transformers

Rationale-based Explanations

Feature attributions / Extractive rationales
Free-text explanations
Structured explanations

Data attribution

Data Influence (Influence Functions)

Transformer understanding

Neuron-level interpretability: Sparse Autoencoders
Causal Mediation
Transformer-oriented Interpretation

69 of 90

Transformer Understanding

Rationale-based Explanations

Feature attributions / Extractive rationales
Free-text explanations
Structured explanations

Data attribution

Data Influence (Influence Functions)

Transformer understanding

Neuron-level interpretability: Sparse Autoencoders
Causal Mediation
Transformer-oriented Interpretation

Transformer Residual Stream and Linear Structure

Grosse, Roger, et al. "Studying large language model generalization with influence functions." arXiv preprint arXiv:2308.03296 (2023).

One-Layer Attention-Only Transformers

70 of 90

Outline

Motivation: Addressing the "Black Box" Problem (10 min)
Taxonomy: An Overview of Interpretation Methodologies (25 min)
Deep Dive: Explainability for LLMs (25 min)
Assessment: Methods for Evaluating Explanations

71 of 90

Evaluating AI Explanations

Faithful
Plausible
Informativeness
Useful

72 of 90

Evaluating Faithfulness of AI Explanations

DeYoung, Jay, et al. "ERASER: A benchmark to evaluate rationalized NLP models." Proceedings of the 58th annual meeting of the association for computational linguistics. 2020.

Sufficiency

Comprehensiveness

73 of 90

Evaluating Plausibility of AI Explanations

74 of 90

Evaluating Informativeness of AI Explanations

75 of 90

Evaluating Utility of AI Explanations

76 of 90

Outline

Motivation: Addressing the "Black Box" Problem (10 min)
Taxonomy: An Overview of Interpretation Methodologies (25 min)
Deep Dive: Explainability for LLMs (25 min)
Assessment: Methods for Evaluating Explanations (15 min)
Human-Centered Explanations:

Are AI Explanations useful for humans? Why and How?

89 of 90

Thank you for taking the course!

Best of luck with your exams!

90 of 90

Course Feedback

Please submit your valuable feedback to this course for our better improvement.

Thank you!

1 of 90

2 of 90

3 of 90

4 of 90

5 of 90

6 of 90

7 of 90

8 of 90

9 of 90

10 of 90

11 of 90

12 of 90

13 of 90

14 of 90

15 of 90

16 of 90

17 of 90

18 of 90

19 of 90

20 of 90

21 of 90

22 of 90

23 of 90

24 of 90

25 of 90

26 of 90

27 of 90

28 of 90

29 of 90

30 of 90

31 of 90

32 of 90

33 of 90

34 of 90

35 of 90

36 of 90

37 of 90

38 of 90

39 of 90

40 of 90

41 of 90

42 of 90

43 of 90

44 of 90

45 of 90

46 of 90

47 of 90

48 of 90

49 of 90

50 of 90

51 of 90

52 of 90

53 of 90

54 of 90

55 of 90

56 of 90

57 of 90

58 of 90

59 of 90

60 of 90

61 of 90

62 of 90

63 of 90

64 of 90

65 of 90

66 of 90

67 of 90

68 of 90

69 of 90

70 of 90

71 of 90

72 of 90

73 of 90

74 of 90

75 of 90

76 of 90

77 of 90

78 of 90

79 of 90

80 of 90