CSCI-SHU 376: Natural Language Processing
Hua Shen
Course Agenda: 2026 Spring-NLP-[CSCI-SHU-376]-Class Schedule
2026-04-28
Spring 2026
Lecture 18 & 19: Mechanistic Interpretability Approaches & Human-Centered and Interactive Explanation
Contents adapted from UC Berkeley Agentic AI Course
Today’s Plan
Recommended Reading:
NAACL 24 Tutorial: Explanations in the Era of Large Language Models
Outline
Why do we need AI interpretability?
The usefulness of AI Interpretability for humans is crucial
Trajectory of Explainable AI (XAI) algorithms
Interpretability v.s. Explanations
(Interchangeable)
What are desired properties of AI explanations
Share Your Thoughts 🙌
Desiderata of AI Explanations
Faithful AI Explanations
Faithfulness: Accurately represent the model mechanisms
Plausible AI Explanations
Plausibility: Coherent with human understanding, also referred to as persuasiveness or understandability
Faithfulness v.s. Plausibility ?
Application categorization w.r.t. the desired levels of faithfulness (left) and plausibility (right) in explanations provided by LLMs.
Faithfulness v.s. Plausibility ?
It is possible that AI explanations are:
Useful AI Explanations
Usefulness:
Useful AI Explanations
Outline
What AI Explanations have you ever met or used
Share Your Thoughts 🙌
Different ways to categorize AI explanations
How would you explain this AI prediction to a user?
A survey of 200+ Explainable AI (XAI) papers
A collection of Explainable AI (XAI) types
Different ways to categorize AI explanations
Global Explanations
How does the AI system generally make predictions?
(Yang, C., et al, 2018)
(Ribeiro, M. T., et al, 2018)
Local Explanations
How does the AI system make this specific predictions?
How can we attribute the AI’s prediction to the input features?
What are the most relevant training examples contributing to this AI prediction?
(Koh, P. W., & Liang, P, 2017)
(Wallace, Eric, et al, 2019)
Post-hoc Explanations
(Shen, Hua, et al. 2022)
Self-Explaining Models
Large Language Model Explanations are primarily focusing on post-hoc Explanations, including both global and local explanations
Outline
Explanations in the Era of LLMs
Rationale-based Explanations
Feature Attributions (Post-hoc Explanation)
Feature Attributions (Post-hoc Explanation)
Feature Attributions (Post-hoc Explanation)
Feature Attributions (Post-hoc Explanation)
Feature Attributions (Post-hoc Explanation)
Challenges:
Extractive rationales (Self-Explanation)
Lei, Tao, Regina Barzilay, and Tommi Jaakkola. "Rationalizing neural predictions." EMNLP, 2016.
Extractive rationales (Self-Explanation)
Extractive rationales (Self-Explanation)
How to control explanation length?
Free-text Explanations
What does a free-text Explanation look like?
Free-text Explanations
How to Generate Free-text Explanations?
Free-text Explanations
In-context learning / Few-shot prompting
Prompting for Explanations
Chain of Thought (CoT) - based Explanations
CoT + Question Decomposition
CoT + Vote and Rank
Structured Explanations
Tafjord, Oyvind, Bhavana Dalvi, and Peter Clark. "Proofwriter: Generating implications, proofs, and abductive statements over natural language." ACL-IJCNLP. 2021.
Dalvi, Bhavana, et al. "Explaining answers with entailment trees." arXiv preprint arXiv:2104.08661 (2021).
ProofWriter
EntailmentWriter
Structured Explanations
Logically-Constrained Reasoning
Symbolically-Aided Reasoning
See also: Program-Aided LM/PAL [Gao et al., 2023]
Data Influence
[Koh and Liang, 2017]
Data Influence
Data Influence: Explaining LLMs’ Completions
Transformer Understanding
Neuron-level interpretability: Sparse Autoencoders
Sparse Auencoders
Sparse Auencoders
Sparse Auencoders
Course Final Project & Scoring Updates
Transformer Understanding
Transformer Understanding
The Three Layer Causal Hierarchy
https://web.cs.ucla.edu/~kaoru/3-layer-causal-hierarchy.pdf
Causal Mediation
Causal Mediation
Causal Mediation
Causal Mediation
Causal Mediation
Causal Mediation in Transformers
Causal Mediation in Transformers
Causal Mediation in Transformers
Causal Mediation in Transformers
Causal Mediation in Transformers
Transformer Understanding
Transformer Residual Stream and Linear Structure
One-Layer Attention-Only Transformers
Outline
Evaluating AI Explanations
Evaluating Faithfulness of AI Explanations
DeYoung, Jay, et al. "ERASER: A benchmark to evaluate rationalized NLP models." Proceedings of the 58th annual meeting of the association for computational linguistics. 2020.
Evaluating Plausibility of AI Explanations
Evaluating Informativeness of AI Explanations
Evaluating Utility of AI Explanations
Outline
Are AI Explanations useful for humans? Why and How?
Thank you for taking the course!
Best of luck with your exams!
Course Feedback
Please submit your valuable feedback to this course for our better improvement.
Thank you!