AP293: Reading Material on Explainable AI and Neuro

Applied Physics 293 Explainable AI

Instructor: Surya Ganguli

Stanford University

General review/perspective articles

Review articles

See also ICML 2025 Tutorial on Mechanistic Interpretability for Language Models

Perspective pieces

Roadmaps

Paper lists

Conferences

2nd New England mechanistic interpretability workshop

Motivations: Foundation models in neuroscience: big data, big models, but understanding?

Task trained models in neuroscience across the years

Complex models fit to neural data, including foundation models

EEG

Neuro-GPT: Towards A Foundation Model for EEG

fMRI

Single-cell electrophysiology

Basic theories of transfer learning explaining how data from other sessions/subjects/species might help

Feature attribution: How does a network output depend on input features?

Perturbation based approaches

Gradient based approaches

Approximation based approaches

Unified view and perspectives

Data Attribution: Which training data points support a test prediction?

Discovery of Concepts

Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV)

Introduction to Interpretability in transformers

Introductory material

Formal algorithms for transformers

Connections to earlier and simpler ideas

Early Interpretation of transformers (Induction Heads)

A Mathematical Framework for Transformer Circuits

RASP interpretation

Connections to modern Hopfield model

Sparse Autoencoders

Causal analysis, editing and control

Perturbation based approaches

Gradient based approaches

Attribution patching: Activation patching at industrial scale

Approximation based approaches

Decomposing and Editing Predictions by Modeling Model Computation (COAR)

Causal abstractions

More model editing

Model steering

Evaluation of model explanations

Circuit discovery

Computational complexity issues in interpretability

Comparing representations across models

Discovering and understanding interesting behaviors

Behavior discovery through “psychology” experiments on LLMs

Language Models are Few-Shot Learners (In-context learning)
Language Models Don't Always Say What They Think (chain-of-thought unfaithfulness)
Taken out of context: On measuring situational awareness in LLMs
Alignment faking in large language models
Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
Introducing Docent: A system for analyzing and intervening on agent behavior

Understanding specific, interesting behaviors

Cautionary tales in explainabilty

Impossibility theorems for feature attribution
The Disagreement Problem in Explainable Machine Learning: A Practitioner's Perspective
Faithfulness vs. Plausibility: On the (Un)Reliability of Explanations from Large Language Models
Adversarial attacks on Interpretations

Automated Interpretability Agents

A Multimodal Automated Interpretability Agent

Reasoning