Applied Physics 293 Explainable AI

Instructor: Surya Ganguli

Stanford University

General review/perspective articles

  • Review articles
  • Perspective pieces
  • Roadmaps
  • Paper lists
  • Conferences

Motivations: Foundation models in neuroscience: big data, big models, but understanding?

  • Task trained models in neuroscience across the years
  • Complex models fit to neural data, including foundation models
  • EEG
  • fMRI
  • Single-cell electrophysiology
  • Basic theories of transfer learning explaining how data from other sessions/subjects/species might help

Feature attribution: How does a network output depend on input features?

  • Perturbation based approaches
  • Gradient based approaches
  • Approximation based approaches
  • Unified view and perspectives

Data Attribution: Which training data points support a test prediction? 

Discovery of Concepts

Introduction to Interpretability in transformers

  • Introductory material
  • Connections to earlier and simpler ideas
  • Early Interpretation of transformers (Induction Heads)
  • RASP interpretation
  • Connections to modern Hopfield model

Sparse Autoencoders

Causal analysis, editing and control 

  • Perturbation based approaches
  • Gradient based approaches
  • Approximation based approaches
  • Causal abstractions
  • More model editing
  • Model steering

Evaluation of model explanations

Circuit discovery

Computational complexity issues in interpretability 

Comparing representations across models

Discovering and understanding interesting behaviors

  • Behavior discovery through “psychology” experiments on LLMs
  • Understanding specific, interesting behaviors

Cautionary tales in explainabilty

Automated Interpretability Agents

Reasoning