Applied Physics 293 Explainable AI
Instructor: Surya Ganguli
Stanford University
General review/perspective articles
Motivations: Foundation models in neuroscience: big data, big models, but understanding?
Feature attribution: How does a network output depend on input features?
Data Attribution: Which training data points support a test prediction?
Discovery of Concepts
Introduction to Interpretability in transformers
Sparse Autoencoders
Causal analysis, editing and control
Evaluation of model explanations
Circuit discovery
Computational complexity issues in interpretability
Comparing representations across models
Discovering and understanding interesting behaviors
Cautionary tales in explainabilty
Automated Interpretability Agents
Reasoning