Mechanistic Interpretability & Mathematics: A Whirlwind Tour
Slides: https://neelnanda.io/maths-mech-interp-slides
Sequence: https://neelnanda.io/concrete-open-problems
Getting Started: https://neelnanda.io/getting-started
What is Mechanistic Interpretability?
Key Claim: Mechanistic Understanding of Neural Networks is Possible
Setup
The Fourier Multiplication Algorithm for Modular Addition
Mystery: Why do models grok?
(Power et al, 2022)
Background
Inspiration: Mechanistic Interpretability
A Growing Area of Research
A Growing Area of Research
Proposal: Apply mechanistic interpretability techniques to reverse-engineer modular addition, to understand grokking
Getting Traction
Understanding the Embedding
t-SNE of embedding, Power et al
Principal components of embedding, Liu et al
Insight: Apply the Fourier Transform
At initialisation
Grokked model
Logits are Sums of Cosines
Explains 95% of the variance!
What are the logits?
Inputs a,b -> output c
Logits are Sums of Cosines
Explains 95% of the variance!
Why cos?
What are the logits?
Inputs a,b -> output c
The Learned Algorithm
The Fourier Multiplication Algorithm for Modular Addition
Lines of Evidence
Lines of Evidence
Lines of Evidence
Understanding Grokking
Mystery: Why do models grok?
Question: Is mechanistic understanding useful?
The Three Phases of Grokking
Two learned algorithms: Memorising circuit and Fourier circuit
Understanding Grokking via Progress Measures
Understanding Grokking via Progress Measures
Generalising the Algorithm with Representation Theory�Follow Up: A Toy Model of Universality (Chughtai et al)
The Group Composition via Representations Algorithm
Analysing Learned Representations in S5 Composition
Testing Universality: Which Representations Are Learned?
Key Takeaway 1: Deep Learning Can Learn Rich Mathematical Structure
Key Takeaway 2: Mechanistic Understanding Is Possible
A Growing Area of Research
A Growing Area of Research
What is Mechanistic Interpretability?
Motivating Mechanistic Interpretability
Goal: Understand Model Cognition
Is it aligned, or telling us what we want to hear?
Motivation
Interpretability in a Post GPT-4 World
Inputs and Outputs Are Not Enough
How do models represent their thoughts?
Understanding Superposition
How do models represent thoughts?
Understanding Superposition
How do models represent thoughts?
Understanding Superposition
Approach: Forming conceptual frameworks
Q: Can we do it for neuron superposition?
Eg n sparse binary inputs, nC2 outputs for x_i AND x_j
Conceptual Frameworks:
Geometry of Superposition
Approach: Extracting features from superposition
Finding an interpretable, overcomplete basis
Q: What to learn from neuroscience, compressed sensing, etc?
Data
Reconstruction
Dictionary
Example: Compound Word Detection
How big a deal is this?
Extracting Features in Superposition
Idea: Train a Sparse Autoencoder
Key Idea: True Features Are Sparse
Decomposing Language Models With Dictionary Learning (Bricken et al)�Sparse Autoencoders Find Highly Interpretable Features in Language Models (Cunningham et al)
L1 Regularisation
Finding Monosemantic Features: The Arabic Feature
Feature Splitting: Surprising Geometric Structure
Example: A Split Feature
Fact Localisation: Study Computation in Superposition�Nanda & Rajamanoharan, forthcoming
How Do Language Models Store & Recall Facts?
How Do Language Models Store & Recall Facts?
Fact Localisation (Nanda & Rajamanoharan, forthcoming)
Distilling to a Toy Model
Integer 0 “embedding”
(Frozen, Random)
Integer 1 “embedding”
(Frozen, Random)
MLP 0
Binary Linear Classifier
+
+
MLP 1
Fact Localisation (Nanda & Rajamanoharan, forthcoming)
MLP(x) = W_out GELU(W_in x + b_in)
Case Study: Emergent World Representations in Othello
Case Study: Emergent World Representations in Othello-GPT
Networks have real underlying principles with predictive power
Seemingly Non-Linear Representations?!
Linear Representation Hypothesis:
Models represent features as directions in space
Models have underlying principles with predictive power
Case Study: Emergent World Representations in Othello
Key takeaway: Models have universal principles with predictive power
Learning More
Model Debugging
Fixing behaviour without breaking everything else
Challenge: Memory editing = Deletion + Insertion
ROME Tries Memory Editing
Challenge: Memory editing = Deletion + Insertion
Memory insertion
Challenge: Memory editing = Deletion + Insertion
True memory editing?
Challenge:
Challenge: What Does Finetuning Do To Circuits?
Are circuits rearranged, or formed anew?
Challenge: Fixing Bad In-Context Learning
Finding the minimal edit
Examples: Sycophancy, less educated, buggy code
Challenge: