1 of 66

Mechanistic Interpretability & Mathematics: A Whirlwind Tour

Slides: https://neelnanda.io/maths-mech-interp-slides

Sequence: https://neelnanda.io/concrete-open-problems

Getting Started: https://neelnanda.io/getting-started

Tooling: https://github.com/neelnanda-io/TransformerLens

2 of 66

What is Mechanistic Interpretability?

Hypothesis: Models learn human-comprehensible algorithms and can be understood, if we learn how to make it legible

Achieve rigorous understanding without tricking ourselves

Goal: Reverse engineer train neural networks, like program binary -> source code

3 of 66

Key Claim: Mechanistic Understanding of Neural Networks is Possible

4 of 66

Setup

5 of 66

The Fourier Multiplication Algorithm for Modular Addition

6 of 66

Mystery: Why do models grok?

(Power et al, 2022)

7 of 66

Background

8 of 66

Inspiration: Mechanistic Interpretability

Hypothesis: Models learn human-comprehensible algorithms and can be understood, if we learn how to make it legible

Achieve rigorous understanding without tricking ourselves

Goal: Reverse engineer train neural networks, like program binary -> source code

9 of 66

A Growing Area of Research

Does Localization Inform Editing? (Hase et al, 2023)

A Mathematical Framework for Transformer Circuits (Elhage et al, Anthropic 2021)

Transformer Feed-Forward Layers Are Key-Value Memories (Geva et al, EMNLP 2021)

Investigating Gender Bias in Language Models Using Causal Mediation Analysis (Vig et al, NeurIPS 2020)

Toy Models of Superposition (Elhage, Anthropic 2022)

Locating and Editing Factual Associations in GPT (Meng et al, NeurIPS 2022)

10 of 66

A Growing Area of Research

Multimodal Neurons in Artificial Neural Networks (Goh et al, Distill 2021)

Compositional Explanations of Neurons (Mu and Andreas, NeurIPS 2020)

Causal Abstractions of Neural Networks (Geiger et al, NeurIPS 2021)

SGD Learns Parities Near the Computational Limit (Barak et al, NeurIPS 2022)

Curve Circuits (Cammarata et al, Distill 2020)

The Quantization Model of Neural Scaling (Michaud et al, 2023)

11 of 66

Proposal: Apply mechanistic interpretability techniques to reverse-engineer modular addition, to understand grokking

12 of 66

Getting Traction

13 of 66

Understanding the Embedding

t-SNE of embedding, Power et al

Principal components of embedding, Liu et al

14 of 66

Insight: Apply the Fourier Transform

At initialisation

Grokked model

15 of 66

Logits are Sums of Cosines

Explains 95% of the variance!

What are the logits?

Inputs a,b -> output c

16 of 66

Logits are Sums of Cosines

Explains 95% of the variance!

Why cos?

What are the logits?

Inputs a,b -> output c

17 of 66

The Learned Algorithm

18 of 66

The Fourier Multiplication Algorithm for Modular Addition

19 of 66

Lines of Evidence

Suggestive Evidence: Surprising Periodicity
Mechanistic Evidence: Reading off the algorithm from model weights

20 of 66

Lines of Evidence

Suggestive Evidence: Surprising Periodicity
Mechanistic Evidence: Composing Model Weights
Zooming In: Approximating Neurons with Sines and Cosines

21 of 66

Lines of Evidence

Suggestive Evidence: Surprising Periodicity
Mechanistic Evidence: Composing Model Weights
Zooming In: Approximating Neurons with Sines and Cosines
Correctness checks: Ablations

22 of 66

Understanding Grokking

23 of 66

Mystery: Why do models grok?

Question: Is mechanistic understanding useful?

It’s a random walk (Millidge, 2022)
It takes time to learn representations (Liu et al, 2022)
Double descent (Davies et al, 2022)
The slingshot mechanism (Thilak et al, 2022)
The weight norm is too high (Liu et al, 2023)

24 of 66

The Three Phases of Grokking

Two learned algorithms: Memorising circuit and Fourier circuit

Memorisation
Circuit Formation
Cleanup

25 of 66

Understanding Grokking via Progress Measures

Progress Measure: A smooth metric that identifies hidden progress
Key: Using our mechanistic understanding to derive the measures
Circuit formation with Excluded loss: Removes the Fourier circuit.

26 of 66

Understanding Grokking via Progress Measures

Progress Measure: A smooth metric that identifies hidden progress
Key: Using our mechanistic understanding to derive the measures
Circuit formation with Excluded loss: Removes the Fourier circuit.
Cleanup with Restricted loss: Removes the memorisation circuit.

27 of 66

Generalising the Algorithm with Representation Theory�Follow Up: A Toy Model of Universality (Chughtai et al)

28 of 66

The Group Composition via Representations Algorithm

A Toy Model of Universality (Chughtai et al)

29 of 66

Analysing Learned Representations in S5 Composition

A Toy Model of Universality (Chughtai et al)

30 of 66

Testing Universality: Which Representations Are Learned?

A Toy Model of Universality (Chughtai et al)

31 of 66

Key Takeaway 1: Deep Learning Can Learn Rich Mathematical Structure

Key Takeaway 2: Mechanistic Understanding Is Possible

Progress Measures for Grokking via Mechanistic Interpretability (Nanda et al)

Open Problems: Interpreting Algorithmic Models

32 of 66

A Growing Area of Research

Does Localization Inform Editing? (Hase et al, 2023)

A Mathematical Framework for Transformer Circuits (Elhage et al, Anthropic 2021)

Transformer Feed-Forward Layers Are Key-Value Memories (Geva et al, EMNLP 2021)

Investigating Gender Bias in Language Models Using Causal Mediation Analysis (Vig et al, NeurIPS 2020)

Toy Models of Superposition (Elhage, Anthropic 2022)

Locating and Editing Factual Associations in GPT (Meng et al, NeurIPS 2022)

33 of 66

A Growing Area of Research

Multimodal Neurons in Artificial Neural Networks (Goh et al, Distill 2021)

Compositional Explanations of Neurons (Mu and Andreas, NeurIPS 2020)

Causal Abstractions of Neural Networks (Geiger et al, NeurIPS 2021)

SGD Learns Parities Near the Computational Limit (Barak et al, NeurIPS 2022)

Curve Circuits (Cammarata et al, Distill 2020)

The Quantization Model of Neural Scaling (Michaud et al, 2023)

34 of 66

What is Mechanistic Interpretability?

Goal: Reverse engineer neural networks

Like reverse-engineering a compiled program binary to source code

Hypothesis: Models learn human-comprehensible algorithms and can be understood, if we learn how to make it legible
Understanding features - the variables inside the model
Understanding circuits - the algorithms learned to compute features
Forming an epistemic foundation for a rigorous science of interpretability

Interpretability Request for Proposals (Chris Olah)

35 of 66

Motivating Mechanistic Interpretability

Goal: Understand Model Cognition

Is it aligned, or telling us what we want to hear?

36 of 66

Motivation

Key Q: What should interpretability look like in a post GPT-4 world?
Large, generative language models are a big deal
Models will keep scaling. What work done now will matter in the future?

Emergent capabilities keep arising
Many mundane problems go away
A single massive foundation model

37 of 66

Interpretability in a Post GPT-4 World

Inputs and Outputs Are Not Enough

38 of 66

How do models represent their thoughts?

Understanding Superposition

39 of 66

How do models represent thoughts?

Understanding Superposition

Decomposing Language Models With Dictionary Learning (Bricken et al)

Goal: Decompose models to independently meaningful + composable units/features

Curse of dimensionality �=> this is crucial

Hope: Neurons = features
Problem: Polysemanticity

40 of 66

How do models represent thoughts?

Understanding Superposition

Goal: Decompose models to composable + meaningful units

Curse of dimensionality -> this is crucial

Hope: Neurons = features
Problem: Polysemanticity
Hypothesis: Superposition

Toy Models of Superposition (Nelson Elhage et al)

41 of 66

Approach: Forming conceptual frameworks

Q: Can we do it for neuron superposition?

Eg n sparse binary inputs, nC2 outputs for x_i AND x_j

Toy Models of Superposition (Nelson Elhage et al)

42 of 66

Conceptual Frameworks:

Geometry of Superposition

Toy Models of Superposition (Elhage et al)

Open Problems: Exploring Polysemanticity & Superposition + Analysing Toy Language Models

43 of 66

Approach: Extracting features from superposition

Finding an interpretable, overcomplete basis

Q: What to learn from neuroscience, compressed sensing, etc?

[Interim research report] Taking features out of superposition with sparse autoencoders (Lee Sharkey et al)

Data

Reconstruction

Dictionary

44 of 66

Example: Compound Word Detection

How big a deal is this?

Finding Neurons In A Haystack (Wes Gurnee et al)

45 of 66

Extracting Features in Superposition

Decomposing Language Models With Dictionary Learning (Bricken et al)

46 of 66

Idea: Train a Sparse Autoencoder

Decomposing Language Models With Dictionary Learning (Bricken et al)

47 of 66

Key Idea: True Features Are Sparse

Decomposing Language Models With Dictionary Learning (Bricken et al)�Sparse Autoencoders Find Highly Interpretable Features in Language Models (Cunningham et al)

L1 Regularisation

48 of 66

Finding Monosemantic Features: The Arabic Feature

Decomposing Language Models With Dictionary Learning (Bricken et al)

49 of 66

Feature Splitting: Surprising Geometric Structure

Decomposing Language Models With Dictionary Learning (Bricken et al)

50 of 66

Example: A Split Feature

51 of 66

Fact Localisation: Study Computation in Superposition�Nanda & Rajamanoharan, forthcoming

52 of 66

How Do Language Models Store & Recall Facts?

Eg “Fact: Michael Jordan plays the sport of” -> “basketball”

Dissecting Recall of Factual Associations (Geva et al)

53 of 66

How Do Language Models Store & Recall Facts?

Eg “Fact: Michael Jordan plays the sport of” -> “basketball”
Sport is linearly separable on the Jordan token!
Mystery: How do early MLP layers combine the name tokens to be linearly separable?

Fact Localisation (Nanda & Rajamanoharan, forthcoming)

54 of 66

Distilling to a Toy Model

Memorise a lookup table from {0, …, n-1} x {0, …, n-1} -> {0, 1}
Mystery: Can we reverse-engineer it?

Integer 0 “embedding”

(Frozen, Random)

Integer 1 “embedding”

(Frozen, Random)

MLP 0

Binary Linear Classifier

+

MLP 1

Fact Localisation (Nanda & Rajamanoharan, forthcoming)

MLP(x) = W_out GELU(W_in x + b_in)

55 of 66

Case Study: Emergent World Representations in Othello

56 of 66

Case Study: Emergent World Representations in Othello-GPT

Networks have real underlying principles with predictive power

Seemingly Non-Linear Representations?!

Emergent World Representations (Li et al)

57 of 66

Linear Representation Hypothesis:

Models represent features as directions in space

Models have underlying principles with predictive power

58 of 66

My colour vs their’s
Linear representation hypothesis

Generalises
Survived falsification
Has predictive power

Actually, Othello-GPT Has A Linear Emergent Representation (Neel Nanda)

Open Problems: Future Work on Othello-GPT

59 of 66

Case Study: Emergent World Representations in Othello

Key takeaway: Models have universal principles with predictive power

60 of 66

Learning More

200 Concrete Open Problems in Mechanistic Interpretability

https://neelnanda.io/concrete-open-problems

Getting Started in Mechanistic Interpretability

https://neelnanda.io/getting-started

A Comprehensive Mechanistic Interpretability Explainer

https://neelnanda.io/glossary

TransformerLens

https://github.com/neelnanda-io/TransformerLens

Slides: https://neelnanda.io/whirlwind-slides

61 of 66

Model Debugging

Fixing behaviour without breaking everything else

62 of 66

Challenge: Memory editing = Deletion + Insertion

ROME Tries Memory Editing

Locating and Editing Factual Associations in GPT (Kevin Meng et al)

63 of 66

Challenge: Memory editing = Deletion + Insertion

Memory insertion

Does Localization Inform Editing? (Peter Hase et al) & Detecting Edit Failures In Large Language Models (Jason Hoelscher-Obermaier et al)

64 of 66

Challenge: Memory editing = Deletion + Insertion

True memory editing?

Challenge:

Zoom in: How are facts represented in neurons?

Probing + interventions + how it’s used
Superposition?

A minimal deletion
A minimal insertion

65 of 66

Challenge: What Does Finetuning Do To Circuits?

Are circuits rearranged, or formed anew?

66 of 66

Challenge: Fixing Bad In-Context Learning

Finding the minimal edit

Examples: Sycophancy, less educated, buggy code

Challenge:

Identify key heads
Find the minimal edit
Superposition? Context-specific?

Overthinking the Truth (Danny Halawi et al)