1 of 66

Mechanistic Interpretability & Mathematics: A Whirlwind Tour

2 of 66

What is Mechanistic Interpretability?

  • Hypothesis: Models learn human-comprehensible algorithms and can be understood, if we learn how to make it legible
    • Achieve rigorous understanding without tricking ourselves
  • Goal: Reverse engineer train neural networks, like program binary -> source code

3 of 66

Key Claim: Mechanistic Understanding of Neural Networks is Possible

4 of 66

Setup

5 of 66

The Fourier Multiplication Algorithm for Modular Addition

6 of 66

Mystery: Why do models grok?

(Power et al, 2022)

7 of 66

Background

8 of 66

Inspiration: Mechanistic Interpretability

  • Hypothesis: Models learn human-comprehensible algorithms and can be understood, if we learn how to make it legible
    • Achieve rigorous understanding without tricking ourselves
  • Goal: Reverse engineer train neural networks, like program binary -> source code

9 of 66

A Growing Area of Research

10 of 66

A Growing Area of Research

11 of 66

Proposal: Apply mechanistic interpretability techniques to reverse-engineer modular addition, to understand grokking

12 of 66

Getting Traction

13 of 66

Understanding the Embedding

t-SNE of embedding, Power et al

Principal components of embedding, Liu et al

14 of 66

Insight: Apply the Fourier Transform

At initialisation

Grokked model

15 of 66

Logits are Sums of Cosines

Explains 95% of the variance!

What are the logits?

Inputs a,b -> output c

16 of 66

Logits are Sums of Cosines

Explains 95% of the variance!

Why cos?

What are the logits?

Inputs a,b -> output c

17 of 66

The Learned Algorithm

18 of 66

The Fourier Multiplication Algorithm for Modular Addition

19 of 66

Lines of Evidence

  1. Suggestive Evidence: Surprising Periodicity
  2. Mechanistic Evidence: Reading off the algorithm from model weights

20 of 66

Lines of Evidence

  • Suggestive Evidence: Surprising Periodicity
  • Mechanistic Evidence: Composing Model Weights
  • Zooming In: Approximating Neurons with Sines and Cosines

21 of 66

Lines of Evidence

  • Suggestive Evidence: Surprising Periodicity
  • Mechanistic Evidence: Composing Model Weights
  • Zooming In: Approximating Neurons with Sines and Cosines
  • Correctness checks: Ablations

22 of 66

Understanding Grokking

23 of 66

Mystery: Why do models grok?

Question: Is mechanistic understanding useful?

  1. It’s a random walk (Millidge, 2022)
  2. It takes time to learn representations (Liu et al, 2022)
  3. Double descent (Davies et al, 2022)
  4. The slingshot mechanism (Thilak et al, 2022)
  5. The weight norm is too high (Liu et al, 2023)

24 of 66

The Three Phases of Grokking

Two learned algorithms: Memorising circuit and Fourier circuit

  1. Memorisation
  2. Circuit Formation
  3. Cleanup

25 of 66

Understanding Grokking via Progress Measures

  • Progress Measure: A smooth metric that identifies hidden progress
  • Key: Using our mechanistic understanding to derive the measures
  • Circuit formation with Excluded loss: Removes the Fourier circuit.

26 of 66

Understanding Grokking via Progress Measures

  • Progress Measure: A smooth metric that identifies hidden progress
  • Key: Using our mechanistic understanding to derive the measures
  • Circuit formation with Excluded loss: Removes the Fourier circuit.
  • Cleanup with Restricted loss: Removes the memorisation circuit.

27 of 66

Generalising the Algorithm with Representation Theory�Follow Up: A Toy Model of Universality (Chughtai et al)

28 of 66

The Group Composition via Representations Algorithm

29 of 66

Analysing Learned Representations in S5 Composition

30 of 66

Testing Universality: Which Representations Are Learned?

31 of 66

Key Takeaway 1: Deep Learning Can Learn Rich Mathematical Structure

Key Takeaway 2: Mechanistic Understanding Is Possible

32 of 66

A Growing Area of Research

33 of 66

A Growing Area of Research

34 of 66

What is Mechanistic Interpretability?

  • Goal: Reverse engineer neural networks
    • Like reverse-engineering a compiled program binary to source code
  • Hypothesis: Models learn human-comprehensible algorithms and can be understood, if we learn how to make it legible
  • Understanding features - the variables inside the model
  • Understanding circuits - the algorithms learned to compute features
  • Forming an epistemic foundation for a rigorous science of interpretability

35 of 66

Motivating Mechanistic Interpretability

Goal: Understand Model Cognition

Is it aligned, or telling us what we want to hear?

36 of 66

Motivation

  • Key Q: What should interpretability look like in a post GPT-4 world?
  • Large, generative language models are a big deal
  • Models will keep scaling. What work done now will matter in the future?
    • Emergent capabilities keep arising
    • Many mundane problems go away
    • A single massive foundation model

37 of 66

Interpretability in a Post GPT-4 World

Inputs and Outputs Are Not Enough

38 of 66

How do models represent their thoughts?

Understanding Superposition

39 of 66

How do models represent thoughts?

Understanding Superposition

  • Goal: Decompose models to independently meaningful + composable units/features
    • Curse of dimensionality �=> this is crucial
  • Hope: Neurons = features
  • Problem: Polysemanticity

40 of 66

How do models represent thoughts?

Understanding Superposition

  • Goal: Decompose models to composable + meaningful units
    • Curse of dimensionality -> this is crucial
  • Hope: Neurons = features
  • Problem: Polysemanticity
  • Hypothesis: Superposition

41 of 66

Approach: Forming conceptual frameworks

Q: Can we do it for neuron superposition?

Eg n sparse binary inputs, nC2 outputs for x_i AND x_j

42 of 66

Conceptual Frameworks:

Geometry of Superposition

43 of 66

Approach: Extracting features from superposition

Finding an interpretable, overcomplete basis

Q: What to learn from neuroscience, compressed sensing, etc?

Data

Reconstruction

Dictionary

44 of 66

Example: Compound Word Detection

How big a deal is this?

45 of 66

Extracting Features in Superposition

46 of 66

Idea: Train a Sparse Autoencoder

47 of 66

Key Idea: True Features Are Sparse

L1 Regularisation

48 of 66

Finding Monosemantic Features: The Arabic Feature

49 of 66

Feature Splitting: Surprising Geometric Structure

50 of 66

Example: A Split Feature

51 of 66

Fact Localisation: Study Computation in Superposition�Nanda & Rajamanoharan, forthcoming

52 of 66

How Do Language Models Store & Recall Facts?

  • Eg “Fact: Michael Jordan plays the sport of” -> “basketball”

53 of 66

How Do Language Models Store & Recall Facts?

  • Eg “Fact: Michael Jordan plays the sport of” -> “basketball”
  • Sport is linearly separable on the Jordan token!
  • Mystery: How do early MLP layers combine the name tokens to be linearly separable?

Fact Localisation (Nanda & Rajamanoharan, forthcoming)

54 of 66

Distilling to a Toy Model

  • Memorise a lookup table from {0, …, n-1} x {0, …, n-1} -> {0, 1}
  • Mystery: Can we reverse-engineer it?

Integer 0 “embedding”

(Frozen, Random)

Integer 1 “embedding”

(Frozen, Random)

MLP 0

Binary Linear Classifier

+

+

MLP 1

Fact Localisation (Nanda & Rajamanoharan, forthcoming)

MLP(x) = W_out GELU(W_in x + b_in)

55 of 66

Case Study: Emergent World Representations in Othello

56 of 66

Case Study: Emergent World Representations in Othello-GPT

Networks have real underlying principles with predictive power

Seemingly Non-Linear Representations?!

57 of 66

Linear Representation Hypothesis:

Models represent features as directions in space

Models have underlying principles with predictive power

58 of 66

  • My colour vs their’s
  • Linear representation hypothesis
    • Generalises
    • Survived falsification
    • Has predictive power

59 of 66

Case Study: Emergent World Representations in Othello

Key takeaway: Models have universal principles with predictive power

60 of 66

Learning More

  • 200 Concrete Open Problems in Mechanistic Interpretability
  • Getting Started in Mechanistic Interpretability
  • A Comprehensive Mechanistic Interpretability Explainer
    • https://neelnanda.io/glossary
  • TransformerLens
  • Slides: https://neelnanda.io/whirlwind-slides

61 of 66

Model Debugging

Fixing behaviour without breaking everything else

62 of 66

Challenge: Memory editing = Deletion + Insertion

ROME Tries Memory Editing

63 of 66

Challenge: Memory editing = Deletion + Insertion

Memory insertion

64 of 66

Challenge: Memory editing = Deletion + Insertion

True memory editing?

Challenge:

  • Zoom in: How are facts represented in neurons?
    • Probing + interventions + how it’s used
    • Superposition?
  • A minimal deletion
  • A minimal insertion

65 of 66

Challenge: What Does Finetuning Do To Circuits?

Are circuits rearranged, or formed anew?

66 of 66

Challenge: Fixing Bad In-Context Learning

Finding the minimal edit

Examples: Sycophancy, less educated, buggy code

Challenge:

  • Identify key heads
  • Find the minimal edit
  • Superposition? Context-specific?