1 of 84

Mechanistic Interpretability

The big ideas, schools of thought, and areas people care about and why

2 of 84

Table of contents

  1. History (2 slides)
  2. Characterising mechanistic interpretability, and progress in it (10 slides)
  3. Give my takes on the big areas in mech interp (in progress, likely ~60 slides)
    1. Brief background
    2. What’s the main idea, what’s the point
    3. Best existing work in this area
    4. Point score:
      1. How many low hanging fruit remain? (/10)
      2. How excited am I about work in this area? (/10)
      3. How hard would it be to pull it off successfully? (/10)
    5. Advice: How to choose a good research direction in this area
  4. Bonus: WTF is attention superposition? Why is it so difficult? (31 slides)

��

3 of 84

History

4 of 84

Chris Olah

5 of 84

Other OGs

Distill�Alexander Mordvintsev - DeepDream, precursor to Distill work�Nick Cammarata - Cofounder of mech interp: led most of Distill circuit work�Ludwig Schubert - Core infrastructure�Shan Carter…�Gabriel Goh…�Chelsea Voss…��transformer-circuits�Christine Olsson - Exp work to produce insights to build the framework from�Neel Nanda - Conceptual work to form the framework w/ Chris�Nelson Elhage - Core infrastructure

6 of 84

The prehistory (biography of the OG)

2010 & 2011 - Chris Olah does 3d printing in highschool�2012 - Chris gets Thiel Fellowship for 3d printing. AI is a side hobby at this point�2013 - cold emails Yoshua Bengio @ MILA. Gets into deep learning�2014 - Google Brain internship, explores visualizing neural networks�2015 - Hops on Alexander Mordvintsev (AM)’s DeepDream, generalizes it�2016 - co-founds Distill. Writes Concrete Problems in AI Safety with Dario Amodei�2017 - Separately, Transformers are invented. �2018 - Creates Lucid w/ AM & Ludwig. Joins OpenAI. Mechinterp work starts here.�2019 - Chris -> mentorship role. Nick Cammarata drives Vision Circuits work�2020 - Circuits thread on Distill�2021 - Distill indefinite hiatus. Chris joins Anthropic. Starts transformer-circuits��

7 of 84

transformer-circuits

LW posts commenting on mech interp date back to 2018, and awareness of this work was commented on by other AI Safety people but as far as I can tell no one else actually did mechinterp, so I won’t cover what other people did before 2022

March 2020 - April 2021 — Original distill circuits thread�Dec 2021 - First transformer-circuits post. Discovery of induction heads.�March 2022 - Induction heads, developmental study of ICL�June 2022 - Softmax Linear Units�September 2022 - Toy Models of Superposition�(Small posts fill a 1 year gap ~ Double descent, May Circuits updates, July updates)�October 2023 - Towards Monosemanticity�January 2024 - January updates���

8 of 84

Leonard’s Taxonomy of MechInt

  • Sharkey’s Fundamental vs Applied mechanistic interpretability
  • When in the training process: Intrinsic vs Developmental vs Post-Hoc
  • Which model? How much of behavior explained? Which data distribution?

9 of 84

A different perspective

�“Solving MechInterp” involves filling�out this ternary plot

Sudden Drops in the Loss: Syntax Acquisition, Phase Transitions, and Simplicity Bias in MLMs

Distill Circuits Thread

In-Context Learning and Induction Heads

Studying Large Language Model Generalization with Influence Functions

Toy Models of Superposition

Mathematical Framework for Transformer Circuits

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small

Towards Monosemanticity: Decomposing Language Models with Sparse Dictionary Learning

Representation Engineering

Singular Learning Theory

Delays, Detours, and Forks in the Road: Latent State Models of Training Dynamics

10 of 84

A different perspective

“Solving MechInterp” involves filling�out this ternary plot

Wherever there’s a gap, there’s�an opportunity. We can talk about�what filling these gaps might look like�

Sudden Drops in the Loss: Syntax Acquisition, Phase Transitions, and Simplicity Bias in MLMs

Distill Circuits Thread

In-Context Learning and Induction Heads

Studying Large Language Model Generalization with Influence Functions

Toy Models of Superposition

Mathematical Framework for Transformer Circuits

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small

Towards Monosemanticity: Decomposing Language Models with Sparse Dictionary Learning

Representation Engineering

Singular Learning Theory

Delays, Detours, and Forks in the Road: Latent State Models of Training Dynamics

11 of 84

A different perspective

“Solving MechInterp” involves filling�out this ternary plot

Micro~Macro: A conflation of:

  • Toy->algorithmic->SLM->SOTA progression
  • Comprehensiveness of explanation
  • Distributional coverage
  • Capability coverage
  • Conceptual proximity

12 of 84

A different perspective

“Solving MechInterp” involves filling�out this ternary plot

13 of 84

A different perspective

“Solving MechInterp” involves filling�out this ternary plot

Sudden Drops in the Loss: Syntax Acquisition, Phase Transitions, and Simplicity Bias in MLMs

Distill Circuits Thread

In-Context Learning and Induction Heads

Studying Large Language Model Generalization with Influence Functions

Toy Models of Superposition

Mathematical Framework for Transformer Circuits

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small

Towards Monosemanticity: Decomposing Language Models with Sparse Dictionary Learning

Representation Engineering

Singular Learning Theory

Delays, Detours, and Forks in the Road: Latent State Models of Training Dynamics

14 of 84

A different perspective

“Solving MechInterp” involves filling�out this ternary plot

“Grokking literature”

15 of 84

A different perspective

“Solving MechInterp” involves filling�out this ternary plot

E.g. influence functions

16 of 84

The main schools of thought

ELK: “Zero-shot transferability of probes between positions”�Circuit analysis�Grokking: hybrid between dev-interp and comprehensive understanding�Use bigger and bigger models�Inductive biases

Circuit analysis:�Head specialization, suppression, world modeling, reasoning, behavioral, safety-relevant, attention superposition, circuits across tokens, factual recall, in-context learning, macro structure, modular circuits, dev-interp, trojan discovery��

17 of 84

Things we want to understand

  • Circuits ✓✓✓
  • Grokking ✓✓✓
  • Backup behavior __
  • SAEs & Dictionary Learning ✓✓
  • Feature Composition __
  • Modularity __
  • The training process __
  • Other things: dev-interp,
  • Attention Superposition ✓✓✓

✓✓✓ extremely done. ✓✓ mostly done. ✓ needs polishing. __ untouched

��

18 of 84

The big areas in mech interp

For each thing we want to understand, I’ll give:

    • Brief background
    • What’s the main idea, what’s the point
    • Best existing work in this area
    • Point score:
      • How many low hanging fruit remain? (/10)
      • How excited am I about work in this area? (/10)
      • How hard would it be to pull it off successfully? (/10)
    • Advice: How to choose a good research direction in this area
    • Bonus: who’s working in this now

��

19 of 84

Table of contents

Circuit analysis:�Head specialization, suppression, world modeling, reasoning, behavioral, safety-relevant, attention superposition, FSA, factual recall, in-context learning, backup behavior, macro structure, modular circuits, dev-interp, trojan discovery, feature circuits, computation in superposition, ELK

Circuit studies done: ORION (question answering, type hint understanding, factual recall, variable binding, translation, induction), “greater than”, IOI.�Approaches: patching (activation, attribution, path, attention pattern, request, cross-prompt, patchscopes), direct effect, logit/prob diff,

20 of 84

Circuit analysis

����

21 of 84

Circuit analysis

b) Arguably, the point of doing a bunch of circuit analyses on a bunch of models is to gather empirical data on the typical phenomena you’ll encounter in an LLM

22 of 84

Circuit analysis

b) Arguably, the point of doing a bunch of circuit analyses on a bunch of models is to gather empirical data on the typical phenomena you’ll encounter in an LLM

c-kinda) Circuit analyses done on “big models”: pronoun selection, “greater than”, indirect object identification, ORION (question answering, type hint understanding, factual recall, variable binding, translation, induction), docstring, time, arithmetic, multiple choice QA

23 of 84

Circuit analysis

b) Arguably, the point of doing a bunch of circuit analyses on a bunch of models is to gather data on the typical phenomena you’ll encounter

c) Circuit analyses done on “big models”: pronoun selection, “greater than”, indirect object identification, ORION (question answering, type hint understanding, factual recall, variable binding, translation, induction), docstring, time, arithmetic, multiple choice QA

Themes studied: Memorization, iterative refinement, factual recall, etc

24 of 84

Circuit analysis

b) Arguably, the point of doing a bunch of circuit analyses on a bunch of models is to gather data on the typical phenomena you’ll encounter

c) Circuit analyses done on “big models”: pronoun selection, “greater than”, indirect object identification, ORION (question answering, type hint understanding, factual recall, variable binding, translation, induction), docstring, time, arithmetic, multiple choice QA

Themes studied: Memorization, iterative refinement, etc

Key discoveries: Head specialization, backup behavior, “hydra effect”, etc

25 of 84

Circuit analysis

b) Arguably, the point of doing a bunch of circuit analyses on a bunch of models is to gather data on the typical phenomena you’ll encounter

c) Circuit analyses done on “big models”: pronoun selection, “greater than”, indirect object identification, ORION (question answering, type hint understanding, factual recall, variable binding, translation, induction), docstring, time, arithmetic, multiple choice QA�Themes studied: Memorization, iterative refinement, etc�Key discoveries: Head specialization, backup behavior, “hydra effect”, etc

d) LHF 2/10, Excitement 2/10, Difficulty 4/10

26 of 84

Circuit analysis

b) Arguably, the point of doing a bunch of circuit analyses on a bunch of models is to gather data on the typical phenomena you’ll encounter

c) Circuit analyses done on “big models”: pronoun selection, “greater than”, indirect object identification, ORION (question answering, type hint understanding, factual recall, variable binding, translation, induction), docstring, time, arithmetic, multiple choice QA�Themes studied: Memorization, iterative refinement, etc�Key discoveries: Head specialization, backup behavior, “hydra effect”, etc

d) LHF 2/10, Excitement 2/10, Difficulty 4/10

e) We’re done doing random circuit analyses IMO. If you still want to do circuit analysis, you should explicitly be looking for counterexamples, which takes a more conceptually motivated approach (e.g. inspired by specifics of the architecture)

27 of 84

Circuit analysis

b) Arguably, the point of doing a bunch of circuit analyses on a bunch of models is to gather data on the typical phenomena you’ll encounter

c) Circuit analyses done on “big models”: pronoun selection, “greater than”, indirect object identification, ORION (question answering, type hint understanding, factual recall, variable binding, translation, induction), docstring, time, arithmetic, multiple choice QA�Themes studied: Memorization, iterative refinement, etc�Key discoveries: Head specialization, backup behavior, “hydra effect”, etc

d) LHF 2/10, Excitement 2/10, Difficulty 4/10

e) We’re done doing random circuit analyses IMO. If you still want to do circuit analysis, you should explicitly be looking for counterexamples, which takes a more conceptually motivated approach (e.g. inspired by specifics of the architecture)

f) Far too many people

28 of 84

Grokking

a) Neutral definition of grokking: “eventual recovery from overfitting”����

29 of 84

Grokking

a) Neutral definition of grokking: “eventual recovery from overfitting”

b) Main idea: 100% test solution means a plausibly clean ground truth exists, and it’s tractable to fully reverse engineer it (still extremely hard).

Secondary idea: studying the nature of phase transitions.����

30 of 84

Grokking

a) Neutral definition of grokking: “eventual recovery from overfitting”

b) Main idea: 100% test solution means a plausibly clean ground truth exists, and it’s tractable to fully reverse engineer it (still extremely hard). Secondary idea: studying the nature of phase transitions.

c) Main work in this area:�[May 2022] Towards Understanding Grokking�[Aug 2022] Progress Measure for Grokking via Mechanistic Interpretability�[May 2023] Grokking of Hierarchical Structure in Vanilla Transformers�[Nov 2023] Feature emergence via margin maximization (insane)����

31 of 84

Grokking

a) Neutral definition of grokking: “eventual recovery from overfitting”

b) Main idea: 100% test solution means a plausibly clean ground truth exists, and it’s tractable to fully reverse engineer it (still extremely hard). Secondary idea: studying the nature of phase transitions.

c) Main work in this area:�[May 2022] Towards Understanding Grokking�[Aug 2022] Progress Measure for Grokking via Mechanistic Interpretability�[May 2023] Grokking of Hierarchical Structure in Vanilla Transformers�[Nov 2023] Feature emergence via margin maximization�d) LHF 6/10, excitement 4/10, difficulty 5/10����

32 of 84

Grokking

a) Neutral definition of grokking: “eventual recovery from overfitting”

b) Main idea: 100% test solution means a plausibly clean ground truth exists, and it’s tractable to fully reverse engineer it (still extremely hard). Secondary idea: studying the nature of phase transitions.

c) Main work in this area:�[May 2022] Towards Understanding Grokking�[Aug 2022] Progress Measure for Grokking via Mechanistic Interpretability�[May 2023] Grokking of Hierarchical Structure in Vanilla Transformers�[Nov 2023] Feature emergence via margin maximization

d) LHF 6/10, excitement 4/10, difficulty 5/10

e) How to choose a good research direction? Think “counterexamples in topology”�Constructing counterintuitive toy models is the most useful work in this area TBH����

33 of 84

Grokking

e) Open questions (off the top of my mind): TBD

34 of 84

Backup behavior

a) Discovered in the Interpretability in the Wild: Indirect Object Identification

35 of 84

Backup behavior

a) Discovered in the Interpretability in the Wild: Indirect Object Identification

b) Pretty important

36 of 84

Backup behavior

a) Discovered in the Interpretability in the Wild: Indirect Object Identification

b) Pretty important

c) Main work in this area:�[] Hydra Effect�[] Comprehensively understanding an attention head�

37 of 84

SAEs & Dictionary learning

a) Not a new idea: it’s existed for over 20 years. Also Anthropic wasn’t first, HAL did it before them����

38 of 84

SAEs & Dictionary learning

a) Not a new idea: it’s existed for over 20 years. Also Anthropic wasn’t first, HAL did it before them

b) Main idea: Lets us decompose arbitrary activation spaces into a combination of few feature directions. Sparsity makes doing interp with this realistic. Good sparsity (kind of) implies interpretability����

39 of 84

SAEs & Dictionary learning

a) Not a new idea: it’s existed for over 20 years. Also Anthropic wasn’t first, HAL did it before them

b) Main idea: Lets us decompose arbitrary activation spaces into a combination of few feature directions. Sparsity makes doing interp with this realistic. Good sparsity (kind of) implies interpretability

c) Main work in this area:�[April 2023] Emergence of Sparse Representations from Noise�[July 2023] (tentatively) Found 600+ Monosemantic Features in a Small LM Using Sparse Autoencoders�[Oct 2023] Towards Monosemanticity: Decomposing Language Models With Dictionary Learning����

40 of 84

SAEs & Dictionary learning

0) Not a new idea: it’s existed for over 20 years. Also Anthropic wasn’t first, HAL did it before them

1) Main idea: Lets us decompose arbitrary activation spaces into a combination of few feature directions. Sparsity makes doing interp with this realistic. Good sparsity (kind of) implies interpretability

2) Main work in this area:�[April 2023] Emergence of Sparse Representations from Noise�[July 2023] (tentatively) Found 600+ Monosemantic Features in a Small LM Using Sparse Autoencoders�[Sept 2023] Interpreting OpenAI's Whisper�[Oct 2023] Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

3) LHF 10/10, excitement 6/10, difficulty 9/10����

41 of 84

SAEs & Dictionary learning

0) Not a new idea: it’s existed for over 20 years. Also Anthropic wasn’t first, HAL did it before them�1) Main idea: Lets us decompose arbitrary activation spaces into a combination of few feature directions. Sparsity makes doing interp with this realistic. Good sparsity (kind of) implies interpretability2) Main work in this area:�[April 2023] Emergence of Sparse Representations from Noise�[July 2023] (tentatively) Found 600+ Monosemantic Features in a Small LM Using Sparse Autoencoders�[Sept 2023] Interpreting OpenAI's Whisper�[Oct 2023] Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

3) LHF 10/10, excitement 6/10, difficulty 9/10

4) ***How to choose a good research direction?�*have a plan beyond “train SAEs” and “improve automated circuit discovery”�*DON’T rely on existing mech interp literature: it was built for a world without SAEs�*Take inspiration from outside domains to maximize the variance of ideas tried���

42 of 84

SAEs & Dictionary learning

0) Not a new idea: it’s existed for over 20 years. Also Anthropic wasn’t first, HAL did it before them�1) Main idea: Lets us decompose arbitrary activation spaces into a combination of few feature directions. Sparsity makes doing interp with this realistic. Good sparsity (kind of) implies interpretability2) Main work in this area:�[April 2023] Emergence of Sparse Representations from Noise�[July 2023] (tentatively) Found 600+ Monosemantic Features in a Small LM Using Sparse Autoencoders�[Sept 2023] Interpreting OpenAI's Whisper�[Oct 2023] Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

3) How much richness is left? 10/10. How excited am I? 6/10�4) ***How to choose a good research direction?�*have a plan beyond “train SAEs” and “improve automated circuit discovery”�*DON’T rely on existing mech interp literature: it was built for a world without SAEs�*Take inspiration from outside domains to maximize the variance of ideas tried

*Ask an expert in sparse coding (from old literature), ask them “what do we do next?”���

43 of 84

SAEs & Dictionary learning

0) Not a new idea: it’s existed for over 20 years. Also Anthropic wasn’t first, HAL did it before them�1) Main idea: Lets us decompose arbitrary activation spaces into a combination of few feature directions. Sparsity makes doing interp with this realistic. Good sparsity (kind of) implies interpretability2) Main work in this area:�[April 2023] Emergence of Sparse Representations from Noise�[July 2023] (tentatively) Found 600+ Monosemantic Features in a Small LM Using Sparse Autoencoders�[Sept 2023] Interpreting OpenAI's Whisper�[Oct 2023] Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

3) How much richness is left? 10/10. How excited am I? 6/10�4) ***How to choose a good research direction?�*have a plan beyond “train SAEs” and “improve automated circuit discovery”�*DON’T rely on existing mech interp literature: it was built for a world without SAEs�*Take inspiration from outside domains to maximize the variance of ideas tried

Note: SAEs don’t solve mech interp���

44 of 84

Dictionary learning open problems

  • Dictionary learning scaling: little progress outside anthropic because no one wants to invest heavily in this. Low hanging fruit exists but it takes creativity
  • Dictionary learning science / motifs
    • What even is a feature? What assumptions do these concepts make and where does it lead when we take these assumptions to their conclusions?
    • How can we identify whether dictionary features are functionally relevant?
    • How do architectural decisions impact the nature of the feature space?
    • How do properties of the data distribution impact the feature space?
      • Low hanging fruit: What gets recovered in other modalities?
  • How do we effectively search over this space? Intuitively describing acts is bad lol
  • Alternate uses: sparse MLP to replace/skip modules entirely?���

45 of 84

Feature composition

a) AKA computation in superposition, feature circuits

���

46 of 84

Modularity

a) AKA circuit composition, how do we combine the information from all these existing circuit studies

���

47 of 84

The training process

a)

���

48 of 84

WTF is Attention superposition?

49 of 84

Attention superposition

I happen to be an expert on this topic (doesn’t mean much)

50 of 84

Attention superposition

I happen to be an expert on this topic (doesn’t mean much)

(1) how complicated a circuit is should scale with how "difficult" the task in question is

51 of 84

Attention superposition

I happen to be an expert on this topic (doesn’t mean much)

(1) how complicated a circuit is should scale with how "difficult" the task in question is

(2) a model will learn less local, more distributed representations when there's pressure to do so�

���

52 of 84

Attention superposition

I happen to be an expert on this topic (doesn’t mean much)

(1) how complicated a circuit is should scale with how "difficult" the task in question is�(2) a model will learn less localized, more distributed representations when there's pressure to do so

(3) it takes a lot more pressure to learn laterally distributed representations than longitudinal

���

53 of 84

Attention superposition

(1) how complicated a circuit is should scale with how "difficult" the task in question is�(2) a model will learn less localized, more distributed representations when there's pressure to do so(3) it takes a lot more pressure to learn laterally distributed representations than longitudinal

54 of 84

Laterally distributed attention features

(1) how complicated a circuit is should scale with how "difficult" the task in question is�(2) a model will learn less localized, more distributed representations when there's pressure to do so(3) it takes a lot more pressure to learn laterally distributed representations than longitudinal

55 of 84

Laterally distributed attention features

(1) how complicated a circuit is should scale with how "difficult" the task in question is�(2) a model will learn less localized, more distributed representations when there's pressure to do so(3) it takes a lot more pressure to learn laterally distributed representations than longitudinal

[Feb 2023] Chris Olah speculates that attention heads can exhibit superposition.

56 of 84

Laterally distributed attention features

(1) how complicated a circuit is should scale with how "difficult" the task in question is�(2) a model will learn less localized, more distributed representations when there's pressure to do so(3) it takes a lot more pressure to learn laterally distributed representations than longitudinal

[Feb 2023] Chris Olah speculates that attention heads can exhibit superposition.

Notably he seems pretty concerned about it, and his galaxy brain intuition tells him this is an important future bottleneck for the mech interp agenda that remains even if SAEs address normal superposition

57 of 84

Laterally distributed attention features

(1) how complicated a circuit is should scale with how "difficult" the task in question is�(2) a model will learn less localized, more distributed representations when there's pressure to do so(3) it takes a lot more pressure to learn laterally distributed representations than longitudinal

[Feb 2023] Chris Olah speculates that attention heads can exhibit superposition.�Notably he seems pretty concerned about it, and his galaxy brain intuition tells him this is an important future bottleneck for the mech interp agenda that remains even if SAEs address normal superposition.

I lowkey disagree but what do I know

58 of 84

Laterally distributed attention features

(1) how complicated a circuit is should scale with how "difficult" the task in question is�(2) a model will learn less localized, more distributed representations when there's pressure to do so(3) it takes a lot more pressure to learn laterally distributed representations than longitudinal

[Feb 2023] Chris Olah speculates that attention heads can exhibit superposition.�Notably he seems pretty concerned about it, and his galaxy brain intuition tells him this is an important future bottleneck for the mech interp agenda that remains even if SAEs address normal superposition.

(I lowkey disagree but what do I know)

He spends a bunch of time on this with Adam Jermyn & Tom Conerly�

59 of 84

Laterally distributed attention features

(1) how complicated a circuit is should scale with how "difficult" the task in question is�(2) a model will learn less localized, more distributed representations when there's pressure to do so(3) it takes a lot more pressure to learn laterally distributed representations than longitudinal

[Feb 2023] Chris Olah speculates that attention heads can exhibit superposition

60 of 84

Laterally distributed attention features

(1) how complicated a circuit is should scale with how "difficult" the task in question is�(2) a model will learn less localized, more distributed representations when there's pressure to do so(3) it takes a lot more pressure to learn laterally distributed representations than longitudinal

[Feb 2023] Chris Olah speculates that attention heads can exhibit superposition

[May 2023] Anthropic “constructs” toy model. OV-incoherence is bad terminology

61 of 84

Laterally distributed attention features

(1) how complicated a circuit is should scale with how "difficult" the task in question is�(2) a model will learn less localized, more distributed representations when there's pressure to do so(3) it takes a lot more pressure to learn laterally distributed representations than longitudinal

[Feb 2023] Chris Olah speculates that attention heads can exhibit superposition�[May 2023] May update: Anthropic “constructs” toy model. OV-incoherence is bad terminology

[July 2023] July update: previous example is not a real toy model of attn superposition

62 of 84

Laterally distributed attention features

(1) how complicated a circuit is should scale with how "difficult" the task in question is�(2) a model will learn less localized, more distributed representations when there's pressure to do so(3) it takes a lot more pressure to learn laterally distributed representations than longitudinal

[Feb 2023] Chris Olah speculates that attention heads can exhibit superposition�[May 2023] May update: Anthropic “constructs” toy model. OV-incoherence is bad terminology�[July 2023] July update: previous example is not a real toy model of attn superposition

[July 2023] Me, Narmeen, and faul_sname successfully construct the first actual (OV-incoherent) toy model of attention superposition for an alignment jam

63 of 84

Laterally distributed attention features

(1) how complicated a circuit is should scale with how "difficult" the task in question is�(2) a model will learn less localized, more distributed representations when there's pressure to do so(3) it takes a lot more pressure to learn laterally distributed representations than longitudinal

[Feb 2023] Chris Olah speculates that attention heads can exhibit superposition�[May 2023] May update: Anthropic “constructs” toy model. OV-incoherence is bad terminology�[July 2023] July update: previous example is not a real toy model of attn superposition�[July 2023] Me, Narmeen, and faul_sname construct an “OV-incoherent” toy model for an alignment jam

[Aug 2023] Lauren Greenspan & Keith Wynroe construct an “OV-coherent” toy model

64 of 84

Laterally distributed attention features

(1) how complicated a circuit is should scale with how "difficult" the task in question is�(2) a model will learn less localized, more distributed representations when there's pressure to do so(3) it takes a lot more pressure to learn laterally distributed representations than longitudinal

[Feb 2023] Chris Olah speculates that attention heads can exhibit superposition�[May 2023] May update: Anthropic “constructs” toy model. OV-incoherence is bad terminology�[July 2023] July update: previous example is not a real toy model of attn superposition�[July 2023] Me, Narmeen, and faul_sname construct an “OV-incoherent” toy model for an alignment jam�[Aug 2023] Lauren Greenspan & Keith Wynroe construct an “OV-coherent” toy model

65 of 84

Laterally distributed attention features

(1) how complicated a circuit is should scale with how "difficult" the task in question is�(2) a model will learn less localized, more distributed representations when there's pressure to do so(3) it takes a lot more pressure to learn laterally distributed representations than longitudinal

[Feb 2023] Chris Olah speculates that attention heads can exhibit superposition�[May 2023] May update: Anthropic “constructs” toy model. OV-incoherence is bad terminology�[July 2023] July update: previous example is not a real toy model of attn superposition�[July 2023] Me, Narmeen, and faul_sname construct an “OV-incoherent” toy model for an alignment jam�[Aug 2023] Lauren Greenspan & Keith Wynroe construct an “OV-coherent” toy model

WTF is OV-coherence or incoherence?

66 of 84

OV-(in)coherence

OV-(in)coherence points to two distinct forms of information bottlenecks, and thus two distinct forms of laterally distributed representations (keys vs queries)

67 of 84

OV-(in)coherence

OV-(in)coherence points to two distinct forms of information bottlenecks, and thus two distinct forms of laterally distributed representations (keys vs queries)

Recall the mathematical�framework paper:

source::key�destination::query

68 of 84

OV-(in)coherence

OV-(in)coherence points to two distinct forms of information bottlenecks, and thus two distinct forms of laterally distributed representations (keys vs queries)

The softmax in attention makes�it really hard to split attention�evenly between positions��Attention heads compute in�parallel, making dependencies�“impossible” – but when there’s�a will there’s a way

It would take a lot of pressure though

69 of 84

OV-(in)coherence

Basically you either have d_head too small, to scatter information over many heads, or you have n_heads too small, to scatter information over many positions. (In practice on SOTA models the worry is you split both ways)

�source::key�destination::query�

OV-Coherent superposition:multiple sources, fixed destination�happens when n_feats >> n_heads

OV-incoherent superposition:�fixed source, multiple destinations�happens when n_feats >> d_head

70 of 84

Attention superposition occurs when several heads’ QK circuits learn to form destructive interference patterns with each other to encode hierarchical dependencies, so that the OV maps aren’t actually effectively linear

OV-Coherent superposition:�multiple sources/keys, fixed destination/queryOV-incoherent superposition:�fixed source/key, multiple destinations/queries

71 of 84

Chris says this isn’t really what he was�looking for though. he wanted more natural�geometry to form reminiscent of Toy Models�Of Superposition

That would take even more pressure though

Attention superposition occurs when several heads’ QK circuits learn to form destructive interference patterns with each other to encode hierarchical dependencies, so that the OV maps aren’t actually effectively linear

OV-Coherent superposition:�multiple sources/keys, fixed destination/queryOV-incoherent superposition:�fixed source/key, multiple destinations/queries

72 of 84

Laterally distributed attention features

(1) how complicated a circuit is should scale with how "difficult" the task in question is�(2) a model will learn less localized, more distributed representations when there's pressure to do so(3) it takes a lot more pressure to learn laterally distributed representations than longitudinal

[Feb 2023] Chris Olah speculates that attention heads can exhibit superposition�[May 2023] May update: Anthropic “constructs” toy model. OV-incoherence is bad terminology�[July 2023] July update: previous example is not a real toy model of attn superposition�[July 2023] Me, Narmeen, and faul_sname construct an “OV-incoherent” toy model for an alignment jam�[Aug 2023] Lauren Greenspan & Keith Wynroe construct an “OV-coherent” toy model

[Sept 2023] We (+Kunvar & Adam Jermyn) all kinda work on it together with limited success. I quit to do something else

73 of 84

Laterally distributed attention features

(1) how complicated a circuit is should scale with how "difficult" the task in question is�(2) a model will learn less localized, more distributed representations when there's pressure to do so(3) it takes a lot more pressure to learn laterally distributed representations than longitudinal

[Feb 2023] Chris Olah speculates that attention heads can exhibit superposition�[May 2023] May update: Anthropic “constructs” toy model. OV-incoherence is bad terminology�[July 2023] July update: previous example is not a real toy model of attn superposition�[July 2023] Me, Narmeen, and faul_sname construct a toy model for an alignment jam�[Aug 2023] Lauren Greenspan & Keith Wynroe construct an “OV-coherent” toy model�[Sept 2023] We all kinda work on it together with limited success

[Oct 2023] Towards monosemanticity: future work comments at the bottom

74 of 84

Laterally distributed attention features

(1) how complicated a circuit is should scale with how "difficult" the task in question is�(2) a model will learn less localized, more distributed representations when there's pressure to do so(3) it takes a lot more pressure to learn laterally distributed representations than longitudinal

[Feb 2023] Chris Olah speculates that attention heads can exhibit superposition�[May 2023] May update: Anthropic “constructs” toy model. OV-incoherence is bad terminology�[July 2023] July update: previous example is not a real toy model of attn superposition�[July 2023] Me, Narmeen, and faul_sname construct a toy model for an alignment jam�[Aug 2023] Lauren Greenspan & Keith Wynroe construct an “OV-coherent” toy model�[Sept 2023] We all kinda work on it together with limited success�[Oct 2023] Towards monosemanticity: future work comments at the bottom

[Jan 2024] Anthropic update: they made real progress. Extremely preliminary results.�They successfully “forced” TMS-style geometry formation, shady procedure though

75 of 84

Laterally distributed attention features

(1) how complicated a circuit is should scale with how "difficult" the task in question is�(2) a model will learn less localized, more distributed representations when there's pressure to do so(3) it takes a lot more pressure to learn laterally distributed representations than longitudinal

[Feb 2023] Chris Olah speculates that attention heads can exhibit superposition�[May 2023] May update: Anthropic “constructs” toy model. OV-incoherence is bad terminology�[July 2023] July update: previous example is not a real toy model of attn superposition�[July 2023] Me, Narmeen, and faul_sname construct a toy model for an alignment jam�[Aug 2023] Lauren Greenspan & Keith Wynroe construct an “OV-coherent” toy model�[Sept 2023] We all kinda work on it together with limited success�[Oct 2023] Towards monosemanticity: future work comments at the bottom�[Jan 2024] Anthropic update: they made real progress. Extremely preliminary results

76 of 84

Laterally distributed attention features

(1) how complicated a circuit is should scale with how "difficult" the task in question is�(2) a model will learn less localized, more distributed representations when there's pressure to do so(3) it takes a lot more pressure to learn laterally distributed representations than longitudinal

[Feb 2023] Chris Olah speculates that attention heads can exhibit superposition�[May 2023] May update: Anthropic “constructs” toy model. OV-incoherence is bad terminology�[July 2023] July update: previous example is not a real toy model of attn superposition�[July 2023] Me, Narmeen, and faul_sname construct a toy model for an alignment jam�[Aug 2023] Lauren Greenspan & Keith Wynroe construct an “OV-coherent” toy model.�[Sept 2023] We all kinda work on it together with limited success.�[Oct 2023] Towards monosemanticity: future work comments at the bottom�[Jan 2024] Anthropic update: they made real progress. Extremely preliminary results

Why is it so hard? Because we don’t actually know what an “attention feature” is

77 of 84

Laterally distributed attention features

(1) how complicated a circuit is should scale with how "difficult" the task in question is�(2) a model will learn less localized, more distributed representations when there's pressure to do so(3) it takes a lot more pressure to learn laterally distributed representations than longitudinal

[Feb 2023] Chris Olah speculates that attention heads can exhibit superposition�[May 2023] May update: Anthropic “constructs” toy model. OV-incoherence is bad terminology�[July 2023] July update: previous example is not a real toy model of attn superposition�[July 2023] Me, Narmeen, and faul_sname construct a toy model for an alignment jam�[Aug 2023] Lauren Greenspan & Keith Wynroe construct an “OV-coherent” toy model.�[Sept 2023] We all kinda work on it together with limited success.�[Oct 2023] Towards monosemanticity: future work comments at the bottom�[Jan 2024] Anthropic update: they made progress. Extremely preliminary results

[Jan 2024] Toward A Mathematical Framework for Computation in Superposition

78 of 84

Laterally distributed attention features

(1) how complicated a circuit is should scale with how "difficult" the task in question is�(2) a model will learn less localized, more distributed representations when there's pressure to do so(3) it takes a lot more pressure to learn laterally distributed representations than longitudinal

[Feb 2023] Chris Olah speculates that attention heads can exhibit superposition�[May 2023] May update: Anthropic “constructs” toy model. OV-incoherence is bad terminology�[July 2023] July update: previous example is not a real toy model of attn superposition�[July 2023] Me, Narmeen, and faul_sname construct a toy model for an alignment jam�[Aug 2023] Lauren Greenspan & Keith Wynroe construct an “OV-coherent” toy model.�[Sept 2023] We all kinda work on it together with limited success.�[Oct 2023] Towards monosemanticity: future work comments at the bottom�[Jan 2024] Anthropic update: they made progress. Extremely preliminary results�[Jan 2024] Toward A Mathematical Framework for Computation in Superposition�I’m not actually sure if they’re talking about the same kind of attention superposition?

79 of 84

Big mech interp open problems (opinion)

  • Dictionary learning science
  • “Zero-shot transferability of probes between positions”
  • ���

80 of 84

What constitutes “mechanistic” understanding?

Examples of the strongest successes:

  • [Feb 2024, Tegmark] Synthesizing the program of the learned model
  • [Jan 2021, Olah] Ability to (blindly) hand write the model weights
  • [Dec 2023, Reddy] Provide an analytic form
  • [Nov 2023, Morwani] Proof of a learned circuit being optimal (learning-wise)

81 of 84

Best mech interp papers spotlight

Examples of the strongest successes:

  • [Jan 2021, Cammarata] Curve Circuits
  • [Dec 2023, Reddy] Mechanistic Basis
  • [Nov 2022, Wang] Interpretability in the Wild: IOI
  • [Dec 2023, Variengien] ORION task constellations

82 of 84

What does “mechanistic” mean?

  • Find correlations
  • Find causations and dependencies
  • Causal understanding
  • Sufficient conditions
    • Finding a circuit
  • Necessary conditions
    • Could it have been anything else?
  • d

83 of 84

What constitutes “ambitious” MI understanding?

Let’s find a north star, something far too ambitious, and land somewhere close.

  • Synthesize the program of a deep neural network
  • Write (by hand) the model weights for a given task, from a MI explanation
  • Trace the learning dynamics of capability emergence down to the optimizer
  • Trace model outputs to the specific training data examples
  • Provide a mathematical proof of a learned circuit/subnetwork being optimal

84 of 84

What constitutes “mechanistic” understanding?

Examples of the strongest successes:

  • [Feb 2024, Tegmark] Synthesizing the program of the learned model
  • [Jan 2021, Olah] Ability to (blindly) hand write the model weights
  • [Dec 2023, Reddy] Provide an analytic form
  • [Nov 2023, Morwani] Proof of a learned circuit being optimal (learning-wise)