Mechanistic Interpretability
The big ideas, schools of thought, and areas people care about and why
Table of contents
��
History
Chris Olah
�
Other OGs
Distill�Alexander Mordvintsev - DeepDream, precursor to Distill work�Nick Cammarata - Cofounder of mech interp: led most of Distill circuit work�Ludwig Schubert - Core infrastructure�Shan Carter…�Gabriel Goh…�Chelsea Voss…��transformer-circuits�Christine Olsson - Exp work to produce insights to build the framework from�Neel Nanda - Conceptual work to form the framework w/ Chris�Nelson Elhage - Core infrastructure
The prehistory (biography of the OG)
2010 & 2011 - Chris Olah does 3d printing in highschool�2012 - Chris gets Thiel Fellowship for 3d printing. AI is a side hobby at this point�2013 - cold emails Yoshua Bengio @ MILA. Gets into deep learning�2014 - Google Brain internship, explores visualizing neural networks�2015 - Hops on Alexander Mordvintsev (AM)’s DeepDream, generalizes it�2016 - co-founds Distill. Writes Concrete Problems in AI Safety with Dario Amodei�2017 - Separately, Transformers are invented. �2018 - Creates Lucid w/ AM & Ludwig. Joins OpenAI. Mechinterp work starts here.�2019 - Chris -> mentorship role. Nick Cammarata drives Vision Circuits work�2020 - Circuits thread on Distill�2021 - Distill indefinite hiatus. Chris joins Anthropic. Starts transformer-circuits��
transformer-circuits
LW posts commenting on mech interp date back to 2018, and awareness of this work was commented on by other AI Safety people but as far as I can tell no one else actually did mechinterp, so I won’t cover what other people did before 2022
March 2020 - April 2021 — Original distill circuits thread�Dec 2021 - First transformer-circuits post. Discovery of induction heads.�March 2022 - Induction heads, developmental study of ICL�June 2022 - Softmax Linear Units�September 2022 - Toy Models of Superposition�(Small posts fill a 1 year gap ~ Double descent, May Circuits updates, July updates)�October 2023 - Towards Monosemanticity�January 2024 - January updates���
Leonard’s Taxonomy of MechInt
A different perspective
�“Solving MechInterp” involves filling�out this ternary plot
Sudden Drops in the Loss: Syntax Acquisition, Phase Transitions, and Simplicity Bias in MLMs
Distill Circuits Thread
In-Context Learning and Induction Heads
Studying Large Language Model Generalization with Influence Functions
Toy Models of Superposition
Mathematical Framework for Transformer Circuits
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
Towards Monosemanticity: Decomposing Language Models with Sparse Dictionary Learning
Representation Engineering
Singular Learning Theory
Delays, Detours, and Forks in the Road: Latent State Models of Training Dynamics
A different perspective
�“Solving MechInterp” involves filling�out this ternary plot
Wherever there’s a gap, there’s�an opportunity. We can talk about�what filling these gaps might look like�
Sudden Drops in the Loss: Syntax Acquisition, Phase Transitions, and Simplicity Bias in MLMs
Distill Circuits Thread
In-Context Learning and Induction Heads
Studying Large Language Model Generalization with Influence Functions
Toy Models of Superposition
Mathematical Framework for Transformer Circuits
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
Towards Monosemanticity: Decomposing Language Models with Sparse Dictionary Learning
Representation Engineering
Singular Learning Theory
Delays, Detours, and Forks in the Road: Latent State Models of Training Dynamics
A different perspective
�“Solving MechInterp” involves filling�out this ternary plot
Micro~Macro: A conflation of:
A different perspective
A different perspective
�“Solving MechInterp” involves filling�out this ternary plot
Sudden Drops in the Loss: Syntax Acquisition, Phase Transitions, and Simplicity Bias in MLMs
Distill Circuits Thread
In-Context Learning and Induction Heads
Studying Large Language Model Generalization with Influence Functions
Toy Models of Superposition
Mathematical Framework for Transformer Circuits
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
Towards Monosemanticity: Decomposing Language Models with Sparse Dictionary Learning
Representation Engineering
Singular Learning Theory
Delays, Detours, and Forks in the Road: Latent State Models of Training Dynamics
A different perspective
A different perspective
�“Solving MechInterp” involves filling�out this ternary plot
E.g. influence functions
The main schools of thought
ELK: “Zero-shot transferability of probes between positions”�Circuit analysis�Grokking: hybrid between dev-interp and comprehensive understanding�Use bigger and bigger models�Inductive biases
Circuit analysis:�Head specialization, suppression, world modeling, reasoning, behavioral, safety-relevant, attention superposition, circuits across tokens, factual recall, in-context learning, macro structure, modular circuits, dev-interp, trojan discovery��
Things we want to understand
✓✓✓ extremely done. ✓✓ mostly done. ✓ needs polishing. __ untouched
��
The big areas in mech interp
For each thing we want to understand, I’ll give:
��
Table of contents
Circuit analysis:�Head specialization, suppression, world modeling, reasoning, behavioral, safety-relevant, attention superposition, FSA, factual recall, in-context learning, backup behavior, macro structure, modular circuits, dev-interp, trojan discovery, feature circuits, computation in superposition, ELK
Circuit studies done: ORION (question answering, type hint understanding, factual recall, variable binding, translation, induction), “greater than”, IOI.�Approaches: patching (activation, attribution, path, attention pattern, request, cross-prompt, patchscopes), direct effect, logit/prob diff,
Circuit analysis
����
Circuit analysis
b) Arguably, the point of doing a bunch of circuit analyses on a bunch of models is to gather empirical data on the typical phenomena you’ll encounter in an LLM
Circuit analysis
b) Arguably, the point of doing a bunch of circuit analyses on a bunch of models is to gather empirical data on the typical phenomena you’ll encounter in an LLM
c-kinda) Circuit analyses done on “big models”: pronoun selection, “greater than”, indirect object identification, ORION (question answering, type hint understanding, factual recall, variable binding, translation, induction), docstring, time, arithmetic, multiple choice QA
Circuit analysis
b) Arguably, the point of doing a bunch of circuit analyses on a bunch of models is to gather data on the typical phenomena you’ll encounter
c) Circuit analyses done on “big models”: pronoun selection, “greater than”, indirect object identification, ORION (question answering, type hint understanding, factual recall, variable binding, translation, induction), docstring, time, arithmetic, multiple choice QA
Themes studied: Memorization, iterative refinement, factual recall, etc
Circuit analysis
b) Arguably, the point of doing a bunch of circuit analyses on a bunch of models is to gather data on the typical phenomena you’ll encounter
c) Circuit analyses done on “big models”: pronoun selection, “greater than”, indirect object identification, ORION (question answering, type hint understanding, factual recall, variable binding, translation, induction), docstring, time, arithmetic, multiple choice QA
Themes studied: Memorization, iterative refinement, etc
Key discoveries: Head specialization, backup behavior, “hydra effect”, etc
Circuit analysis
b) Arguably, the point of doing a bunch of circuit analyses on a bunch of models is to gather data on the typical phenomena you’ll encounter
c) Circuit analyses done on “big models”: pronoun selection, “greater than”, indirect object identification, ORION (question answering, type hint understanding, factual recall, variable binding, translation, induction), docstring, time, arithmetic, multiple choice QA�Themes studied: Memorization, iterative refinement, etc�Key discoveries: Head specialization, backup behavior, “hydra effect”, etc
d) LHF 2/10, Excitement 2/10, Difficulty 4/10
Circuit analysis
b) Arguably, the point of doing a bunch of circuit analyses on a bunch of models is to gather data on the typical phenomena you’ll encounter
c) Circuit analyses done on “big models”: pronoun selection, “greater than”, indirect object identification, ORION (question answering, type hint understanding, factual recall, variable binding, translation, induction), docstring, time, arithmetic, multiple choice QA�Themes studied: Memorization, iterative refinement, etc�Key discoveries: Head specialization, backup behavior, “hydra effect”, etc
d) LHF 2/10, Excitement 2/10, Difficulty 4/10
e) We’re done doing random circuit analyses IMO. If you still want to do circuit analysis, you should explicitly be looking for counterexamples, which takes a more conceptually motivated approach (e.g. inspired by specifics of the architecture)
Circuit analysis
b) Arguably, the point of doing a bunch of circuit analyses on a bunch of models is to gather data on the typical phenomena you’ll encounter
c) Circuit analyses done on “big models”: pronoun selection, “greater than”, indirect object identification, ORION (question answering, type hint understanding, factual recall, variable binding, translation, induction), docstring, time, arithmetic, multiple choice QA�Themes studied: Memorization, iterative refinement, etc�Key discoveries: Head specialization, backup behavior, “hydra effect”, etc
d) LHF 2/10, Excitement 2/10, Difficulty 4/10
e) We’re done doing random circuit analyses IMO. If you still want to do circuit analysis, you should explicitly be looking for counterexamples, which takes a more conceptually motivated approach (e.g. inspired by specifics of the architecture)
f) Far too many people
Grokking
a) Neutral definition of grokking: “eventual recovery from overfitting”����
Grokking
a) Neutral definition of grokking: “eventual recovery from overfitting”
b) Main idea: 100% test solution means a plausibly clean ground truth exists, and it’s tractable to fully reverse engineer it (still extremely hard).
Secondary idea: studying the nature of phase transitions.����
Grokking
a) Neutral definition of grokking: “eventual recovery from overfitting”
b) Main idea: 100% test solution means a plausibly clean ground truth exists, and it’s tractable to fully reverse engineer it (still extremely hard). Secondary idea: studying the nature of phase transitions.
c) Main work in this area:�[May 2022] Towards Understanding Grokking�[Aug 2022] Progress Measure for Grokking via Mechanistic Interpretability�[May 2023] Grokking of Hierarchical Structure in Vanilla Transformers�[Nov 2023] Feature emergence via margin maximization (insane)����
Grokking
a) Neutral definition of grokking: “eventual recovery from overfitting”
b) Main idea: 100% test solution means a plausibly clean ground truth exists, and it’s tractable to fully reverse engineer it (still extremely hard). Secondary idea: studying the nature of phase transitions.
c) Main work in this area:�[May 2022] Towards Understanding Grokking�[Aug 2022] Progress Measure for Grokking via Mechanistic Interpretability�[May 2023] Grokking of Hierarchical Structure in Vanilla Transformers�[Nov 2023] Feature emergence via margin maximization�d) LHF 6/10, excitement 4/10, difficulty 5/10����
Grokking
a) Neutral definition of grokking: “eventual recovery from overfitting”
b) Main idea: 100% test solution means a plausibly clean ground truth exists, and it’s tractable to fully reverse engineer it (still extremely hard). Secondary idea: studying the nature of phase transitions.
c) Main work in this area:�[May 2022] Towards Understanding Grokking�[Aug 2022] Progress Measure for Grokking via Mechanistic Interpretability�[May 2023] Grokking of Hierarchical Structure in Vanilla Transformers�[Nov 2023] Feature emergence via margin maximization
d) LHF 6/10, excitement 4/10, difficulty 5/10
e) How to choose a good research direction? Think “counterexamples in topology”�Constructing counterintuitive toy models is the most useful work in this area TBH����
Grokking
e) Open questions (off the top of my mind): TBD
Backup behavior
a) Discovered in the Interpretability in the Wild: Indirect Object Identification
Backup behavior
a) Discovered in the Interpretability in the Wild: Indirect Object Identification
b) Pretty important
Backup behavior
a) Discovered in the Interpretability in the Wild: Indirect Object Identification
b) Pretty important
c) Main work in this area:�[] Hydra Effect�[] Comprehensively understanding an attention head�
SAEs & Dictionary learning
a) Not a new idea: it’s existed for over 20 years. Also Anthropic wasn’t first, HAL did it before them����
SAEs & Dictionary learning
a) Not a new idea: it’s existed for over 20 years. Also Anthropic wasn’t first, HAL did it before them
b) Main idea: Lets us decompose arbitrary activation spaces into a combination of few feature directions. Sparsity makes doing interp with this realistic. Good sparsity (kind of) implies interpretability����
SAEs & Dictionary learning
a) Not a new idea: it’s existed for over 20 years. Also Anthropic wasn’t first, HAL did it before them
b) Main idea: Lets us decompose arbitrary activation spaces into a combination of few feature directions. Sparsity makes doing interp with this realistic. Good sparsity (kind of) implies interpretability
c) Main work in this area:�[April 2023] Emergence of Sparse Representations from Noise�[July 2023] (tentatively) Found 600+ Monosemantic Features in a Small LM Using Sparse Autoencoders�[Oct 2023] Towards Monosemanticity: Decomposing Language Models With Dictionary Learning����
SAEs & Dictionary learning
0) Not a new idea: it’s existed for over 20 years. Also Anthropic wasn’t first, HAL did it before them
1) Main idea: Lets us decompose arbitrary activation spaces into a combination of few feature directions. Sparsity makes doing interp with this realistic. Good sparsity (kind of) implies interpretability
2) Main work in this area:�[April 2023] Emergence of Sparse Representations from Noise�[July 2023] (tentatively) Found 600+ Monosemantic Features in a Small LM Using Sparse Autoencoders�[Sept 2023] Interpreting OpenAI's Whisper�[Oct 2023] Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
3) LHF 10/10, excitement 6/10, difficulty 9/10����
SAEs & Dictionary learning
0) Not a new idea: it’s existed for over 20 years. Also Anthropic wasn’t first, HAL did it before them�1) Main idea: Lets us decompose arbitrary activation spaces into a combination of few feature directions. Sparsity makes doing interp with this realistic. Good sparsity (kind of) implies interpretability�2) Main work in this area:�[April 2023] Emergence of Sparse Representations from Noise�[July 2023] (tentatively) Found 600+ Monosemantic Features in a Small LM Using Sparse Autoencoders�[Sept 2023] Interpreting OpenAI's Whisper�[Oct 2023] Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
3) LHF 10/10, excitement 6/10, difficulty 9/10
4) ***How to choose a good research direction?�*have a plan beyond “train SAEs” and “improve automated circuit discovery”�*DON’T rely on existing mech interp literature: it was built for a world without SAEs�*Take inspiration from outside domains to maximize the variance of ideas tried���
SAEs & Dictionary learning
0) Not a new idea: it’s existed for over 20 years. Also Anthropic wasn’t first, HAL did it before them�1) Main idea: Lets us decompose arbitrary activation spaces into a combination of few feature directions. Sparsity makes doing interp with this realistic. Good sparsity (kind of) implies interpretability�2) Main work in this area:�[April 2023] Emergence of Sparse Representations from Noise�[July 2023] (tentatively) Found 600+ Monosemantic Features in a Small LM Using Sparse Autoencoders�[Sept 2023] Interpreting OpenAI's Whisper�[Oct 2023] Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
3) How much richness is left? 10/10. How excited am I? 6/10�4) ***How to choose a good research direction?�*have a plan beyond “train SAEs” and “improve automated circuit discovery”�*DON’T rely on existing mech interp literature: it was built for a world without SAEs�*Take inspiration from outside domains to maximize the variance of ideas tried
*Ask an expert in sparse coding (from old literature), ask them “what do we do next?”���
SAEs & Dictionary learning
0) Not a new idea: it’s existed for over 20 years. Also Anthropic wasn’t first, HAL did it before them�1) Main idea: Lets us decompose arbitrary activation spaces into a combination of few feature directions. Sparsity makes doing interp with this realistic. Good sparsity (kind of) implies interpretability�2) Main work in this area:�[April 2023] Emergence of Sparse Representations from Noise�[July 2023] (tentatively) Found 600+ Monosemantic Features in a Small LM Using Sparse Autoencoders�[Sept 2023] Interpreting OpenAI's Whisper�[Oct 2023] Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
3) How much richness is left? 10/10. How excited am I? 6/10�4) ***How to choose a good research direction?�*have a plan beyond “train SAEs” and “improve automated circuit discovery”�*DON’T rely on existing mech interp literature: it was built for a world without SAEs�*Take inspiration from outside domains to maximize the variance of ideas tried
Note: SAEs don’t solve mech interp���
Dictionary learning open problems
Feature composition
a) AKA computation in superposition, feature circuits
���
Modularity
a) AKA circuit composition, how do we combine the information from all these existing circuit studies
���
The training process
a)
���
WTF is Attention superposition?
Attention superposition
I happen to be an expert on this topic (doesn’t mean much)
Attention superposition
I happen to be an expert on this topic (doesn’t mean much)
(1) how complicated a circuit is should scale with how "difficult" the task in question is
Attention superposition
I happen to be an expert on this topic (doesn’t mean much)
(1) how complicated a circuit is should scale with how "difficult" the task in question is
(2) a model will learn less local, more distributed representations when there's pressure to do so�
���
Attention superposition
I happen to be an expert on this topic (doesn’t mean much)
(1) how complicated a circuit is should scale with how "difficult" the task in question is�(2) a model will learn less localized, more distributed representations when there's pressure to do so
(3) it takes a lot more pressure to learn laterally distributed representations than longitudinal
���
Attention superposition
(1) how complicated a circuit is should scale with how "difficult" the task in question is�(2) a model will learn less localized, more distributed representations when there's pressure to do so�(3) it takes a lot more pressure to learn laterally distributed representations than longitudinal
Laterally distributed attention features
(1) how complicated a circuit is should scale with how "difficult" the task in question is�(2) a model will learn less localized, more distributed representations when there's pressure to do so�(3) it takes a lot more pressure to learn laterally distributed representations than longitudinal
Laterally distributed attention features
(1) how complicated a circuit is should scale with how "difficult" the task in question is�(2) a model will learn less localized, more distributed representations when there's pressure to do so�(3) it takes a lot more pressure to learn laterally distributed representations than longitudinal
[Feb 2023] Chris Olah speculates that attention heads can exhibit superposition.
Laterally distributed attention features
(1) how complicated a circuit is should scale with how "difficult" the task in question is�(2) a model will learn less localized, more distributed representations when there's pressure to do so�(3) it takes a lot more pressure to learn laterally distributed representations than longitudinal
[Feb 2023] Chris Olah speculates that attention heads can exhibit superposition.
Notably he seems pretty concerned about it, and his galaxy brain intuition tells him this is an important future bottleneck for the mech interp agenda that remains even if SAEs address normal superposition
Laterally distributed attention features
(1) how complicated a circuit is should scale with how "difficult" the task in question is�(2) a model will learn less localized, more distributed representations when there's pressure to do so�(3) it takes a lot more pressure to learn laterally distributed representations than longitudinal
[Feb 2023] Chris Olah speculates that attention heads can exhibit superposition.�Notably he seems pretty concerned about it, and his galaxy brain intuition tells him this is an important future bottleneck for the mech interp agenda that remains even if SAEs address normal superposition.
I lowkey disagree but what do I know
Laterally distributed attention features
(1) how complicated a circuit is should scale with how "difficult" the task in question is�(2) a model will learn less localized, more distributed representations when there's pressure to do so�(3) it takes a lot more pressure to learn laterally distributed representations than longitudinal
[Feb 2023] Chris Olah speculates that attention heads can exhibit superposition.�Notably he seems pretty concerned about it, and his galaxy brain intuition tells him this is an important future bottleneck for the mech interp agenda that remains even if SAEs address normal superposition.
(I lowkey disagree but what do I know)
He spends a bunch of time on this with Adam Jermyn & Tom Conerly�
Laterally distributed attention features
(1) how complicated a circuit is should scale with how "difficult" the task in question is�(2) a model will learn less localized, more distributed representations when there's pressure to do so�(3) it takes a lot more pressure to learn laterally distributed representations than longitudinal
[Feb 2023] Chris Olah speculates that attention heads can exhibit superposition
Laterally distributed attention features
(1) how complicated a circuit is should scale with how "difficult" the task in question is�(2) a model will learn less localized, more distributed representations when there's pressure to do so�(3) it takes a lot more pressure to learn laterally distributed representations than longitudinal
[Feb 2023] Chris Olah speculates that attention heads can exhibit superposition
[May 2023] Anthropic “constructs” toy model. OV-incoherence is bad terminology
Laterally distributed attention features
(1) how complicated a circuit is should scale with how "difficult" the task in question is�(2) a model will learn less localized, more distributed representations when there's pressure to do so�(3) it takes a lot more pressure to learn laterally distributed representations than longitudinal
[Feb 2023] Chris Olah speculates that attention heads can exhibit superposition�[May 2023] May update: Anthropic “constructs” toy model. OV-incoherence is bad terminology
[July 2023] July update: previous example is not a real toy model of attn superposition
Laterally distributed attention features
(1) how complicated a circuit is should scale with how "difficult" the task in question is�(2) a model will learn less localized, more distributed representations when there's pressure to do so�(3) it takes a lot more pressure to learn laterally distributed representations than longitudinal
[Feb 2023] Chris Olah speculates that attention heads can exhibit superposition�[May 2023] May update: Anthropic “constructs” toy model. OV-incoherence is bad terminology�[July 2023] July update: previous example is not a real toy model of attn superposition
[July 2023] Me, Narmeen, and faul_sname successfully construct the first actual (OV-incoherent) toy model of attention superposition for an alignment jam
Laterally distributed attention features
(1) how complicated a circuit is should scale with how "difficult" the task in question is�(2) a model will learn less localized, more distributed representations when there's pressure to do so�(3) it takes a lot more pressure to learn laterally distributed representations than longitudinal
[Feb 2023] Chris Olah speculates that attention heads can exhibit superposition�[May 2023] May update: Anthropic “constructs” toy model. OV-incoherence is bad terminology�[July 2023] July update: previous example is not a real toy model of attn superposition�[July 2023] Me, Narmeen, and faul_sname construct an “OV-incoherent” toy model for an alignment jam
[Aug 2023] Lauren Greenspan & Keith Wynroe construct an “OV-coherent” toy model
Laterally distributed attention features
(1) how complicated a circuit is should scale with how "difficult" the task in question is�(2) a model will learn less localized, more distributed representations when there's pressure to do so�(3) it takes a lot more pressure to learn laterally distributed representations than longitudinal
[Feb 2023] Chris Olah speculates that attention heads can exhibit superposition�[May 2023] May update: Anthropic “constructs” toy model. OV-incoherence is bad terminology�[July 2023] July update: previous example is not a real toy model of attn superposition�[July 2023] Me, Narmeen, and faul_sname construct an “OV-incoherent” toy model for an alignment jam�[Aug 2023] Lauren Greenspan & Keith Wynroe construct an “OV-coherent” toy model
Laterally distributed attention features
(1) how complicated a circuit is should scale with how "difficult" the task in question is�(2) a model will learn less localized, more distributed representations when there's pressure to do so�(3) it takes a lot more pressure to learn laterally distributed representations than longitudinal
[Feb 2023] Chris Olah speculates that attention heads can exhibit superposition�[May 2023] May update: Anthropic “constructs” toy model. OV-incoherence is bad terminology�[July 2023] July update: previous example is not a real toy model of attn superposition�[July 2023] Me, Narmeen, and faul_sname construct an “OV-incoherent” toy model for an alignment jam�[Aug 2023] Lauren Greenspan & Keith Wynroe construct an “OV-coherent” toy model
WTF is OV-coherence or incoherence?
OV-(in)coherence
OV-(in)coherence points to two distinct forms of information bottlenecks, and thus two distinct forms of laterally distributed representations (keys vs queries)
OV-(in)coherence
OV-(in)coherence points to two distinct forms of information bottlenecks, and thus two distinct forms of laterally distributed representations (keys vs queries)
Recall the mathematical�framework paper:
source::key�destination::query
OV-(in)coherence
OV-(in)coherence points to two distinct forms of information bottlenecks, and thus two distinct forms of laterally distributed representations (keys vs queries)
The softmax in attention makes�it really hard to split attention�evenly between positions��Attention heads compute in�parallel, making dependencies�“impossible” – but when there’s�a will there’s a way
It would take a lot of pressure though
OV-(in)coherence
Basically you either have d_head too small, to scatter information over many heads, or you have n_heads too small, to scatter information over many positions. (In practice on SOTA models the worry is you split both ways)
�source::key�destination::query�
OV-Coherent superposition:�multiple sources, fixed destination�happens when n_feats >> n_heads
OV-incoherent superposition:�fixed source, multiple destinations�happens when n_feats >> d_head
Attention superposition occurs when several heads’ QK circuits learn to form destructive interference patterns with each other to encode hierarchical dependencies, so that the OV maps aren’t actually effectively linear
OV-Coherent superposition:�multiple sources/keys, fixed destination/query�OV-incoherent superposition:�fixed source/key, multiple destinations/queries
Chris says this isn’t really what he was�looking for though. he wanted more natural�geometry to form reminiscent of Toy Models�Of Superposition
That would take even more pressure though
Attention superposition occurs when several heads’ QK circuits learn to form destructive interference patterns with each other to encode hierarchical dependencies, so that the OV maps aren’t actually effectively linear
OV-Coherent superposition:�multiple sources/keys, fixed destination/query�OV-incoherent superposition:�fixed source/key, multiple destinations/queries
Laterally distributed attention features
(1) how complicated a circuit is should scale with how "difficult" the task in question is�(2) a model will learn less localized, more distributed representations when there's pressure to do so�(3) it takes a lot more pressure to learn laterally distributed representations than longitudinal
[Feb 2023] Chris Olah speculates that attention heads can exhibit superposition�[May 2023] May update: Anthropic “constructs” toy model. OV-incoherence is bad terminology�[July 2023] July update: previous example is not a real toy model of attn superposition�[July 2023] Me, Narmeen, and faul_sname construct an “OV-incoherent” toy model for an alignment jam�[Aug 2023] Lauren Greenspan & Keith Wynroe construct an “OV-coherent” toy model
[Sept 2023] We (+Kunvar & Adam Jermyn) all kinda work on it together with limited success. I quit to do something else
Laterally distributed attention features
(1) how complicated a circuit is should scale with how "difficult" the task in question is�(2) a model will learn less localized, more distributed representations when there's pressure to do so�(3) it takes a lot more pressure to learn laterally distributed representations than longitudinal
[Feb 2023] Chris Olah speculates that attention heads can exhibit superposition�[May 2023] May update: Anthropic “constructs” toy model. OV-incoherence is bad terminology�[July 2023] July update: previous example is not a real toy model of attn superposition�[July 2023] Me, Narmeen, and faul_sname construct a toy model for an alignment jam�[Aug 2023] Lauren Greenspan & Keith Wynroe construct an “OV-coherent” toy model�[Sept 2023] We all kinda work on it together with limited success
[Oct 2023] Towards monosemanticity: future work comments at the bottom
Laterally distributed attention features
(1) how complicated a circuit is should scale with how "difficult" the task in question is�(2) a model will learn less localized, more distributed representations when there's pressure to do so�(3) it takes a lot more pressure to learn laterally distributed representations than longitudinal
[Feb 2023] Chris Olah speculates that attention heads can exhibit superposition�[May 2023] May update: Anthropic “constructs” toy model. OV-incoherence is bad terminology�[July 2023] July update: previous example is not a real toy model of attn superposition�[July 2023] Me, Narmeen, and faul_sname construct a toy model for an alignment jam�[Aug 2023] Lauren Greenspan & Keith Wynroe construct an “OV-coherent” toy model�[Sept 2023] We all kinda work on it together with limited success�[Oct 2023] Towards monosemanticity: future work comments at the bottom
[Jan 2024] Anthropic update: they made real progress. Extremely preliminary results.�They successfully “forced” TMS-style geometry formation, shady procedure though
Laterally distributed attention features
(1) how complicated a circuit is should scale with how "difficult" the task in question is�(2) a model will learn less localized, more distributed representations when there's pressure to do so�(3) it takes a lot more pressure to learn laterally distributed representations than longitudinal
[Feb 2023] Chris Olah speculates that attention heads can exhibit superposition�[May 2023] May update: Anthropic “constructs” toy model. OV-incoherence is bad terminology�[July 2023] July update: previous example is not a real toy model of attn superposition�[July 2023] Me, Narmeen, and faul_sname construct a toy model for an alignment jam�[Aug 2023] Lauren Greenspan & Keith Wynroe construct an “OV-coherent” toy model�[Sept 2023] We all kinda work on it together with limited success�[Oct 2023] Towards monosemanticity: future work comments at the bottom�[Jan 2024] Anthropic update: they made real progress. Extremely preliminary results
Laterally distributed attention features
(1) how complicated a circuit is should scale with how "difficult" the task in question is�(2) a model will learn less localized, more distributed representations when there's pressure to do so�(3) it takes a lot more pressure to learn laterally distributed representations than longitudinal
[Feb 2023] Chris Olah speculates that attention heads can exhibit superposition�[May 2023] May update: Anthropic “constructs” toy model. OV-incoherence is bad terminology�[July 2023] July update: previous example is not a real toy model of attn superposition�[July 2023] Me, Narmeen, and faul_sname construct a toy model for an alignment jam�[Aug 2023] Lauren Greenspan & Keith Wynroe construct an “OV-coherent” toy model.�[Sept 2023] We all kinda work on it together with limited success.�[Oct 2023] Towards monosemanticity: future work comments at the bottom�[Jan 2024] Anthropic update: they made real progress. Extremely preliminary results
Why is it so hard? Because we don’t actually know what an “attention feature” is
Laterally distributed attention features
(1) how complicated a circuit is should scale with how "difficult" the task in question is�(2) a model will learn less localized, more distributed representations when there's pressure to do so�(3) it takes a lot more pressure to learn laterally distributed representations than longitudinal
[Feb 2023] Chris Olah speculates that attention heads can exhibit superposition�[May 2023] May update: Anthropic “constructs” toy model. OV-incoherence is bad terminology�[July 2023] July update: previous example is not a real toy model of attn superposition�[July 2023] Me, Narmeen, and faul_sname construct a toy model for an alignment jam�[Aug 2023] Lauren Greenspan & Keith Wynroe construct an “OV-coherent” toy model.�[Sept 2023] We all kinda work on it together with limited success.�[Oct 2023] Towards monosemanticity: future work comments at the bottom�[Jan 2024] Anthropic update: they made progress. Extremely preliminary results
[Jan 2024] Toward A Mathematical Framework for Computation in Superposition
Laterally distributed attention features
(1) how complicated a circuit is should scale with how "difficult" the task in question is�(2) a model will learn less localized, more distributed representations when there's pressure to do so�(3) it takes a lot more pressure to learn laterally distributed representations than longitudinal
[Feb 2023] Chris Olah speculates that attention heads can exhibit superposition�[May 2023] May update: Anthropic “constructs” toy model. OV-incoherence is bad terminology�[July 2023] July update: previous example is not a real toy model of attn superposition�[July 2023] Me, Narmeen, and faul_sname construct a toy model for an alignment jam�[Aug 2023] Lauren Greenspan & Keith Wynroe construct an “OV-coherent” toy model.�[Sept 2023] We all kinda work on it together with limited success.�[Oct 2023] Towards monosemanticity: future work comments at the bottom�[Jan 2024] Anthropic update: they made progress. Extremely preliminary results�[Jan 2024] Toward A Mathematical Framework for Computation in Superposition�I’m not actually sure if they’re talking about the same kind of attention superposition?
Big mech interp open problems (opinion)
What constitutes “mechanistic” understanding?
Best mech interp papers spotlight
What does “mechanistic” mean?
What constitutes “ambitious” MI understanding?
Let’s find a north star, something far too ambitious, and land somewhere close.
What constitutes “mechanistic” understanding?