Neural Circuits in Language Models
Neural Mechanics
Week 8 (10th March)
Historical Context
2
Rumelhart et al reversed engineered each connection weight to make the point that backpropagation can learn genuine algorithm.
Historical Context
3
Chris Olah et al asked the same question ~30 years later for Inception model.
Can we reverse engineer neurons and their connections to understand the underlying algorithms?
Historical Context
4
What is a Circuit?
5
Layer 2
tokens
embed
+
h0
h1
…
+
MLP m
+
h0
h1
…
MLP m
Layer 1
unembed
logits
Layer 3
+
+
h0
h1
…
MLP m
Transformer Model
Circuit is a subgraph that connects the inputs to the logits
6
Layer 2
tokens
embed
+
h0
h1
…
+
MLP m
+
h0
h1
…
MLP m
Layer 1
unembed
logits
Layer 3
+
+
h0
h1
…
MLP m
Circuit
Circuit is a subgraph that connects the inputs to the logits
7
Layer 2
tokens
embed
+
h0
h1
…
+
MLP m
+
h0
h1
…
MLP m
Layer 1
unembed
logits
Layer 3
+
+
h0
h1
…
MLP m
Circuit
Jasmine, Rice: What counts as circuit?
How do you find Circuits in LM?
8
Experiment Setup
9
John and Mary went to the store. John gave a drink to ____
Path Patching: Circuit Discovery Algorithm (Intuition)
10
tokens
logits
What are the important edges in a computational graph?
Path Patching: Circuit Discovery Algorithm (Intuition)
tokens
logits
tokens
logits
Path Patching: Circuit Discovery Algorithm (Intuition)
tokens
logits
tokens
logits
Path Patching: Circuit Discovery Algorithm (Intuition)
tokens
logits
tokens
logits
Path Patching: Circuit Discovery Algorithm (Intuition)
tokens
logits
tokens
logits
Path Patching: Circuit Discovery Algorithm (Intuition)
tokens
logits
Edges that exhibited the greatest degradation in logit values.
Circuit Edges
Path Patching: Circuit Discovery Algorithm (Intuition)
tokens
logits
Circuit Edges
Path Patching Algorithm
Path Patching Algorithm
Which heads directly affect the output? (Name Mover)
19
Which heads affect the Name Mover heads?
20
Iterative Path Patch yields the Circuit
21
Arya, Ananya, Jesseba: Generalizability (task)?
Haoyu: Retraining impact?
22
Avery, Christopher, Rice: Generalizability (models)?
Backup Mechanism
23
Claire: Comment about redundancy
Circuit Validation (Faithfulness)
24
+
+
+
Faithfulness of circuit C
=
F(C)
F(M)
=
+
+
+
where F() is mean logit difference.
=
0.87
Yuqi: operationalize faithfulness?
Circuit Validation (Completeness)
25
C1
C2
I
+
O
Circuit Validation (Minimality)
26
Can we automate Circuit Discovery?
27
Yiqian: human-driven or automated interpretability?
Circuit Discovery Overview
28
ACDC automates step 3.
29
30
Evaluating ACDC
31
Courtney: ground truth circuit?
Problems with ACDC?
32
Problems with ACDC?
33
Solving ACDC and its evaluation issues
34
Solving Scalability with EAP
35
Edge Importance =
EAP Issue
36
EAP-IG (Integrated Gradients)
37
Edge Importance =
Solving Disconnected Components Issue with EAP-IG
38
Identifying Negative Components with EAP-IG
39
Using absolute EAP-IG show solves the negative component identification issue.
EAP-IG finds more faithful circuits
40
Overlap of EAP-IG circuits with ground truth is low
41
Overlap and/or faithfulness?
42
Grace, Luze: Overlap => faithfulness?
Overlap and/or faithfulness?
43
Is finding the circuit end goal?
44
Can we reverse engineer neurons and their connections to understand the underlying algorithms?
Is this what we want?
No! We want to understand the mechanisms
45
Value Info.
Position Info.
Position Info.
Circuit Discovery is a step towards understanding mechanisms
46
Tapan, Rice: Label components + create counterfactuals
More recent works on circuit discovery
47
Supports MLP Transcoder only!