3 of 47

Historical Context

Chris Olah et al asked the same question ~30 years later for Inception model.

Can we reverse engineer neurons and their connections to understand the underlying algorithms?

4 of 47

Historical Context

How are features in different layers connected?

Studied specific sub-networks, and them referred them as circuits, of a model to understand how early layer features give rise to later layers ones.

5 of 47

What is a Circuit?

Layer 2

tokens

embed

h⁰

h¹

…

MLP m

h⁰

h¹

…

MLP m

Layer 1

unembed

logits

Layer 3

h⁰

h¹

…

MLP m

Transformer Model

6 of 47

Circuit is a subgraph that connects the inputs to the logits

Layer 2

tokens

embed

h⁰

h¹

…

MLP m

h⁰

h¹

…

MLP m

Layer 1

unembed

logits

Layer 3

h⁰

h¹

…

MLP m

Circuit

7 of 47

Circuit is a subgraph that connects the inputs to the logits

Layer 2

tokens

embed

h⁰

h¹

…

MLP m

h⁰

h¹

…

MLP m

Layer 1

unembed

logits

Layer 3

h⁰

h¹

…

MLP m

Circuit

Jasmine, Rice: What counts as circuit?

8 of 47

How do you find Circuits in LM?

9 of 47

Experiment Setup

Indirect Object Identification task:

John and Mary went to the store. John gave a drink to ____

GPT2-small (12 layers and 12 attention heads)

Can perform the task very well (~99%)

Metric: Logit difference (John - Mary)

Path Patching: Circuit Discovery Algorithm (Intuition)

tokens

logits

Circuit Edges

17 of 47

Path Patching Algorithm

18 of 47

Path Patching Algorithm

19 of 47

Which heads directly affect the output? (Name Mover)

Receiver: Final logit
Sender: Each head in the model

20 of 47

Which heads affect the Name Mover heads?

Receiver: Query vector of Name Mover heads (Why Query?)
Sender: Each head in the model

21 of 47

Iterative Path Patch yields the Circuit

Arya, Ananya, Jesseba: Generalizability (task)?

Haoyu: Retraining impact?

22 of 47

Avery, Christopher, Rice: Generalizability (models)?

23 of 47

Backup Mechanism

Claire: Comment about redundancy

24 of 47

Circuit Validation (Faithfulness)

Faithfulness: Is the identified circuit faithful to the underlying model?

Can we recover model’s performance using only the circuit?

Faithfulness of circuit C

F(C)

F(M)

where F() is mean logit difference.

0.87

Yuqi: operationalize faithfulness?

25 of 47

Circuit Validation (Completeness)

Finding either path will result in full faithfulness, but won’t provide complete picture.

26 of 47

Circuit Validation (Minimality)

Faithful and complete circuit can still include redundant components.

27 of 47

Can we automate Circuit Discovery?

Yiqian: human-driven or automated interpretability?

28 of 47

Circuit Discovery Overview

Define the task, dataset, metric, and model(s) that can do the task.
Unit of analysis (layers, heads, neurons, subspaces, etc.) -> Computational graph.
Perform patching experiments to identify the circuit.

29 of 47

ACDC automates step 3.

Define the task, dataset, metric, and model(s) that can do the task.
Unit of analysis (layers, heads, neurons, subspaces, etc.) -> Computational graph.
Perform patching experiments to identify the circuit.

31 of 47

Evaluating ACDC

Use previous discovered circuit as ground truth.
Compute ROC of the identified and ground truth circuits.

Courtney: ground truth circuit?

32 of 47

Problems with ACDC?

33 of 47

Problems with ACDC?

Shuyi: Scalability of ACDC to frontier models? (~10 mins for IOI on GPT2-small)
Claire: Negative Components are not identifiable.
Disconnected circuit components.
Evaluation required ground truth which are scarce and error prone.

34 of 47

Solving ACDC and its evaluation issues

35 of 47

Solving Scalability with EAP

ACDC requires a single forward pass for each edge in the model.
EAP only needs:

Two forward pass
One backward pass

Edge Importance =

36 of 47

EAP Issue

Derivative at z is ~0 => edge in unimportant.
However, it seems important for the corrupt sample!

37 of 47

EAP-IG (Integrated Gradients)

IG score is computed by averaging the EAP score over a straight line path from z to z’.

Edge Importance =

38 of 47

Solving Disconnected Components Issue with EAP-IG

39 of 47

Identifying Negative Components with EAP-IG

Using absolute EAP-IG show solves the negative component identification issue.

40 of 47

EAP-IG finds more faithful circuits

41 of 47

Overlap of EAP-IG circuits with ground truth is low

42 of 47

Overlap and/or faithfulness?

Grace, Luze: Overlap => faithfulness?

43 of 47

Overlap and/or faithfulness?

Faithfulness directly measures whether the circuit reproduces the model’s behavior when everything else is ablated.
Overlap only measures structural similarity, which might not reflect causal importance.
Example:

Circuit A: 10 edges (5 critical + 5 not so critical)
Circuit B: 20 edges (includes only 5 not so critical edges of A)
Overlap = 50% | Faithfulness = ~0%

Prioritize faithfulness. When possible report both measures.

44 of 47

Is finding the circuit end goal?

Can we reverse engineer neurons and their connections to understand the underlying algorithms?

Is this what we want?

45 of 47

No! We want to understand the mechanisms

Value Info.

Position Info.

46 of 47

Circuit Discovery is a step towards understanding mechanisms

Define the task, dataset, metric, and model(s) that can do the task.
Unit of analysis (layers, heads, neurons, subspaces, etc.) -> Computational graph.
Perform patching experiments to identify the circuit.
Identify the functionalities/roles of circuit components.

This is more challenging because this is still hypothesis driven.
You need to come up with the abstract causal model / circuit component functionality -> Create appropriate counterfactuals -> Test it using activation patching!
BIG SCOPE FOR AUTOMATION!

Tapan, Rice: Label components + create counterfactuals

1 of 47

2 of 47

3 of 47

4 of 47

5 of 47

6 of 47

7 of 47

8 of 47

9 of 47

10 of 47

11 of 47

12 of 47

13 of 47

14 of 47

15 of 47

16 of 47

17 of 47

18 of 47

19 of 47

20 of 47

21 of 47

22 of 47

23 of 47

24 of 47

25 of 47

26 of 47

27 of 47

28 of 47

29 of 47

30 of 47

31 of 47

32 of 47

33 of 47

34 of 47

35 of 47

36 of 47

37 of 47

38 of 47

39 of 47

40 of 47

41 of 47

42 of 47

43 of 47

44 of 47

45 of 47

46 of 47

47 of 47