1 of 47

Neural Circuits in Language Models

Neural Mechanics

Week 8 (10th March)

2 of 47

Historical Context

2

Rumelhart et al reversed engineered each connection weight to make the point that backpropagation can learn genuine algorithm.

3 of 47

Historical Context

3

Chris Olah et al asked the same question ~30 years later for Inception model.

Can we reverse engineer neurons and their connections to understand the underlying algorithms?

4 of 47

Historical Context

4

  • How are features in different layers connected?

  • Studied specific sub-networks, and them referred them as circuits, of a model to understand how early layer features give rise to later layers ones.

5 of 47

What is a Circuit?

5

Layer 2

tokens

embed

+

h0

h1

+

MLP m

+

h0

h1

MLP m

Layer 1

unembed

logits

Layer 3

+

+

h0

h1

MLP m

Transformer Model

6 of 47

Circuit is a subgraph that connects the inputs to the logits

6

Layer 2

tokens

embed

+

h0

h1

+

MLP m

+

h0

h1

MLP m

Layer 1

unembed

logits

Layer 3

+

+

h0

h1

MLP m

Circuit

7 of 47

Circuit is a subgraph that connects the inputs to the logits

7

Layer 2

tokens

embed

+

h0

h1

+

MLP m

+

h0

h1

MLP m

Layer 1

unembed

logits

Layer 3

+

+

h0

h1

MLP m

Circuit

Jasmine, Rice: What counts as circuit?

8 of 47

How do you find Circuits in LM?

8

9 of 47

Experiment Setup

9

  • Indirect Object Identification task:

John and Mary went to the store. John gave a drink to ____

  • GPT2-small (12 layers and 12 attention heads)
    • Can perform the task very well (~99%)

  • Metric: Logit difference (John - Mary)

10 of 47

Path Patching: Circuit Discovery Algorithm (Intuition)

10

tokens

logits

What are the important edges in a computational graph?

11 of 47

Path Patching: Circuit Discovery Algorithm (Intuition)

tokens

logits

tokens

logits

12 of 47

Path Patching: Circuit Discovery Algorithm (Intuition)

tokens

logits

tokens

logits

13 of 47

Path Patching: Circuit Discovery Algorithm (Intuition)

tokens

logits

tokens

logits

14 of 47

Path Patching: Circuit Discovery Algorithm (Intuition)

tokens

logits

tokens

logits

15 of 47

Path Patching: Circuit Discovery Algorithm (Intuition)

tokens

logits

Edges that exhibited the greatest degradation in logit values.

Circuit Edges

16 of 47

Path Patching: Circuit Discovery Algorithm (Intuition)

tokens

logits

Circuit Edges

17 of 47

Path Patching Algorithm

18 of 47

Path Patching Algorithm

19 of 47

Which heads directly affect the output? (Name Mover)

  • Receiver: Final logit
  • Sender: Each head in the model

19

20 of 47

Which heads affect the Name Mover heads?

  • Receiver: Query vector of Name Mover heads (Why Query?)
  • Sender: Each head in the model

20

21 of 47

Iterative Path Patch yields the Circuit

21

Arya, Ananya, Jesseba: Generalizability (task)?

Haoyu: Retraining impact?

22 of 47

22

Avery, Christopher, Rice: Generalizability (models)?

23 of 47

Backup Mechanism

23

Claire: Comment about redundancy

24 of 47

Circuit Validation (Faithfulness)

  • Faithfulness: Is the identified circuit faithful to the underlying model?
    • Can we recover model’s performance using only the circuit?

24

+

+

+

Faithfulness of circuit C

=

F(C)

F(M)

=

+

+

+

where F() is mean logit difference.

=

0.87

Yuqi: operationalize faithfulness?

25 of 47

Circuit Validation (Completeness)

25

C1

C2

I

+

  • Finding either path will result in full faithfulness, but won’t provide complete picture.

O

26 of 47

Circuit Validation (Minimality)

26

  • Faithful and complete circuit can still include redundant components.

27 of 47

Can we automate Circuit Discovery?

27

Yiqian: human-driven or automated interpretability?

28 of 47

Circuit Discovery Overview

  1. Define the task, dataset, metric, and model(s) that can do the task.
  2. Unit of analysis (layers, heads, neurons, subspaces, etc.) -> Computational graph.
  3. Perform patching experiments to identify the circuit.

28

29 of 47

ACDC automates step 3.

  1. Define the task, dataset, metric, and model(s) that can do the task.
  2. Unit of analysis (layers, heads, neurons, subspaces, etc.) -> Computational graph.
  3. Perform patching experiments to identify the circuit.

29

30 of 47

30

31 of 47

Evaluating ACDC

  • Use previous discovered circuit as ground truth.
  • Compute ROC of the identified and ground truth circuits.

31

Courtney: ground truth circuit?

32 of 47

Problems with ACDC?

32

33 of 47

Problems with ACDC?

  • Shuyi: Scalability of ACDC to frontier models? (~10 mins for IOI on GPT2-small)
  • Claire: Negative Components are not identifiable.
  • Disconnected circuit components.
  • Evaluation required ground truth which are scarce and error prone.

33

34 of 47

Solving ACDC and its evaluation issues

34

35 of 47

Solving Scalability with EAP

35

  • ACDC requires a single forward pass for each edge in the model.
  • EAP only needs:
    • Two forward pass
    • One backward pass

Edge Importance =

36 of 47

EAP Issue

36

  • Derivative at z is ~0 => edge in unimportant.
  • However, it seems important for the corrupt sample!

37 of 47

EAP-IG (Integrated Gradients)

37

  • IG score is computed by averaging the EAP score over a straight line path from z to z’.

Edge Importance =

38 of 47

Solving Disconnected Components Issue with EAP-IG

38

39 of 47

Identifying Negative Components with EAP-IG

39

Using absolute EAP-IG show solves the negative component identification issue.

40 of 47

EAP-IG finds more faithful circuits

40

41 of 47

Overlap of EAP-IG circuits with ground truth is low

41

42 of 47

Overlap and/or faithfulness?

42

Grace, Luze: Overlap => faithfulness?

43 of 47

Overlap and/or faithfulness?

43

  • Faithfulness directly measures whether the circuit reproduces the model’s behavior when everything else is ablated.
  • Overlap only measures structural similarity, which might not reflect causal importance.
  • Example:
    • Circuit A: 10 edges (5 critical + 5 not so critical)
    • Circuit B: 20 edges (includes only 5 not so critical edges of A)
    • Overlap = 50% | Faithfulness = ~0%
  • Prioritize faithfulness. When possible report both measures.

44 of 47

Is finding the circuit end goal?

44

Can we reverse engineer neurons and their connections to understand the underlying algorithms?

Is this what we want?

45 of 47

No! We want to understand the mechanisms

45

Value Info.

Position Info.

Position Info.

46 of 47

Circuit Discovery is a step towards understanding mechanisms

  1. Define the task, dataset, metric, and model(s) that can do the task.
  2. Unit of analysis (layers, heads, neurons, subspaces, etc.) -> Computational graph.
  3. Perform patching experiments to identify the circuit.
  4. Identify the functionalities/roles of circuit components.
    1. This is more challenging because this is still hypothesis driven.
    2. You need to come up with the abstract causal model / circuit component functionality -> Create appropriate counterfactuals -> Test it using activation patching!
    3. BIG SCOPE FOR AUTOMATION!

46

Tapan, Rice: Label components + create counterfactuals

47 of 47

More recent works on circuit discovery

47

Supports MLP Transcoder only!