1 of 18

Probing

Detecting the presence of information

Neural Mechanics Spring 2026

2 of 18

Basic probing: Does z=f(x) “know” concept C?

Three steps for classic probing (TCAV illustrations)

  1. Measure the accuracy of c =? g(f(x)) on the other half of D

(2) Train a probe g(z) = c, as g(f(x)) 🡪 c on half of D

(3) Measure accuracy of g

(1) Define concept C with classification data�D = {x: c}

Higher accuracy = “C is known better”

Jasmine: the hope for probing?

Kai: abstract concepts, e.g. "beautiful"?

Claire: human-imperceptible features??

Isaac: human framing?

3 of 18

Classic insights from probes: machine translation

Credit: Yonatan Belinkov (read his authoritative surveys of probing methods)

Armita: TCAV expand to LLMs? Where should concept examples come from?

4 of 18

Machine Translation: Morphology

Concept example: Is the word Uppercase? Plural? Past-tense?

Why is probe accuracy dropping at deeper layers?

5 of 18

Machine Translation: Syntactic Relations

Concept example: does the word have a dependency on another word?

Why is probe accuracy rising at deeper layers?

6 of 18

Probing for past tokens

does “mitt” encode “mitt”?

does “mitt” encode “inter”?

does “mitt” encode “the”?

does “mitt” encode “ent”?

Wait: why are results so different when�repeating this experiment at the last�token of a multi-token word?

7 of 18

Discussion about probing

Why would a paper use probes?

  • Isaac: is probing second-best?

  • Haoyu: “use” vs correlation?
  • Ananya: “causal” probes?

Tishby’s “Information Bottleneck”

8 of 18

TCAV adds a bit of causality

TCAV doesn’t ask about the accuracy of the probe.

They ask what the effect of the perturbations are on outputs.

Gather a set of behavior of interest�(zebra predictions)

Ask: how does the output change if the representation is tweaked in the direction of the probe?

Rice: spurious features?

9 of 18

Anecdote: the Othello Paper

[Kenneth Li 2023 “Emergent World” - Martin Wattenberg’s student]

Rice: spurious features?

10 of 18

D3

C5

B6

E3

F5

E6 

F4

D2

D6

hi(l) state

attention

MLP

The setup: Legal Othello Sequences

[Vaswani 2017, Radford 2018]

11 of 18

We asked: do transformers learn�a “world model” of Othello?

8-layer network

Train one on valid games

> 99% accuracy on the task

12 of 18

D3

C5

B6

E3

F5

E6 

F4

D2

D6

How to ask: The Probing Setup

Task for the F5 state probe:�“is F5 white/black/empty?”

13 of 18

The Probing Results (linear was bad)!

Acc:�98%

Acc:�79%

Acc:�91%

Acc:�76%

We needed to use MLPs to make the point

What is wrong with this?

14 of 18

Discussion: What went wrong in Othello

Courtney: optimal probe complexity?

Luze: simple probes enough?

Arya: complexity? Memorization?

Neel Nanda blog post

15 of 18

Hewitt’s Control Tasks: a simple sanity check

  1. Define a “meaningless” concept C’ that boils down to random memorization. That is the control task.
  2. Train a probe on your real concept C, and also train the same capacity probe on C’
  3. Selectivity is the gap.

Grace: way to choose control task?

16 of 18

Why and when lower accuracy can be better

Hewitt: Accuracy is not the point of a probe.

Shuyi: Is 'Selectivity' fair

Guangyuan: Regularization enough?

17 of 18

Hewitt’s probing takeaways

What about regularization?

Dropout?

Bigger models?

Bigger data sets?

Why are all the green lines

lower in accuracy?

higher in selectivity?

Jeseba: low selectivity meaning?

18 of 18

Probing for demographic concepts

Ayush: Can CAVs identify hidden biases

Yunus: Probing ideology, peronality

Yida Chen: Designing a Dashboard https://arxiv.org/abs/2406.07882v3