Probing
Detecting the presence of information
Neural Mechanics Spring 2026
Basic probing: Does z=f(x) “know” concept C?
Three steps for classic probing (TCAV illustrations)
(2) Train a probe g(z) = c, as g(f(x)) 🡪 c on half of D
(3) Measure accuracy of g
(1) Define concept C with classification data�D = {x: c}
Higher accuracy = “C is known better”
Jasmine: the hope for probing?
Kai: abstract concepts, e.g. "beautiful"?
Claire: human-imperceptible features??
Isaac: human framing?
Classic insights from probes: machine translation
Credit: Yonatan Belinkov (read his authoritative surveys of probing methods)
Armita: TCAV expand to LLMs? Where should concept examples come from?
Machine Translation: Morphology
Concept example: Is the word Uppercase? Plural? Past-tense?
Why is probe accuracy dropping at deeper layers?
Machine Translation: Syntactic Relations
Concept example: does the word have a dependency on another word?
Why is probe accuracy rising at deeper layers?
Probing for past tokens
does “mitt” encode “mitt”?
does “mitt” encode “inter”?
does “mitt” encode “the”?
does “mitt” encode “ent”?
Wait: why are results so different when�repeating this experiment at the last�token of a multi-token word?
Feucht 2024 https://footprints.baulab.info
Discussion about probing
Why would a paper use probes?
Tishby’s “Information Bottleneck”
TCAV adds a bit of causality
TCAV doesn’t ask about the accuracy of the probe.
They ask what the effect of the perturbations are on outputs.
Gather a set of behavior of interest�(zebra predictions)
Ask: how does the output change if the representation is tweaked in the direction of the probe?
Rice: spurious features?
Anecdote: the Othello Paper
[Kenneth Li 2023 “Emergent World” - Martin Wattenberg’s student]
Rice: spurious features?
D3
C5
B6
E3
F5
E6
F4
D2
D6
hi(l) state
attention
MLP
The setup: Legal Othello Sequences
[Vaswani 2017, Radford 2018]
We asked: do transformers learn�a “world model” of Othello?
8-layer network
Train one on valid games
> 99% accuracy on the task
D3
C5
B6
E3
F5
E6
F4
D2
D6
How to ask: The Probing Setup
Task for the F5 state probe:�“is F5 white/black/empty?”
The Probing Results (linear was bad)!
Acc:�98%
Acc:�79%
Acc:�91%
Acc:�76%
We needed to use MLPs to make the point
What is wrong with this?
Discussion: What went wrong in Othello
Courtney: optimal probe complexity?
Luze: simple probes enough?
Arya: complexity? Memorization?
Neel Nanda blog post
Hewitt’s Control Tasks: a simple sanity check
Grace: way to choose control task?
Why and when lower accuracy can be better
Hewitt: Accuracy is not the point of a probe.
Shuyi: Is 'Selectivity' fair
Guangyuan: Regularization enough?
Hewitt’s probing takeaways
What about regularization?
Dropout?
Bigger models?
Bigger data sets?
Why are all the green lines
lower in accuracy?
higher in selectivity?
Jeseba: low selectivity meaning?
Probing for demographic concepts
Ayush: Can CAVs identify hidden biases
Yunus: Probing ideology, peronality
Yida Chen: Designing a Dashboard https://arxiv.org/abs/2406.07882v3