LLMs & Interpretability:
Current research & beyond
Chandan Singh
Microsoft Research
What restricts the “context length”?
What are the nonlinearities?
What changes in a different modality (e.g. audio)?
Where are most of the parameters (MoE)?
Where is most of the computational cost?
KV-caching reduces computation for a new token
Challenge: Can interpretation methods help us steer, edit, prune, or generally improve models reliably?
Prompting is a very strong baseline
Decomposing and Editing Predictions by Modeling Model Computation (Shah, Ilyas, Madry, 2024)
The Hydra Effect: Emergent Self-repair in Language Model Computations (McGrath, Rahtz, Kramar, Mikulik, Legg, 2023)
Outputs can be rewritten as a sum of compounding layer / attn. head contributions
Individual units can be interpretable
Outputs can be rewritten as a sum of SAE contributions, making things seem interpretable
SAE critiques
AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders (wu...jurafsky, manning, potts, 2025)
SAEs Can Interpret Randomly Initialized Transformers (heap...aitchison, 2025)
SAEs Trained on the Same Data Learn Different Features (paulo & belrose, 2025)
Intermediate representations are generally aligned, enabling some hacky interpretability
X
X
Logit lens (2020; dar, ..., berant, 2022)
Tuned lens (belrose...steinhardt, 2023)
Future lens (pal...wallace, bau, 2023)
Intermediate representations are generally aligned, enabling some hacky interpretability
ROME: Locating and Editing Factual Associations in GPT (meng, bau, andonian, & belinkov, 2022)
Intermediate representations are generally aligned, enabling some hacky interpretability
Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models (Ghandeharioun, Caciularu, Pearce, Dixon, Geva, 2024)
LLMs/transformers can provide new, actionable info from data
If we are careful…
Now perform standard tabular analysis
Concept-based models convert unstructured data into tabular datasets
Classification examples
A boring film that contains no wit, only labored…
The greatest score and greatest plot of the year…
…
Was laughing throughout the entire movie…
Questions
Is the movie’s dialogue gripping? {yes/no}
The soundtrack is {good/bad}
…
The plot is {fun/dull}
– | – | … | + |
– | + | … | + |
… | … | … | … |
+ | + | … | – |
LLM answers
GT label
+ |
– |
… |
+ |
Concept bottleneck models (Koh et al. 2020)
Post-hoc CBMs (Yuksegonul et al. 2022)
CHiLL (McInerney et al. 2023)
Bayesian CBMs (Feng et al. 2024)
FunSearch: Mathematical discoveries from program search with LLMs (deepmind, 2023)
D3 (zhong, snell, klein, & steinhardt, 2022) - finetune an LLM to directly describe difference between 2 text distrs
iPrompt (singh*, morris*, ...gao, 2022) - iteratively generate and test explanations for describing outputs from inputs
Bridging the Human-AI Knowledge Gap: Concept Discovery and Transfer in AlphaZero (schut et al. 2023)
TalkToEBM: LLMs Understand Glass-Box Models, Discover Surprises, and Suggest Repairs (lengerich et al. 2023)
Data formulator (wang et al. 2025)
TabPFN (Hollmann et al. 2025)
GAMformer: In-Context Learning for GAMs (mueller et al. 2024)
Learning a Decision Tree Algorithm with Transformers (zhuang et al. 2024)
Follow-ups using LLMs for feature engineering…
Text is hard for decision trees / linear models
This movie was awful
yes
no
yes
yes
no
no
+
+
–
–
good
great
not bad
Sparse, high-dimensional
Augmenting interpretable models with large language models during training
(singh, askari, caruana, & gao, 2022)
not funny
bad
scenic
very funny
interesting sci-fi
stellar plot
great actor
terrible, dull, not good
tom hanks, leonardo dicaprio, …
futuristic, prescient, …
hilarious, ROFL, …
artistic, picturesque, …
lackluster writing, bad jokes, …
thrilling, gripping, plot-twist, …
Screen by performance
terrible, dull, not good
bad
LLM
Generate similar keyphrases
terrible, awful, not good, nasty, unpleasant, …, icky, horrendous
Keyphrase expansion
Augmenting interpretable models with large language models during training
(singh, askari, caruana, & gao, 2022)
Emerging issue: models struggle with handling long or rich contexts
Stopgap solutions
RAG
“Agents”
2020
Finetune backbone per task / example before predicting
Test error (%) on CIFAR-10-C with corruptions.
Test error (%) on CIFAR-10-C, for the three noise types, with gradually changing distribution.
2024
f is linear or 2-layer MLP
Wt is updated at each token position to minimize token-level loss:
Loss reconstructs “corrupted version” of token input:
Single TTT layer: Inner loop
General idea: at each token position, inner weights are updated by predicting the current token, helping to adapt to context
Single TTT layer: Outer loop
Choosing a “corrupted version” is hard. We can instead metalearn low-rank projections (theta) for inputs, labels, and outputs
(shared across sequences)
label
a input a
a output a
TTT-Linear has low perplexity vs FLOP usage
TTT-Linear does well in later positions in long contexts
Evaluations ranging from 125M to 1.3B parameters on the Pile and Books3 (Pile subset)