1 of 31

LLMs & Interpretability:

Current research & beyond

Chandan Singh

Microsoft Research

2 of 31

LLM basics

Interpreting LLMs

Data science with LLMs

Next-generation LLMs?

3 of 31

LLM basics

Interpreting LLMs

Data science with LLMs

Next-generation LLMs?

4 of 31

What restricts the “context length”?

What are the nonlinearities?

What changes in a different modality (e.g. audio)?

5 of 31

Where are most of the parameters (MoE)?

Where is most of the computational cost?

6 of 31

KV-caching reduces computation for a new token

7 of 31

LLM basics

Interpreting LLMs

Data science with LLMs

Next-generation LLMs?

8 of 31

Challenge: Can interpretation methods help us steer, edit, prune, or generally improve models reliably?

Prompting is a very strong baseline

9 of 31

Decomposing and Editing Predictions by Modeling Model Computation (Shah, Ilyas, Madry, 2024)

The Hydra Effect: Emergent Self-repair in Language Model Computations (McGrath, Rahtz, Kramar, Mikulik, Legg, 2023)

Outputs can be rewritten as a sum of compounding layer / attn. head contributions

Individual units can be interpretable

10 of 31

Outputs can be rewritten as a sum of SAE contributions, making things seem interpretable

SAE critiques

AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders (wu...jurafsky, manning, potts, 2025)

SAEs Can Interpret Randomly Initialized Transformers (heap...aitchison, 2025)

SAEs Trained on the Same Data Learn Different Features (paulo & belrose, 2025)

11 of 31

Intermediate representations are generally aligned, enabling some hacky interpretability

X

Logit lens (2020; dar, ..., berant, 2022)

Tuned lens (belrose...steinhardt, 2023)

Future lens (pal...wallace, bau, 2023)

12 of 31

Intermediate representations are generally aligned, enabling some hacky interpretability

ROME: Locating and Editing Factual Associations in GPT (meng, bau, andonian, & belinkov, 2022)

13 of 31

Intermediate representations are generally aligned, enabling some hacky interpretability

Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models (Ghandeharioun, Caciularu, Pearce, Dixon, Geva, 2024)

14 of 31

LLM basics

Interpreting LLMs

Data science with LLMs

Next-generation LLMs?

15 of 31

LLMs/transformers can provide new, actionable info from data

If we are careful…

16 of 31

Now perform standard tabular analysis

Concept-based models convert unstructured data into tabular datasets

Classification examples

A boring film that contains no wit, only labored…

The greatest score and greatest plot of the year…

…

Was laughing throughout the entire movie…

Questions

Is the movie’s dialogue gripping? {yes/no}

The soundtrack is {good/bad}

…

The plot is {fun/dull}

–	–	…	+
–	+	…	+
…	…	…	…
+	+	…	–

LLM answers

GT label

+
–
…
+

Concept bottleneck models (Koh et al. 2020)

Post-hoc CBMs (Yuksegonul et al. 2022)

CHiLL (McInerney et al. 2023)

Bayesian CBMs (Feng et al. 2024)

17 of 31

FunSearch: Mathematical discoveries from program search with LLMs (deepmind, 2023)

D3 (zhong, snell, klein, & steinhardt, 2022) - finetune an LLM to directly describe difference between 2 text distrs

iPrompt (singh*, morris*, ...gao, 2022) - iteratively generate and test explanations for describing outputs from inputs

18 of 31

Bridging the Human-AI Knowledge Gap: Concept Discovery and Transfer in AlphaZero (schut et al. 2023)

19 of 31

TalkToEBM: LLMs Understand Glass-Box Models, Discover Surprises, and Suggest Repairs (lengerich et al. 2023)

20 of 31

Data formulator (wang et al. 2025)

21 of 31

TabPFN (Hollmann et al. 2025)

GAMformer: In-Context Learning for GAMs (mueller et al. 2024)

Learning a Decision Tree Algorithm with Transformers (zhuang et al. 2024)

Follow-ups using LLMs for feature engineering…

22 of 31

Text is hard for decision trees / linear models

This movie was awful

yes

no

yes

no

+

–

good

great

not bad

Sparse, high-dimensional

Augmenting interpretable models with large language models during training

(singh, askari, caruana, & gao, 2022)

23 of 31

not funny

bad

scenic

very funny

interesting sci-fi

stellar plot

great actor

terrible, dull, not good

tom hanks, leonardo dicaprio, …

futuristic, prescient, …

hilarious, ROFL, …

artistic, picturesque, …

lackluster writing, bad jokes, …

thrilling, gripping, plot-twist, …

Screen by performance

terrible, dull, not good

bad

LLM

Generate similar keyphrases

terrible, awful, not good, nasty, unpleasant, …, icky, horrendous

Keyphrase expansion

Augmenting interpretable models with large language models during training

(singh, askari, caruana, & gao, 2022)

24 of 31

LLM basics

Interpreting LLMs

Data science with LLMs

Next-generation LLMs?

25 of 31

Emerging issue: models struggle with handling long or rich contexts

Crucial for reasoning, code bases, robotics, video

26 of 31

Stopgap solutions

RAG

“Agents”

27 of 31

2020

Finetune backbone per task / example before predicting

28 of 31

Test error (%) on CIFAR-10-C with corruptions.

Test error (%) on CIFAR-10-C, for the three noise types, with gradually changing distribution.

29 of 31

2024

30 of 31

f is linear or 2-layer MLP

W_t is updated at each token position to minimize token-level loss:

Loss reconstructs “corrupted version” of token input:

Single TTT layer: Inner loop

General idea: at each token position, inner weights are updated by predicting the current token, helping to adapt to context

Single TTT layer: Outer loop

Choosing a “corrupted version” is hard. We can instead metalearn low-rank projections (theta) for inputs, labels, and outputs

(shared across sequences)

label

a input a

a output a

31 of 31

TTT-Linear has low perplexity vs FLOP usage

TTT-Linear does well in later positions in long contexts

Evaluations ranging from 125M to 1.3B parameters on the Pile and Books3 (Pile subset)