Lecture 10
CS 263:
Advanced NLP
Saadia Gabriel
Announcements
Announcements
Last Time
LLM Evaluation & Benchmarking
How do we determine if we’re making progress?
Alice
Bob
🤖
🤖
A
B
Today
In Fall 2026 I’ll be teaching a new grad-level class that covers these topics in-depth
LLM Interpretability
MODEL
What happened in-between?
How did components of the input and the model contribute to this decision?
Input:
This movie vibrantly presents a fun reimagining of her life…
Output:
Predicted rating of 5/5
Why We Care
There are a few reasons:
Some models are inherently interpretable…
This means they not only offer an explanation for the decision-making, but this explanation is faithful to both the outcome and the model internals.
Some models are inherently interpretable…
In a linear regression model, the contribution of each feature is clearly defined through its coefficient βj.
However, we usually work with much more complex, non-linear models.
Non-causal interventions
An advantage of attention is that it “builds in” interpretability
BertViz Demo
However, attention weights are not always informative as explanation
Jain and Wallace, 2019
Shapley values
Lime
Feature Attribution
Feature Attribution
How Lime works
Lime fits a interpretable (linear) model based on the local region to a specific prediction (shown in red).
Positive class
Negative class
Function f defines decision boundary
Feature Attribution
E.g. f(x=[1,0,1]) = positive, f([1,1,1]) = negative
Feature Attribution
How Integrated Gradients work
Works with either visual or textual data!
Feature Attribution
How Integrated Gradients work
Feature Attribution
How Integrated Gradients work
Input x
Baseline
Which features of x are important?
From Stanford CS224U
Feature Attribution
How Integrated Gradients work
Interpolate points between x and the baseline, then accumulate gradients wrt these points
From Stanford CS224U
Feature Attribution
How Integrated Gradients work
https://arxiv.org/pdf/1703.01365
Feature Attribution
1. May locally explain a model, but not globally explain behavior
2. “Explanations” are sensitive to implementation decisions
Drawbacks
:
in integrated gradients, what is a good baseline?
https://arxiv.org/pdf/1703.01365
Probing Classifiers
- Alain & Bengio (2016)
…thermometers used to measure the temperature simultaneously at many different locations.
Probing Classifiers train supervised models to predict specific properties from neural model representations. It can provide some evidence to whether representations encode critical information for tasks like POS tagging.
Image courtesy of John Hewitt
Softmax(Whk + b)
(W,b): probe weights and bias
Probing Classifiers
Target Model
Input: Features hk from hidden layer k
Extract hidden representation
Linear Probe
Output: Ŷ
Where could this go wrong?
Courtesy of John Hewitt
Probing Classifiers train supervised models to predict specific properties from neural model representations. It can provide some evidence to whether representations encode critical information for tasks like POS tagging.
Image courtesy of John Hewitt
We really can’t say that the representations encode this information for sure, only that they are predictive…
Correlation ≠ Causation
Control Representations attempt to isolate the effect from learning a specific function from the property-specific expressiveness of the representation.
These “baseline representations” may be random inputs that show if the probing classifier can make predictions as effectively (or more effectively) from meaningless noise.
Causal interventions
http://xkcd.com/552/
Outcome
(nausea)
Outcome
(pain relief)
A
C
E
Causal Graph
Causal discovery is the process of identifying causal relationships between variables in a system (Pearl, 2009)
This relationships can be expressed as a directed, acyclic graph (DAG) where the edges denote casual influences
https://arxiv.org/abs/2402.01207
Graph learning
Treatment
(ibuprofen)
Outcome
(nausea)
Outcome
(pain relief)
Treatment
(ibuprofen)
Flu
Confounder Variable
Causal Graph
Mediator
Direct effect
Indirect effect
Casual Mediation Analysis (Pearl, 2001) studies how mediators (e.g. “white collar” in the right graph) affect the outcome (e.g. “wage”).
(Vig et al., 2020) explore in-depth how this can apply to study the impact of LLM internals (e.g. neurons and attention heads) on outputs.
Causal Graph
Causal inference is ultimately concerned with the effects of causes on specific units, that is, with ascertaining the value of the causal effect (Yt(u) - Yc(u)). It is frustrated by an inherent fact of observational life that I call the Fundamental Problem of Causal Inference.
Difference in value of response variable if exposed to t (treatment) vs. c (control)
- (Holland, 1986)
Fundamental Problem of Causal Inference. It is impossible to observe the value of Yt(u) and Yc(u) on the same unit and, therefore, it is impossible to observe the effect of t on u.
However, neural network internals are manipulatable in this way!
Example courtesy of Dhanya Sridhar
Causal Inference for Understanding LLMs
Causal Inference for Understanding LLMs
Example courtesy of Dhanya Sridhar
Causal Inference for Understanding LLMs
Example courtesy of Dhanya Sridhar
Counterfactual Intervention
Causal Inference for Understanding LLMs
Example courtesy of Dhanya Sridhar
Where can we intervene in the transformer internals (“edit” the model) to achieve an equivalent transformation?
Causal Inference for Understanding LLMs
Courtesy of Jing Huang
Where is this used?
Query prompt
Model Editing
LLM
Knowledge cut-off: January 2025
Who is the president of UC?
🧑
Output
Michael Drake
As of August 2025:
LLMs are static, but factual knowledge is dynamic and can change over time
Training LLMs is very costly and time-consuming, so how do we update them?
Model Editing
https://arxiv.org/abs/2202.05262
How do we locate where factual knowledge is stored and retrieved?
Causal tracing
(uses mediation analysis)
Adding noise to hidden activations and then restoring them reveals their causal influence
Model Editing
https://arxiv.org/abs/2202.05262
We can then intervene on identified components (e.g. target feedforward weights) to update factual knowledge
**In-class Coding Demo**
https://drive.google.com/file/d/1BCaGgA2xd4kxRGjU555F_BaJfWDBJPZE/view?usp=sharing
Note: this requires HuggingFace and wandb accounts.