2 of 41

Announcements

Midterm on Wednesday, a review sheet with key information has been posted on the course website.

Midterm format:

You may stay for the entire hour and 50 minutes, but I expect that it is doable within an hour or less.

It will have 15 questions (mostly multiple-choice or open-ended, though two questions will require derivatives or matrix/vector operations).

3 of 41

Announcements

No class on Monday (Presidents’ Day).

Our third and final guest lecture will be next Wednesday.

4 of 41

Last Time

LLM Evaluation & Benchmarking

How do we determine if we’re making progress?

Alice

Bob

🤖

5 of 41

Today

In Fall 2026 I’ll be teaching a new grad-level class that covers these topics in-depth

LLM Interpretability

MODEL

What happened in-between?

How did components of the input and the model contribute to this decision?

Input:

This movie vibrantly presents a fun reimagining of her life…

Output:

Predicted rating of 5/5

6 of 41

Why We Care

There are a few reasons:

7 of 41

Some models are inherently interpretable…

This means they not only offer an explanation for the decision-making, but this explanation is faithful to both the outcome and the model internals.

8 of 41

Some models are inherently interpretable…

In a linear regression model, the contribution of each feature is clearly defined through its coefficient β_j.

However, we usually work with much more complex, non-linear models.

9 of 41

Non-causal interventions

10 of 41

An advantage of attention is that it “builds in” interpretability

BertViz Demo

11 of 41

However, attention weights are not always informative as explanation

Jain and Wallace, 2019

12 of 41

Shapley values

Lime

Feature Attribution

13 of 41

Feature Attribution

How Lime works

Lime fits a interpretable (linear) model based on the local region to a specific prediction (shown in red).

Positive class

Negative class

Function f defines decision boundary

14 of 41

Feature Attribution

E.g. f(x=[1,0,1]) = positive, f([1,1,1]) = negative

15 of 41

Feature Attribution

How Integrated Gradients work

Works with either visual or textual data!

16 of 41

Feature Attribution

How Integrated Gradients work

First you need a “baseline,” which serves as an uninformative input for comparison (See DeepLIFT paper for more on baselines)

17 of 41

Feature Attribution

How Integrated Gradients work

Input x

Baseline

Which features of x are important?

From Stanford CS224U

18 of 41

Feature Attribution

How Integrated Gradients work

Interpolate points between x and the baseline, then accumulate gradients wrt these points

From Stanford CS224U

19 of 41

Feature Attribution

How Integrated Gradients work

First you need a “baseline,” which serves as an uninformative input for comparison (See DeepLIFT paper for more on baselines)

https://arxiv.org/pdf/1703.01365

20 of 41

Feature Attribution

1. May locally explain a model, but not globally explain behavior

2. “Explanations” are sensitive to implementation decisions

Drawbacks

in integrated gradients, what is a good baseline?

https://arxiv.org/pdf/1703.01365

21 of 41

Probing Classifiers

- Alain & Bengio (2016)

…thermometers used to measure the temperature simultaneously at many different locations.

22 of 41

Probing Classifiers train supervised models to predict specific properties from neural model representations. It can provide some evidence to whether representations encode critical information for tasks like POS tagging.

Image courtesy of John Hewitt

23 of 41

Softmax(Wh_k + b)

(W,b): probe weights and bias

Probing Classifiers

Target Model

Input: Features h_kfrom hidden layer k

Extract hidden representation

Linear Probe

Output: Ŷ

24 of 41

Where could this go wrong?

25 of 41

Courtesy of John Hewitt

26 of 41

Image courtesy of John Hewitt

We really can’t say that the representations encode this information for sure, only that they are predictive…

Correlation ≠ Causation

27 of 41

Control Representations attempt to isolate the effect from learning a specific function from the property-specific expressiveness of the representation.

These “baseline representations” may be random inputs that show if the probing classifier can make predictions as effectively (or more effectively) from meaningless noise.

28 of 41

Causal interventions

http://xkcd.com/552/

29 of 41

Outcome

(nausea)

Outcome

(pain relief)

Causal Graph

Causal discovery is the process of identifying causal relationships between variables in a system (Pearl, 2009)

This relationships can be expressed as a directed, acyclic graph (DAG) where the edges denote casual influences

https://arxiv.org/abs/2402.01207

Graph learning

Treatment

(ibuprofen)

Outcome

(nausea)

Outcome

(pain relief)

Treatment

(ibuprofen)

Flu

Confounder Variable

30 of 41

Causal Graph

Mediator

Direct effect

Indirect effect

Casual Mediation Analysis (Pearl, 2001) studies how mediators (e.g. “white collar” in the right graph) affect the outcome (e.g. “wage”).

(Vig et al., 2020) explore in-depth how this can apply to study the impact of LLM internals (e.g. neurons and attention heads) on outputs.

31 of 41

Causal Graph

Causal inference is ultimately concerned with the effects of causes on specific units, that is, with ascertaining the value of the causal effect (Y_t(u) - Y_c(u)). It is frustrated by an inherent fact of observational life that I call the Fundamental Problem of Causal Inference.

Difference in value of response variable if exposed to t (treatment) vs. c (control)

- (Holland, 1986)

Fundamental Problem of Causal Inference. It is impossible to observe the value of Y_t(u) and Y_c(u) on the same unit and, therefore, it is impossible to observe the effect of t on u.

However, neural network internals are manipulatable in this way!

32 of 41

Example courtesy of Dhanya Sridhar