1 of 57

For later – set up using your laptop now

Go to colab.google.com

Secrets:

Enter your HF_TOKEN
Enter your NDIF_API_KEY

Submit your NDIF_API_KEY to the googleform: [TBD]

2 of 57

Neural Mechanics�Week 1: LLM Foundations�and Logit Lens

Tuesday, January 13, 2026

David Bau

Northeastern University

3 of 57

How Goes Research Planning?

First Two Reading Questions:

1. Keivalya – on high-quality research. what is a “significant result?”

2. Jasmine – evaluation, “small world” beginning, “open doors”?

3. Yiqian – which comes first, the hypothesis or experiment, in research?

(More discussion later.)

4 of 57

The Research Process is Iterative

A good research question is both Feasible and Interesting

Today:�exploratory experiments�are a toolkit for feasibility

Interesting?

Feasible?

Exploratory�Experiment

Scaled-Up�Experiments

5 of 57

Today’s Goals

Whirlwind review: Neural nets, Language models, Transformers
First interpretability method: the Logit Lens
Hands-on setup: Huggingface, NDIF tokens
Lab: reproduce experiments, discuss what they tell us

6 of 57

One-Picture Neural Networks Review

An old idea:�Frank Rosenblatt’s

1958 Perceptron

Train by gradually

adjusting weights

to reduce errors�on seen examples

Multi�Layer�Perceptron

MLP =

7 of 57

One-slide Language Model Review

When

in

town

Rome

near

home

Back

to

back

p=0.5

p=0.6

0.67

p=0.4

p=0.6

p=0.4

in

0.5

0.67

0.33

0.5

p=0.5

How to pick what comes after “in” depends on what came before

0.5

0.4

0.5

8 of 57

Language Models take Tokens to Probabilities

a process called token

ization

izing

-

a process called tokenization

.

,

that

predict

sample

Run a language model repeatedly to generate text!

vector of preceding input text

vector of next token probabilities

Ayush Agrawal – tokenization question

9 of 57

Inside Transformer Language Models

Miles

Davis

plays

the

trumpet�(predicted)

✓

1. First the encoder turns each token “the”

into a vector of neural activations.

2. Then a series of neural layers mixes and transforms the vectors for each token

per

is

the

Guess

for�every�word

3. Finally the decoder turns each vector

into a prediction for the next word.

trumpet

Detail: LM estimates probabilities

Haoyu He – internal vocab question

Ananya Malik – linear sep question

10 of 57

Important Transformer Pieces to Know

Encoder is a look-up table from tokens to vectors.

Typical vocabulary: 50,000 or 150,000 vectors in the table.

Attention is a neural network that “remembers” recent information from contextual tokens by (1) making a query vector (2) to match key vectors (3) and gather & add value vectors

MLP (multilayer perceptrons) are two-layer neural networks that match and modify single-token feature vectors

Decoder makes a vector of 50k/150k next-token probabilities

short-term�contextual memory

long-term�parametric memory

Yuchen Hou – memory question

11 of 57

Details of Self-Attention

We have some keys 𝑘₁, 𝑘₂, … , 𝑘_𝑇. Each key is 𝑘_𝑖∈ ℝ^𝑑
We have some values 𝑣₁, 𝑣₂, … , 𝑣_𝑇. Each value is 𝑣_𝑖∈ ℝ^𝑑

Attention operates on queries, keys, and values.

We have some queries 𝑞 , 𝑞 , … , 𝑞 . Each query is 𝑞_i ∈ ℝ^𝑑

In self-attention, the queries, keys, and values are drawn from the same source.

For example, if the output of the previous layer is 𝑥₁, … , 𝑥_𝑇, (one vec per word) we could let 𝑣_𝑖= 𝑘_𝑖= 𝑞_𝑖= 𝑥_𝑖(If our V, K, Q matrices were all identity, we could use the same vectors for all of them!)

The (dot product) self-attention operation is as follows:

𝑖

𝑖𝑗 𝑗

Compute key- query affinities

𝑖𝑗

𝑒 = 𝑞_i^𝖳𝑘 ^𝛼 =

exp(𝑒_𝑖𝑗)

Σ

_𝑗'

Compute attention weights from affinities

(softmax)

output = Σ _𝑗 𝛼 𝑣

𝑖 𝑖𝑗 𝑗

Compute outputs as weighted sum of values

exp(𝑒_𝑖𝑗_’ )

John Hewitt

Luze – attention question

12 of 57

Self-Attention Details

Step1: create three vectors from each of the encoder’s input vectors:

Query, a Key, Value (typically smaller dimension).

by multiplying the embedding by three matrices that we trained during the training process.

While processing each word it allows to look at other positions in the input sequence for clues to build a better encoding for this word.

As the model processes each word (each position in the input sequence), self attention allows it to look at other positions in the input sequence for clues that can help lead to a better encoding for this word.

If you’re familiar with RNNs, think of how maintaining a hidden state allows an RNN to incorporate its representation of previous words/vectors it has processed with the current one it’s processing. Self-attention is the method the Transformer uses to bake the “understanding” of other relevant words into the one we’re currently processing.

The first step in calculating self-attention is to create three vectors from each of the encoder’s input vectors (in this case, the embedding of each word). So for each word, we create a Query vector, a Key vector, and a Value vector. These vectors are created by multiplying the embedding by three matrices that we trained during the training process.

Notice that these new vectors are smaller in dimension than the embedding vector. Their dimensionality is 64, while the embedding and encoder input/output vectors have dimensionality of 512. They don’t HAVE to be smaller, this is an architecture choice to make the computation of multiheaded attention (mostly) constant.

13 of 57

Self-Attention

Step 2: calculate a score (like we have seen for regular attention!) how much focus to place on other parts of the input sentence as we encode a word at a certain position.

Take dot product of the query vector with the key vector of the respective word we’re scoring.

E.g., Processing the self-attention for word “Thinking” in position #1, the first score would be the dot product of q1 and k1. The second score would be the dot product of q1 and k2.

14 of 57

Self Attention

Step 3 divide scores by the square root of the dimension of the key vectors (more stable gradients).
Step 4 pass result through a softmax operation. (all positive and add up to 1)

Intuition: softmax score determines how much each word will be expressed at this position.

15 of 57

Self Attention

Step6 : sum up the weighted value vectors. This produces the output of the self-attention layer at this position

More details:

What we have seen for a word is done for all words (using matrices)
Need to encode position of words
And improved using a mechanism called “multi-headed” attention

(kind of like multiple filters for CNN)

see https://jalammar.github.io/illustrated-transformer/

16 of 57

Is Self-Attention All You Need? Not yet.

𝑤₁

The

𝑞₁

𝑘₁𝑣₁

𝑤₂

chef

𝑞₂

𝑤₃

who

𝑤_𝑇

food

𝑘_𝑇

𝑞_𝑇

𝑣_𝑇

…

𝑞₁

𝑘₁𝑣₁

𝑘₂

𝑞₂

𝑞₃

^𝑣₂𝑘₃𝑣₃

𝑘_𝑇

𝑞_𝑇

𝑣_𝑇

…

self-attention

^𝑘₂^𝑣₂𝑘₃^𝑞₃𝑣₃

In the diagram at the right, we have stacked self-attention blocks, like we might stack LSTM layers.

Can self-attention be a drop-in

replacement for recurrence?

No. It has a few issues, which

we’ll go through.

First, self-attention is an operation on sets. It has no inherent notion of order.

self-attention

Self-attention doesn’t know the order of its inputs.

John Hewitt

17 of 57

Position Encoding

Sinusoidal position representations: concatenate sinusoidal functions of varying periods:

sin(ω₁ t)�cos(ω₁ t)

𝑝_𝑖=

sin(ω_d/2 t)

cos(ω_d/2 t)

Image: https://timodenk.com/blog/linear-relationships-in-the-transformers-positional-encoding/

Index in the sequence

Dimension

John Hewitt

18 of 57

Masking Attention to the Future

To use self-attention in decoders, we need to ensure we can’t peek at the future.

At every timestep, we could change the set of keys and queries to include only past words. (Inefficient!)

To enable parallelization, we mask out attention to future words by setting attention scores to −∞.

𝑒_𝑖𝑗=

−∞	−∞	−∞	−∞
	−∞	−∞	−∞
		−∞	−∞
			−∞

The

chef

who

[START]

For encoding these words

We can look at these (not greyed out) words

𝑖

𝑞^𝖳𝑘_𝑗, 𝑗 < 𝑖

−∞, 𝑗 ≥ 𝑖

John Hewitt

19 of 57

MLP (Feed-Forward) Modules

Note that there are no elementwise nonlinearities in self-attention; stacking more self-attention layers just re-averages value vectors

Easy fix: add a feed-forward network

to post-process each output vector.

𝑚_𝑖= 𝑀𝐿𝑃 output_𝑖

= 𝑊₂∗ ReLU 𝑊₁× output_𝑖+ 𝑏₁

+ 𝑏₂

𝑤₁

The

𝑤₂

chef

𝑤₃

who

𝑤_𝑇

food

…

Intuition: the FF network processes the result of attention

FF

self-attention

FF

…

FF

self-attention

FF

John Hewitt

20 of 57

The “Residual Stream”

Miles

Davis

plays

the

trumpet�(predicted)

h_i^(l)state

attention

MLP

Layer input

Layer output

Every layer calculates a small “residual vector” to add to the stream (He 2015, Elhage 2021)

Rice Wang – privileged basis question

Grace – contribution, echo question

21 of 57

“Early Exit Decoding”

Miles

Davis

plays

the

trumpet

If you skip the last layer and decode early, it often already knows the prediction

(Panda 2016,�Elbayad 2020)

22 of 57

The “Logit Lens”

Miles

Davis

plays

the

trumpet

horn

Miles

Stein

If you decode each vector early, you can see how the prediction evolves (nostalgebrist 2020)

Isaac Dalke – logit lens early?

Avery Huang – logit lens timing?

23 of 57

The “Logit Lens” grid

Logit lens lets you view a transformer as a grid of “next token” predictions.

One prediction for each layer at each token.

Yunus – circularity?

Claire – veracity?

Guangyuan – thinking?

24 of 57

The “Logit Lens” grid

Logit lens lets you view a transformer as a grid of “next token” predictions.

One prediction for each layer at each token.

25 of 57

Logit Lens on a Translation Task

Use logit lens to inspect a French 🡪 Chinese translation task

(Wendler 2024)��Predict the token after��Français: "fleur" - 中文:

(“中文” means “Chinese”)

花 is correct

in the middle:�this is neither French nor Chinese!

26 of 57

Logit Lens on a Translation

Use logit lens to inspect a French – Chinese

translation task

(Wendler 2024)��Predict the token after��Français: "fleur" - 中文:

(“中文” means “Chinese”)

花 is correct

in the middle:�this is neither French nor Chinese!

Courtney – artifact?

Yuqi – other langs?

Christopher Curtis

Arya - multitoken?

Jesseba - SAEs?

27 of 57

“Do Llamas Work in English?”

(Wendler 2024)

�On the y axis:

The probability decoded for the English or Chinese translation of the French word

(averaged over many cases)

On the x axis: which internal transformer layer

花 at the end

“flower” in English

28 of 57

Try the Workbench Prototype

Open the document bit.ly/eab-lens

Open the workbench prototype workbench.ndif.us

29 of 57

Step 1: Login and Make a Workspace

Prerequisite: need to be logged into GitHub

Then: Create Workspace. Call it “Demo”.

30 of 57

The Three-Pane Interface

List of charts

Experiment Designer

Experiment Output

31 of 57

Select an LLM: Llama 3.1-8b

List of charts

Experiment Designer

Experiment Output

Model Selector.�Llama 3.1-8b has eight billion trained parameters.

32 of 57

Enter a “Cloze Prompt”

A “Cloze prompt” is a fill-in-the-blank text form designed to test LLM knowledge.

E.g., “Miles Davis plays the ____”

An LLM predicts the next word, so we leave the last word blank to test it.

Hit Enter to run the LLM

(This means it’s working)

33 of 57

Tokenization

LLMs break text into “tokens” to process. Internally, each token becomes a column (a sequence of vectors) of neurons.

As soon as you run the LLM, your text is tokenized.

Llama always requires a “begin-of-text” token so that appears here.

The predicted tokens are shown here...

The LLM got it right!

34 of 57

The Logit Lens Heatmap

Input tokens

Darker squares mean “more confident predictions”

Output corner

The direction of info flow

Each square is a “representation vector” of neurons

35 of 57

Reading the Heatmap

The word shown in each square is the “decoded” vector. Here: “Layer 25 thinks after “Miles Davis”, you should say “Miles” again.

Layer 30 thinks after “Miles Davis plays the” should come “horn”

Color shows confidence; it is not very confident about “horn”

36 of 57

Controlling the Heatmap

Then click “crop” button to zoom in

Change x-step stride to 1 to show every layer

Drag a rectangle to focus on the last layers

37 of 57

Controlling the Heatmap

But right before that it was “thinking” about predicting “horn”

The very last layer predicts “trumpet” a bit more confidently

38 of 57

Switching to the Line Plot

This token highlights�to show the predictions after the token “the”

Click on “Line” for the line plot.

Lines show predictions by layer

39 of 57

Details in the line plot

But right before that it was “thinking” about predicting “horn”

An LLM doesn’t predict just one guess, but it assigns a probability to several guesses

40 of 57

The Many Guesses of an LLM

All these guesses materialized suddenly at the last layer

All the top guesses are selected here. The _ means “space”

41 of 57

Adding a token to track

x to remove blues and “space”

42 of 57

Adding a token to track

Type “horn” and then�select the “_horn” token that includes the space

43 of 57

Adding a token to track

Type “horn” and then�select the “_horn” token that includes the space

Also: add

“_Miles”

44 of 57

The story told by tokens

Layer 23-29 “think” about saying “Miles”

Layer 30 votes for “horn”

Layer 31 chooses “trumpet”

45 of 57

Try a Translation

Click in the empty part of the box to edit the text

Enter: (copy from bit.ly/eab-lens)�Français: "fleur" - 中文: "

The answer should be 花 which means “flower” in Chinese

46 of 57

Try a Translation

We didn’t provide any input in English!

47 of 57

Zoom into the last ten layers

The quote is predicted but only at the very last layer

The Chinese word appears at layer 28, but it’s back to English at 29

48 of 57

Assemble a line plot

The Chinese prediction

Select “Line” and then add tokens of interest here: be sure to add the “space” version of ”_flower”

Predicting English

49 of 57

What Does this Teach Us?

Hypothesis: the presence of English between French and Chinese suggests a Language-independent concept representation that sits between all languages.

How would we test this hypothesis?

50 of 57

Now Your Turn. https://bit.ly/eab-lens

Try investigating other prompts using the logit lens prototype. Visit: http://bit.ly/eab-lens

51 of 57

Now in a notebook

Go to colab.google.com

Secrets:

Enter your HF_TOKEN
Enter your NDIF_API_KEY

Submit your NDIF_API_KEY to the googleform: [TBD]

Then: https://bit.ly/4jCc5ZD

52 of 57

Logit Lens Research Example Notebook

https://bit.ly/4jCc5ZD

Do Llamas Work in English? (Wendler et al., ACL 2024) - Key paper on multilingual concept representations
Looking for puns
Miscellaneous phrases
In-Context Representation Hijacking (Yona et al., 2024) - Doublespeak attack

53 of 57

Capital of France: “a” at the last layer

54 of 57

Language translation: amor 🡪 amour

55 of 57

Pun: electrician swimmers

56 of 57

Neutral versus Punny Contexts

57 of 57

Representation hijacking bomb🡪carrot