For later – set up using your laptop now
Go to colab.google.com
Secrets:
Submit your NDIF_API_KEY to the googleform: [TBD]
Neural Mechanics�Week 1: LLM Foundations�and Logit Lens
Tuesday, January 13, 2026
David Bau
Northeastern University
How Goes Research Planning?
First Two Reading Questions:
1. Keivalya – on high-quality research. what is a “significant result?”
2. Jasmine – evaluation, “small world” beginning, “open doors”?
3. Yiqian – which comes first, the hypothesis or experiment, in research?
(More discussion later.)
The Research Process is Iterative
Interesting?
Feasible?
Exploratory�Experiment
Scaled-Up�Experiments
Today’s Goals
One-Picture Neural Networks Review
An old idea:�Frank Rosenblatt’s
1958 Perceptron
Train by gradually
adjusting weights
to reduce errors�on seen examples
Multi�Layer�Perceptron
MLP =
One-slide Language Model Review
When
in
town
Rome
near
home
Back
to
back
p=0.5
p=0.6
0.67
p=0.4
p=0.6
p=0.4
in
0.5
0.67
0.33
0.33
0.5
0.5
0.5
p=0.5
How to pick what comes after “in” depends on what came before
0.5
0.4
0.5
Language Models take Tokens to Probabilities
a process called token
ization
izing
-
a process called tokenization
.
,
that
predict
sample
Run a language model repeatedly to generate text!
vector of preceding input text
vector of next token probabilities
Ayush Agrawal – tokenization question
Inside Transformer Language Models
Miles
Davis
plays
the
trumpet�(predicted)
✓
1. First the encoder turns each token “the”
into a vector of neural activations.
2. Then a series of neural layers mixes and transforms the vectors for each token
per
is
the
Guess
for�every�word
3. Finally the decoder turns each vector
into a prediction for the next word.
trumpet
Detail: LM estimates probabilities
Haoyu He – internal vocab question
Ananya Malik – linear sep question
Important Transformer Pieces to Know
Encoder is a look-up table from tokens to vectors.
Typical vocabulary: 50,000 or 150,000 vectors in the table.
Attention is a neural network that “remembers” recent information from contextual tokens by (1) making a query vector (2) to match key vectors (3) and gather & add value vectors
MLP (multilayer perceptrons) are two-layer neural networks that match and modify single-token feature vectors
Decoder makes a vector of 50k/150k next-token probabilities
short-term�contextual memory
long-term�parametric memory
Yuchen Hou – memory question
Details of Self-Attention
𝑖
𝑖𝑗 𝑗
Compute key- query affinities
𝑖𝑗
𝑒 = 𝑞i𝖳𝑘 𝛼 =
exp(𝑒𝑖𝑗)
Σ
𝑗'
Compute attention weights from affinities
(softmax)
output = Σ 𝑗 𝛼 𝑣
𝑖 𝑖𝑗 𝑗
Compute outputs as weighted sum of values
exp(𝑒𝑖𝑗’ )
John Hewitt
Luze – attention question
Self-Attention Details
Step1: create three vectors from each of the encoder’s input vectors:
Query, a Key, Value (typically smaller dimension).
by multiplying the embedding by three matrices that we trained during the training process.
While processing each word it allows to look at other positions in the input sequence for clues to build a better encoding for this word.
Self-Attention
Step 2: calculate a score (like we have seen for regular attention!) how much focus to place on other parts of the input sentence as we encode a word at a certain position.
Take dot product of the query vector with the key vector of the respective word we’re scoring.
E.g., Processing the self-attention for word “Thinking” in position #1, the first score would be the dot product of q1 and k1. The second score would be the dot product of q1 and k2.
Self Attention
Intuition: softmax score determines how much each word will be expressed at this position.
Self Attention
More details:
(kind of like multiple filters for CNN)
see https://jalammar.github.io/illustrated-transformer/
Is Self-Attention All You Need? Not yet.
𝑤1
The
𝑞1
𝑘1 𝑣1
𝑤2
chef
𝑞2
𝑤3
who
𝑤𝑇
food
𝑘𝑇
𝑞𝑇
𝑣𝑇
…
𝑞1
𝑘1 𝑣1
𝑘2
𝑞2
𝑞3
𝑣2 𝑘3 𝑣3
𝑘𝑇
𝑞𝑇
𝑣𝑇
…
self-attention
𝑘2 𝑣2 𝑘3 𝑞3 𝑣3
replacement for recurrence?
we’ll go through.
self-attention
Self-attention doesn’t know the order of its inputs.
John Hewitt
Position Encoding
sin(ω1 t)�cos(ω1 t)
𝑝𝑖 =
sin(ωd/2 t)
cos(ωd/2 t)
Image: https://timodenk.com/blog/linear-relationships-in-the-transformers-positional-encoding/
Index in the sequence
Dimension
John Hewitt
Masking Attention to the Future
𝑒𝑖𝑗 =
−∞ | −∞ | −∞ | −∞ |
| −∞ | −∞ | −∞ |
| | −∞ | −∞ |
| | | −∞ |
The
chef
who
[START]
For encoding these words
We can look at these (not greyed out) words
𝑖
𝑞𝖳𝑘𝑗, 𝑗 < 𝑖
−∞, 𝑗 ≥ 𝑖
John Hewitt
MLP (Feed-Forward) Modules
to post-process each output vector.
𝑚𝑖 = 𝑀𝐿𝑃 output𝑖
= 𝑊2 ∗ ReLU 𝑊1 × output𝑖 + 𝑏1
+ 𝑏2
𝑤1
The
𝑤2
chef
𝑤3
who
𝑤𝑇
food
…
Intuition: the FF network processes the result of attention
FF
FF
FF
self-attention
FF
…
FF
FF
FF
self-attention
FF
John Hewitt
The “Residual Stream”
Miles
Davis
plays
the
trumpet�(predicted)
hi(l) state
attention
MLP
Layer input
Layer output
Every layer calculates a small “residual vector” to add to the stream (He 2015, Elhage 2021)
Rice Wang – privileged basis question
Grace – contribution, echo question
“Early Exit Decoding”
Miles
Davis
plays
the
trumpet
trumpet
If you skip the last layer and decode early, it often already knows the prediction
(Panda 2016,�Elbayad 2020)
The “Logit Lens”
Miles
Davis
plays
the
trumpet
trumpet
horn
Miles
Miles
Stein
If you decode each vector early, you can see how the prediction evolves (nostalgebrist 2020)
Isaac Dalke – logit lens early?
Avery Huang – logit lens timing?
The “Logit Lens” grid
Logit lens lets you view a transformer as a grid of “next token” predictions.
One prediction for each layer at each token.
Yunus – circularity?
Claire – veracity?
Guangyuan – thinking?
The “Logit Lens” grid
Logit lens lets you view a transformer as a grid of “next token” predictions.
One prediction for each layer at each token.
Logit Lens on a Translation Task
Use logit lens to inspect a French 🡪 Chinese translation task
(Wendler 2024)��Predict the token after��Français: "fleur" - 中文:
(“中文” means “Chinese”)
花 is correct
in the middle:�this is neither French nor Chinese!
Logit Lens on a Translation
Use logit lens to inspect a French – Chinese
translation task
(Wendler 2024)��Predict the token after��Français: "fleur" - 中文:
(“中文” means “Chinese”)
花 is correct
in the middle:�this is neither French nor Chinese!
Courtney – artifact?
Yuqi – other langs?
Christopher Curtis
Arya - multitoken?
Jesseba - SAEs?
“Do Llamas Work in English?”
(Wendler 2024)
�On the y axis:
The probability decoded for the English or Chinese translation of the French word
(averaged over many cases)
On the x axis: which internal transformer layer
花 at the end
“flower” in English
Try the Workbench Prototype
Step 1: Login and Make a Workspace
Prerequisite: need to be logged into GitHub
Then: Create Workspace. Call it “Demo”.
The Three-Pane Interface
List of charts
Experiment Designer
Experiment Output
Select an LLM: Llama 3.1-8b
List of charts
Experiment Designer
Experiment Output
Model Selector.�Llama 3.1-8b has eight billion trained parameters.
Enter a “Cloze Prompt”
A “Cloze prompt” is a fill-in-the-blank text form designed to test LLM knowledge.
E.g., “Miles Davis plays the ____”
An LLM predicts the next word, so we leave the last word blank to test it.
Hit Enter to run the LLM
(This means it’s working)
Tokenization
LLMs break text into “tokens” to process. Internally, each token becomes a column (a sequence of vectors) of neurons.
As soon as you run the LLM, your text is tokenized.
Llama always requires a “begin-of-text” token so that appears here.
The predicted tokens are shown here...
The LLM got it right!
The Logit Lens Heatmap
Input tokens
Darker squares mean “more confident predictions”
Output corner
The direction of info flow
Each square is a “representation vector” of neurons
Reading the Heatmap
The word shown in each square is the “decoded” vector. Here: “Layer 25 thinks after “Miles Davis”, you should say “Miles” again.
Layer 30 thinks after “Miles Davis plays the” should come “horn”
Color shows confidence; it is not very confident about “horn”
Controlling the Heatmap
Then click “crop” button to zoom in
Change x-step stride to 1 to show every layer
Drag a rectangle to focus on the last layers
Controlling the Heatmap
But right before that it was “thinking” about predicting “horn”
The very last layer predicts “trumpet” a bit more confidently
Switching to the Line Plot
This token highlights�to show the predictions after the token “the”
Click on “Line” for the line plot.
Lines show predictions by layer
Details in the line plot
But right before that it was “thinking” about predicting “horn”
An LLM doesn’t predict just one guess, but it assigns a probability to several guesses
The Many Guesses of an LLM
All these guesses materialized suddenly at the last layer
All the top guesses are selected here. The _ means “space”
Adding a token to track
x to remove blues and “space”
Adding a token to track
Type “horn” and then�select the “_horn” token that includes the space
Adding a token to track
Type “horn” and then�select the “_horn” token that includes the space
Also: add
“_Miles”
The story told by tokens
Layer 23-29 “think” about saying “Miles”
Layer 30 votes for “horn”
Layer 31 chooses “trumpet”
Try a Translation
Click in the empty part of the box to edit the text
Enter: (copy from bit.ly/eab-lens)�Français: "fleur" - 中文: "
The answer should be 花 which means “flower” in Chinese
Try a Translation
We didn’t provide any input in English!
Zoom into the last ten layers
The quote is predicted but only at the very last layer
The Chinese word appears at layer 28, but it’s back to English at 29
Assemble a line plot
The Chinese prediction
Select “Line” and then add tokens of interest here: be sure to add the “space” version of ”_flower”
Predicting English
What Does this Teach Us?
Hypothesis: the presence of English between French and Chinese suggests a Language-independent concept representation that sits between all languages.
How would we test this hypothesis?
Now Your Turn. https://bit.ly/eab-lens
Try investigating other prompts using the logit lens prototype. Visit: http://bit.ly/eab-lens
Now in a notebook
Go to colab.google.com
Secrets:
Submit your NDIF_API_KEY to the googleform: [TBD]
Then: https://bit.ly/4jCc5ZD
Logit Lens Research Example Notebook
Capital of France: “a” at the last layer
Language translation: amor 🡪 amour
Pun: electrician swimmers
Neutral versus Punny Contexts
Representation hijacking bomb🡪carrot