1 of 57

For later – set up using your laptop now

Go to colab.google.com

Secrets:

  • Enter your HF_TOKEN
  • Enter your NDIF_API_KEY

Submit your NDIF_API_KEY to the googleform: [TBD]

2 of 57

Neural Mechanics�Week 1: LLM Foundations�and Logit Lens

Tuesday, January 13, 2026

David Bau

Northeastern University

3 of 57

How Goes Research Planning?

First Two Reading Questions:

1. Keivalya – on high-quality research. what is a “significant result?”

2. Jasmine – evaluation, “small world” beginning, “open doors”?

3. Yiqian – which comes first, the hypothesis or experiment, in research?

(More discussion later.)

4 of 57

The Research Process is Iterative

  • A good research question is both Feasible and Interesting

  • Today:�exploratory experiments�are a toolkit for feasibility

Interesting?

Feasible?

Exploratory�Experiment

Scaled-Up�Experiments

5 of 57

Today’s Goals

  1. Whirlwind review: Neural nets, Language models, Transformers
  2. First interpretability method: the Logit Lens
  3. Hands-on setup: Huggingface, NDIF tokens
  4. Lab: reproduce experiments, discuss what they tell us

6 of 57

One-Picture Neural Networks Review

An old idea:�Frank Rosenblatt’s

1958 Perceptron

 

 

 

 

 

 

 

 

 

 

Train by gradually

adjusting weights

to reduce errors�on seen examples

Multi�Layer�Perceptron

MLP =

7 of 57

One-slide Language Model Review

When

in

town

Rome

near

home

Back

to

back

p=0.5

p=0.6

0.67

p=0.4

p=0.6

p=0.4

in

0.5

0.67

0.33

0.33

0.5

0.5

0.5

p=0.5

How to pick what comes after “in” depends on what came before

 

 

0.5

0.4

0.5

8 of 57

Language Models take Tokens to Probabilities

 

a process called token

ization

izing

-

 

a process called tokenization

.

,

that

predict

sample

Run a language model repeatedly to generate text!

vector of preceding input text

vector of next token probabilities

Ayush Agrawal – tokenization question

9 of 57

Inside Transformer Language Models

Miles

Davis

plays

the

trumpet�(predicted)

1. First the encoder turns each token “the”

into a vector of neural activations.

2. Then a series of neural layers mixes and transforms the vectors for each token

per

is

the

Guess

for�every�word

3. Finally the decoder turns each vector

into a prediction for the next word.

trumpet

Detail: LM estimates probabilities

Haoyu He – internal vocab question

Ananya Malik – linear sep question

10 of 57

Important Transformer Pieces to Know

Encoder is a look-up table from tokens to vectors.

Typical vocabulary: 50,000 or 150,000 vectors in the table.

Attention is a neural network that “remembers” recent information from contextual tokens by (1) making a query vector (2) to match key vectors (3) and gather & add value vectors

MLP (multilayer perceptrons) are two-layer neural networks that match and modify single-token feature vectors

Decoder makes a vector of 50k/150k next-token probabilities

short-term�contextual memory

long-term�parametric memory

Yuchen Hou – memory question

11 of 57

Details of Self-Attention

  • We have some keys 𝑘1, 𝑘2, … , 𝑘𝑇. Each key is 𝑘𝑖 ∈ ℝ𝑑
  • We have some values 𝑣1, 𝑣2, … , 𝑣𝑇. Each value is 𝑣𝑖 ∈ ℝ𝑑
  • Attention operates on queries, keys, and values.
  • We have some queries 𝑞 , 𝑞 , … , 𝑞 . Each query is 𝑞i ∈ ℝ𝑑
  • In self-attention, the queries, keys, and values are drawn from the same source.
    • For example, if the output of the previous layer is 𝑥1, … , 𝑥𝑇, (one vec per word) we could let 𝑣𝑖 = 𝑘𝑖 = 𝑞𝑖 = 𝑥𝑖 (If our V, K, Q matrices were all identity, we could use the same vectors for all of them!)
  • The (dot product) self-attention operation is as follows:

𝑖

𝑖𝑗 𝑗

Compute key- query affinities

𝑖𝑗

𝑒 = 𝑞i𝖳𝑘 𝛼 =

exp(𝑒𝑖𝑗)

Σ

𝑗'

Compute attention weights from affinities

(softmax)

output = Σ 𝑗 𝛼 𝑣

𝑖 𝑖𝑗 𝑗

Compute outputs as weighted sum of values

exp(𝑒𝑖𝑗 )

John Hewitt

Luze – attention question

12 of 57

Self-Attention Details

Step1: create three vectors from each of the encoder’s input vectors:

Query, a Key, Value (typically smaller dimension).

by multiplying the embedding by three matrices that we trained during the training process.

While processing each word it allows to look at other positions in the input sequence for clues to build a better encoding for this word.

13 of 57

Self-Attention

Step 2: calculate a score (like we have seen for regular attention!)  how much focus to place on other parts of the input sentence as we encode a word at a certain position.

Take dot product of the query vector with the key vector of the respective word we’re scoring.

E.g., Processing the self-attention for word “Thinking” in position #1, the first score would be the dot product of q1 and k1. The second score would be the dot product of q1 and k2.

14 of 57

Self Attention

  • Step 3 divide scores by the square root of the dimension of the key vectors (more stable gradients).
  • Step 4 pass result through a softmax operation. (all positive and add up to 1)

Intuition: softmax score determines how much each word will be expressed at this position.

15 of 57

Self Attention

  • Step6 : sum up the weighted value vectors. This produces the output of the self-attention layer at this position

More details:

  • What we have seen for a word is done for all words (using matrices)
  • Need to encode position of words
  • And improved using a mechanism called “multi-headed” attention

(kind of like multiple filters for CNN)

see https://jalammar.github.io/illustrated-transformer/

16 of 57

Is Self-Attention All You Need? Not yet.

𝑤1

The

𝑞1

𝑘1 𝑣1

𝑤2

chef

𝑞2

𝑤3

who

𝑤𝑇

food

𝑘𝑇

𝑞𝑇

𝑣𝑇

𝑞1

𝑘1 𝑣1

𝑘2

𝑞2

𝑞3

𝑣2 𝑘3 𝑣3

𝑘𝑇

𝑞𝑇

𝑣𝑇

self-attention

𝑘2 𝑣2 𝑘3 𝑞3 𝑣3

  • In the diagram at the right, we have stacked self-attention blocks, like we might stack LSTM layers.

  • Can self-attention be a drop-in

replacement for recurrence?

  • No. It has a few issues, which

we’ll go through.

  • First, self-attention is an operation on sets. It has no inherent notion of order.

self-attention

Self-attention doesn’t know the order of its inputs.

John Hewitt

17 of 57

Position Encoding

 

  • Sinusoidal position representations: concatenate sinusoidal functions of varying periods:

sin(ω1 t)�cos(ω1 t)

𝑝𝑖 =

sin(ωd/2 t)

cos(ωd/2 t)

Image: https://timodenk.com/blog/linear-relationships-in-the-transformers-positional-encoding/

Index in the sequence

Dimension

John Hewitt

18 of 57

Masking Attention to the Future

  • To use self-attention in decoders, we need to ensure we can’t peek at the future.
  • At every timestep, we could change the set of keys and queries to include only past words. (Inefficient!)

  • To enable parallelization, we mask out attention to future words by setting attention scores to −∞.

𝑒𝑖𝑗 =

−∞

−∞

−∞

−∞

−∞

−∞

−∞

−∞

−∞

−∞

The

chef

who

[START]

For encoding these words

We can look at these (not greyed out) words

𝑖

𝑞𝖳𝑘𝑗, 𝑗 < 𝑖

−∞, 𝑗 ≥ 𝑖

John Hewitt

19 of 57

MLP (Feed-Forward) Modules

  • Note that there are no elementwise nonlinearities in self-attention; stacking more self-attention layers just re-averages value vectors

  • Easy fix: add a feed-forward network

to post-process each output vector.

𝑚𝑖 = 𝑀𝐿𝑃 output𝑖

= 𝑊2 ∗ ReLU 𝑊1 × output𝑖 + 𝑏1

+ 𝑏2

𝑤1

The

𝑤2

chef

𝑤3

who

𝑤𝑇

food

Intuition: the FF network processes the result of attention

FF

FF

FF

self-attention

FF

FF

FF

FF

self-attention

FF

John Hewitt

20 of 57

The “Residual Stream”

Miles

Davis

plays

the

trumpet�(predicted)

hi(l) state

attention

MLP

Layer input

Layer output

Every layer calculates a small “residual vector” to add to the stream (He 2015, Elhage 2021)

Rice Wang – privileged basis question

Grace – contribution, echo question

21 of 57

“Early Exit Decoding”

Miles

Davis

plays

the

trumpet

trumpet

If you skip the last layer and decode early, it often already knows the prediction

(Panda 2016,�Elbayad 2020)

22 of 57

The “Logit Lens”

Miles

Davis

plays

the

trumpet

trumpet

horn

Miles

Miles

Stein

If you decode each vector early, you can see how the prediction evolves (nostalgebrist 2020)

Isaac Dalke – logit lens early?

Avery Huang – logit lens timing?

23 of 57

The “Logit Lens” grid

Logit lens lets you view a transformer as a grid of “next token” predictions.

One prediction for each layer at each token.

Yunus – circularity?

Claire – veracity?

Guangyuan – thinking?

24 of 57

The “Logit Lens” grid

Logit lens lets you view a transformer as a grid of “next token” predictions.

One prediction for each layer at each token.

25 of 57

Logit Lens on a Translation Task

Use logit lens to inspect a French 🡪 Chinese translation task

(Wendler 2024)��Predict the token after��Français: "fleur" - 中文:

(“中文” means “Chinese”)

花 is correct

in the middle:�this is neither French nor Chinese!

26 of 57

Logit Lens on a Translation

Use logit lens to inspect a French – Chinese

translation task

(Wendler 2024)��Predict the token after��Français: "fleur" - 中文:

(“中文” means “Chinese”)

花 is correct

in the middle:�this is neither French nor Chinese!

Courtney – artifact?

Yuqi – other langs?

Christopher Curtis

Arya - multitoken?

Jesseba - SAEs?

27 of 57

“Do Llamas Work in English?”

(Wendler 2024)

�On the y axis:

The probability decoded for the English or Chinese translation of the French word

(averaged over many cases)

On the x axis: which internal transformer layer

花 at the end

“flower” in English

28 of 57

Try the Workbench Prototype

Open the document bit.ly/eab-lens

Open the workbench prototype workbench.ndif.us

29 of 57

Step 1: Login and Make a Workspace

Prerequisite: need to be logged into GitHub

Then: Create Workspace. Call it “Demo”.

30 of 57

The Three-Pane Interface

List of charts

Experiment Designer

Experiment Output

31 of 57

Select an LLM: Llama 3.1-8b

List of charts

Experiment Designer

Experiment Output

Model Selector.�Llama 3.1-8b has eight billion trained parameters.

32 of 57

Enter a “Cloze Prompt”

A “Cloze prompt” is a fill-in-the-blank text form designed to test LLM knowledge.

E.g., “Miles Davis plays the ____”

An LLM predicts the next word, so we leave the last word blank to test it.

Hit Enter to run the LLM

(This means it’s working)

33 of 57

Tokenization

LLMs break text into “tokens” to process. Internally, each token becomes a column (a sequence of vectors) of neurons.

As soon as you run the LLM, your text is tokenized.

Llama always requires a “begin-of-text” token so that appears here.

The predicted tokens are shown here...

The LLM got it right!

34 of 57

The Logit Lens Heatmap

Input tokens

Darker squares mean “more confident predictions”

Output corner

The direction of info flow

Each square is a “representation vector” of neurons

35 of 57

Reading the Heatmap

The word shown in each square is the “decoded” vector. Here: “Layer 25 thinks after “Miles Davis”, you should say “Miles” again.

Layer 30 thinks after “Miles Davis plays the” should come “horn”

Color shows confidence; it is not very confident about “horn”

36 of 57

Controlling the Heatmap

Then click “crop” button to zoom in

Change x-step stride to 1 to show every layer

Drag a rectangle to focus on the last layers

37 of 57

Controlling the Heatmap

But right before that it was “thinking” about predicting “horn”

The very last layer predicts “trumpet” a bit more confidently

38 of 57

Switching to the Line Plot

This token highlights�to show the predictions after the token “the”

Click on “Line” for the line plot.

Lines show predictions by layer

39 of 57

Details in the line plot

But right before that it was “thinking” about predicting “horn”

An LLM doesn’t predict just one guess, but it assigns a probability to several guesses

40 of 57

The Many Guesses of an LLM

All these guesses materialized suddenly at the last layer

All the top guesses are selected here. The _ means “space”

41 of 57

Adding a token to track

x to remove blues and “space”

42 of 57

Adding a token to track

Type “horn” and then�select the “_horn” token that includes the space

43 of 57

Adding a token to track

Type “horn” and then�select the “_horn” token that includes the space

Also: add

“_Miles”

44 of 57

The story told by tokens

Layer 23-29 “think” about saying “Miles”

Layer 30 votes for “horn”

Layer 31 chooses “trumpet”

45 of 57

Try a Translation

Click in the empty part of the box to edit the text

Enter: (copy from bit.ly/eab-lens)�Français: "fleur" - 中文: "

The answer should be 花 which means “flower” in Chinese

46 of 57

Try a Translation

We didn’t provide any input in English!

47 of 57

Zoom into the last ten layers

The quote is predicted but only at the very last layer

The Chinese word appears at layer 28, but it’s back to English at 29

48 of 57

Assemble a line plot

The Chinese prediction

Select “Line” and then add tokens of interest here: be sure to add the “space” version of ”_flower”

Predicting English

49 of 57

What Does this Teach Us?

Hypothesis: the presence of English between French and Chinese suggests a Language-independent concept representation that sits between all languages.

How would we test this hypothesis?

50 of 57

Now Your Turn. https://bit.ly/eab-lens

Try investigating other prompts using the logit lens prototype. Visit: http://bit.ly/eab-lens

51 of 57

Now in a notebook

Go to colab.google.com

Secrets:

  • Enter your HF_TOKEN
  • Enter your NDIF_API_KEY

Submit your NDIF_API_KEY to the googleform: [TBD]

Then: https://bit.ly/4jCc5ZD

52 of 57

Logit Lens Research Example Notebook

https://bit.ly/4jCc5ZD

53 of 57

Capital of France: “a” at the last layer

54 of 57

Language translation: amor 🡪 amour

55 of 57

Pun: electrician swimmers

56 of 57

Neutral versus Punny Contexts

57 of 57

Representation hijacking bomb🡪carrot