4 of 120

Attention in Cognitive Science

Behavioral and cognitive process of selectively concentrating on a discrete aspect of information while ignoring other perceivable information
State of arousal
It is the taking possession by the mind in clear and vivid form of one out of what seem several simultaneous objects or trains of thought.
Focalization, the concentration of consciousness
The allocation of limited cognitive processing resources

https://www.youtube.com/watch?v=CIQ354Y_gOc

5 of 120

Issues with Recurrent Models (1)

Linear interaction distance

RNNs are unrolled “left-to-right”.
This encodes linear locality: a useful heuristic!

Nearby words often affect each other’s meanings

Problem: RNNs take O(sequence length) steps for distant word pairs to interact.

Hard to learn long-distance dependencies (because gradient problems!)

short-term dependency�: enough with RNN

long-term dependency�: NOT enough with RNN

O(sequence length)

https://web.stanford.edu/class/cs224n/

6 of 120

Issues with Recurrent Models (2)

Lack of parallelizability

Forward and backward passes have O(sequence length) unparallelizable operations

GPUs can perform a bunch of independent computations at once!
But future RNN hidden states can’t be computed in full before past RNN hidden states have been computed.
Inhibits training on very large datasets!

https://web.stanford.edu/class/cs224n/

Numbers indicate min # of steps before a state can be computed

7 of 120

Issues with Recurrent Models (3)

Encoding bottleneck

It’s like a one-shot compression (sequence of features 🡪 a vector).
RNN + classification is like judging a book by its summary.

BiLSTM Encoder

LSTM Decoder

https://web.stanford.edu/class/cs224n/

8 of 120

Solution: Attention

Encoder-Decoder �with Attention

9 of 120

Solution: Attention

We can think of attention as performing fuzzy lookup in a key-value store.

https://web.stanford.edu/class/cs224n/

In a lookup table, we have a table of keys that map to values. The query matches one of the keys, returning its value.

In attention, the query matches all keys softly, to a weight between 0 and 1. The keys’ values are multiplied by the weights and summed.

query

keys

values

output

query

keys

values

output

weighted �sum

10 of 120

Attention for Machine Translation

Sequence-to-sequence (seq2seq)

Encode source sequence to latent representation
Decode to target sequence one character at a time
Need memory for long sequences

https://alex.smola.org/talks/ICML19-attention.pdf

Sequence to Sequence Learning with Neural Networks (https://arxiv.org/pdf/1409.3215)

11 of 120

Attention for Machine Translation

https://alex.smola.org/talks/ICML19-attention.pdf

Neural Machine Translation by Jointly Learning to Align and Translate (https://arxiv.org/pdf/1409.0473)

Effective Approaches to Attention-based Neural Machine Translation (https://arxiv.org/pdf/1508.04025)

Image generated by Gemini 3 Pro

12 of 120

Attention for Machine Translation

With attention, translation quality remains much more stable for long sentences.

https://alex.smola.org/talks/ICML19-attention.pdf

Neural Machine Translation by Jointly Learning to Align and Translate (https://arxiv.org/pdf/1409.0473)

13 of 120

Attention for Machine Translation

Neural Machine Translation by Jointly Learning to Align and Translate (https://arxiv.org/pdf/1409.0473)

14 of 120

Attention for Image Captioning

In computer vision, attention learns alignment between image region and word.

Show, Attend, and Tell: Neural Image Caption Generation with Visual Attention (https://arxiv.org/pdf/1502.03044)

15 of 120

Self-Attention as a New Building Block

Prev: Attention on recurrent models
Todays: Attention only

Number of unparallelizable operations does not increase with sequence length.
Maximum interaction distance: O(1), since all words interact at every layer!
How? Treat each word’s representation as a query to access and incorporate information from a set of values.

https://web.stanford.edu/class/cs224n/

Attention Is All You Need (https://arxiv.org/pdf/1706.03762)

Encoder-Decoder �with Attention

All words attend to all words in previous layer

16 of 120

Self-Attention as a New Building Block

https://web.stanford.edu/class/cs224n/

Attention Is All You Need (https://arxiv.org/pdf/1706.03762)

17 of 120

Limitations of Self-Attention (1)

1^st challenge: self-attention doesn’t have an inherent notion of order!

https://web.stanford.edu/class/cs224n/

Attention Is All You Need (https://arxiv.org/pdf/1706.03762)

went

HUFS

LAI

2025

and

learned

No order information!

There is just summation

over the set

18 of 120

Position Representation Vectors

https://web.stanford.edu/class/cs224n/

Attention Is All You Need (https://arxiv.org/pdf/1706.03762)

In deep self-attention networks, we do this at the first layer! You could concatenate them as well, but people mostly just add…

19 of 120

Position Representation Vectors

Attention Is All You Need (https://arxiv.org/pdf/1706.03762)

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (https://arxiv.org/pdf/1810.04805)

20 of 120

Limitations of Self-Attention (2)

2^nd challenge: there is no non-linearities. It’s all just weighted averages.

https://web.stanford.edu/class/cs224n/

Attention Is All You Need (https://arxiv.org/pdf/1706.03762)

query

keys

values

output

Even stacking multiple layers cannot introduce any non-linearity because it’s still a summation of value vectors

output

21 of 120

Adding nonlinearities in self-attention

https://web.stanford.edu/class/cs224n/

Attention Is All You Need (https://arxiv.org/pdf/1706.03762)

22 of 120

Limitations of Self-Attention (3)

3^rd challenge: need to ensure we don’t “look at the future” when predicting a sequence

https://web.stanford.edu/class/cs224n/

Attention Is All You Need (https://arxiv.org/pdf/1706.03762)

23 of 120

Masking the Future in Self-Attention

https://web.stanford.edu/class/cs224n/

Attention Is All You Need (https://arxiv.org/pdf/1706.03762)

24 of 120

“Transformer”

Self-attention:

The basis of the method.

Position representations:

Specify the sequence order, since self-attention is an unordered function of its inputs.

Nonlinearities:

At the output of the self-attention block
Frequently implemented as a simple feed-forward network.

Masking:

In order to parallelize operations while not looking at the future.
Keeps information about the future from “leaking” to the past.

https://web.stanford.edu/class/cs224n/

Attention Is All You Need (https://arxiv.org/pdf/1706.03762)

25 of 120

Understanding Transformer Step-by-Step

https://www.youtube.com/watch?v=GvezxUdLrEk

26 of 120

Transformer Decoder

A Transformer decoder is how we’ll build systems like language models.
It’s a lot like our minimal self-attention architecture, but with a few more components.
The embeddings and position embeddings are identical.
We’ll next replace our self-attention with multi-head self-attention.

https://web.stanford.edu/class/cs224n/

Attention Is All You Need (https://arxiv.org/pdf/1706.03762)

27 of 120

Multi-Head Attention

https://web.stanford.edu/class/cs224n/

Attention head 1 �attends to entities

Attention head 2 attends to syntactically relevant words

28 of 120

went

HUFS

LAI

2025

and

learned

29 of 120

went

HUFS

LAI

2025

and

learned

30 of 120

Multi-Head Attention

https://web.stanford.edu/class/cs224n/

Attention Is All You Need (https://arxiv.org/pdf/1706.03762)

https://www.youtube.com/watch?v=nfti9lEs8k8

Multi-head attention = stacking multiple self-attention

31 of 120

Transformer …

Today, we will focus more on “attention”, rather than “Transformer”.
However, it’s recommended to study the full details of Transformer, such as …

Low-rank computation in multi-head attention
Scaled Dot-Product Attention (dot product between large vectors is bad)
Residual connections
Layer normalization
Cross-attention (decoder attends to encoder states)
Quadratically growing computation by sequence length
…

32 of 120

Analysis of Transformer Attention

What Does BERT Look At? An Analysis of BERT’s Attention (https://arxiv.org/pdf/1906.04341)

33 of 120

Could We Trust Attention? (1)

Is attention an explanation for its prediction?

If attention were a faithful explanation for its prediction, then perturbing attention should significantly change the prediction.
In that sense, (Jain & Wallace, 2019) claims attention weights are NOT reliably faithful explanations.
They constructed shuffled / adversarial attention distributions on a fixed model, but the predictions stay almost the same.

Attention is not Explanation (https://arxiv.org/pdf/1902.10186)

34 of 120

Could We Trust Attention? (2)

Is a token with high attention really one that matters?

(Serrano & Smith, 2019) remove words in order of attention vs. gradients, and find that deleting high-attention words often barely affects the prediction.
=> Attention weights are not a reliable importance ranking, so we should be careful using them as explanations.

Is Attention Interpretable? (https://arxiv.org/pdf/1906.03731)

35 of 120

Advanced Topics

Large Language Models

36 of 120

Today’s Topic

What is language model?
What is “large” language model?
How to train large language model?
How to evaluate large language model?
How to use large language model?

37 of 120

What is Language Model?

A language model (LM) is a probabilistic model over sequences of symbols (words, subwords, characters).

It assigns a probability to each possible sentence or text sequence.
It captures what “sounds natural” in a language.

Core capability: predicting the next token given previous tokens.

38 of 120

What is Language Model?

39 of 120

Example: Smartphone Keyboard

https://jalammar.github.io/illustrated-gpt2/

40 of 120

What is Language Model?

41 of 120

What is Language Model?

42 of 120