3 of 89

Sequence Modeling

Sequence modeling is the task of predicting what comes next

E.g., “This morning I took my dog for a walk.”

E.g., given historical air quality, forecast air quality in next couple of hours.

https://yaoyichi.github.io/spatial-ai/w9w10.3-rnn.pdf

given previous words

predict the next word

4 of 89

Sequence Modeling

To model sequences, we need to:

Handle variable-length sequences
Track long-term dependencies
Maintain information about order
Share parameters across the sequence

Solution:

Recurrent Neural Networks (RNNs)

https://yaoyichi.github.io/spatial-ai/w9w10.3-rnn.pdf

5 of 89

A Recurrent Neural Network (RNN)

Apply a recurrence relation at every time step to process a sequence:

Note: the same function and set of parameters �are used at every time step

https://yaoyichi.github.io/spatial-ai/w9w10.3-rnn.pdf

output vector

input vector

cell state

old state

current input

6 of 89

RNN: State Update and Output

https://yaoyichi.github.io/spatial-ai/w9w10.3-rnn.pdf

output vector

input vector

cell state

old state

current input

7 of 89

RNN: Backpropagation Through Time

https://yaoyichi.github.io/spatial-ai/w9w10.3-rnn.pdf

8 of 89

Standard RNN Gradient Flow: Exploding Gradients

https://www.youtube.com/watch?v=GvezxUdLrEk

Many values > 1:

exploding gradients

Gradient clipping to�scale big gradients

Many values < 1:

vanishing gradients

Activation function
Weight initialization
Network architecture

9 of 89

Standard RNN Gradient Flow: Vanishing Gradients

https://www.youtube.com/watch?v=GvezxUdLrEk

Many values > 1:

exploding gradients

Gradient clipping to�scale big gradients

Many values < 1:

vanishing gradients

Activation function
Weight initialization
Network architecture

10 of 89

Why is Exploding Gradient a Problem?

If the gradient becomes too big, then SGD update step becomes too big:

This can be bad updates: we take too large a step and reach a weird and bad parameter configuration (with large loss)

You think you’ve found a hill to climb, but suddenly you’re in Iowa.

In the worst case, this will result in Inf or NaN �in your network (then you have to restart training from �an earlier checkpoint)

https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1246/

11 of 89

Solution: Gradient Clipping

Gradient clipping: if the norm of the gradient is greater than some threshold, scale it down before applying SGD update

Intuition: take a step in the same direction, but a smaller step

In practice: remembering to clip gradients is important, but exploding gradients are an easy problem to solve.

https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1246/

12 of 89

Vanishing Gradient Intuition

https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1246/

Vanishing gradient problem:

When these are small, the gradient signal gets smaller and smaller as it backpropagates further

13 of 89

Why is Vanishing Gradient a Problem?

https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1246/

Gradient signal from far away is lost because it’s much smaller than gradient signal from close-by.

So, model weights are basically updated only with respect to near effects, not long-term effects.

14 of 89

Long Short-Term Memory (LSTM)

In an LSTM network, recurrent modules contain gated cells that control the information flow

https://yaoyichi.github.io/spatial-ai/w9w10.3-rnn.pdf

https://colah.github.io/posts/2015-08-Understanding-LSTMs/

15 of 89

Long Short-Term Memory (LSTM)

Information is added or removed to cell state through structures called gates.

https://yaoyichi.github.io/spatial-ai/w9w10.3-rnn.pdf

https://colah.github.io/posts/2015-08-Understanding-LSTMs/

Gates optionally let information through, via a sigmoid layer and pointwise multiplication

16 of 89

LSTM: Cell State Matters

https://yaoyichi.github.io/spatial-ai/w9w10.3-rnn.pdf

https://colah.github.io/posts/2015-08-Understanding-LSTMs/

17 of 89

LSTM: Cell State Matters

You can think of the LSTM equations visually like this:

https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1246/

18 of 89

LSTM: Mitigate Vanishing Gradient

Vanilla RNNs

https://yaoyichi.github.io/spatial-ai/w9w10.3-rnn.pdf

… Vanish!

… Explode!

🡪

19 of 89

LSTM: Mitigate Vanishing Gradient

Vanilla RNNs

LSTM

https://yaoyichi.github.io/spatial-ai/w9w10.3-rnn.pdf

So…

We can keep information if we want!�(by adjusting how much we forget)

20 of 89

Supplementary: Residual Connection

Is vanishing/exploding gradient just an RNN problem?

No! It can be a problem for all neural architectures (including feed-forward and convolutional neural networks), especially very deep ones.

Due to chain rule / choice of nonlinearity function, gradient can become vanishingly small as it backpropagates
Thus, lower layers are learned very slowly (i.e., are hard to train)

Another solution: lots of new deep feedforward/convolutional architectures add more direct connections (thus allowing the gradient to flow)

For example:

Residual connects a.k.a. “ResNet”
Also known as skip-connections
The identity connection preserves information by default
This makes deep networks much easier to train

https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1246/

21 of 89

RNN Applications

RNNs can be used as a sentence encoder model�e.g., sentiment classification

https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1246/

Usually better: �Take element-wise max or mean of all hidden states

22 of 89

RNN Applications

RNNs can be used to generate text based on other information
e.g., speech recognition, machine translation, summarization

https://www.youtube.com/watch?v=GvezxUdLrEk

https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1246/

23 of 89

RNN Applications

https://adeveloperdiary.com/data-science/deep-learning/nlp/machine-translation-recurrent-neural-network-pytorch/

24 of 89

Bidirectional and Multi-layer RNNs

Motivation

https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1246/

We can regard this hidden state as a representation of the word “terribly” in the context of this sentence. We call this a contextual representation.

element-wise mean/max

the

movie

was

terribly

exciting

These contextual representations only contain information about the left context (e.g., “the movie was”).

What about right context?

In this example, “exciting” is in the right context and this modifies the meaning of “terribly” (from negative to positive)

25 of 89

Bidirectional and Multi-layer RNNs

https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1246/

This contextual representation of “terribly” has both left and right context!

Concatenated hidden states

Forward RNN

Backward RNN

26 of 89

Bidirectional RNNs

https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1246/

This is a general notation to mean “compute one forward step of the RNN” – it could be a simple RNN or LSTM computation.

Concatenated hidden states

Forward RNN

Backward RNN

We regard this as “the hidden state” of a bidirectional RNN.

This is what we pass on the next parts of the network.

Generally, these two RNNs have separate weights

27 of 89

Bidirectional RNNs: Simplified Diagram

https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1246/

The two-way arrows indicate bidirectionality and the depicted hidden states are assumed to be the concatenated forwards+backwards states

28 of 89

Multi-layer RNNs

https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1246/

RNN layer 1

RNN layer 2

RNN layer 3

29 of 89

Multi-layer RNNs

Multi-layer or stacked RNNs allow a network to compute more complex representations – they work better than just one layer of high-dimensional encodings!

The lower RNNs should compute lower-level features and the higher RNNs should compute higher-level features.

High-performing RNNs are usually multi-layer (but aren’t as deep as convolutional or feed-forward networks)
For example: In a 2017 paper, Britz et al. find that for Neural Machine Translation, 2 to 4 layers is best for encoder RNN, and 4 layers is best for the decoder RNN

Often 2 layers is a lot better than 1, and 3 might be a little better than 2
Usually, skip-connections/dense-connections are needed to train deeper RNNs (e.g., 8 layers)

Transformer-based networks (e.g., BERT) are usually deeper, like 12 or 24 layers.

You will learn about Transformers later; they have a lot of skipping-like connections.

https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1246/

30 of 89

RNN Limitations

Limitations

Encoding bottleneck: Fixed-size hidden state 🡪 information loss

🡪 Continuous stream

Slow, no parallelization: Step-by-step processing 🡪 slow on long sequences

🡪 Parallelization

Not long memory: Vanishing/exploding gradients 🡪 week long-term dependency

🡪 Long memory

https://www.youtube.com/watch?v=GvezxUdLrEk

31 of 89

Deep Learning

Attention

32 of 89

Attention

https://alex.smola.org/talks/ICML19-attention.pdf

33 of 89

Attention

https://alex.smola.org/talks/ICML19-attention.pdf

34 of 89

Attention

generated by Gemini 3 Pro

35 of 89

Attention in Animal

Resource saving

Only need sensors where relevant bits are �(e.g., fovea vs. peripheral vision)
Only compute relevant bits of information �(e.g., fovea has many more ‘pixels’ than periphery)

Variable state manipulation

Manipulate environment (for all grains do: eat)
Learn modular subroutines (not state)

36 of 89

Attention in Science

Behavioral and cognitive process of selectively concentrating on a discrete aspect of information while ignoring other perceivable information
State of arousal
It is the taking possession by the mind in clear and vivid form of one out of what seem several simultaneous objects or trains of thought.
Focalization, the concentration of consciousness
The allocation of limited cognitive processing resources

https://www.youtube.com/watch?v=CIQ354Y_gOc

37 of 89

Recap: Sentence Encoding

Is this the best way to model a sentence?

38 of 89

Recap: Image Encoding

Is this the best way to model an image?

39 of 89

Recap: Seq2Seq

Is this enough to preserve information in the input?

https://adeveloperdiary.com/data-science/deep-learning/nlp/machine-translation-recurrent-neural-network-pytorch/

40 of 89

Issues with Recurrent Models (1)

Linear interaction distance

RNNs are unrolled “left-to-right”.
This encodes linear locality: a useful heuristic!

Nearby words often affect each other’s meanings

Problem: RNNs take O(sequence length) steps for distant word pairs to interact.

Hard to learn long-distance dependencies (because gradient problems!)

short-term dependency�: enough with RNN

long-term dependency�: NOT enough with RNN

O(sequence length)

https://web.stanford.edu/class/cs224n/

41 of 89

Issues with Recurrent Models (2)

Lack of parallelizability

Forward and backward passes have O(sequence length) unparallelizable operations

GPUs can perform a bunch of independent computations at once!
But future RNN hidden states can’t be computed in full before past RNN hidden states have been computed.
Inhibits training on very large datasets!

https://web.stanford.edu/class/cs224n/

Numbers indicate min # of steps before a state can be computed

42 of 89

Issues with Recurrent Models (3)

Encoding bottleneck

It’s like a one-shot compression (sequence of features 🡪 a vector).
RNN + classification is like judging a book by its summary.

BiLSTM Encoder

LSTM Decoder

https://web.stanford.edu/class/cs224n/

43 of 89

Solution: Attention

Encoder-Decoder �with Attention

44 of 89

Solution: Attention

We can think of attention as performing fuzzy lookup in a key-value store.

https://web.stanford.edu/class/cs224n/

In a lookup table, we have a table of keys that map to values. The query matches one of the keys, returning its value.

In attention, the query matches all keys softly, to a weight between 0 and 1. The keys’ values are multiplied by the weights and summed.

query

keys

values

output

query

keys

values

output

weighted �sum

45 of 89

Advantages of Attention

Attention is better for preserving the input features �(not encoded with nonlinearity)
Attention is better for adapting its weights based on query/task.
Attention is better for capturing long-range dependency.
Attention is better for interpreting the model predictions.
…

46 of 89

Attention for Documents

Hierarchical Attention Networks for Document Classification (https://aclanthology.org/N16-1174.pdf)

word attention

sentence attention

47 of 89

Attention for Documents

Some words are more important than others.
Some sentences are more important than others.

Hierarchical Attention Networks for Document Classification (https://aclanthology.org/N16-1174.pdf)

48 of 89

Attention for Documents

Hierarchical Attention Networks for Document Classification (https://aclanthology.org/N16-1174.pdf)

‘good’ gets higher attention

in positive reviews

‘bad’ gets higher attention

in negative reviews

49 of 89

Attention for Graphs

Graph Attention Networks (https://arxiv.org/abs/1710.10903)

50 of 89

Attention for QA Systems

Question answering = given the passage, answer the question.

# Passage

Joe went to the kitchen.

Fred went to the kitchen.

Joe picked up the milk.

Joe travelled to the office.

Joe left the milk.

Joe went to the bathroom.

# Question

Where is the milk?

https://alex.smola.org/talks/ICML19-attention.pdf

51 of 89

Attention for QA Systems

Simple attention selects sentences with ‘milk’.
Attention pooling doesn’t help much since it misses intermediate steps.

# Passage

Joe went to the kitchen.

Fred went to the kitchen.

Joe picked up the milk.

Joe travelled to the office.

Joe left the milk.

Joe went to the bathroom.

# Question

Where is the milk?

# Answer

https://alex.smola.org/talks/ICML19-attention.pdf

52 of 89

Attention for QA Systems

The model needs to reason sequentially.
Multi-hop attention solves the problem.

Hop 1: Focuses on “milk” 🡪 finds “Joe left the milk.”
Hop 2: Focuses on “Joe” and temporal context 🡪 finds “Joe travelled to the office.”

https://alex.smola.org/talks/ICML19-attention.pdf

End-To-End Memory Networks (https://arxiv.org/pdf/1503.08895)

iterative�attention

53 of 89

Attention for QA Systems

Multiple computational steps (hops) are performed per output symbol.

https://alex.smola.org/talks/ICML19-attention.pdf

End-To-End Memory Networks (https://arxiv.org/pdf/1503.08895)

54 of 89

Attention for Machine Translation

Sequence-to-sequence (seq2seq)

Encode source sequence to latent representation
Decode to target sequence one character at a time
Need memory for long sequences

https://alex.smola.org/talks/ICML19-attention.pdf

Sequence to Sequence Learning with Neural Networks (https://arxiv.org/pdf/1409.3215)

55 of 89

Attention for Machine Translation

https://alex.smola.org/talks/ICML19-attention.pdf

Neural Machine Translation by Jointly Learning to Align and Translate (https://arxiv.org/pdf/1409.0473)

Effective Approaches to Attention-based Neural Machine Translation (https://arxiv.org/pdf/1508.04025)

Image generated by Gemini 3 Pro

56 of 89

Attention for Machine Translation

With attention, translation quality remains much more stable for long sentences.

https://alex.smola.org/talks/ICML19-attention.pdf

Neural Machine Translation by Jointly Learning to Align and Translate (https://arxiv.org/pdf/1409.0473)

57 of 89

Attention for Machine Translation

Neural Machine Translation by Jointly Learning to Align and Translate (https://arxiv.org/pdf/1409.0473)

58 of 89

Attention for Image Captioning

In computer vision, attention learns alignment between image region and word.

Show, Attend, and Tell: Neural Image Caption Generation with Visual Attention (https://arxiv.org/pdf/1502.03044)

59 of 89

Self-Attention as a New Building Block

Prev: Attention on recurrent models
Todays: Attention only

Number of unparallelizable operations does not increase with sequence length.
Maximum interaction distance: O(1), since all words interact at every layer!
How? Treat each word’s representation as a query to access and incorporate information from a set of values.

https://web.stanford.edu/class/cs224n/

Attention Is All You Need (https://arxiv.org/pdf/1706.03762)

Encoder-Decoder �with Attention

All words attend to all words in previous layer

60 of 89

Self-Attention as a New Building Block

https://web.stanford.edu/class/cs224n/

Attention Is All You Need (https://arxiv.org/pdf/1706.03762)

61 of 89

Limitations of Self-Attention (1)

1^st challenge: self-attention doesn’t have an inherent notion of order!

https://web.stanford.edu/class/cs224n/

Attention Is All You Need (https://arxiv.org/pdf/1706.03762)

went

HUFS

LAI

2025

and

learned

No order information!

There is just summation

over the set

62 of 89

Position Representation Vectors

https://web.stanford.edu/class/cs224n/

Attention Is All You Need (https://arxiv.org/pdf/1706.03762)

In deep self-attention networks, we do this at the first layer! You could concatenate them as well, but people mostly just add…

63 of 89

Position Representation Vectors

Attention Is All You Need (https://arxiv.org/pdf/1706.03762)

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (https://arxiv.org/pdf/1810.04805)

64 of 89

Position Representation Vectors

Sinusoidal position representations: concatenate sinusoidal functions of varying periods:

Pros:

Periodicity indicates that maybe “absolute position” isn’t as important
Maybe can extrapolate to longer sequences as periods restart!

Cons:

Not learnable; also the extrapolation doesn’t really work!

https://web.stanford.edu/class/cs224n/

Attention Is All You Need (https://arxiv.org/pdf/1706.03762)

65 of 89

Position Representation Vectors

https://web.stanford.edu/class/cs224n/

Attention Is All You Need (https://arxiv.org/pdf/1706.03762)

66 of 89

Limitations of Self-Attention (2)

2^nd challenge: there is no non-linearities. It’s all just weighted averages.

https://web.stanford.edu/class/cs224n/

Attention Is All You Need (https://arxiv.org/pdf/1706.03762)

query

keys

values

output

Even stacking multiple layers cannot introduce any non-linearity because it’s still a summation of value vectors

output

67 of 89

Adding nonlinearities in self-attention

https://web.stanford.edu/class/cs224n/

Attention Is All You Need (https://arxiv.org/pdf/1706.03762)

68 of 89

Limitations of Self-Attention (3)

3^rd challenge: need to ensure we don’t “look at the future” when predicting a sequence

https://web.stanford.edu/class/cs224n/

Attention Is All You Need (https://arxiv.org/pdf/1706.03762)

69 of 89

Masking the Future in Self-Attention

https://web.stanford.edu/class/cs224n/

Attention Is All You Need (https://arxiv.org/pdf/1706.03762)

70 of 89

Self-Attention as a New Building Block

Barriers and solutions:

1. Doesn’t have an inherent notion of order

-> Add position representations to the inputs

2. No nonlinearities for deep learning magic. It’s all just weighted averages

-> Easy fix: apply the same feedforward network to each self-attention output

3. Need to ensure we don’t “look at the future” when predicting a sequence

-> Mask out the future by artificially setting attention weights to 0!

https://web.stanford.edu/class/cs224n/

Attention Is All You Need (https://arxiv.org/pdf/1706.03762)

71 of 89

“Transformer”

Self-attention:

The basis of the method.

Position representations:

Specify the sequence order, since self-attention is an unordered function of its inputs.

Nonlinearities:

At the output of the self-attention block
Frequently implemented as a simple feed-forward network.

Masking:

In order to parallelize operations while not looking at the future.
Keeps information about the future from “leaking” to the past.

https://web.stanford.edu/class/cs224n/

Attention Is All You Need (https://arxiv.org/pdf/1706.03762)

72 of 89

Understanding Transformer Step-by-Step

https://www.youtube.com/watch?v=GvezxUdLrEk

73 of 89

Transformer Decoder

A Transformer decoder is how we’ll build systems like language models.
It’s a lot like our minimal self-attention architecture, but with a few more components.
The embeddings and position embeddings are identical.
We’ll next replace our self-attention with multi-head self-attention.

https://web.stanford.edu/class/cs224n/

Attention Is All You Need (https://arxiv.org/pdf/1706.03762)

74 of 89

Multi-Head Attention

https://web.stanford.edu/class/cs224n/

Attention head 1 �attends to entities

Attention head 2 attends to syntactically relevant words

75 of 89

went

HUFS

LAI

2025

and

learned

76 of 89

went

HUFS

LAI

2025

and

learned

77 of 89

Multi-Head Attention

https://web.stanford.edu/class/cs224n/

Attention Is All You Need (https://arxiv.org/pdf/1706.03762)

https://www.youtube.com/watch?v=nfti9lEs8k8

Multi-head attention = stacking multiple self-attention

78 of 89

Transformer …

Today, we will focus more on “attention”, rather than “Transformer”.
However, it’s recommended to study the full details of Transformer, such as …

Low-rank computation in multi-head attention
Scaled Dot-Product Attention (dot product between large vectors is bad)
Residual connections
Layer normalization
Cross-attention (decoder attends to encoder states)
Quadratically growing computation by sequence length
…

80 of 89

Analysis of Transformer Attention

What Does BERT Look At? An Analysis of BERT’s Attention (https://arxiv.org/pdf/1906.04341)

81 of 89

Could We Trust Attention?

Attention looks like an explanation for its prediction.

We can visualize which source tokens the model “attends to”.

But: Can we trust these weights as explanations?

82 of 89

Could We Trust Attention? (1)

Is attention an explanation for its prediction?

If attention were a faithful explanation for its prediction, then perturbing attention should significantly change the prediction.
In that sense, (Jain & Wallace, 2019) claims attention weights are NOT reliably faithful explanations.
They constructed shuffled / adversarial attention distributions on a fixed model, but the predictions stay almost the same.

Attention is not Explanation (https://arxiv.org/pdf/1902.10186)

83 of 89

Could We Trust Attention? (2)

Is a token with high attention really one that matters?

(Serrano & Smith, 2019) remove words in order of attention vs. gradients, and find that deleting high-attention words often barely affects the prediction.
=> Attention weights are not a reliable importance ranking, so we should be careful using them as explanations.

Is Attention Interpretable? (https://arxiv.org/pdf/1906.03731)

84 of 89

Could We Trust Attention? (3)

Is attention concentrated on “meaningful” tokens?

(Xiao et al., 2024) found that many LLMs assign very high attention to a few special / initial tokens, regardless of the actual content (“attention sinks”).

Efficient Streaming Language Models with Attention Sinks (https://arxiv.org/pdf/2309.17453)

85 of 89

Could We Trust Attention? (4)

Is AI’s attention aligned with humans’ attention?

In visual question answering, AI’s attention maps show low correlation with humans’ attention (eye-gaze).

Human Attention in Visual Question Answering: Do Humans and Deep Networks Look at the Same Regions? (https://arxiv.org/pdf/1606.03556)

86 of 89

Could We Trust Attention? (5)

Is AI’s attention aligned with humans’ attention?

In natural language processing too, AI’s attention maps show low correlation with humans’ semantic importance.

What Does BERT Look At? An Analysis of BERT’s Attention (https://arxiv.org/pdf/1906.04341)

87 of 89

How Could We Make Attention Trustable? (1)

Attention is Not Only a Weight: Analyzing Transformers with Vector Norms (https://arxiv.org/pdf/2004.10102)

88 of 89

How Could We Make Attention Trustable? (2)

Human Attention Maps for Text Classification: Do Humans and Neural Networks Focus on the Same Words? (https://aclanthology.org/2020.acl-main.419.pdf)

Deriving Machine Attention from Human Rationales (https://arxiv.org/pdf/1808.09367)

Less is More: Attention Supervision with Counterfactuals for Text Classification (https://aclanthology.org/2020.emnlp-main.543.pdf)

1 of 89

2 of 89

3 of 89

4 of 89

5 of 89

6 of 89

7 of 89

8 of 89

9 of 89

10 of 89

11 of 89

12 of 89

13 of 89

14 of 89

15 of 89

16 of 89

17 of 89

18 of 89

19 of 89

20 of 89

21 of 89

22 of 89

23 of 89

24 of 89

25 of 89

26 of 89

27 of 89

28 of 89

29 of 89

30 of 89

31 of 89

32 of 89

33 of 89

34 of 89

35 of 89

36 of 89

37 of 89

38 of 89

39 of 89

40 of 89

41 of 89

42 of 89

43 of 89

44 of 89

45 of 89

46 of 89

47 of 89

48 of 89

49 of 89

50 of 89

51 of 89

52 of 89

53 of 89

54 of 89

55 of 89

56 of 89

57 of 89

58 of 89

59 of 89

60 of 89

61 of 89

62 of 89

63 of 89

64 of 89

65 of 89

66 of 89

67 of 89

68 of 89

69 of 89

70 of 89

71 of 89

72 of 89

73 of 89

74 of 89

75 of 89

76 of 89

77 of 89

78 of 89

79 of 89

80 of 89