Machine Learning�Week 13 – Deep Learning (4)
Deep Learning
Recurrent Neural Networks
Sequence Modeling
given previous words
predict the next word
Sequence Modeling
A Recurrent Neural Network (RNN)
output vector
input vector
cell state
old state
current input
RNN: State Update and Output
output vector
input vector
cell state
old state
current input
RNN: Backpropagation Through Time
Standard RNN Gradient Flow: Exploding Gradients
Many values > 1:
exploding gradients
Gradient clipping to�scale big gradients
Many values < 1:
vanishing gradients
Standard RNN Gradient Flow: Vanishing Gradients
Many values > 1:
exploding gradients
Gradient clipping to�scale big gradients
Many values < 1:
vanishing gradients
Why is Exploding Gradient a Problem?
Solution: Gradient Clipping
Vanishing Gradient Intuition
Vanishing gradient problem:
When these are small, the gradient signal gets smaller and smaller as it backpropagates further
Why is Vanishing Gradient a Problem?
Gradient signal from far away is lost because it’s much smaller than gradient signal from close-by.
So, model weights are basically updated only with respect to near effects, not long-term effects.
Long Short-Term Memory (LSTM)
Long Short-Term Memory (LSTM)
Gates optionally let information through, via a sigmoid layer and pointwise multiplication
LSTM: Cell State Matters
LSTM: Cell State Matters
LSTM: Mitigate Vanishing Gradient
… Vanish!
… Explode!
🡪
🡪
LSTM: Mitigate Vanishing Gradient
So…
We can keep information if we want!�(by adjusting how much we forget)
Supplementary: Residual Connection
RNN Applications
Usually better: �Take element-wise max or mean of all hidden states
RNN Applications
RNN Applications
Bidirectional and Multi-layer RNNs
We can regard this hidden state as a representation of the word “terribly” in the context of this sentence. We call this a contextual representation.
element-wise mean/max
element-wise mean/max
the
movie
was
terribly
exciting
!
These contextual representations only contain information about the left context (e.g., “the movie was”).
What about right context?
In this example, “exciting” is in the right context and this modifies the meaning of “terribly” (from negative to positive)
Bidirectional and Multi-layer RNNs
This contextual representation of “terribly” has both left and right context!
Concatenated hidden states
Forward RNN
Backward RNN
Bidirectional RNNs
This is a general notation to mean “compute one forward step of the RNN” – it could be a simple RNN or LSTM computation.
Concatenated hidden states
Forward RNN
Backward RNN
We regard this as “the hidden state” of a bidirectional RNN.
This is what we pass on the next parts of the network.
Generally, these two RNNs have separate weights
Bidirectional RNNs: Simplified Diagram
The two-way arrows indicate bidirectionality and the depicted hidden states are assumed to be the concatenated forwards+backwards states
Multi-layer RNNs
RNN layer 1
RNN layer 2
RNN layer 3
Multi-layer RNNs
RNN Limitations
Deep Learning
Attention
Attention
Attention
Attention
generated by Gemini 3 Pro
Attention in Animal
Attention in Science
Recap: Sentence Encoding
Is this the best way to model a sentence?
Recap: Image Encoding
Is this the best way to model an image?
Recap: Seq2Seq
Is this enough to preserve information in the input?
Issues with Recurrent Models (1)
short-term dependency�: enough with RNN
long-term dependency�: NOT enough with RNN
O(sequence length)
Issues with Recurrent Models (2)
1
2
3
0
1
2
n
Numbers indicate min # of steps before a state can be computed
Issues with Recurrent Models (3)
BiLSTM Encoder
LSTM Decoder
Solution: Attention
Encoder-Decoder �with Attention
Solution: Attention
In a lookup table, we have a table of keys that map to values. The query matches one of the keys, returning its value.
In attention, the query matches all keys softly, to a weight between 0 and 1. The keys’ values are multiplied by the weights and summed.
query
d
keys
a
values
v1
b
v2
c
v3
d
v4
e
v5
output
v4
query
q
keys
k1
values
v1
k2
v2
k3
v3
k4
v4
k5
v5
output
weighted �sum
Advantages of Attention
Attention for Documents
Hierarchical Attention Networks for Document Classification (https://aclanthology.org/N16-1174.pdf)
word attention
sentence attention
Attention for Documents
Hierarchical Attention Networks for Document Classification (https://aclanthology.org/N16-1174.pdf)
Attention for Documents
Hierarchical Attention Networks for Document Classification (https://aclanthology.org/N16-1174.pdf)
‘good’ gets higher attention
in positive reviews
‘bad’ gets higher attention
in negative reviews
Attention for Graphs
Graph Attention Networks (https://arxiv.org/abs/1710.10903)
Attention for QA Systems
# Passage
Joe went to the kitchen.
Fred went to the kitchen.
Joe picked up the milk.
Joe travelled to the office.
Joe left the milk.
Joe went to the bathroom.
# Question
Where is the milk?
Attention for QA Systems
# Passage
Joe went to the kitchen.
Fred went to the kitchen.
Joe picked up the milk.
Joe travelled to the office.
Joe left the milk.
Joe went to the bathroom.
# Question
Where is the milk?
# Answer
?
Attention for QA Systems
https://alex.smola.org/talks/ICML19-attention.pdf
End-To-End Memory Networks (https://arxiv.org/pdf/1503.08895)
iterative�attention
Attention for QA Systems
https://alex.smola.org/talks/ICML19-attention.pdf
End-To-End Memory Networks (https://arxiv.org/pdf/1503.08895)
Attention for Machine Translation
https://alex.smola.org/talks/ICML19-attention.pdf
Sequence to Sequence Learning with Neural Networks (https://arxiv.org/pdf/1409.3215)
Attention for Machine Translation
https://alex.smola.org/talks/ICML19-attention.pdf
Neural Machine Translation by Jointly Learning to Align and Translate (https://arxiv.org/pdf/1409.0473)
Effective Approaches to Attention-based Neural Machine Translation (https://arxiv.org/pdf/1508.04025)
Image generated by Gemini 3 Pro
Attention for Machine Translation
https://alex.smola.org/talks/ICML19-attention.pdf
Neural Machine Translation by Jointly Learning to Align and Translate (https://arxiv.org/pdf/1409.0473)
Attention for Machine Translation
Neural Machine Translation by Jointly Learning to Align and Translate (https://arxiv.org/pdf/1409.0473)
Attention for Image Captioning
Show, Attend, and Tell: Neural Image Caption Generation with Visual Attention (https://arxiv.org/pdf/1502.03044)
Self-Attention as a New Building Block
https://web.stanford.edu/class/cs224n/
Attention Is All You Need (https://arxiv.org/pdf/1706.03762)
Encoder-Decoder �with Attention
All words attend to all words in previous layer
Self-Attention as a New Building Block
https://web.stanford.edu/class/cs224n/
Attention Is All You Need (https://arxiv.org/pdf/1706.03762)
Limitations of Self-Attention (1)
https://web.stanford.edu/class/cs224n/
Attention Is All You Need (https://arxiv.org/pdf/1706.03762)
q
k
v
k
v
k
v
k
v
k
v
k
v
k
v
k
v
went
to
HUFS
LAI
at
2025
and
learned
No order information!
There is just summation
over the set
Position Representation Vectors
https://web.stanford.edu/class/cs224n/
Attention Is All You Need (https://arxiv.org/pdf/1706.03762)
In deep self-attention networks, we do this at the first layer! You could concatenate them as well, but people mostly just add…
Position Representation Vectors
Attention Is All You Need (https://arxiv.org/pdf/1706.03762)
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (https://arxiv.org/pdf/1810.04805)
Position Representation Vectors
https://web.stanford.edu/class/cs224n/
Attention Is All You Need (https://arxiv.org/pdf/1706.03762)
Position Representation Vectors
https://web.stanford.edu/class/cs224n/
Attention Is All You Need (https://arxiv.org/pdf/1706.03762)
Limitations of Self-Attention (2)
https://web.stanford.edu/class/cs224n/
Attention Is All You Need (https://arxiv.org/pdf/1706.03762)
query
q
keys
k1
values
v1
k2
v2
k3
v3
k4
v4
k5
v5
output
Even stacking multiple layers cannot introduce any non-linearity because it’s still a summation of value vectors
output
Adding nonlinearities in self-attention
https://web.stanford.edu/class/cs224n/
Attention Is All You Need (https://arxiv.org/pdf/1706.03762)
Limitations of Self-Attention (3)
https://web.stanford.edu/class/cs224n/
Attention Is All You Need (https://arxiv.org/pdf/1706.03762)
Masking the Future in Self-Attention
https://web.stanford.edu/class/cs224n/
Attention Is All You Need (https://arxiv.org/pdf/1706.03762)
Self-Attention as a New Building Block
https://web.stanford.edu/class/cs224n/
Attention Is All You Need (https://arxiv.org/pdf/1706.03762)
“Transformer”
https://web.stanford.edu/class/cs224n/
Attention Is All You Need (https://arxiv.org/pdf/1706.03762)
Understanding Transformer Step-by-Step
Transformer Decoder
https://web.stanford.edu/class/cs224n/
Attention Is All You Need (https://arxiv.org/pdf/1706.03762)
Multi-Head Attention
Attention head 1 �attends to entities
Attention head 2 attends to syntactically relevant words
q
k
v
k
v
k
v
k
v
k
v
k
v
k
v
k
v
went
to
HUFS
LAI
at
2025
and
learned
q
k
v
k
v
k
v
k
v
k
v
k
v
k
v
k
v
went
to
HUFS
LAI
at
2025
and
learned
Multi-Head Attention
https://web.stanford.edu/class/cs224n/
Attention Is All You Need (https://arxiv.org/pdf/1706.03762)
Multi-head attention = stacking multiple self-attention
Transformer …
Analysis of Transformer Attention
What Does BERT Look At? An Analysis of BERT’s Attention (https://arxiv.org/pdf/1906.04341)
Could We Trust Attention?
Could We Trust Attention? (1)
Attention is not Explanation (https://arxiv.org/pdf/1902.10186)
Could We Trust Attention? (2)
Is Attention Interpretable? (https://arxiv.org/pdf/1906.03731)
Could We Trust Attention? (3)
Efficient Streaming Language Models with Attention Sinks (https://arxiv.org/pdf/2309.17453)
Could We Trust Attention? (4)
Human Attention in Visual Question Answering: Do Humans and Deep Networks Look at the Same Regions? (https://arxiv.org/pdf/1606.03556)
Could We Trust Attention? (5)
What Does BERT Look At? An Analysis of BERT’s Attention (https://arxiv.org/pdf/1906.04341)
How Could We Make Attention Trustable? (1)
Attention is Not Only a Weight: Analyzing Transformers with Vector Norms (https://arxiv.org/pdf/2004.10102)
How Could We Make Attention Trustable? (2)
Human Attention Maps for Text Classification: Do Humans and Neural Networks Focus on the Same Words? (https://aclanthology.org/2020.acl-main.419.pdf)
Deriving Machine Attention from Human Rationales (https://arxiv.org/pdf/1808.09367)
Less is More: Attention Supervision with Counterfactuals for Text Classification (https://aclanthology.org/2020.emnlp-main.543.pdf)
How Could We Make Attention Trustable? (3)
Linguistically-Informed Self-Attention for Semantic Role Labeling (https://arxiv.org/pdf/1804.08199)