1 of 89

Machine Learning�Week 13 – Deep Learning (4)

Seungtaek Choi

Division of Language & AI at HUFS

seungtaek.choi@hufs.ac.kr

2 of 89

Deep Learning

Recurrent Neural Networks

3 of 89

Sequence Modeling

  • Sequence modeling is the task of predicting what comes next
    • E.g., “This morning I took my dog for a walk.”

    • E.g., given historical air quality, forecast air quality in next couple of hours.

given previous words

predict the next word

4 of 89

Sequence Modeling

  • To model sequences, we need to:
    • Handle variable-length sequences
    • Track long-term dependencies
    • Maintain information about order
    • Share parameters across the sequence

  • Solution:
    • Recurrent Neural Networks (RNNs)

5 of 89

A Recurrent Neural Network (RNN)

  • Apply a recurrence relation at every time step to process a sequence:

  • Note: the same function and set of parameters �are used at every time step

output vector

input vector

 

cell state

 

old state

current input

6 of 89

RNN: State Update and Output

  •  

output vector

input vector

 

cell state

 

old state

current input

 

 

7 of 89

RNN: Backpropagation Through Time

  •  

8 of 89

Standard RNN Gradient Flow: Exploding Gradients

  •  

Many values > 1:

exploding gradients

Gradient clipping to�scale big gradients

Many values < 1:

vanishing gradients

  1. Activation function
  2. Weight initialization
  3. Network architecture

9 of 89

Standard RNN Gradient Flow: Vanishing Gradients

  •  

Many values > 1:

exploding gradients

Gradient clipping to�scale big gradients

Many values < 1:

vanishing gradients

  1. Activation function
  2. Weight initialization
  3. Network architecture

10 of 89

Why is Exploding Gradient a Problem?

  • If the gradient becomes too big, then SGD update step becomes too big:

  • This can be bad updates: we take too large a step and reach a weird and bad parameter configuration (with large loss)
    • You think you’ve found a hill to climb, but suddenly you’re in Iowa.

  • In the worst case, this will result in Inf or NaN �in your network (then you have to restart training from �an earlier checkpoint)

 

11 of 89

Solution: Gradient Clipping

  • Gradient clipping: if the norm of the gradient is greater than some threshold, scale it down before applying SGD update

  • Intuition: take a step in the same direction, but a smaller step

  • In practice: remembering to clip gradients is important, but exploding gradients are an easy problem to solve.

12 of 89

Vanishing Gradient Intuition

 

Vanishing gradient problem:

When these are small, the gradient signal gets smaller and smaller as it backpropagates further

 

 

13 of 89

Why is Vanishing Gradient a Problem?

Gradient signal from far away is lost because it’s much smaller than gradient signal from close-by.

So, model weights are basically updated only with respect to near effects, not long-term effects.

14 of 89

Long Short-Term Memory (LSTM)

  • In an LSTM network, recurrent modules contain gated cells that control the information flow

15 of 89

Long Short-Term Memory (LSTM)

  • Information is added or removed to cell state through structures called gates.

Gates optionally let information through, via a sigmoid layer and pointwise multiplication

16 of 89

LSTM: Cell State Matters

  •  

 

17 of 89

LSTM: Cell State Matters

  • You can think of the LSTM equations visually like this:

18 of 89

LSTM: Mitigate Vanishing Gradient

  • Vanilla RNNs

 

 

 

… Vanish!

 

… Explode!

🡪

🡪

 

19 of 89

LSTM: Mitigate Vanishing Gradient

  • Vanilla RNNs

  • LSTM

 

 

 

 

 

So…

We can keep information if we want!�(by adjusting how much we forget)

20 of 89

Supplementary: Residual Connection

  • Is vanishing/exploding gradient just an RNN problem?

  • No! It can be a problem for all neural architectures (including feed-forward and convolutional neural networks), especially very deep ones.
    • Due to chain rule / choice of nonlinearity function, gradient can become vanishingly small as it backpropagates
    • Thus, lower layers are learned very slowly (i.e., are hard to train)

  • Another solution: lots of new deep feedforward/convolutional architectures add more direct connections (thus allowing the gradient to flow)

  • For example:
    • Residual connects a.k.a. “ResNet”
    • Also known as skip-connections
    • The identity connection preserves information by default
    • This makes deep networks much easier to train

21 of 89

RNN Applications

  • RNNs can be used as a sentence encoder model�e.g., sentiment classification

Usually better: �Take element-wise max or mean of all hidden states

22 of 89

RNN Applications

  • RNNs can be used to generate text based on other information
  • e.g., speech recognition, machine translation, summarization

23 of 89

RNN Applications

24 of 89

Bidirectional and Multi-layer RNNs

  • Motivation

We can regard this hidden state as a representation of the word “terribly” in the context of this sentence. We call this a contextual representation.

element-wise mean/max

element-wise mean/max

the

movie

was

terribly

exciting

!

These contextual representations only contain information about the left context (e.g., “the movie was”).

What about right context?

In this example, “exciting” is in the right context and this modifies the meaning of “terribly” (from negative to positive)

25 of 89

Bidirectional and Multi-layer RNNs

This contextual representation of “terribly” has both left and right context!

Concatenated hidden states

Forward RNN

Backward RNN

26 of 89

Bidirectional RNNs

  •  

This is a general notation to mean “compute one forward step of the RNN” – it could be a simple RNN or LSTM computation.

Concatenated hidden states

Forward RNN

Backward RNN

 

 

 

We regard this as “the hidden state” of a bidirectional RNN.

This is what we pass on the next parts of the network.

Generally, these two RNNs have separate weights

27 of 89

Bidirectional RNNs: Simplified Diagram

The two-way arrows indicate bidirectionality and the depicted hidden states are assumed to be the concatenated forwards+backwards states

28 of 89

Multi-layer RNNs

RNN layer 1

RNN layer 2

RNN layer 3

 

29 of 89

Multi-layer RNNs

  • Multi-layer or stacked RNNs allow a network to compute more complex representations – they work better than just one layer of high-dimensional encodings!
    • The lower RNNs should compute lower-level features and the higher RNNs should compute higher-level features.
  • High-performing RNNs are usually multi-layer (but aren’t as deep as convolutional or feed-forward networks)
  • For example: In a 2017 paper, Britz et al. find that for Neural Machine Translation, 2 to 4 layers is best for encoder RNN, and 4 layers is best for the decoder RNN
    • Often 2 layers is a lot better than 1, and 3 might be a little better than 2
    • Usually, skip-connections/dense-connections are needed to train deeper RNNs (e.g., 8 layers)
  • Transformer-based networks (e.g., BERT) are usually deeper, like 12 or 24 layers.
    • You will learn about Transformers later; they have a lot of skipping-like connections.

30 of 89

RNN Limitations

  • Limitations
    • Encoding bottleneck: Fixed-size hidden state 🡪 information loss
      • 🡪 Continuous stream
    • Slow, no parallelization: Step-by-step processing 🡪 slow on long sequences
      • 🡪 Parallelization
    • Not long memory: Vanishing/exploding gradients 🡪 week long-term dependency
      • 🡪 Long memory

31 of 89

Deep Learning

Attention

32 of 89

Attention

33 of 89

Attention

34 of 89

Attention

generated by Gemini 3 Pro

35 of 89

Attention in Animal

  • Resource saving
    • Only need sensors where relevant bits are �(e.g., fovea vs. peripheral vision)
    • Only compute relevant bits of information �(e.g., fovea has many more ‘pixels’ than periphery)
  • Variable state manipulation
    • Manipulate environment (for all grains do: eat)
    • Learn modular subroutines (not state)

36 of 89

Attention in Science

  • Behavioral and cognitive process of selectively concentrating on a discrete aspect of information while ignoring other perceivable information
  • State of arousal
  • It is the taking possession by the mind in clear and vivid form of one out of what seem several simultaneous objects or trains of thought.
  • Focalization, the concentration of consciousness
  • The allocation of limited cognitive processing resources

37 of 89

Recap: Sentence Encoding

Is this the best way to model a sentence?

38 of 89

Recap: Image Encoding

Is this the best way to model an image?

39 of 89

Recap: Seq2Seq

Is this enough to preserve information in the input?

40 of 89

Issues with Recurrent Models (1)

  • Linear interaction distance
    • RNNs are unrolled “left-to-right”.
    • This encodes linear locality: a useful heuristic!
      • Nearby words often affect each other’s meanings

    • Problem: RNNs take O(sequence length) steps for distant word pairs to interact.
      • Hard to learn long-distance dependencies (because gradient problems!)

short-term dependency�: enough with RNN

long-term dependency�: NOT enough with RNN

O(sequence length)

41 of 89

Issues with Recurrent Models (2)

  • Lack of parallelizability
    • Forward and backward passes have O(sequence length) unparallelizable operations
      • GPUs can perform a bunch of independent computations at once!
      • But future RNN hidden states can’t be computed in full before past RNN hidden states have been computed.
      • Inhibits training on very large datasets!

1

2

3

0

1

2

n

Numbers indicate min # of steps before a state can be computed

42 of 89

Issues with Recurrent Models (3)

  • Encoding bottleneck
    • It’s like a one-shot compression (sequence of features 🡪 a vector).
    • RNN + classification is like judging a book by its summary.

BiLSTM Encoder

LSTM Decoder

43 of 89

Solution: Attention

  •  

 

Encoder-Decoder �with Attention

44 of 89

Solution: Attention

  • We can think of attention as performing fuzzy lookup in a key-value store.

In a lookup table, we have a table of keys that map to values. The query matches one of the keys, returning its value.

In attention, the query matches all keys softly, to a weight between 0 and 1. The keys’ values are multiplied by the weights and summed.

query

d

keys

a

values

v1

b

v2

c

v3

d

v4

e

v5

output

v4

query

q

keys

k1

values

v1

k2

v2

k3

v3

k4

v4

k5

v5

output

 

weighted �sum

45 of 89

Advantages of Attention

  • Attention is better for preserving the input features �(not encoded with nonlinearity)
  • Attention is better for adapting its weights based on query/task.
  • Attention is better for capturing long-range dependency.
  • Attention is better for interpreting the model predictions.

46 of 89

Attention for Documents

  •  

Hierarchical Attention Networks for Document Classification (https://aclanthology.org/N16-1174.pdf)

word attention

sentence attention

47 of 89

Attention for Documents

  • Some words are more important than others.
  • Some sentences are more important than others.

Hierarchical Attention Networks for Document Classification (https://aclanthology.org/N16-1174.pdf)

48 of 89

Attention for Documents

Hierarchical Attention Networks for Document Classification (https://aclanthology.org/N16-1174.pdf)

‘good’ gets higher attention

in positive reviews

‘bad’ gets higher attention

in negative reviews

49 of 89

Attention for Graphs

Graph Attention Networks (https://arxiv.org/abs/1710.10903)

50 of 89

Attention for QA Systems

  • Question answering = given the passage, answer the question.

# Passage

Joe went to the kitchen.

Fred went to the kitchen.

Joe picked up the milk.

Joe travelled to the office.

Joe left the milk.

Joe went to the bathroom.

# Question

Where is the milk?

51 of 89

Attention for QA Systems

  • Simple attention selects sentences with ‘milk’.
  • Attention pooling doesn’t help much since it misses intermediate steps.

# Passage

Joe went to the kitchen.

Fred went to the kitchen.

Joe picked up the milk.

Joe travelled to the office.

Joe left the milk.

Joe went to the bathroom.

# Question

Where is the milk?

# Answer

?

52 of 89

Attention for QA Systems

  • The model needs to reason sequentially.
  • Multi-hop attention solves the problem.
    • Hop 1: Focuses on “milk” 🡪 finds “Joe left the milk.”
    • Hop 2: Focuses on “Joe” and temporal context 🡪 finds “Joe travelled to the office.”

iterative�attention

 

53 of 89

Attention for QA Systems

  • Multiple computational steps (hops) are performed per output symbol.

54 of 89

Attention for Machine Translation

  • Sequence-to-sequence (seq2seq)
    • Encode source sequence to latent representation
    • Decode to target sequence one character at a time
    • Need memory for long sequences

https://alex.smola.org/talks/ICML19-attention.pdf

Sequence to Sequence Learning with Neural Networks (https://arxiv.org/pdf/1409.3215)

55 of 89

Attention for Machine Translation

  •  

https://alex.smola.org/talks/ICML19-attention.pdf

Neural Machine Translation by Jointly Learning to Align and Translate (https://arxiv.org/pdf/1409.0473)

Effective Approaches to Attention-based Neural Machine Translation (https://arxiv.org/pdf/1508.04025)

Image generated by Gemini 3 Pro

56 of 89

Attention for Machine Translation

  • With attention, translation quality remains much more stable for long sentences.

https://alex.smola.org/talks/ICML19-attention.pdf

Neural Machine Translation by Jointly Learning to Align and Translate (https://arxiv.org/pdf/1409.0473)

57 of 89

Attention for Machine Translation

  •  

Neural Machine Translation by Jointly Learning to Align and Translate (https://arxiv.org/pdf/1409.0473)

58 of 89

Attention for Image Captioning

  • In computer vision, attention learns alignment between image region and word.

Show, Attend, and Tell: Neural Image Caption Generation with Visual Attention (https://arxiv.org/pdf/1502.03044)

59 of 89

Self-Attention as a New Building Block

  • Prev: Attention on recurrent models
  • Todays: Attention only
    • Number of unparallelizable operations does not increase with sequence length.
    • Maximum interaction distance: O(1), since all words interact at every layer!
    • How? Treat each word’s representation as a query to access and incorporate information from a set of values.

Encoder-Decoder �with Attention

All words attend to all words in previous layer

60 of 89

Self-Attention as a New Building Block

  •  

61 of 89

Limitations of Self-Attention (1)

  • 1st challenge: self-attention doesn’t have an inherent notion of order!

q

k

v

k

v

k

v

k

v

k

v

k

v

k

v

k

v

went

to

HUFS

LAI

at

2025

and

learned

No order information!

There is just summation

over the set

62 of 89

Position Representation Vectors

  •  

 

In deep self-attention networks, we do this at the first layer! You could concatenate them as well, but people mostly just add…

63 of 89

Position Representation Vectors

Attention Is All You Need (https://arxiv.org/pdf/1706.03762)

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (https://arxiv.org/pdf/1810.04805)

64 of 89

Position Representation Vectors

  • Sinusoidal position representations: concatenate sinusoidal functions of varying periods:

  • Pros:
    • Periodicity indicates that maybe “absolute position” isn’t as important
    • Maybe can extrapolate to longer sequences as periods restart!
  • Cons:
    • Not learnable; also the extrapolation doesn’t really work!

65 of 89

Position Representation Vectors

  •  

66 of 89

Limitations of Self-Attention (2)

  • 2nd challenge: there is no non-linearities. It’s all just weighted averages.

query

q

keys

k1

values

v1

k2

v2

k3

v3

k4

v4

k5

v5

output

 

Even stacking multiple layers cannot introduce any non-linearity because it’s still a summation of value vectors

output

 

67 of 89

Adding nonlinearities in self-attention

  •  

68 of 89

Limitations of Self-Attention (3)

  • 3rd challenge: need to ensure we don’t “look at the future” when predicting a sequence

69 of 89

Masking the Future in Self-Attention

  •  

70 of 89

Self-Attention as a New Building Block

  • Barriers and solutions:
    • 1. Doesn’t have an inherent notion of order
      • -> Add position representations to the inputs
    • 2. No nonlinearities for deep learning magic. It’s all just weighted averages
      • -> Easy fix: apply the same feedforward network to each self-attention output
    • 3. Need to ensure we don’t “look at the future” when predicting a sequence
      • -> Mask out the future by artificially setting attention weights to 0!

71 of 89

“Transformer”

  • Self-attention:
    • The basis of the method.
  • Position representations:
    • Specify the sequence order, since self-attention is an unordered function of its inputs.
  • Nonlinearities:
    • At the output of the self-attention block
    • Frequently implemented as a simple feed-forward network.
  • Masking:
    • In order to parallelize operations while not looking at the future.
    • Keeps information about the future from “leaking” to the past.

72 of 89

Understanding Transformer Step-by-Step

73 of 89

Transformer Decoder

  • A Transformer decoder is how we’ll build systems like language models.
  • It’s a lot like our minimal self-attention architecture, but with a few more components.
  • The embeddings and position embeddings are identical.
  • We’ll next replace our self-attention with multi-head self-attention.

74 of 89

Multi-Head Attention

  •  

Attention head 1 �attends to entities

Attention head 2 attends to syntactically relevant words

75 of 89

q

k

v

k

v

k

v

k

v

k

v

k

v

k

v

k

v

went

to

HUFS

LAI

at

2025

and

learned

76 of 89

q

k

v

k

v

k

v

k

v

k

v

k

v

k

v

k

v

went

to

HUFS

LAI

at

2025

and

learned

77 of 89

Multi-Head Attention

  •  

Multi-head attention = stacking multiple self-attention

78 of 89

Transformer …

  • Today, we will focus more on “attention”, rather than “Transformer”.
  • However, it’s recommended to study the full details of Transformer, such as …
    • Low-rank computation in multi-head attention
    • Scaled Dot-Product Attention (dot product between large vectors is bad)
    • Residual connections
    • Layer normalization
    • Cross-attention (decoder attends to encoder states)
    • Quadratically growing computation by sequence length

79 of 89

80 of 89

Analysis of Transformer Attention

What Does BERT Look At? An Analysis of BERT’s Attention (https://arxiv.org/pdf/1906.04341)

81 of 89

Could We Trust Attention?

  • Attention looks like an explanation for its prediction.
    • We can visualize which source tokens the model “attends to”.

  • But: Can we trust these weights as explanations?

82 of 89

Could We Trust Attention? (1)

  • Is attention an explanation for its prediction?
    • If attention were a faithful explanation for its prediction, then perturbing attention should significantly change the prediction.
    • In that sense, (Jain & Wallace, 2019) claims attention weights are NOT reliably faithful explanations.
    • They constructed shuffled / adversarial attention distributions on a fixed model, but the predictions stay almost the same.

Attention is not Explanation (https://arxiv.org/pdf/1902.10186)

83 of 89

Could We Trust Attention? (2)

  • Is a token with high attention really one that matters?
    • (Serrano & Smith, 2019) remove words in order of attention vs. gradients, and find that deleting high-attention words often barely affects the prediction.
    • => Attention weights are not a reliable importance ranking, so we should be careful using them as explanations.

Is Attention Interpretable? (https://arxiv.org/pdf/1906.03731)

84 of 89

Could We Trust Attention? (3)

  • Is attention concentrated on “meaningful” tokens?
    • (Xiao et al., 2024) found that many LLMs assign very high attention to a few special / initial tokens, regardless of the actual content (“attention sinks”).

Efficient Streaming Language Models with Attention Sinks (https://arxiv.org/pdf/2309.17453)

85 of 89

Could We Trust Attention? (4)

  • Is AI’s attention aligned with humans’ attention?
    • In visual question answering, AI’s attention maps show low correlation with humans’ attention (eye-gaze).

Human Attention in Visual Question Answering: Do Humans and Deep Networks Look at the Same Regions? (https://arxiv.org/pdf/1606.03556)

86 of 89

Could We Trust Attention? (5)

  • Is AI’s attention aligned with humans’ attention?
    • In natural language processing too, AI’s attention maps show low correlation with humans’ semantic importance.

What Does BERT Look At? An Analysis of BERT’s Attention (https://arxiv.org/pdf/1906.04341)

87 of 89

How Could We Make Attention Trustable? (1)

  •  

Attention is Not Only a Weight: Analyzing Transformers with Vector Norms (https://arxiv.org/pdf/2004.10102)

88 of 89

How Could We Make Attention Trustable? (2)

  •  

Human Attention Maps for Text Classification: Do Humans and Neural Networks Focus on the Same Words? (https://aclanthology.org/2020.acl-main.419.pdf)

Deriving Machine Attention from Human Rationales (https://arxiv.org/pdf/1808.09367)

Less is More: Attention Supervision with Counterfactuals for Text Classification (https://aclanthology.org/2020.emnlp-main.543.pdf)

89 of 89

How Could We Make Attention Trustable? (3)

  • Multi-head self-attention can be supervised with linguistic information.

Linguistically-Informed Self-Attention for Semantic Role Labeling (https://arxiv.org/pdf/1804.08199)

+ https://www.youtube.com/watch?v=nfti9lEs8k8