1 of 118

CS 6120

Lecture 9 (Topic 8.2): Modeling Sequences

2 of 118

Administrative

Hope you had a great Spring Break!�
Today’s papers:

Homeworks are due today. RNN’s due by next week.

3 of 118

Some People I Follow on the Internet

4 of 118

The Remainder of the Course

March 20, Next Week: Building Blocks of LLMs – the Transformer�
March 27, Introduction to LLMs: Tuning and Inference�
April 3, LLM Applications with RAG Systems�
April 10, Lecturer on Travel, No Class�
April 17, LLM Agents and LLMs in Practice�
April 24, Project Presentations - Papers and Projects

5 of 118

LLM Project

Paper Idea OR Deploy an LLM RAG Systems�
Project Proposals Due March 21�
Project Website

Live Demonstration of Your RAG System:

Provide an endpoint
Submit the writeup and project repository

6 of 118

Recurrent Neural Networks (RNNs)

7 of 118

Named Entity Recognition

8 of 118

What is it good for?

9 of 118

Sequence to Sequence Modeling

10 of 118

Machine Translation with Sequence to Sequence Modeling

11 of 118

Machine Translation with Sequence to Sequence Modeling

One of the “holy grail” problems in artificial intelligence
Practical use case: Facilitate communication between people in the world
Extremely challenging (especially for low-resource languages)

12 of 118

How Difficult is Translation?

13 of 118

Modeling Sequences�

Section 0: Administrivia

Section 1: Intro to Recurrent Neural Networks

Section 2: Gated RNNs and LSTMs

Section 3: Modernized Tech: Machine Translation

14 of 118

Feed Forward Networks vs RNNs

15 of 118

Recurrence in Neural Network

Apply same weights (U, V, W) repeatedly

outputs optional

hidden states

input sequence (any length)

16 of 118

Terminology

(next section)

17 of 118

What are we predicting?

RNNs could be used in a variety of tasks ranging from machine translation to caption generation. There are many ways to implement an RNN model:

One to One: given scores of a championship, you can predict the winner. �
One to Many: given an image, you can predict what the caption is going to be.

Many to One: given a tweet, you can predict the sentiment of that tweet.

Many to Many: given an English sentence, you can translate it to its German. �
Many to Few / Few to Many: given a phrase, predict the sentence, vice versa

18 of 118

One to Many: Microsoft CoCo Challenges

19 of 118

Many to One: Classification

20 of 118

OFA: A Mix of Applications

https://github.com/OFA-Sys/OFA

we explore an omni-model for multimodal pretraining and propose OFA, hopefully “One For All”, which

achieves the objectives of unifying architectures, tasks, and modalities, and supports the three properties above.2

We formulate both pretraining and finetuning tasks in a unified sequence-to-sequence abstraction via handcrafted

instructions [9, 10] to achieve Task-Agnostic. A Transformer is adopted as the Modality-Agnostic compute engine, with

a constraint that no learnable task- or modality-specific components will be added to downstream tasks. It is available

to represent information from different modalities within a globally shared multimodal vocabulary across all tasks. We

then support Task Comprehensiveness by pretraining on varieties of uni-modal and cross-modal tasks

21 of 118

Sequence to Sequence

A Number of Applications:�

Summarization

Initial generation approaches�

Language Modeling

RNNs a precursor and a step towards LLMs�

Machine Translation

Today’s lecture / application

22 of 118

Some Vocabulary

Machine Translation (MT): The goal is to automatically translate text from one language to another.

Statistical Machine Translation (SMT): relies on statistical models to learn how to translate between languages.

Neural Machine Translation (NMT): the use of neural networks and large corpora to translate between language�

Parallel Corpora: Bilingual text data with pairs of samples in each language

23 of 118

DeepSpeech: The First RNN at Scale

Hannun, Case, Catanzaro, Diamos, Elsen, Prenger, Satheesh, Sengupta, Coates, Andrew Y. Ng

Mozilla later open sourced the technology

24 of 118

The Broader Picture of Neural Machine

Since 2015, model transcription / translation has traditionally been RNN-based��

��

DeepSpeech (ASR): Straight up predicting the Chinese Characters

You may know this guy

But at NEU, do you know this guy?

25 of 118

Neural Translation Models Today

Overwhelmingly Transformer Based�
GMNT 2016��
Microsoft + Amazon 2018/19

DeepL 2017/18

26 of 118

Section 1: Introduction to RNNs

A Brief Introduction to RNNs�
Deconstructing the Basic RNN�
Backpropagation through Time�
Sequence to Sequence: Modeling & Inference�
Bi-Directionality in RNNs

27 of 118

Recall Trigram Models

28 of 118

The Effective Memory of the Trigram vs RNNs

29 of 118

The Effective Memory of the Trigram vs RNNs

30 of 118

The Anatomy of the Basic RNN

31 of 118

Your Basic RNN

32 of 118

Recurrent Neural Language Models

33 of 118

Recurrent Neural Language Models

34 of 118

Recurrent Neural Language Models: Tying Word Weights

35 of 118

Performance on 5-Gram Prediction

36 of 118

Attributes of RNNs

37 of 118

Sequence Length vs RNN Prediction

How far into the future should you predict?�
Character prediction: some consistent behavior

38 of 118

Implementation Notes: for Forward Propagation

Frameworks like Tensorflow require this type of abstraction

Parallel computation and GPU usage

tensorflow.scan()

39 of 118

Section 1: Introduction to RNNs

A Brief Introduction to RNNs�
Deconstructing the Basic RNN�
Backpropagation through Time�
Sequence to Sequence: Modeling & Inference�
Bi-Directionality in RNNs

40 of 118

Training RNN Language Models

41 of 118

Training RNN Language Models

42 of 118

Backwards Pass: RNNs

43 of 118

Gradient calculations

W_x and W_h are the same at every step

Gradient is proportional to the sum of

partial derivative products

V

44 of 118

Gradient Derivation

Intermediate Layer: ��Output Layer:�

Output Loss:�

Loss Function Over Time:

h(t)

v, b

W

x

U

45 of 118

Gradient at Time t from the Contribution at Each Time Step

Gradient for any contribution at time t summed together for all time steps

46 of 118

Backpropagation through Time

47 of 118

Vanishing Gradients

48 of 118

Vanishing Gradients

49 of 118

Solving the Exploding Gradients

50 of 118

Truncated BPTT

51 of 118

Section 1: Introduction to RNNs

A Brief Introduction to RNNs�
Deconstructing the Basic RNN�
Backpropagation through Time�
Sequence to Sequence: Modeling & Inference�
Bi-Directionality in RNNs

52 of 118

An Overview of Sequence to Sequence Problems

53 of 118

Sequence to Sequence: Encoder

54 of 118

Sequence to Sequence: Decoder

55 of 118

Sequence to Sequence: Decoder

56 of 118

Sequence to Sequence: Encoder / Decoder

57 of 118

Sequence to Sequence: Encoder / Decoder

58 of 118

Sequence to Sequence: Encoder / Decoder Network

59 of 118

Applying the Loss Function for Seq2Seq RNNs

60 of 118

Training Seq2Seq Models

61 of 118

Inference / Decoding Seq2Seq RNNs

62 of 118

Beam Search: Keeping Track of Non-Maximal Solutions

63 of 118

Example: Beam Search

64 of 118

Example: Beam Search - Backtracking

65 of 118

Beam Search: Details

66 of 118

Neural Machine Translation vs Statistical Machine Translation

67 of 118

NMT: The First Successful Application of Deep Learning in NLP

68 of 118

Machine Translation with Sequence to Sequence Modeling

One of the “holy grail” problems in artificial intelligence
Practical use case: Facilitate communication between people in the world
Extremely challenging (especially for low-resource languages)

69 of 118

How Difficult is Translation?

70 of 118

Basics of Machine Translation

71 of 118

Section 1: Introduction to RNNs

A Brief Introduction to RNNs�
Deconstructing the Basic RNN�
Backpropagation through Time�
Sequence to Sequence: Modeling & Inference�
Bi-Directionality in RNNs

72 of 118

Recap: Simple RNNs

73 of 118

Bi-Directional RNNs

74 of 118

Bi-Directional Networks in General

75 of 118

Modeling Sequences�

Section 0: Administrivia

Section 1: Intro to Recurrent Neural Networks

Section 2: Gated RNNs and LSTMs

Section 3: Modernized Tech: Machine Translation

76 of 118

Recap of Basic RNN Issues

77 of 118

Advanced RNN Variants

78 of 118

Section 2: Advanced RNNs

Long Short Term Memory (LSTM) Networks�
Gated Recurrent Units (GRUs)�
Comparing the two RNNs

79 of 118

Gated Recurrent Units (GRUs)

80 of 118

Gated Recurrent Units (GRUs)

GRU’s = Vanilla RNNs + Gates�
Keep / update relevant information in hidden states�
Update and inference based on current inputs and prior time step outputs

81 of 118

Introducing the Generalized Recurrent Unit

Relevance Gate

Update Gate

x

h_t-1

82 of 118

Introducing the Generalized Recurrent Unit

Relevance Gate

Update Gate

x

h_t-1

✕

Hidden State Candidate

h'_t-1

83 of 118

Introducing the Generalized Recurrent Unit

Relevance Gate

Update Gate

x

h_t-1

✕

Hidden State Candidate

1-

✕

+

h_t

h'_t-1

84 of 118

Introducing the Generalized Recurrent Unit

Relevance Gate

Update Gate

x

h_t-1

✕

Hidden State Candidate

1-

✕

+

h_t

h'_t-1

Final Output

85 of 118

Introducing the Generalized Recurrent Unit

Relevance Gate

Update Gate

x

h_t-1

✕

Hidden State Candidate

1-

✕

+

h_t

h'_t-1

Final Output

86 of 118

Section 2: Advanced RNNs

Long Short Term Memory (LSTM) Networks�
Gated Recurrent Units (GRUs)�
Comparing the two RNNs

87 of 118

Long Short Term Memory (LSTM) RNNs

88 of 118

Where have you seen LSTMs

89 of 118

Adding Cell State in the LSTM

Similar to GRU: learn when to remember and when to forget … but�
Adds a Cell State (its memory). � So, over RNNs, it has cell state, hidden state, & multiple gates�
Three gates instead of two: Input, Forget, and Output Gate

90 of 118

Some Intuition

gates

cell & hidden units

91 of 118

The Anatomy of the LSTM

h_t-1

c_t-1

x_t

h_t

c_t

y_t

output

input

✕

Forget gate��Input and previous state → forget or not�
Think of as: is the current context relevant to the prior information?�
If it’s not, then just forget it (i.e., zero out cell state)�

92 of 118

The Anatomy of the LSTM

Input Gate��Identify what to put into memory�
Think of as: From the previous state and current input, what should I remember?�
If it’s important: commit it

h_t-1

c_t-1

x_t

h_t

c_t

✕

+

y_t

output

input

✕

93 of 118

The Anatomy of the LSTM

h_t-1

c_t-1

x_t

h_t

c_t

✕

+

✕

y_t

output

input

✕

Output Gate��What to output?�
Think of as: combine what I know (either positively or negatively) with the current output

94 of 118

The Anatomy of the LSTM

h_t-1

c_t-1

x_t

h_t

c_t

✕

+

✕

y_t

output

input

✕

95 of 118

The LSTM Cell: Perspectives

Granular Control over Information Flow:

Separate Gates: LSTMs have three gates (forget, input, output) compared to GRUs' two (update, reset).
Dedicated Cell State: LSTMs maintain a separate cell state that acts as a kind of memory unit. �

Superior Handling of Long-Term Dependencies:

Preserving Gradients: LSTMs architecture helps mitigate vanishing gradient effectively. �

Higher Capacity

More Parameters: LSTMs have 4x more parameters than RNNs.

96 of 118

Intuition Behind LSTMs

97 of 118

LSTM: Formulation

98 of 118

LSTM Parameter Count

99 of 118

The Range of Values

100 of 118

Section 2: Advanced RNNs

Long Short Term Memory (LSTM) Networks�
Gated Recurrent Units (GRUs)�
Comparing the two RNNs

101 of 118

The Architectures We Know

102 of 118

A Quick Note on the Order of Teaching

Both are widely used. generally GRUs taught before LSTMs: Why?�
GRU’s: fewer gates & a single state → easier to understand�
LSTM’s: can be built off the understanding of GRU’s�
They both work comparatively well (often on parity)

LSTM

(2000)

GRU

(2014)

https://gemini.google.com/app/c96946792edbd581 (karl.northeastern)

You might start with an automatic transmission car. It's simpler to learn the basics of steering, braking, and accelerating without having to worry about shifting gears.
Once you've mastered the basics, you can move on to a manual transmission car. It's more complex, but it gives you more control and a deeper understanding of how the car works.

However, it's important to note that LSTMs are still very relevant and widely used. Their greater complexity allows them to capture more intricate long-term dependencies in sequential data, making them crucial for tasks like:

Natural Language Processing: LSTMs excel at tasks like machine translation, where understanding long-range relationships between words is essential.
Speech Recognition: LSTMs can effectively model the temporal dynamics of speech signals, leading to accurate speech-to-text conversion.

GRUs similarly are used:

Machine Translation: GRUs can be very effective in capturing the context of a sentence and generating accurate translations, often with a faster training time than LSTMs.

Text Summarization: GRUs can process long sequences of text and identify the most important information to create concise summaries.

Sentiment Analysis: Understanding the emotional tone of text often benefits from GRUs' ability to remember word order and context.

Language Modeling: Predicting the next word in a sentence or the probability of a sequence of words relies on GRUs' ability to learn dependencies in text.

103 of 118

Considerations and Factors

Consideration	RNNs	LSTMs
Task	Speed and efficiency �(e.g., real-time sentiment analysis)	Very long sequences Complex dependencies (e.g., Machine Translations)
Dataset Size	Smaller datasets - �less prone to overfitting due to simpler structure	Larger datasets - �higher capacity due to complexity
Computational Resources	Train faster with less memory	Able to memorize further at computational cost
Developer Preference	Easier to understand	Been around longer

Note: RNNs are generally much more difficult to train / “get right”

104 of 118

Music Modeling & Ubisoft Dataset

(Chung et al, 2014): Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

For the polyphonic music modeling, we use three polyphonic music datasets from [BoulangerLewandowski et al., 2012]: Nottingham, JSB Chorales, MuseData and Piano-midi. These datasets contain sequences of which each symbol is respectively a 93-, 96-, 105-, and 108-dimensional binary vector. We use logistic sigmoid function as output units.

We use two internal datasets provided by Ubisoft2 for speech signal modeling. Each sequence is an one-dimensional raw audio signal, and at each time step, we design a recurrent neural network to look at 20 consecutive samples to predict the following 10 consecutive samples. We have used two different versions of the dataset: One with sequences of length 500 (Ubisoft A) and the other with sequences of length 8, 000 (Ubisoft B). Ubisoft A and Ubisoft B have 7, 230 and 800 sequences each. We use mixture of Gaussians with 20 components as output layer.

105 of 118

Convergence Speed of Various RNNs

Gating units clearly helpful for convergence and performance�
Between GRUs and LSTMs not necessarily conclusive�
https://arxiv.org/pdf/1412.3555

106 of 118

Module Quiz

107 of 118

Lecture Summary

We learned about:

Recurrent Neural Networks: what they are…
Backpropagation: The Vanishing Gradient
More complex RNNs that remedy it with forget gates

108 of 118

Laboratory - Recurrent Neural Networks

109 of 118

Graveyard

110 of 118

References

111 of 118

112 of 118

113 of 118

114 of 118

115 of 118

Bi-Directional RNNs

116 of 118

117 of 118

Comparing GRU, LSTM

https://distill.pub/2019/memorization-in-rnns/

118 of 118

What sequences do?

E-Bay Summarization