CS 6120
Lecture 9 (Topic 8.2): Modeling Sequences
Administrative
Some People I Follow on the Internet
The Remainder of the Course
LLM Project
Recurrent Neural Networks (RNNs)
Named Entity Recognition
What is it good for?
Sequence to Sequence Modeling
Machine Translation with Sequence to Sequence Modeling
Machine Translation with Sequence to Sequence Modeling
How Difficult is Translation?
Modeling Sequences�
Section 0: Administrivia
Section 1: Intro to Recurrent Neural Networks
Section 2: Gated RNNs and LSTMs
Section 3: Modernized Tech: Machine Translation
Feed Forward Networks vs RNNs
Recurrence in Neural Network
Apply same weights (U, V, W) repeatedly
outputs optional
hidden states
input sequence (any length)
Terminology
(next section)
What are we predicting?
RNNs could be used in a variety of tasks ranging from machine translation to caption generation. There are many ways to implement an RNN model:
One to Many: Microsoft CoCo Challenges
Many to One: Classification
OFA: A Mix of Applications
Sequence to Sequence
A Number of Applications:�
Some Vocabulary
Machine Translation (MT): The goal is to automatically translate text from one language to another.
Statistical Machine Translation (SMT): relies on statistical models to learn how to translate between languages.
Neural Machine Translation (NMT): the use of neural networks and large corpora to translate between language�
Parallel Corpora: Bilingual text data with pairs of samples in each language
DeepSpeech: The First RNN at Scale
Hannun, Case, Catanzaro, Diamos, Elsen, Prenger, Satheesh, Sengupta, Coates, Andrew Y. Ng
Mozilla later open sourced the technology
The Broader Picture of Neural Machine
��
You may know this guy
But at NEU, do you know this guy?
Neural Translation Models Today
Section 1: Introduction to RNNs
Recall Trigram Models
The Effective Memory of the Trigram vs RNNs
The Effective Memory of the Trigram vs RNNs
The Anatomy of the Basic RNN
Your Basic RNN
Recurrent Neural Language Models
Recurrent Neural Language Models
Recurrent Neural Language Models: Tying Word Weights
Performance on 5-Gram Prediction
Attributes of RNNs
Sequence Length vs RNN Prediction
Implementation Notes: for Forward Propagation
Frameworks like Tensorflow require this type of abstraction
Parallel computation and GPU usage
tensorflow.scan()
Section 1: Introduction to RNNs
Training RNN Language Models
Training RNN Language Models
Backwards Pass: RNNs
Gradient calculations
Wx and Wh are the same at every step
Gradient is proportional to the sum of
partial derivative products
V
Gradient Derivation
Intermediate Layer: ��Output Layer:�
Output Loss:�
Loss Function Over Time:
h(t)
v, b
W
x
U
Gradient at Time t from the Contribution at Each Time Step
Backpropagation through Time
Vanishing Gradients
Vanishing Gradients
Solving the Exploding Gradients
Truncated BPTT
Section 1: Introduction to RNNs
An Overview of Sequence to Sequence Problems
Sequence to Sequence: Encoder
Sequence to Sequence: Decoder
Sequence to Sequence: Decoder
Sequence to Sequence: Encoder / Decoder
Sequence to Sequence: Encoder / Decoder
Sequence to Sequence: Encoder / Decoder Network
Applying the Loss Function for Seq2Seq RNNs
Training Seq2Seq Models
Inference / Decoding Seq2Seq RNNs
Beam Search: Keeping Track of Non-Maximal Solutions
Example: Beam Search
Example: Beam Search - Backtracking
Beam Search: Details
Neural Machine Translation vs Statistical Machine Translation
NMT: The First Successful Application of Deep Learning in NLP
Machine Translation with Sequence to Sequence Modeling
How Difficult is Translation?
Basics of Machine Translation
Section 1: Introduction to RNNs
Recap: Simple RNNs
Bi-Directional RNNs
Bi-Directional Networks in General
Modeling Sequences�
Section 0: Administrivia
Section 1: Intro to Recurrent Neural Networks
Section 2: Gated RNNs and LSTMs
Section 3: Modernized Tech: Machine Translation
Recap of Basic RNN Issues
Advanced RNN Variants
Section 2: Advanced RNNs
Gated Recurrent Units (GRUs)
Gated Recurrent Units (GRUs)
Introducing the Generalized Recurrent Unit
Relevance Gate
Update Gate
x
ht-1
Introducing the Generalized Recurrent Unit
Relevance Gate
Update Gate
x
ht-1
✕
Hidden State Candidate
h't-1
Introducing the Generalized Recurrent Unit
Relevance Gate
Update Gate
x
ht-1
✕
Hidden State Candidate
1-
✕
✕
+
ht
h't-1
Introducing the Generalized Recurrent Unit
Relevance Gate
Update Gate
x
ht-1
✕
Hidden State Candidate
1-
✕
✕
+
ht
h't-1
Final Output
Introducing the Generalized Recurrent Unit
Relevance Gate
Update Gate
x
ht-1
✕
Hidden State Candidate
1-
✕
✕
+
ht
h't-1
Final Output
Section 2: Advanced RNNs
Long Short Term Memory (LSTM) RNNs
Where have you seen LSTMs
Adding Cell State in the LSTM
Some Intuition
gates
cell & hidden units
The Anatomy of the LSTM
ht-1
ct-1
xt
ht
ct
yt
output
input
✕
The Anatomy of the LSTM
ht-1
ct-1
xt
ht
ct
✕
+
yt
output
input
✕
The Anatomy of the LSTM
ht-1
ct-1
xt
ht
ct
✕
+
✕
yt
output
input
✕
The Anatomy of the LSTM
ht-1
ct-1
xt
ht
ct
✕
+
✕
yt
output
input
✕
The LSTM Cell: Perspectives
Intuition Behind LSTMs
LSTM: Formulation
LSTM Parameter Count
The Range of Values
Section 2: Advanced RNNs
The Architectures We Know
A Quick Note on the Order of Teaching
LSTM
(2000)
GRU
(2014)
Considerations and Factors
Consideration | RNNs | LSTMs |
Task | Speed and efficiency �(e.g., real-time sentiment analysis) | Very long sequences Complex dependencies (e.g., Machine Translations) |
Dataset Size | Smaller datasets - �less prone to overfitting due to simpler structure | Larger datasets - �higher capacity due to complexity |
Computational Resources | Train faster with less memory | Able to memorize further at computational cost |
Developer Preference | Easier to understand | Been around longer |
Note: RNNs are generally much more difficult to train / “get right”
Music Modeling & Ubisoft Dataset
Convergence Speed of Various RNNs
Module Quiz
Lecture Summary
We learned about:
Laboratory - Recurrent Neural Networks
Graveyard
References
Bi-Directional RNNs
Comparing GRU, LSTM
What sequences do?