1 of 118

CS 6120

Lecture 9 (Topic 8.2): Modeling Sequences

2 of 118

Administrative

  • Hope you had a great Spring Break!�
  • Today’s papers:
    • GPT Survey
    • Siamese Networks
  • Homeworks are due today. RNN’s due by next week.

3 of 118

Some People I Follow on the Internet

4 of 118

The Remainder of the Course

  • March 20, Next Week: Building Blocks of LLMs – the Transformer�
  • March 27, Introduction to LLMs: Tuning and Inference�
  • April 3, LLM Applications with RAG Systems�
  • April 10, Lecturer on Travel, No Class�
  • April 17, LLM Agents and LLMs in Practice�
  • April 24, Project Presentations - Papers and Projects

5 of 118

LLM Project

  • Paper Idea OR Deploy an LLM RAG Systems�
  • Project Proposals Due March 21
  • Project Website
    • Live Demonstration of Your RAG System:
      • Provide an endpoint
      • Submit the writeup and project repository

6 of 118

Recurrent Neural Networks (RNNs)

7 of 118

Named Entity Recognition

8 of 118

What is it good for?

9 of 118

Sequence to Sequence Modeling

10 of 118

Machine Translation with Sequence to Sequence Modeling

11 of 118

Machine Translation with Sequence to Sequence Modeling

  • One of the “holy grail” problems in artificial intelligence
  • Practical use case: Facilitate communication between people in the world
  • Extremely challenging (especially for low-resource languages)

12 of 118

How Difficult is Translation?

13 of 118

Modeling Sequences

Section 0: Administrivia

Section 1: Intro to Recurrent Neural Networks

Section 2: Gated RNNs and LSTMs

Section 3: Modernized Tech: Machine Translation

14 of 118

Feed Forward Networks vs RNNs

15 of 118

Recurrence in Neural Network

Apply same weights (U, V, W) repeatedly

outputs optional

hidden states

input sequence (any length)

16 of 118

Terminology

(next section)

17 of 118

What are we predicting?

RNNs could be used in a variety of tasks ranging from machine translation to caption generation. There are many ways to implement an RNN model:

  • One to One: given scores of a championship, you can predict the winner. �
  • One to Many: given an image, you can predict what the caption is going to be.

  • Many to One: given a tweet, you can predict the sentiment of that tweet.

  • Many to Many: given an English sentence, you can translate it to its German. �
  • Many to Few / Few to Many: given a phrase, predict the sentence, vice versa

18 of 118

One to Many: Microsoft CoCo Challenges

19 of 118

Many to One: Classification

20 of 118

OFA: A Mix of Applications

21 of 118

Sequence to Sequence

A Number of Applications:�

  • Summarization
    • Initial generation approaches�
  • Language Modeling
    • RNNs a precursor and a step towards LLMs�
  • Machine Translation
    • Today’s lecture / application

22 of 118

Some Vocabulary

Machine Translation (MT): The goal is to automatically translate text from one language to another.

Statistical Machine Translation (SMT): relies on statistical models to learn how to translate between languages.

Neural Machine Translation (NMT): the use of neural networks and large corpora to translate between language�

Parallel Corpora: Bilingual text data with pairs of samples in each language

23 of 118

DeepSpeech: The First RNN at Scale

Hannun, Case, Catanzaro, Diamos, Elsen, Prenger, Satheesh, Sengupta, Coates, Andrew Y. Ng

Mozilla later open sourced the technology

24 of 118

The Broader Picture of Neural Machine

  • Since 2015, model transcription / translation has traditionally been RNN-based��

��

  • DeepSpeech (ASR): Straight up predicting the Chinese Characters

You may know this guy

But at NEU, do you know this guy?

25 of 118

Neural Translation Models Today

  • Overwhelmingly Transformer Based�
  • GMNT 2016����
  • Microsoft + Amazon 2018/19
  • DeepL 2017/18

26 of 118

Section 1: Introduction to RNNs

  • A Brief Introduction to RNNs�
  • Deconstructing the Basic RNN�
  • Backpropagation through Time�
  • Sequence to Sequence: Modeling & Inference�
  • Bi-Directionality in RNNs

27 of 118

Recall Trigram Models

28 of 118

The Effective Memory of the Trigram vs RNNs

29 of 118

The Effective Memory of the Trigram vs RNNs

30 of 118

The Anatomy of the Basic RNN

31 of 118

Your Basic RNN

32 of 118

Recurrent Neural Language Models

33 of 118

Recurrent Neural Language Models

34 of 118

Recurrent Neural Language Models: Tying Word Weights

35 of 118

Performance on 5-Gram Prediction

36 of 118

Attributes of RNNs

37 of 118

Sequence Length vs RNN Prediction

  • How far into the future should you predict?�
  • Character prediction: some consistent behavior

38 of 118

Implementation Notes: for Forward Propagation

Frameworks like Tensorflow require this type of abstraction

Parallel computation and GPU usage

tensorflow.scan()

39 of 118

Section 1: Introduction to RNNs

  • A Brief Introduction to RNNs�
  • Deconstructing the Basic RNN�
  • Backpropagation through Time�
  • Sequence to Sequence: Modeling & Inference�
  • Bi-Directionality in RNNs

40 of 118

Training RNN Language Models

41 of 118

Training RNN Language Models

42 of 118

Backwards Pass: RNNs

43 of 118

Gradient calculations

Wx and Wh are the same at every step

Gradient is proportional to the sum of

partial derivative products

V

44 of 118

Gradient Derivation

Intermediate Layer: ��Output Layer:�

Output Loss:�

Loss Function Over Time:

h(t)

v, b

W

x

U

45 of 118

Gradient at Time t from the Contribution at Each Time Step

  • Gradient for any contribution at time t summed together for all time steps

46 of 118

Backpropagation through Time

47 of 118

Vanishing Gradients

48 of 118

Vanishing Gradients

49 of 118

Solving the Exploding Gradients

50 of 118

Truncated BPTT

51 of 118

Section 1: Introduction to RNNs

  • A Brief Introduction to RNNs�
  • Deconstructing the Basic RNN�
  • Backpropagation through Time�
  • Sequence to Sequence: Modeling & Inference�
  • Bi-Directionality in RNNs

52 of 118

An Overview of Sequence to Sequence Problems

53 of 118

Sequence to Sequence: Encoder

54 of 118

Sequence to Sequence: Decoder

55 of 118

Sequence to Sequence: Decoder

56 of 118

Sequence to Sequence: Encoder / Decoder

57 of 118

Sequence to Sequence: Encoder / Decoder

58 of 118

Sequence to Sequence: Encoder / Decoder Network

59 of 118

Applying the Loss Function for Seq2Seq RNNs

60 of 118

Training Seq2Seq Models

61 of 118

Inference / Decoding Seq2Seq RNNs

62 of 118

Beam Search: Keeping Track of Non-Maximal Solutions

63 of 118

Example: Beam Search

64 of 118

Example: Beam Search - Backtracking

65 of 118

Beam Search: Details

66 of 118

Neural Machine Translation vs Statistical Machine Translation

67 of 118

NMT: The First Successful Application of Deep Learning in NLP

68 of 118

Machine Translation with Sequence to Sequence Modeling

  • One of the “holy grail” problems in artificial intelligence
  • Practical use case: Facilitate communication between people in the world
  • Extremely challenging (especially for low-resource languages)

69 of 118

How Difficult is Translation?

70 of 118

Basics of Machine Translation

71 of 118

Section 1: Introduction to RNNs

  • A Brief Introduction to RNNs�
  • Deconstructing the Basic RNN�
  • Backpropagation through Time�
  • Sequence to Sequence: Modeling & Inference�
  • Bi-Directionality in RNNs

72 of 118

Recap: Simple RNNs

73 of 118

Bi-Directional RNNs

74 of 118

Bi-Directional Networks in General

75 of 118

Modeling Sequences

Section 0: Administrivia

Section 1: Intro to Recurrent Neural Networks

Section 2: Gated RNNs and LSTMs

Section 3: Modernized Tech: Machine Translation

76 of 118

Recap of Basic RNN Issues

77 of 118

Advanced RNN Variants

78 of 118

Section 2: Advanced RNNs

  • Long Short Term Memory (LSTM) Networks�
  • Gated Recurrent Units (GRUs)�
  • Comparing the two RNNs

79 of 118

Gated Recurrent Units (GRUs)

80 of 118

Gated Recurrent Units (GRUs)

  • GRU’s = Vanilla RNNs + Gates�
  • Keep / update relevant information in hidden states�
  • Update and inference based on current inputs and prior time step outputs

81 of 118

Introducing the Generalized Recurrent Unit

Relevance Gate

Update Gate

x

ht-1

82 of 118

Introducing the Generalized Recurrent Unit

Relevance Gate

Update Gate

x

ht-1

Hidden State Candidate

h't-1

83 of 118

Introducing the Generalized Recurrent Unit

Relevance Gate

Update Gate

x

ht-1

Hidden State Candidate

1-

+

ht

h't-1

84 of 118

Introducing the Generalized Recurrent Unit

Relevance Gate

Update Gate

x

ht-1

Hidden State Candidate

1-

+

ht

h't-1

Final Output

85 of 118

Introducing the Generalized Recurrent Unit

Relevance Gate

Update Gate

x

ht-1

Hidden State Candidate

1-

+

ht

h't-1

Final Output

86 of 118

Section 2: Advanced RNNs

  • Long Short Term Memory (LSTM) Networks�
  • Gated Recurrent Units (GRUs)�
  • Comparing the two RNNs

87 of 118

Long Short Term Memory (LSTM) RNNs

88 of 118

Where have you seen LSTMs

89 of 118

Adding Cell State in the LSTM

  • Similar to GRU: learn when to remember and when to forget … but�
  • Adds a Cell State (its memory). � So, over RNNs, it has cell state, hidden state, & multiple gates
  • Three gates instead of two: Input, Forget, and Output Gate

90 of 118

Some Intuition

gates

cell & hidden units

91 of 118

The Anatomy of the LSTM

ht-1

ct-1

xt

ht

ct

yt

output

input

  • Forget gate��Input and previous state → forget or not�
  • Think of as: is the current context relevant to the prior information?�
  • If it’s not, then just forget it (i.e., zero out cell state)�

92 of 118

The Anatomy of the LSTM

  • Input Gate��Identify what to put into memory�
  • Think of as: From the previous state and current input, what should I remember?�
  • If it’s important: commit it

ht-1

ct-1

xt

ht

ct

+

yt

output

input

93 of 118

The Anatomy of the LSTM

ht-1

ct-1

xt

ht

ct

+

yt

output

input

  • Output Gate��What to output?�
  • Think of as: combine what I know (either positively or negatively) with the current output

94 of 118

The Anatomy of the LSTM

ht-1

ct-1

xt

ht

ct

+

yt

output

input

95 of 118

The LSTM Cell: Perspectives

  • Granular Control over Information Flow:
    • Separate Gates: LSTMs have three gates (forget, input, output) compared to GRUs' two (update, reset).
    • Dedicated Cell State: LSTMs maintain a separate cell state that acts as a kind of memory unit. �
  • Superior Handling of Long-Term Dependencies:
    • Preserving Gradients: LSTMs architecture helps mitigate vanishing gradient effectively. �
  • Higher Capacity
    • More Parameters: LSTMs have 4x more parameters than RNNs.

96 of 118

Intuition Behind LSTMs

97 of 118

LSTM: Formulation

98 of 118

LSTM Parameter Count

99 of 118

The Range of Values

100 of 118

Section 2: Advanced RNNs

  • Long Short Term Memory (LSTM) Networks�
  • Gated Recurrent Units (GRUs)�
  • Comparing the two RNNs

101 of 118

The Architectures We Know

102 of 118

A Quick Note on the Order of Teaching

  • Both are widely used. generally GRUs taught before LSTMs: Why?�
  • GRU’s: fewer gates & a single state → easier to understand�
  • LSTM’s: can be built off the understanding of GRU’s�
  • They both work comparatively well (often on parity)

LSTM

(2000)

GRU

(2014)

103 of 118

Considerations and Factors

Consideration

RNNs

LSTMs

Task

Speed and efficiency �(e.g., real-time sentiment analysis)

Very long sequences

Complex dependencies (e.g., Machine Translations)

Dataset Size

Smaller datasets - �less prone to overfitting due to simpler structure

Larger datasets - �higher capacity due to complexity

Computational Resources

Train faster with less memory

Able to memorize further at computational cost

Developer Preference

Easier to understand

Been around longer

Note: RNNs are generally much more difficult to train / “get right”

104 of 118

Music Modeling & Ubisoft Dataset

105 of 118

Convergence Speed of Various RNNs

  • Gating units clearly helpful for convergence and performance�
  • Between GRUs and LSTMs not necessarily conclusive�
  • https://arxiv.org/pdf/1412.3555

106 of 118

Module Quiz

107 of 118

Lecture Summary

We learned about:

  • Recurrent Neural Networks: what they are…
  • Backpropagation: The Vanishing Gradient
  • More complex RNNs that remedy it with forget gates

108 of 118

Laboratory - Recurrent Neural Networks

109 of 118

Graveyard

110 of 118

References

111 of 118

112 of 118

113 of 118

114 of 118

115 of 118

Bi-Directional RNNs

116 of 118

117 of 118

Comparing GRU, LSTM

118 of 118

What sequences do?