1 of 31

CS458 Natural language Processing

Lecture 11

RNNs

Krishnendu Ghosh

Department of Computer Science & Engineering

Indian Institute of Information Technology Dharwad

2 of 31

Simple Recurrent Networks (RNNs or Elman Nets)

3 of 31

The Need for Sequences

Text, Audio, and Video are time-dependent.
The meaning of a word depends on what came before it.

4 of 31

RNN: Feedback Loop

5 of 31

RNN Architecture

6 of 31

RNN

7 of 31

RNN

8 of 31

Advantages

Processing Variable Lengths

Unlike CNNs, RNNs can process sentences of any length.
One word at a time, updating the memory.

9 of 31

Issues

Vanishing/Exploding Gradient Problem

Mathematical limitation: Gradients shrink as they go back in time.
RNNs "forget" the beginning of long sentences.

10 of 31

Training Recurrent Networks (RNNs)

11 of 31

Training in simple RNNs

Just like feedforward training:

training set,
a loss function,
backpropagation

Weights that need to be updated:

W, the weights from the input layer to the hidden layer,
U, the weights from the previous hidden layer to the current hidden layer,
V, the weights from the hidden layer to the output layer.

12 of 31

Training in simple RNNs: unrolling in time

Unlike feedforward networks:

1. To compute loss function for the output at time t we need the hidden layer from time t − 1.

2. hidden layer at time t influences the output at time t and hidden layer at time t+1 (and hence the output and loss at t+1).

So: to measure error accruing to h_t,

need to know its influence on both the current output as well as the ones that follow.

13 of 31

Unrolling in time (2)

We unroll a recurrent network into a feedforward computational graph eliminating recurrence

Given an input sequence,
Generate an unrolled feedforward network specific to input
Use graph to train weights directly via ordinary backprop (or can do forward inference)

14 of 31

RNNs as Language Models

15 of 31

Reminder: Language Modeling

16 of 31

The size of the conditioning context for different LMs

The n-gram LM:

Context size is the n − 1 prior words we condition on.

The feedforward LM:

Context is the window size.

The RNN LM:

No fixed context size; h_t-1 represents entire history

17 of 31

Training RNN LM

Self-supervision

take a corpus of text as training material
at each time step t
ask the model to predict the next word.

Why called self-supervised: we don't need human labels; the text is its own supervision signal
We train the model to

minimize the error
in predicting the true next word in the training sequence,
using cross-entropy as the loss function.

18 of 31

RNNs for Sequences

19 of 31

RNNs for Sequence Labeling

Assign a label to each element of a sequence

Part-of-speech tagging

20 of 31

RNNs for Sequence Labeling

RNNs that assign a label to each element in a sequence (e.g., POS tagging, NER).
Example: POS tagging – “I/PRON love/VERB NLP/NOUN”

21 of 31

RNNs for Sequence Classification

Text classification

Instead of taking the last state, could use some pooling function of all the output states, like mean pooling

22 of 31

RNNs for Sequence Classification

Assign one label to the entire sequence.
Example: Sentiment analysis – “The movie was great” → Positive

23 of 31

Autoregressive Generation

24 of 31

Autoregressive Generation

Generate output step by step, using previous outputs as inputs.
Example: Text generation – “Once upon a time…”

25 of 31

Stacked RNNs

26 of 31

Stacked RNNs

Multiple RNN layers stacked to capture higher-level temporal patterns.
Example: Speech recognition using multi-layer LSTMs

27 of 31

Bidirectional RNNs

28 of 31

Bidirectional RNNs

Process input from past and future simultaneously.
Example: NER where future words help decide current tags

29 of 31

Bidirectional RNNs for Classification

30 of 31

Bidirectional RNNs for Classification

Combine forward and backward context to classify a whole sequence.
Example: Document classification using full sentence context

31 of 31

Thank You