1 of 30

CS458 Natural language Processing

Lecture 11

RNNs

Krishnendu Ghosh

Department of Computer Science & Engineering

Indian Institute of Information Technology Dharwad

2 of 30

Simple Recurrent Networks (RNNs or Elman Nets)

3 of 30

Modeling Time in Neural Networks

Language is inherently temporal

Yet the simple NLP classifiers we've seen (for example for sentiment analysis) mostly ignore time

  • (Feedforward neural LMs (and the transformers we'll see later) use a "moving window" approach to time.)

Here we introduce a deep learning architecture with a different way of representing time

  • RNNs and their variants like LSTMs

4 of 30

Recurrent Neural Networks (RNNs)

Any network that contains a cycle within its network connections.

The value of some unit is directly, or indirectly, dependent on its own earlier outputs as an input.

5 of 30

Simple Recurrent Nets (Elman nets)

The hidden layer has a recurrence as part of its input

The activation value ht depends on xt but also ht-1!

xt

yt

ht

6 of 30

Forward inference in simple RNNs

Very similar to the feedforward networks we've seen!

7 of 30

Simple recurrent neural network illustrated as a feedforward network

8 of 30

Inference has to be incremental

Computing h at time t requires that we first computed h at the previous time step!

9 of 30

Training in simple RNNs

Just like feedforward training:

  • training set,
  • a loss function,
  • backpropagation

Weights that need to be updated:

  • W, the weights from the input layer to the hidden layer,
  • U, the weights from the previous hidden layer to the current hidden layer,
  • V, the weights from the hidden layer to the output layer.

10 of 30

Training in simple RNNs: unrolling in time

Unlike feedforward networks:

1. To compute loss function for the output at time t we need the hidden layer from time t − 1.

2. hidden layer at time t influences the output at time t and hidden layer at time t+1 (and hence the output and loss at t+1).

So: to measure error accruing to ht,

  • need to know its influence on both the current output as well as the ones that follow.

11 of 30

Unrolling in time (2)

We unroll a recurrent network into a feedforward computational graph eliminating recurrence

  1. Given an input sequence,
  2. Generate an unrolled feedforward network specific to input
  3. Use graph to train weights directly via ordinary backprop (or can do forward inference)

12 of 30

RNNs as Language Models

13 of 30

Reminder: Language Modeling

14 of 30

The size of the conditioning context for different LMs

The n-gram LM:

Context size is the n − 1 prior words we condition on.

The feedforward LM:

Context is the window size.

The RNN LM:

No fixed context size; ht-1 represents entire history

15 of 30

FFN LMs vs RNN LMs

FFN

RNN

16 of 30

Forward inference in the RNN LM

Given input X of of N tokens represented as one-hot vectors

Use embedding matrix to get the embedding for current token xt

Combine …

17 of 30

Shapes

d x 1

d x d

d x d

d x 1

d x 1

|V| x d

|V| x 1

18 of 30

Computing the probability that the next word is word k

19 of 30

Training RNN LM

  • Self-supervision
    • take a corpus of text as training material
    • at each time step t
    • ask the model to predict the next word.
  • Why called self-supervised: we don't need human labels; the text is its own supervision signal
  • We train the model to
    • minimize the error
    • in predicting the true next word in the training sequence,
    • using cross-entropy as the loss function.

20 of 30

Cross-entropy loss

The difference between:

  • a predicted probability distribution
  • the correct distribution.

CE loss for LMs is simpler!!!

  • the correct distribution yt is a one-hot vector over the vocabulary
    • where the entry for the actual next word is 1, and all the other entries are 0.
  • So the CE loss for LMs is only determined by the probability of next word.
  • So at time t, CE loss is:

21 of 30

Teacher Forcing

We always give the model the correct history to predict the next word (rather than feeding the model the possible buggy guess from the prior time step).

This is called teacher forcing (in training we force the context to be correct based on the gold words)

What teacher forcing looks like:

  • At word position t
  • the model takes as input the correct word wt together with ht−1, computes a probability distribution over possible next words
  • That gives loss for the next token wt+1
  • Then we move on to next word, ignore what the model predicted for the next word and instead use the correct word wt+1 along with the prior history encoded to estimate the probability of token wt+2.

22 of 30

Weight Tying

The input embedding matrix E and the final layer matrix V, are similar

  • The columns of E represent the word embeddings for each word in vocab. E is [d x |V|]
  • The final layer matrix V helps give a score (logit) for each word in vocab . V is [|V| x d ]

Instead of having separate E and V, we just tie them together, using ET instead of V:

23 of 30

RNNs for Sequences

24 of 30

RNNs for Sequence Labeling

Assign a label to each element of a sequence

Part-of-speech tagging

25 of 30

RNNs for Sequence Classification

Text classification

Instead of taking the last state, could use some pooling function of all the output states, like mean pooling

26 of 30

Autoregressive Generation

27 of 30

Stacked RNNs

28 of 30

Bidirectional RNNs

29 of 30

Bidirectional RNNs for Classification

30 of 30

Thank You