CS458 Natural language Processing
Lecture 11
RNNs
Krishnendu Ghosh
Department of Computer Science & Engineering
Indian Institute of Information Technology Dharwad
Simple Recurrent Networks (RNNs or Elman Nets)
Modeling Time in Neural Networks
Language is inherently temporal
Yet the simple NLP classifiers we've seen (for example for sentiment analysis) mostly ignore time
Here we introduce a deep learning architecture with a different way of representing time
Recurrent Neural Networks (RNNs)
Any network that contains a cycle within its network connections.
The value of some unit is directly, or indirectly, dependent on its own earlier outputs as an input.
Simple Recurrent Nets (Elman nets)
The hidden layer has a recurrence as part of its input
The activation value ht depends on xt but also ht-1!
xt
yt
ht
Forward inference in simple RNNs
Very similar to the feedforward networks we've seen!
Simple recurrent neural network illustrated as a feedforward network
Inference has to be incremental
Computing h at time t requires that we first computed h at the previous time step!
Training in simple RNNs
Just like feedforward training:
Weights that need to be updated:
Training in simple RNNs: unrolling in time
Unlike feedforward networks:
1. To compute loss function for the output at time t we need the hidden layer from time t − 1.
2. hidden layer at time t influences the output at time t and hidden layer at time t+1 (and hence the output and loss at t+1).
So: to measure error accruing to ht,
Unrolling in time (2)
We unroll a recurrent network into a feedforward computational graph eliminating recurrence
RNNs as Language Models
Reminder: Language Modeling
The size of the conditioning context for different LMs
The n-gram LM:
Context size is the n − 1 prior words we condition on.
The feedforward LM:
Context is the window size.
The RNN LM:
No fixed context size; ht-1 represents entire history
FFN LMs vs RNN LMs
FFN
RNN
…
Forward inference in the RNN LM
Given input X of of N tokens represented as one-hot vectors
Use embedding matrix to get the embedding for current token xt
Combine …
Shapes
d x 1
d x d
d x d
d x 1
d x 1
|V| x d
|V| x 1
Computing the probability that the next word is word k
Training RNN LM
Cross-entropy loss
The difference between:
CE loss for LMs is simpler!!!
Teacher Forcing
We always give the model the correct history to predict the next word (rather than feeding the model the possible buggy guess from the prior time step).
This is called teacher forcing (in training we force the context to be correct based on the gold words)
What teacher forcing looks like:
Weight Tying
The input embedding matrix E and the final layer matrix V, are similar
Instead of having separate E and V, we just tie them together, using ET instead of V:�
RNNs for Sequences
RNNs for Sequence Labeling
Assign a label to each element of a sequence
Part-of-speech tagging
RNNs for Sequence Classification
Text classification
Instead of taking the last state, could use some pooling function of all the output states, like mean pooling
Autoregressive Generation
Stacked RNNs
Bidirectional RNNs
Bidirectional RNNs for Classification
Thank You