Recurrent Neural Networks for text analysis
From idea to practice
ALEC RADFORD
Follow Along
Slides at: http://goo.gl/WLsUWv
How ML
-0.15, 0.2, 0, 1.5
A, B, C, D
The cat sat on the mat.
Numerical, great!
Categorical, great!
Uhhh…….
How text is dealt with
(ML perspective)
Text
Features
(bow, TFIDF, LSA, etc...)
Linear Model
(SVM, softmax)
Structure is important!
The cat sat on the mat.
sat
the
on
mat
cat
the
Structure is hard
Ngrams is typical way of preserving some structure.
sat
the
on
mat
cat
the cat
cat sat
sat on
on the
the mat
Beyond bi or tri-grams occurrences become very rare and dimensionality becomes huge (1, 10 million + features)
Structure is hard
How text is dealt with
(ML perspective)
Text
Features
(bow, TFIDF, LSA, etc...)
Linear Model
(SVM, softmax)
How text should be dealt with?
Text
RNN
Linear Model
(SVM, softmax)
How an RNN works
the
cat
sat
on
the
mat
How an RNN works
the
cat
sat
on
the
mat
input to hidden
How an RNN works
the
cat
sat
on
the
mat
input to hidden
hidden to hidden
How an RNN works
the
cat
sat
on
the
mat
input to hidden
hidden to hidden
How an RNN works
the
cat
sat
on
the
mat
projections
(activities x weights)
activities
(vectors of values)
input to hidden
hidden to hidden
How an RNN works
the
cat
sat
on
the
mat
projections
(activities x weights)
activities
(vectors of values)
Learned representation of sequence.
input to hidden
hidden to hidden
How an RNN works
the
cat
sat
on
the
mat
projections
(activities x weights)
activities
(vectors of values)
cat
hidden to output
input to hidden
hidden to hidden
From text to RNN input
the
cat
sat
on
the
mat
“The cat sat on the mat.”
Tokenize
.
Assign index
0
1
2
3
0
4
5
String input
Embedding lookup
2.5 0.3 -1.2
0.2 -3.3 0.7
-4.1 1.6 2.8
1.1 5.7 -0.2
2.5 0.3 -1.2
1.4 0.6 -3.9
-3.8 1.5 0.1
2.5 0.3 -1.2
0.2 -3.3 0.7
-4.1 1.6 2.8
1.1 5.7 -0.2
1.4 0.6 -3.9
-3.8 1.5 0.1
Learned matrix
You can stack them too
the
cat
sat
on
the
mat
cat
hidden to output
input to hidden
hidden to hidden
But aren’t RNNs unstable?
Simple RNNs trained with SGD are unstable/difficult to learn.
�But modern RNNs with various tricks blow up much less often!
Simple Recurrent Unit
ht-1
xt
+
ht
xt+1
+
ht+1
+
Element wise addition
Activation function
Routes information can propagate along
Involved in modifying information flow and values
Gated Recurrent Unit - GRU
⊙
⊙
⊙
xt
r
ht
ht-1
ht
z
+
~
1-z
z
+
Element wise addition
⊙
Element wise multiplication
Routes information can propagate along
Involved in modifying information flow and values
Gated Recurrent Unit - GRU
⊙
⊙
⊙
xt
r
ht
ht-1
z
+
~
1-z
z
⊙
⊙
⊙
xt+1
r
ht+1
ht
z
+
~
1-z
z
ht+1
Gating is important
For sentiment analysis of longer sequences of text (paragraph or so) a simple RNN has difficulty learning at all while a gated RNN does so easily.
Which One?
There are two types of gated RNNs:
Which One?
GRU is simpler, faster, and optimizes quicker (at least on sentiment).��Because it only has two gates (compared to four) approximately 1.5-1.75x faster for theano implementation.��If you have a huge dataset and don’t mind waiting LSTM may be better in the long run due to its greater complexity - especially if you add peephole connections.
Exploding Gradients?
Exploding gradients are a major problem for traditional RNNs trained with SGD. One of the sources of the reputation of RNNs being hard to train.��In 2012, R Pascanu and T. Mikolov proposed clipping the norm of the gradient to alleviate this.��Modern optimizers don’t seem to have this problem - at least for classification text analysis.
Better Gating Functions
Interesting paper at NIPS workshop (Q. Lyu, J. Zhu) - make the gates “steeper” so they change more rapidly from “off” to “on” so model learns to use them quicker.��
Better Initialization
Andrew Saxe last year showed that initializing weight matrices with random orthogonal matrices works better than random gaussian (or uniform) matrices.
In addition, Richard Socher (and more recently Quoc Le) have used identity initialization schemes which work great as well.
Understanding Optimizers
2D moons dataset
courtesy of scikit-learn
Comparing Optimizers
Adam (D. Kingma) combines the early optimization speed of Adagrad (J. Duchi) with the better later convergence of various other methods like Adadelta (M. Zeiler) and RMSprop (T. Tieleman).��Warning: Generalization performance of Adam seems slightly worse for smaller datasets.
It adds up
Up to 10x more efficient training once you add all the tricks together compared to a naive implementation - much more stable - rarely diverges.
Around 7.5x faster, the various tricks add a bit of computation time.
Too much? - Overfitting
RNNs can overfit very well as we will see. As they continue to fit to training dataset, their performance on test data will plateau or even worsen.��Keep track of it using a validation set, save model at each iteration over training data and pick the earliest, best, validation performance.
The Showdown
Model #1
Model #2
+
512 dim embedding
512 dim
hidden state
output
Using bigrams and grid search on min_df for vectorizer and regularization coefficient for model.
Using whatever I tried that worked :)
Adam, GRU, steeper sigmoid gates, ortho/identity init are good defaults
Sentiment & Helpfulness
Effect of Dataset Size
The Thing we don’t talk about
For 1 million paragraph sized text examples to converge:
RNN is about 250x slower on CPU than linear model…
This is why we use GPUs
Visualizing representations of words learned via sentiment
TSNE - L.J.P. van der Maaten
Individual words colored by average sentiment
Negative
Positive
Model learns to separate negative and positive words, not too surprising
Quantities of Time
Qualifiers
Product nouns
Punctuation
Much cooler, model also begins to learn components of language from only binary sentiment labels
The library - Passage
An example
Sentiment analysis of movie reviews - 25K labeled examples
RNN imports
RNN imports
preprocessing
RNN imports
preprocessing
load training data
RNN imports
preprocessing
tokenize data
load training data
RNN imports
preprocessing
configure model
tokenize data
load training data
RNN imports
preprocessing
make and train model
tokenize data
load training data
configure model
RNN imports
preprocessing
load test data
make and train model
tokenize data
load training data
configure model
RNN imports
preprocessing
predict on test data
load test data
make and train model
tokenize data
load training data
configure model
The results
Top 10! - barely :)
Summary
Contact
alec@indico.io
We’re hiring!
Questions?