1 of 55

Recurrent Neural Networks for text analysis

From idea to practice

ALEC RADFORD

2 of 55

Follow Along

Slides at: http://goo.gl/WLsUWv

3 of 55

How ML

-0.15, 0.2, 0, 1.5

A, B, C, D

The cat sat on the mat.

Numerical, great!

Categorical, great!

Uhhh…….

4 of 55

How text is dealt with

(ML perspective)

Text

Features

(bow, TFIDF, LSA, etc...)

Linear Model

(SVM, softmax)

5 of 55

Structure is important!

The cat sat on the mat.

sat

the

on

mat

cat

the

Certain tasks, structure is essential:

Humor
Sarcasm

Certain tasks, ngrams can get you a long way:

Sentiment Analysis
Topic detection

Specific words can be strong indicators

useless, fantastic (sentiment)
hoop, green tea, NASDAQ (topic)

6 of 55

Structure is hard

Ngrams is typical way of preserving some structure.

sat

the

on

mat

cat

the cat

cat sat

sat on

on the

the mat

Beyond bi or tri-grams occurrences become very rare and dimensionality becomes huge (1, 10 million + features)

7 of 55

Structure is hard

8 of 55

How text is dealt with

(ML perspective)

Text

Features

(bow, TFIDF, LSA, etc...)

Linear Model

(SVM, softmax)

9 of 55

How text should be dealt with?

Text

RNN

Linear Model

(SVM, softmax)

10 of 55

How an RNN works

the

cat

sat

on

the

mat

11 of 55

How an RNN works

the

cat

sat

on

the

mat

input to hidden

12 of 55

How an RNN works

the

cat

sat

on

the

mat

input to hidden

hidden to hidden

13 of 55

How an RNN works

the

cat

sat

on

the

mat

input to hidden

hidden to hidden

14 of 55

How an RNN works

the

cat

sat

on

the

mat

projections

(activities x weights)

activities

(vectors of values)

input to hidden

hidden to hidden

15 of 55

How an RNN works

the

cat

sat

on

the

mat

projections

(activities x weights)

activities

(vectors of values)

Learned representation of sequence.

input to hidden

hidden to hidden

16 of 55

How an RNN works

the

cat

sat

on

the

mat

projections

(activities x weights)

activities

(vectors of values)

cat

hidden to output

input to hidden

hidden to hidden

17 of 55

From text to RNN input

the

cat

sat

on

the

mat

“The cat sat on the mat.”

Tokenize

.

Assign index

0

1

2

3

0

4

5

String input

Embedding lookup

2.5 0.3 -1.2

0.2 -3.3 0.7

-4.1 1.6 2.8

1.1 5.7 -0.2

2.5 0.3 -1.2

1.4 0.6 -3.9

-3.8 1.5 0.1

2.5 0.3 -1.2

0.2 -3.3 0.7

-4.1 1.6 2.8

1.1 5.7 -0.2

1.4 0.6 -3.9

-3.8 1.5 0.1

Learned matrix

18 of 55

You can stack them too

the

cat

sat

on

the

mat

cat

hidden to output

input to hidden

hidden to hidden

19 of 55

But aren’t RNNs unstable?

Simple RNNs trained with SGD are unstable/difficult to learn.

�But modern RNNs with various tricks blow up much less often!

Gating Units
Gradient Clipping
Steeper gates
Better initialization
Better optimizers
Bigger datasets

20 of 55

Simple Recurrent Unit

h_t-1

x_t

+

h_t

x_t+1

+

h_t+1

+

Element wise addition

Activation function

Routes information can propagate along

Involved in modifying information flow and values

21 of 55

Gated Recurrent Unit - GRU

⊙

x_t

r

h_t

h_t-1

h_t

z

+

~

1-z

z

+

Element wise addition

⊙

Element wise multiplication

Routes information can propagate along

Involved in modifying information flow and values

22 of 55

Gated Recurrent Unit - GRU

⊙

x_t

r

h_t

h_t-1

z

+

~

1-z

z

⊙

x_t+1

r

h_t+1

h_t

z

+

~

1-z

z

h_t+1

23 of 55

Gating is important

For sentiment analysis of longer sequences of text (paragraph or so) a simple RNN has difficulty learning at all while a gated RNN does so easily.

24 of 55

Which One?

There are two types of gated RNNs:

Gated Recurrent Units (GRU) by K. Cho, recently introduced and used for machine translation and speech recognition tasks.

Long short term memory (LSTM) by S. Hochreiter and J. Schmidhuber has been around since 1997 and has been used far more. Various modifications to it exist.��

25 of 55

Which One?

GRU is simpler, faster, and optimizes quicker (at least on sentiment).��Because it only has two gates (compared to four) approximately 1.5-1.75x faster for theano implementation.��If you have a huge dataset and don’t mind waiting LSTM may be better in the long run due to its greater complexity - especially if you add peephole connections.

26 of 55

Exploding Gradients?

Exploding gradients are a major problem for traditional RNNs trained with SGD. One of the sources of the reputation of RNNs being hard to train.��In 2012, R Pascanu and T. Mikolov proposed clipping the norm of the gradient to alleviate this.��Modern optimizers don’t seem to have this problem - at least for classification text analysis.

27 of 55

Better Gating Functions

Interesting paper at NIPS workshop (Q. Lyu, J. Zhu) - make the gates “steeper” so they change more rapidly from “off” to “on” so model learns to use them quicker.��

28 of 55

Better Initialization

Andrew Saxe last year showed that initializing weight matrices with random orthogonal matrices works better than random gaussian (or uniform) matrices.

In addition, Richard Socher (and more recently Quoc Le) have used identity initialization schemes which work great as well.

29 of 55

Understanding Optimizers

2D moons dataset

courtesy of scikit-learn

30 of 55

Comparing Optimizers

Adam (D. Kingma) combines the early optimization speed of Adagrad (J. Duchi) with the better later convergence of various other methods like Adadelta (M. Zeiler) and RMSprop (T. Tieleman).��Warning: Generalization performance of Adam seems slightly worse for smaller datasets.

31 of 55

It adds up

Up to 10x more efficient training once you add all the tricks together compared to a naive implementation - much more stable - rarely diverges.

Around 7.5x faster, the various tricks add a bit of computation time.

32 of 55

Too much? - Overfitting

RNNs can overfit very well as we will see. As they continue to fit to training dataset, their performance on test data will plateau or even worsen.��Keep track of it using a validation set, save model at each iteration over training data and pick the earliest, best, validation performance.

33 of 55

The Showdown

Model #1

Model #2

+

512 dim embedding

512 dim

hidden state

output

Using bigrams and grid search on min_df for vectorizer and regularization coefficient for model.

Using whatever I tried that worked :)

Adam, GRU, steeper sigmoid gates, ortho/identity init are good defaults

34 of 55

Sentiment & Helpfulness

35 of 55

Effect of Dataset Size

RNNs have poor generalization properties on small datasets.

1K labeled examples 25-50% worse than linear model…

RNNs have better generalization properties on large datasets.

1M labeled examples 0-30% better than linear model.

Crossovers between 10K and 1M examples

Depends on dataset.

36 of 55

The Thing we don’t talk about

For 1 million paragraph sized text examples to converge:

Linear model takes 30 minutes on a single CPU core.
RNN takes 90 minutes on a Titan X.
RNN takes five days on a single CPU core.

RNN is about 250x slower on CPU than linear model…

This is why we use GPUs

37 of 55

Visualizing representations of words learned via sentiment

TSNE - L.J.P. van der Maaten

Individual words colored by average sentiment

38 of 55

Negative

Positive

Model learns to separate negative and positive words, not too surprising

39 of 55

Quantities of Time

Qualifiers

Product nouns

Punctuation

Much cooler, model also begins to learn components of language from only binary sentiment labels

40 of 55

The library - Passage

Tiny RNN library built on top of Theano
https://github.com/IndicoDataSolutions/Passage
Still alpha - we’re working on it!
Supports simple, LSTM, and GRU recurrent layers
Supports multiple recurrent layers
Supports deep input to and deep output from hidden layers

no deep transitions currently

Supports embedding and onehot input representations
Can be used for both regression and classification problems

Regression needs preprocessing for stability - working on it

Much more in the pipeline

41 of 55

An example

Sentiment analysis of movie reviews - 25K labeled examples

42 of 55

43 of 55

RNN imports

44 of 55

RNN imports

preprocessing

45 of 55

RNN imports

preprocessing

load training data

46 of 55

RNN imports

preprocessing

tokenize data

load training data

47 of 55

RNN imports

preprocessing

configure model

tokenize data

load training data

48 of 55

RNN imports

preprocessing

make and train model

tokenize data

load training data

configure model

49 of 55

RNN imports

preprocessing

load test data

make and train model

tokenize data

load training data

configure model

50 of 55

RNN imports

preprocessing

predict on test data

load test data

make and train model

tokenize data

load training data

configure model

51 of 55

The results

Top 10! - barely :)

52 of 55

Summary

RNNs look to be a competitive tool in certain situations for text analysis.
Especially if you have a large 1M+ example dataset

A GPU or great patience is essential

Otherwise it can be difficult to justify over linear models

Speed
Complexity
Poor generalization with small datasets

53 of 55

Contact

alec@indico.io

54 of 55

We’re hiring!

Data Engineer
Infrastructure Engineer

Interested?

contact@indico.io (or talk-to/email me after pres.)

55 of 55

Questions?