JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.

1 of 37

Convolutional Neural Networks for NLP

Human Language Technologies

Giuseppe Attardi

Università di Pisa

Some slides from Christopher Manning

2 of 37

Dealing with Sequences: Idea

Main CNN idea:

What if we compute vectors for every possible word subsequence of a certain length?
Example: “tentative deal reached to keep government open” computes vectors for:

tentative deal reached, deal reached to, reached to keep, to keep government, keep government open

Regardless of whether phrase is grammatical
Not very linguistically or cognitively plausible
Then group them afterwards

Slide from Chris Manning

3 of 37

CNN

Convolution is classically used to extract features from images

Models position-invariant identification

2d example

Yellow color and red numbers show filter (=kernel) weights

Green shows input

Pink shows output �

Slide from Chris Manning

4 of 37

Convolutiona Neural Network

5 of 37

Convolutional Neural Network

A convolutional layer in a NN is composed by a set of filters.

A filter combines a "local" selection of input values into an output value.
All filters are "sweeped" across all input.

During training each filter specializes into recognizing some kind of relevant combination of features.
CNNs work well on stationary features, i.e., those independent from position.

6 of 37

A 1D convolution for text

Not	0.2	0.1	−0.3	0.4
going	0.5	0.2	−0.3	−0.1
to	−0.1	−0.3	−0.2	0.4
the	0.3	−0.3	0.1	0.1
beach	0.2	−0.3	0.4	0.2
tomorrow	0.1	0.2	−0.1	−0.1
:-(	−0.4	−0.4	0.2	0.3

3	1	2	−3
−1	2	1	−3
1	1	−1	1

Apply a filter (or kernel) of size 3

w1,w2,w3	−1.0
w2,w3,w4	−0.5
w3,w4,w5	−3.6
w4,w5,w6	−0.2
w5,w6,w7	0.3

0.0
0.1
0.0
0.6
1.6

+ bias�➔ non-linearity (RELU)

0.2 x 3	0.1 x 1	-0.3 x 2	0.4 x −3	-1.1
0.5 x −1	0.2 x 2	-0.3 x 1	-0.1 x −3	-0.1
-0.1 x 1	-0.3 x 1	-0.2 x −1	0.4 x 1	0.2
			Σ	-1.0

First filter, w1,w2,w3 = -1.0

0.0
0.6
-2.6
0.8
1.3

➔

7 of 37

Filters

Filters have additional parameters that define:

behavior at the start/end of documents (padding)
size of the sweep step (stride)
possible presence of holes in the filter window (dilation).

A filter using a filter size of 5 is applied to all the sequences of 5 words in a text.

3 filters using a size of 5 applied to a text of 10 words produce 18 output values. Why?

8 of 37

1D convolution for text with padding

0	0.0	0.0	0.0	0.0
Not	0.2	0.1	−0.3	0.4
going	0.5	0.2	−0.3	−0.1
to	−0.1	−0.3	−0.2	0.4
the	0.3	−0.3	0.1	0.1
beach	0.2	−0.3	0.4	0.2
tomorrow	0.1	0.2	−0.1	−0.1
:-(	−0.4	−0.4	0.2	0.3
0	0.0	0.0	0.0	0.0

3	1	2	−3
−1	2	1	−3
1	1	−1	1

Apply a filter (or kernel) of size 3

0,w1,w2	−0.6
w1,w2,w3	−1.0
w2,w3,w4	−0.5
w3,w4,w5	−0.1
w4,w5,w6	−0.2
w5,w6,w7	0.3
w6,w7,0	−0.5

9 of 37

3D channel convolution with padding

0	0.0	0.0	0.0	0.0
Not	0.2	0.1	−0.3	0.4
going	0.5	0.2	−0.3	−0.1
to	−0.1	−0.3	−0.2	0.4
the	0.3	−0.3	0.1	0.1
beach	0.2	−0.3	0.4	0.2
tomorrow	0.1	0.2	−0.1	−0.1
:-(	−0.4	−0.4	0.2	0.3
0	0.0	0.0	0.0	0.0

3	1	2	−3
−1	2	1	−3
1	1	−1	1

Apply 3 filters (or kernel) of size 3

0,w1,w2	−0.6	0.2	1.4
w1,w2,w3	−1.0	1.6	−1.0
w2,w3,w4	−0.5	−0.1	0.8
w3,w4,w5	−3.6	0.3	0.3
w4,w5,w6	−0.2	0.1	1.2
w5,w6,w7	0.3	0.6	0.9
w6,w7,0	−0.5	−0.9	0.1

1	0	0	1
1	0	−1	−1
0	1	0	1

1	−1	2	−1
1	0	−1	3
0	2	2	1

10 of 37

conv1d, padded, with max pooling over time

0	0.0	0.0	0.0	0.0
Not	0.2	0.1	−0.3	0.4
going	0.5	0.2	−0.3	−0.1
to	−0.1	−0.3	−0.2	0.4
the	0.3	−0.3	0.1	0.1
beach	0.2	−0.3	0.4	0.2
tomorrow	0.1	0.2	−0.1	−0.1
:-(	−0.4	−0.4	0.2	0.3
0	0.0	0.0	0.0	0.0

3	1	2	−3
−1	2	1	−3
1	1	−1	1

Apply 3 filters (or kernel) of size 3

0,w1,w2	−0.6	0.2	1.4
w1,w2,w3	−1.0	1.6	−1.0
w2,w3,w4	−0.5	−0.1	0.8
w3,w4,w5	−3.6	0.3	0.3
w4,w5,w6	−0.2	0.1	1.2
w5,w6,w7	0.3	0.6	0.9
w6,w7,0	−0.5	−0.9	0.1

1	0	0	1
1	0	−1	−1
0	1	0	1

1	−1	2	−1
1	0	−1	3
0	2	2	1

Max pool	0.3	1.6	1.4

11 of 37

conv1d, padded, average pooling over time

0	0.0	0.0	0.0	0.0
Not	0.2	0.1	−0.3	0.4
going	0.5	0.2	−0.3	−0.1
to	−0.1	−0.3	−0.2	0.4
the	0.3	−0.3	0.1	0.1
beach	0.2	−0.3	0.4	0.2
tomorrow	0.1	0.2	−0.1	−0.1
:-(	−0.4	−0.4	0.2	0.3
0	0.0	0.0	0.0	0.0

3	1	2	−3
−1	2	1	−3
1	1	−1	1

Apply 3 filters (or kernel) of size 3

0,w1,w2	−0.6	0.2	1.4
w1,w2,w3	−1.0	1.6	−1.0
w2,w3,w4	−0.5	−0.1	0.8
w3,w4,w5	−3.6	0.3	0.3
w4,w5,w6	−0.2	0.1	1.2
w5,w6,w7	0.3	0.6	0.9
w6,w7,0	−0.5	−0.9	0.1

1	0	0	1
1	0	−1	−1
0	1	0	1

1	−1	2	−1
1	0	−1	3
0	2	2	1

average	−0.87	0.26	0.53

12 of 37

conv1d, padded, average pooling, stride = 2

0	0.0	0.0	0.0	0.0
Not	0.2	0.1	−0.3	0.4
going	0.5	0.2	−0.3	−0.1
to	−0.1	−0.3	−0.2	0.4
the	0.3	−0.3	0.1	0.1
beach	0.2	−0.3	0.4	0.2
tomorrow	0.1	0.2	−0.1	−0.1
:-(	−0.4	−0.4	0.2	0.3
0	0.0	0.0	0.0	0.0

3	1	2	−3
−1	2	1	−3
1	1	−1	1

Apply 3 filters (or kernel) of size 3

0,w1,w2	−0.6	0.2	1.4
w2,w3,w4	−0.5	−0.1	0.8
w4,w5,w6	−0.2	0.1	1.2
w6,w7,0	−0.5	−0.9	0.1

1	0	0	1
1	0	−1	−1
0	1	0	1

1	−1	2	−1
1	0	−1	3
0	2	2	1

Max p	−0.2	0.2	1.4

13 of 37

Keras

from tensorflow.random import normal

batch_size = 16

word_embed_size = 4

seq_len = 7

input = normal((batch_size, seq_len, word_embed_size))

kernel_size = 3

conv1 = Conv1D(3, kernel_size) # can add: padding=1

hidden1 = conv1(input)

hidden2 = np.max(hidden1, dim=2) # max pool

14 of 37

PyTorch

batch_size = 16

word_embed_size = 4

seq_len = 7

input = torch.randn(batch_size, word_embed_size, seq_len)

conv1 = Conv1d(in_channels=word_embed_size, out_channels=3,

kernel_size=3) # can add: padding=1

hidden1 = conv1(input)

hidden2 = torch.max(hidden1, dim=2) # max pool

15 of 37

Single Layer CNN for Sentence Classification

Yoon Kim (2014): Convolutional Neural Networks for Sentence Classification. EMNLP 2014. https://arxiv.org/pdf/1408.5882.pdf
A variant of convolutional NNs of Collobert, Weston et al. (2011) Natural Language Processing (almost) from Scratch.
Goal: Sentence classification:

Mainly positive or negative sentiment of a sentence

Other tasks like:

Subjective or objective language sentence
Question classification: about person, location, number, ...

16 of 37

Code

See notebook:

http://medialab.di.unipi.it:8000/hub/user-redirect/lab/tree/HLT/Lectures/CnnNLP.ipynb

17 of 37

Sentiment Analysis on Tweets

18 of 37

Evolution

SemEval Shared Task Competition

2013, Task 2
2014, Task 9
2015, Task 10
2016
2017

Evolution of technology:

Top system in 2013: SVM with sentiment lexicons and many lexical features
Top system in 2016: CNN with word embeddings
In 2017: most systems used CNN or variants

19 of 37

Semeval 2013 - Examples

0	will testdrive the new Nokia N9 phone with our newest app starting on Thursday :-)
-1	RT @arodsf: no way to underestimate the madness and cynicism and frank and open loathing of country
1	I feel like a kid before xmas, i cannot wait to get one RT @NokiaKnowings: In case you missed it...No...

20 of 37

SemEval 2013, Task 2

Best Submission:

NRC-Canada: Building the State-of-the-Art in Sentiment Analysis of Tweets, Saif M. Mohammad, Svetlana Kiritchenko, and Xiaodan Zhu, In Proceedings of the seventh international workshop on Semantic Evaluation Exercises (SemEval-2013), June 2013, Atlanta, USA.

Approach:

SVM with lots of handcrafted features:

word ngrams, char ngrams, all-caps, POS, elongated words, emoticons, negation, etc.
sentiment lexicon: counts of polarized words, sum of scores, max score, last score

21 of 37

SemEval 2015 – Task 10

Best Submission:

A. Severyn, A. Moschitti. 2015. UNITN: Training Deep Convolutional Neural Network for Twitter Sentiment Classification. Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), pages 464–469, Denver, Colorado, June 4-5, 2015.�https://www.aclweb.org/anthology/S15-2079

22 of 37

CNN for Sentiment Classification

Embeddings Layer, R^d (d = 300)
Convolutional Layer with Relu activation

Multiple filters of sliding windows of various sizes h

c_i = f(F ⊗ S_i_:_i₊_h₋₁ + b)

max-pooling layer
dropout layer
linear layer with tanh activation
softmax layer

Not

going

the

beach

tomorrow

:-(

convolutional layer with

multiple filters

Multilayer perceptron

with dropout

embeddings

for each word

max over time pooling

Frobenius elementwise

matrix product

23 of 37

Distant Supervision

A. Severyn and A. Moschitti, UNITN at SemEval 2015 Task 10.
Word embeddings from plain text are completely clueless about their sentiment behavior
Distant supervision approach using our convolutional neural network to further refine the embeddings
Collected 10M tweets treating tweets containing positive emoticons, used as distantly supervised labels to train sentiment-aware embeddings

24 of 37

Results of UNITN on SemEval 2015

Dataset	Score	Rank
Twitter 15	84.79	1

Dataset	Score	Rank
Twitter 15	64.59	2

Phrase-level subtask A

Message-level subtask B

25 of 37

Sentiment Specific Word Embeddings

Sentiment Specific Word Embeddings

Uses an annotated corpus with polarities (e.g. tweets)
SS Word Embeddings achieve SotA accuracy on tweet sentiment classification
G. Attardi, D. Saertiano. UniPi at SemEval 2016 Task 4: Convolutional Neural Networks for Sentiment Classification.�https://www.aclweb.org/anthology/S/S16/S16-1033.pdf

the cat sits on

LM likelihood + Polarity

26 of 37

Learning SS Embeddings

27 of 37

Semeval 2015 Sentiment on Tweets

Team	Phrase Level Polarity	Tweet
Attardi (unofficial)		67.28
UNITN	84.79	64.59
KLUEless	84.51	61.20
IOA	82.76	62.62
WarwickDCS	82.46	57.62
Webis		64.84

28 of 37

SwissCheese at SemEval 2016

Three-phase procedure:

creation of word embeddings for initialization of the ﬁrst layer. Word2vec on an unlabeled corpus of 200M tweets.
distant supervised phase, where the network weights and word embeddings are trained to capture aspects related to sentiment. Emoticons used to infer the polarity of a balanced set of 90M tweets.
supervised phase, where the network is trained on the provided supervised training data.

29 of 37

Ensemble of Classifiers

combining the outputs of two 2-layer CNNs having similar architectures but differing in the choice of certain parameters (such as the number of convolutional ﬁlters).
networks were also initialized using different word embeddings and used slightly different training data for the distant supervised phase.
A total of 7 outputs were combined

30 of 37

Results

	2013		2014			2015	2016 Tweet
	Tweet	SMS	Tweet	Sarcasm	Live-Journal	Tweet	Avg F1	Acc
SwissCheese Combination	70.0₅	63.7₂	71.6₂	56.6₁	69.5₇	67.1₁	63.3₁	64.6₁
SwissCheese single	67.00		69.12	62.00	71.32	61.01	57.19
UniPI	59.2₁₈	58.5₁₁	62.7₁₈	38.1₂₅	65.4₁₂	58.6₁₉	57.1₁₈	63.9₃
UniPI SWE	64.2	60.6	68.4	48.1	66.8	63.5	59.2	65.2

31 of 37

Breakdown over all test sets

SwissCheese	Prec.	Rec.	F1
positive	67.48	74.14	70.66
negative	53.26	67.86	59.68
neutral	71.47	59.51	64.94
Avg F1			65.17
Accuracy			64.62

UniPI 3	Prec.	Rec.	F1
positive	70.88	65.35	68.00
negative	50.29	58.93	54.27
neutral	68.02	68.12	68.07
Avg F1			61.14
Accuracy			65.64

32 of 37

Sentiment Classification from a single neuron

A char-level LSTM with 4096 units has been trained on 82 millions reviews from Amazon.
The model is trained only to predict the next character in the text
After training one of the units had a very high correlation with sentiment, resulting in state-of-the-art accuracy when used as a classifier.
The model can be used to generate text.
By setting the value of the sentiment unit, one can control the sentiment of the resulting text.

Blog post - Radford et al. Learning to Generate Reviews and Discovering Sentiment. Arxiv 1704.01444

33 of 37

Follow up

Zhang and Wallace (2015) A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification

https://arxiv.org/pdf/1510.03820.pdf

34 of 37

Regularization

35 of 37

A pitfall when fine-tuning word vectors

Setting: We are training a logistic regression classification model for movie review sentiment using single words.
In the training data we have “TV” and “telly”
In the testing data we have “television”
The pre-trained word vectors have all three similar:

Question: What happens when we update the word vectors?

telly

television

36 of 37

A pitfall when fine-tuning word vectors

Question: What happens when we update the word vectors?
Answer:

Those words that are in the training data move around

“TV”and“telly”

Words not in the training data stay where they were

“television”

telly

television

37 of 37

What to do

Question: Should I use available “pre-trained” word vectors Answer:

Almost always, yes!
They are trained on a huge amount of data, and so they will know about words not in your training data and will know more about words that are in your training data
Have 100s of millions of words of data? Okay to start random

Question: Should I update (“fine tune”) my own word vectors?
Answer:

If you only have a small training data set, don’t train the word vectors
If you have have a large dataset, it probably will work better to�train = update = fine-tune word vectors to the task