Convolutional Neural Networks for NLP
Human Language Technologies
Giuseppe Attardi
Università di Pisa
Some slides from Christopher Manning
Dealing with Sequences: Idea
tentative deal reached, deal reached to, reached to keep, to keep government, keep government open
Slide from Chris Manning
CNN
Yellow color and red numbers show filter (=kernel) weights
Green shows input
Pink shows output �
Slide from Chris Manning
Convolutiona Neural Network
Convolutional Neural Network
A 1D convolution for text
Not | 0.2 | 0.1 | −0.3 | 0.4 |
going | 0.5 | 0.2 | −0.3 | −0.1 |
to | −0.1 | −0.3 | −0.2 | 0.4 |
the | 0.3 | −0.3 | 0.1 | 0.1 |
beach | 0.2 | −0.3 | 0.4 | 0.2 |
tomorrow | 0.1 | 0.2 | −0.1 | −0.1 |
:-( | −0.4 | −0.4 | 0.2 | 0.3 |
3 | 1 | 2 | −3 |
−1 | 2 | 1 | −3 |
1 | 1 | −1 | 1 |
Apply a filter (or kernel) of size 3
w1,w2,w3 | −1.0 |
w2,w3,w4 | −0.5 |
w3,w4,w5 | −3.6 |
w4,w5,w6 | −0.2 |
w5,w6,w7 | 0.3 |
0.0 |
0.1 |
0.0 |
0.6 |
1.6 |
+ bias�➔ non-linearity (RELU)
0.2 x 3 | 0.1 x 1 | -0.3 x 2 | 0.4 x −3 | -1.1 |
0.5 x −1 | 0.2 x 2 | -0.3 x 1 | -0.1 x −3 | -0.1 |
-0.1 x 1 | -0.3 x 1 | -0.2 x −1 | 0.4 x 1 | 0.2 |
| | | Σ | -1.0 |
First filter, w1,w2,w3 = -1.0
0.0 |
0.6 |
-2.6 |
0.8 |
1.3 |
➔
+
Filters
Filters have additional parameters that define:
A filter using a filter size of 5 is applied to all the sequences of 5 words in a text.
3 filters using a size of 5 applied to a text of 10 words produce 18 output values. Why?
1D convolution for text with padding
0 | 0.0 | 0.0 | 0.0 | 0.0 |
Not | 0.2 | 0.1 | −0.3 | 0.4 |
going | 0.5 | 0.2 | −0.3 | −0.1 |
to | −0.1 | −0.3 | −0.2 | 0.4 |
the | 0.3 | −0.3 | 0.1 | 0.1 |
beach | 0.2 | −0.3 | 0.4 | 0.2 |
tomorrow | 0.1 | 0.2 | −0.1 | −0.1 |
:-( | −0.4 | −0.4 | 0.2 | 0.3 |
0 | 0.0 | 0.0 | 0.0 | 0.0 |
3 | 1 | 2 | −3 |
−1 | 2 | 1 | −3 |
1 | 1 | −1 | 1 |
Apply a filter (or kernel) of size 3
0,w1,w2 | −0.6 |
w1,w2,w3 | −1.0 |
w2,w3,w4 | −0.5 |
w3,w4,w5 | −0.1 |
w4,w5,w6 | −0.2 |
w5,w6,w7 | 0.3 |
w6,w7,0 | −0.5 |
3D channel convolution with padding
0 | 0.0 | 0.0 | 0.0 | 0.0 |
Not | 0.2 | 0.1 | −0.3 | 0.4 |
going | 0.5 | 0.2 | −0.3 | −0.1 |
to | −0.1 | −0.3 | −0.2 | 0.4 |
the | 0.3 | −0.3 | 0.1 | 0.1 |
beach | 0.2 | −0.3 | 0.4 | 0.2 |
tomorrow | 0.1 | 0.2 | −0.1 | −0.1 |
:-( | −0.4 | −0.4 | 0.2 | 0.3 |
0 | 0.0 | 0.0 | 0.0 | 0.0 |
3 | 1 | 2 | −3 |
−1 | 2 | 1 | −3 |
1 | 1 | −1 | 1 |
Apply 3 filters (or kernel) of size 3
0,w1,w2 | −0.6 | 0.2 | 1.4 |
w1,w2,w3 | −1.0 | 1.6 | −1.0 |
w2,w3,w4 | −0.5 | −0.1 | 0.8 |
w3,w4,w5 | −3.6 | 0.3 | 0.3 |
w4,w5,w6 | −0.2 | 0.1 | 1.2 |
w5,w6,w7 | 0.3 | 0.6 | 0.9 |
w6,w7,0 | −0.5 | −0.9 | 0.1 |
1 | 0 | 0 | 1 |
1 | 0 | −1 | −1 |
0 | 1 | 0 | 1 |
1 | −1 | 2 | −1 |
1 | 0 | −1 | 3 |
0 | 2 | 2 | 1 |
conv1d, padded, with max pooling over time
0 | 0.0 | 0.0 | 0.0 | 0.0 |
Not | 0.2 | 0.1 | −0.3 | 0.4 |
going | 0.5 | 0.2 | −0.3 | −0.1 |
to | −0.1 | −0.3 | −0.2 | 0.4 |
the | 0.3 | −0.3 | 0.1 | 0.1 |
beach | 0.2 | −0.3 | 0.4 | 0.2 |
tomorrow | 0.1 | 0.2 | −0.1 | −0.1 |
:-( | −0.4 | −0.4 | 0.2 | 0.3 |
0 | 0.0 | 0.0 | 0.0 | 0.0 |
3 | 1 | 2 | −3 |
−1 | 2 | 1 | −3 |
1 | 1 | −1 | 1 |
Apply 3 filters (or kernel) of size 3
0,w1,w2 | −0.6 | 0.2 | 1.4 |
w1,w2,w3 | −1.0 | 1.6 | −1.0 |
w2,w3,w4 | −0.5 | −0.1 | 0.8 |
w3,w4,w5 | −3.6 | 0.3 | 0.3 |
w4,w5,w6 | −0.2 | 0.1 | 1.2 |
w5,w6,w7 | 0.3 | 0.6 | 0.9 |
w6,w7,0 | −0.5 | −0.9 | 0.1 |
1 | 0 | 0 | 1 |
1 | 0 | −1 | −1 |
0 | 1 | 0 | 1 |
1 | −1 | 2 | −1 |
1 | 0 | −1 | 3 |
0 | 2 | 2 | 1 |
Max pool | 0.3 | 1.6 | 1.4 |
conv1d, padded, average pooling over time
0 | 0.0 | 0.0 | 0.0 | 0.0 |
Not | 0.2 | 0.1 | −0.3 | 0.4 |
going | 0.5 | 0.2 | −0.3 | −0.1 |
to | −0.1 | −0.3 | −0.2 | 0.4 |
the | 0.3 | −0.3 | 0.1 | 0.1 |
beach | 0.2 | −0.3 | 0.4 | 0.2 |
tomorrow | 0.1 | 0.2 | −0.1 | −0.1 |
:-( | −0.4 | −0.4 | 0.2 | 0.3 |
0 | 0.0 | 0.0 | 0.0 | 0.0 |
3 | 1 | 2 | −3 |
−1 | 2 | 1 | −3 |
1 | 1 | −1 | 1 |
Apply 3 filters (or kernel) of size 3
0,w1,w2 | −0.6 | 0.2 | 1.4 |
w1,w2,w3 | −1.0 | 1.6 | −1.0 |
w2,w3,w4 | −0.5 | −0.1 | 0.8 |
w3,w4,w5 | −3.6 | 0.3 | 0.3 |
w4,w5,w6 | −0.2 | 0.1 | 1.2 |
w5,w6,w7 | 0.3 | 0.6 | 0.9 |
w6,w7,0 | −0.5 | −0.9 | 0.1 |
1 | 0 | 0 | 1 |
1 | 0 | −1 | −1 |
0 | 1 | 0 | 1 |
1 | −1 | 2 | −1 |
1 | 0 | −1 | 3 |
0 | 2 | 2 | 1 |
average | −0.87 | 0.26 | 0.53 |
conv1d, padded, average pooling, stride = 2
0 | 0.0 | 0.0 | 0.0 | 0.0 |
Not | 0.2 | 0.1 | −0.3 | 0.4 |
going | 0.5 | 0.2 | −0.3 | −0.1 |
to | −0.1 | −0.3 | −0.2 | 0.4 |
the | 0.3 | −0.3 | 0.1 | 0.1 |
beach | 0.2 | −0.3 | 0.4 | 0.2 |
tomorrow | 0.1 | 0.2 | −0.1 | −0.1 |
:-( | −0.4 | −0.4 | 0.2 | 0.3 |
0 | 0.0 | 0.0 | 0.0 | 0.0 |
3 | 1 | 2 | −3 |
−1 | 2 | 1 | −3 |
1 | 1 | −1 | 1 |
Apply 3 filters (or kernel) of size 3
0,w1,w2 | −0.6 | 0.2 | 1.4 |
w2,w3,w4 | −0.5 | −0.1 | 0.8 |
w4,w5,w6 | −0.2 | 0.1 | 1.2 |
w6,w7,0 | −0.5 | −0.9 | 0.1 |
1 | 0 | 0 | 1 |
1 | 0 | −1 | −1 |
0 | 1 | 0 | 1 |
1 | −1 | 2 | −1 |
1 | 0 | −1 | 3 |
0 | 2 | 2 | 1 |
Max p | −0.2 | 0.2 | 1.4 |
Keras
from tensorflow.random import normal
batch_size = 16
word_embed_size = 4
seq_len = 7
input = normal((batch_size, seq_len, word_embed_size))
kernel_size = 3
conv1 = Conv1D(3, kernel_size) # can add: padding=1
hidden1 = conv1(input)
hidden2 = np.max(hidden1, dim=2) # max pool
PyTorch
batch_size = 16
word_embed_size = 4
seq_len = 7
input = torch.randn(batch_size, word_embed_size, seq_len)
conv1 = Conv1d(in_channels=word_embed_size, out_channels=3,
kernel_size=3) # can add: padding=1
hidden1 = conv1(input)
hidden2 = torch.max(hidden1, dim=2) # max pool
Single Layer CNN for Sentence Classification
Code
See notebook:
http://medialab.di.unipi.it:8000/hub/user-redirect/lab/tree/HLT/Lectures/CnnNLP.ipynb
Sentiment Analysis on Tweets
Evolution
Semeval 2013 - Examples
0 | will testdrive the new Nokia N9 phone with our newest app starting on Thursday :-) |
-1 | RT @arodsf: no way to underestimate the madness and cynicism and frank and open loathing of country |
1 | I feel like a kid before xmas, i cannot wait to get one RT @NokiaKnowings: In case you missed it...No... |
SemEval 2013, Task 2
NRC-Canada: Building the State-of-the-Art in Sentiment Analysis of Tweets, Saif M. Mohammad, Svetlana Kiritchenko, and Xiaodan Zhu, In Proceedings of the seventh international workshop on Semantic Evaluation Exercises (SemEval-2013), June 2013, Atlanta, USA.
Approach:
SVM with lots of handcrafted features:
SemEval 2015 – Task 10
Best Submission:
A. Severyn, A. Moschitti. 2015. UNITN: Training Deep Convolutional Neural Network for Twitter Sentiment Classification. Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), pages 464–469, Denver, Colorado, June 4-5, 2015.�https://www.aclweb.org/anthology/S15-2079
CNN for Sentiment Classification
Multiple filters of sliding windows of various sizes h
ci = f(F ⊗ Si:i+h−1 + b)
-
Not
going
to
the
beach
tomorrow
:-(
+
-
convolutional layer with
multiple filters
Multilayer perceptron
with dropout
embeddings
for each word
max over time pooling
S
Frobenius elementwise
matrix product
Distant Supervision
Results of UNITN on SemEval 2015
Dataset | Score | Rank |
Twitter 15 | 84.79 | 1 |
Dataset | Score | Rank |
Twitter 15 | 64.59 | 2 |
Phrase-level subtask A
Message-level subtask B
Sentiment Specific Word Embeddings
U
the cat sits on
LM likelihood + Polarity
Learning SS Embeddings
Semeval 2015 Sentiment on Tweets
Team | Phrase Level Polarity | Tweet |
Attardi (unofficial) | | 67.28 |
UNITN | 84.79 | 64.59 |
KLUEless | 84.51 | 61.20 |
IOA | 82.76 | 62.62 |
WarwickDCS | 82.46 | 57.62 |
Webis | | 64.84 |
SwissCheese at SemEval 2016
Three-phase procedure:
Ensemble of Classifiers
Results
| 2013 | 2014 | 2015 | 2016 Tweet | ||||
| Tweet | SMS | Tweet | Sarcasm | Live-Journal | Tweet | Avg F1 | Acc |
SwissCheese Combination | 70.05 | 63.72 | 71.62 | 56.61 | 69.57 | 67.11 | 63.31 | 64.61 |
SwissCheese single | 67.00 | | 69.12 | 62.00 | 71.32 | 61.01 | 57.19 | |
UniPI | 59.218 | 58.511 | 62.718 | 38.125 | 65.412 | 58.619 | 57.118 | 63.93 |
UniPI SWE | 64.2 | 60.6 | 68.4 | 48.1 | 66.8 | 63.5 | 59.2 | 65.2 |
Breakdown over all test sets
SwissCheese | Prec. | Rec. | F1 |
positive | 67.48 | 74.14 | 70.66 |
negative | 53.26 | 67.86 | 59.68 |
neutral | 71.47 | 59.51 | 64.94 |
Avg F1 |
|
| 65.17 |
Accuracy |
|
| 64.62 |
UniPI 3 | Prec. | Rec. | F1 |
positive | 70.88 | 65.35 | 68.00 |
negative | 50.29 | 58.93 | 54.27 |
neutral | 68.02 | 68.12 | 68.07 |
Avg F1 |
|
| 61.14 |
Accuracy |
|
| 65.64 |
Sentiment Classification from a single neuron
Blog post - Radford et al. Learning to Generate Reviews and Discovering Sentiment. Arxiv 1704.01444
Follow up
Zhang and Wallace (2015) A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification
https://arxiv.org/pdf/1510.03820.pdf
Regularization
A pitfall when fine-tuning word vectors
TV
telly
television
A pitfall when fine-tuning word vectors
TV
telly
television
What to do