1 of 37

Convolutional Neural Networks for NLP

Human Language Technologies

Giuseppe Attardi

Università di Pisa

Some slides from Christopher Manning

2 of 37

Dealing with Sequences: Idea

  • Main CNN idea:
    • What if we compute vectors for every possible word subsequence of a certain length?
    • Example: “tentative deal reached to keep government open” computes vectors for:

tentative deal reached, deal reached to, reached to keep, to keep government, keep government open

  • Regardless of whether phrase is grammatical
  • Not very linguistically or cognitively plausible
  • Then group them afterwards

Slide from Chris Manning

3 of 37

CNN

  • Convolution is classically used to extract features from images
    • Models position-invariant identification

  • 2d example

Yellow color and red numbers show filter (=kernel) weights

Green shows input

Pink shows output �

Slide from Chris Manning

 

4 of 37

Convolutiona Neural Network

5 of 37

Convolutional Neural Network

  • A convolutional layer in a NN is composed by a set of filters.
    • A filter combines a "local" selection of input values into an output value.
    • All filters are "sweeped" across all input.
  • During training each filter specializes into recognizing some kind of relevant combination of features.
  • CNNs work well on stationary features, i.e., those independent from position.

6 of 37

A 1D convolution for text

Not

0.2

0.1

−0.3

0.4

going

0.5

0.2

−0.3

−0.1

to

−0.1

−0.3

−0.2

0.4

the

0.3

−0.3

0.1

0.1

beach

0.2

−0.3

0.4

0.2

tomorrow

0.1

0.2

−0.1

−0.1

:-(

−0.4

−0.4

0.2

0.3

3

1

2

−3

−1

2

1

−3

1

1

−1

1

Apply a filter (or kernel) of size 3

w1,w2,w3

−1.0

w2,w3,w4

−0.5

w3,w4,w5

−3.6

w4,w5,w6

−0.2

w5,w6,w7

0.3

0.0

0.1

0.0

0.6

1.6

+ bias�non-linearity (RELU)

0.2 x 3

0.1 x 1

-0.3 x 2

0.4 x −3

-1.1

0.5 x −1

0.2 x 2

-0.3 x 1

-0.1 x −3

-0.1

-0.1 x 1

-0.3 x 1

-0.2 x −1

0.4 x 1

0.2

Σ

-1.0

First filter, w1,w2,w3 = -1.0

0.0

0.6

-2.6

0.8

1.3

+

7 of 37

Filters

Filters have additional parameters that define:

  • behavior at the start/end of documents (padding)
  • size of the sweep step (stride)
  • possible presence of holes in the filter window (dilation).

A filter using a filter size of 5 is applied to all the sequences of 5 words in a text.

3 filters using a size of 5 applied to a text of 10 words produce 18 output values. Why?

8 of 37

1D convolution for text with padding

0

0.0

0.0

0.0

0.0

Not

0.2

0.1

−0.3

0.4

going

0.5

0.2

−0.3

−0.1

to

−0.1

−0.3

−0.2

0.4

the

0.3

−0.3

0.1

0.1

beach

0.2

−0.3

0.4

0.2

tomorrow

0.1

0.2

−0.1

−0.1

:-(

−0.4

−0.4

0.2

0.3

0

0.0

0.0

0.0

0.0

3

1

2

−3

−1

2

1

−3

1

1

−1

1

Apply a filter (or kernel) of size 3

0,w1,w2

−0.6

w1,w2,w3

−1.0

w2,w3,w4

−0.5

w3,w4,w5

−0.1

w4,w5,w6

−0.2

w5,w6,w7

0.3

w6,w7,0

−0.5

9 of 37

3D channel convolution with padding

0

0.0

0.0

0.0

0.0

Not

0.2

0.1

−0.3

0.4

going

0.5

0.2

−0.3

−0.1

to

−0.1

−0.3

−0.2

0.4

the

0.3

−0.3

0.1

0.1

beach

0.2

−0.3

0.4

0.2

tomorrow

0.1

0.2

−0.1

−0.1

:-(

−0.4

−0.4

0.2

0.3

0

0.0

0.0

0.0

0.0

3

1

2

−3

−1

2

1

−3

1

1

−1

1

Apply 3 filters (or kernel) of size 3

0,w1,w2

−0.6

0.2

1.4

w1,w2,w3

−1.0

1.6

−1.0

w2,w3,w4

−0.5

−0.1

0.8

w3,w4,w5

−3.6

0.3

0.3

w4,w5,w6

−0.2

0.1

1.2

w5,w6,w7

0.3

0.6

0.9

w6,w7,0

−0.5

−0.9

0.1

1

0

0

1

1

0

−1

−1

0

1

0

1

1

−1

2

−1

1

0

−1

3

0

2

2

1

10 of 37

conv1d, padded, with max pooling over time

0

0.0

0.0

0.0

0.0

Not

0.2

0.1

−0.3

0.4

going

0.5

0.2

−0.3

−0.1

to

−0.1

−0.3

−0.2

0.4

the

0.3

−0.3

0.1

0.1

beach

0.2

−0.3

0.4

0.2

tomorrow

0.1

0.2

−0.1

−0.1

:-(

−0.4

−0.4

0.2

0.3

0

0.0

0.0

0.0

0.0

3

1

2

−3

−1

2

1

−3

1

1

−1

1

Apply 3 filters (or kernel) of size 3

0,w1,w2

−0.6

0.2

1.4

w1,w2,w3

−1.0

1.6

−1.0

w2,w3,w4

−0.5

−0.1

0.8

w3,w4,w5

−3.6

0.3

0.3

w4,w5,w6

−0.2

0.1

1.2

w5,w6,w7

0.3

0.6

0.9

w6,w7,0

−0.5

−0.9

0.1

1

0

0

1

1

0

−1

−1

0

1

0

1

1

−1

2

−1

1

0

−1

3

0

2

2

1

Max pool

0.3

1.6

1.4

11 of 37

conv1d, padded, average pooling over time

0

0.0

0.0

0.0

0.0

Not

0.2

0.1

−0.3

0.4

going

0.5

0.2

−0.3

−0.1

to

−0.1

−0.3

−0.2

0.4

the

0.3

−0.3

0.1

0.1

beach

0.2

−0.3

0.4

0.2

tomorrow

0.1

0.2

−0.1

−0.1

:-(

−0.4

−0.4

0.2

0.3

0

0.0

0.0

0.0

0.0

3

1

2

−3

−1

2

1

−3

1

1

−1

1

Apply 3 filters (or kernel) of size 3

0,w1,w2

−0.6

0.2

1.4

w1,w2,w3

−1.0

1.6

−1.0

w2,w3,w4

−0.5

−0.1

0.8

w3,w4,w5

−3.6

0.3

0.3

w4,w5,w6

−0.2

0.1

1.2

w5,w6,w7

0.3

0.6

0.9

w6,w7,0

−0.5

−0.9

0.1

1

0

0

1

1

0

−1

−1

0

1

0

1

1

−1

2

−1

1

0

−1

3

0

2

2

1

average

−0.87

0.26

0.53

12 of 37

conv1d, padded, average pooling, stride = 2

0

0.0

0.0

0.0

0.0

Not

0.2

0.1

−0.3

0.4

going

0.5

0.2

−0.3

−0.1

to

−0.1

−0.3

−0.2

0.4

the

0.3

−0.3

0.1

0.1

beach

0.2

−0.3

0.4

0.2

tomorrow

0.1

0.2

−0.1

−0.1

:-(

−0.4

−0.4

0.2

0.3

0

0.0

0.0

0.0

0.0

3

1

2

−3

−1

2

1

−3

1

1

−1

1

Apply 3 filters (or kernel) of size 3

0,w1,w2

−0.6

0.2

1.4

w2,w3,w4

−0.5

−0.1

0.8

w4,w5,w6

−0.2

0.1

1.2

w6,w7,0

−0.5

−0.9

0.1

1

0

0

1

1

0

−1

−1

0

1

0

1

1

−1

2

−1

1

0

−1

3

0

2

2

1

Max p

−0.2

0.2

1.4

13 of 37

Keras

from tensorflow.random import normal

batch_size = 16

word_embed_size = 4

seq_len = 7

input = normal((batch_size, seq_len, word_embed_size))

kernel_size = 3

conv1 = Conv1D(3, kernel_size) # can add: padding=1

hidden1 = conv1(input)

hidden2 = np.max(hidden1, dim=2) # max pool

14 of 37

PyTorch

batch_size = 16

word_embed_size = 4

seq_len = 7

input = torch.randn(batch_size, word_embed_size, seq_len)

conv1 = Conv1d(in_channels=word_embed_size, out_channels=3,

kernel_size=3) # can add: padding=1

hidden1 = conv1(input)

hidden2 = torch.max(hidden1, dim=2) # max pool

15 of 37

Single Layer CNN for Sentence Classification

  • Yoon Kim (2014): Convolutional Neural Networks for Sentence Classification. EMNLP 2014. https://arxiv.org/pdf/1408.5882.pdf
  • A variant of convolutional NNs of Collobert, Weston et al. (2011) Natural Language Processing (almost) from Scratch.
  • Goal: Sentence classification:
    • Mainly positive or negative sentiment of a sentence
  • Other tasks like:
    • Subjective or objective language sentence
    • Question classification: about person, location, number, ...

16 of 37

Code

See notebook:

http://medialab.di.unipi.it:8000/hub/user-redirect/lab/tree/HLT/Lectures/CnnNLP.ipynb

17 of 37

Sentiment Analysis on Tweets

18 of 37

Evolution

  • SemEval Shared Task Competition
    • 2013, Task 2
    • 2014, Task 9
    • 2015, Task 10
    • 2016
    • 2017
  • Evolution of technology:
    • Top system in 2013: SVM with sentiment lexicons and many lexical features
    • Top system in 2016: CNN with word embeddings
    • In 2017: most systems used CNN or variants

19 of 37

Semeval 2013 - Examples

0

will testdrive the new Nokia N9 phone with our newest app starting on Thursday :-)

-1

RT @arodsf: no way to underestimate the madness and cynicism and frank and open loathing of country

1

I feel like a kid before xmas, i cannot wait to get one RT @NokiaKnowings: In case you missed it...No...

20 of 37

SemEval 2013, Task 2

Best Submission:

NRC-Canada: Building the State-of-the-Art in Sentiment Analysis of Tweets, Saif M. Mohammad, Svetlana Kiritchenko, and Xiaodan Zhu, In Proceedings of the seventh international workshop on Semantic Evaluation Exercises (SemEval-2013), June 2013, Atlanta, USA.

Approach:

SVM with lots of handcrafted features:

    • word ngrams, char ngrams, all-caps, POS, elongated words, emoticons, negation, etc.
    • sentiment lexicon: counts of polarized words, sum of scores, max score, last score

21 of 37

SemEval 2015 – Task 10

Best Submission:

A. Severyn, A. Moschitti. 2015. UNITN: Training Deep Convolutional Neural Network for Twitter Sentiment Classification. Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), pages 464–469, Denver, Colorado, June 4-5, 2015.�https://www.aclweb.org/anthology/S15-2079

22 of 37

CNN for Sentiment Classification

  1. Embeddings Layer, Rd (d = 300)
  2. Convolutional Layer with Relu activation

Multiple filters of sliding windows of various sizes h

ci = f(F Si:i+h−1 + b)

  • max-pooling layer
  • dropout layer
  • linear layer with tanh activation
  • softmax layer

-

Not

going

to

the

beach

tomorrow

:-(

+

-

convolutional layer with

multiple filters

Multilayer perceptron

with dropout

embeddings

for each word

max over time pooling

S

Frobenius elementwise

matrix product

23 of 37

Distant Supervision

  • A. Severyn and A. Moschitti, UNITN at SemEval 2015 Task 10.
  • Word embeddings from plain text are completely clueless about their sentiment behavior
  • Distant supervision approach using our convolutional neural network to further refine the embeddings
  • Collected 10M tweets treating tweets containing positive emoticons, used as distantly supervised labels to train sentiment-aware embeddings

24 of 37

Results of UNITN on SemEval 2015

Dataset

Score

Rank

Twitter 15

84.79

1

Dataset

Score

Rank

Twitter 15

64.59

2

Phrase-level subtask A

Message-level subtask B

25 of 37

Sentiment Specific Word Embeddings

  • Sentiment Specific Word Embeddings

  • Uses an annotated corpus with polarities (e.g. tweets)
  • SS Word Embeddings achieve SotA accuracy on tweet sentiment classification
  • G. Attardi, D. Saertiano. UniPi at SemEval 2016 Task 4: Convolutional Neural Networks for Sentiment Classification.�https://www.aclweb.org/anthology/S/S16/S16-1033.pdf

 

 

 

 

 

U

 

 

 

 

the cat sits on

LM likelihood + Polarity

26 of 37

Learning SS Embeddings

  •  

 

27 of 37

Semeval 2015 Sentiment on Tweets

Team

Phrase Level Polarity

Tweet

Attardi (unofficial)

67.28

UNITN

84.79

64.59

KLUEless

84.51

61.20

IOA

82.76

62.62

WarwickDCS

82.46

57.62

Webis

64.84

28 of 37

SwissCheese at SemEval 2016

Three-phase procedure:

    • creation of word embeddings for initialization of the first layer. Word2vec on an unlabeled corpus of 200M tweets.
    • distant supervised phase, where the network weights and word embeddings are trained to capture aspects related to sentiment. Emoticons used to infer the polarity of a balanced set of 90M tweets.
    • supervised phase, where the network is trained on the provided supervised training data.

29 of 37

Ensemble of Classifiers

  • combining the outputs of two 2-layer CNNs having similar architectures but differing in the choice of certain parameters (such as the number of convolutional filters).
  • networks were also initialized using different word embeddings and used slightly different training data for the distant supervised phase.
  • A total of 7 outputs were combined

30 of 37

Results

 

2013

2014

2015

2016 Tweet

 

Tweet

SMS

Tweet

Sarcasm

Live-Journal

Tweet

Avg F1

Acc

SwissCheese

Combination

70.05

63.72

71.62

56.61

69.57

67.11

63.31

64.61

SwissCheese

single

67.00

69.12

62.00

71.32

61.01

57.19

UniPI

59.218

58.511

62.718

38.125

65.412

58.619

57.118

63.93

UniPI SWE

64.2

60.6

68.4

48.1

66.8

63.5

59.2

65.2

31 of 37

Breakdown over all test sets

SwissCheese

Prec.

Rec.

F1

positive

67.48

74.14

70.66

negative

53.26

67.86

59.68

neutral

71.47

59.51

64.94

Avg F1

 

 

65.17

Accuracy

 

 

64.62

UniPI 3

Prec.

Rec.

F1

positive

70.88

65.35

68.00

negative

50.29

58.93

54.27

neutral

68.02

68.12

68.07

Avg F1

 

 

61.14

Accuracy

 

 

65.64

32 of 37

Sentiment Classification from a single neuron

  • A char-level LSTM with 4096 units has been trained on 82 millions reviews from Amazon.
  • The model is trained only to predict the next character in the text
  • After training one of the units had a very high correlation with sentiment, resulting in state-of-the-art accuracy when used as a classifier.
  • The model can be used to generate text.
  • By setting the value of the sentiment unit, one can control the sentiment of the resulting text.

Blog post - Radford et al. Learning to Generate Reviews and Discovering Sentiment. Arxiv 1704.01444

33 of 37

Follow up

Zhang and Wallace (2015) A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification

https://arxiv.org/pdf/1510.03820.pdf

34 of 37

Regularization

  •  

35 of 37

A pitfall when fine-tuning word vectors

  • Setting: We are training a logistic regression classification model for movie review sentiment using single words.
  • In the training data we have “TV” and “telly”
  • In the testing data we have “television”
  • The pre-trained word vectors have all three similar:

  • Question: What happens when we update the word vectors?

TV

telly

television

36 of 37

A pitfall when fine-tuning word vectors

  • Question: What happens when we update the word vectors?
  • Answer:
    • Those words that are in the training data move around
      • “TV”and“telly”
    • Words not in the training data stay where they were
      • “television”

TV

telly

television

37 of 37

What to do

  • Question: Should I use available “pre-trained” word vectors Answer:
    • Almost always, yes!
    • They are trained on a huge amount of data, and so they will know about words not in your training data and will know more about words that are in your training data
    • Have 100s of millions of words of data? Okay to start random
  • Question: Should I update (“fine tune”) my own word vectors?
  • Answer:
    • If you only have a small training data set, don’t train the word vectors
    • If you have have a large dataset, it probably will work better to�train = update = fine-tune word vectors to the task