1 of 65

Self-Introduction and Research Topics

D3 - Jorge Balazs

2018 - 11 - 30

2 of 65

Contents

2

  • About me

  • RepEval Shared Task, EMNLP 2017, Workshop

Refining Raw Sentence Representations for Textual Entailment Recognition

via Attention

  • Implicit Emotions Shared Task, EMNLP 2018, Workshop (2nd place, out of 26 teams; best system analysis paper)

IIIDYT at IEST 2018: Implicit Emotion Classification With Deep

Contextualized Word Representations

  • Current research: On combining character- and word-level representations.

3 of 65

About me

3

4 of 65

About me

4

My name is Jorge [ˈxoɾxe]

But you can call me George, ホルヘ, or じょうじ

5 of 65

About me

5

I come from Chile!

6 of 65

About me

6

I graduated as an Industrial Engineer from the University of Chile (4 years as Bachelor + 2 Years specialization)

7 of 65

About me

7

I graduated as an Industrial Engineer from the University of Chile (4 years as Bachelor + 2 Years specialization)

8 of 65

About me - Research Interests

8

  • I’m interested in NLP as a way to understand human cognition.

“There are any number of questions that might lead one to undertake a study of language. Personally, I am primarily intrigued by the possibility of learning something, from the study of language, that will bring to light inherent properties of the human mind.”

Noam Chomsky, “Language and Mind”

  • I’m particularly interested in the learning of representations at different hierarchy levels (character, word, sentence, document)

9 of 65

Contents

9

  • About me

  • RepEval Shared Task, EMNLP 2017, Workshop

Refining Raw Sentence Representations for Textual Entailment Recognition

via Attention

  • Implicit Emotions Shared Task, EMNLP 2018, Workshop (2nd place, out of 26 teams; best system analysis paper)

IIIDYT at IEST 2018: Implicit Emotion Classification With Deep

Contextualized Word Representations

  • Current research: On combining character- and word-level representations.

10 of 65

RepEval Shared Task

Refining Raw Sentence Representations for Textual Entailment Recognition via Attention

10

11 of 65

RepEval - Task Description

11

Task:

classify pairs of sentences in one of three categories: Entailment, Contradiction or Neutral.

Dataset:

SNLI and MultiNLI

Examples:

Premise: At the other end of Pennsylvania Avenue, people began to line up for a White House tour.

Hypothesis: People formed a line at the end of Pennsylvania Avenue.

Label: entailment

Premise: This site includes a list of all award winners and a searchable database of Government Executive articles.

Hypothesis: The Government Executive articles housed on the website are not able to be searched.

Label: contradiction

Premise: The new rights are nice enough

Hypothesis: Everyone really likes the newest benefits

Label: neutral

12 of 65

RepEval - Approach 1

12

Full

Maxpooling

Attentive

Max-Attentive

First attempt: IBM’s BiMPM model1 (details)

  • State of the art for the NLI task at the time.

13 of 65

RepEval - Approach 1 Results

13

Method

Accuracy

CBOW Baseline

64.7

ESIM (Tree-based encoder)

72.2

Our implementation of BiMPM

73.4

First attempt: IBM’s BiMPM model1 (details)

  • Validation results:

14 of 65

RepEval

14

First attempt: IBM’s BiMPM model1 (details)

Problem: This kind of model was not allowed in the RepEval competition

The purpose of the competition was to create good models for encoding single sentences into vectors.

This one did not.

15 of 65

RepEval - Approach 2

15

EMB

LSTM

LSTM

EMB

LSTM

EMB

Word Encoder

Second attempt: character-aware Inner-attention mechanism2 (details)

16 of 65

RepEval - Approach 2

16

EMB

LSTM

LSTM

EMB

LSTM

EMB

BiLSTM

Word Encoder

Context Layer

Second attempt: character-aware Inner-attention mechanism2 (details)

17 of 65

RepEval - Approach 2

17

EMB

LSTM

LSTM

EMB

LSTM

EMB

Pooling

Inner Attention

BiLSTM

Word Encoder

Context Layer

Sentence Encoder

Second attempt: character-aware Inner-attention mechanism2 (details)

18 of 65

RepEval - Approach 2

18

EMB

LSTM

LSTM

EMB

LSTM

EMB

Pooling

Inner Attention

BiLSTM

Word Encoder

Context Layer

Sentence Encoder

Feature Extractor

Second attempt: character-aware Inner-attention mechanism2 (details)

19 of 65

RepEval - Approach 2

19

EMB

LSTM

LSTM

EMB

LSTM

EMB

Pooling

Inner Attention

BiLSTM

Linear

Softmax & Argmax

Label

Word Encoder

Context Layer

Sentence Encoder

Feature Extractor

Classifier

Second attempt: character-aware Inner-attention mechanism2 (details)

20 of 65

RepEval - Approach 2 Results

20

Inner-attention mechanism2 (details)

  • This model does encode single sentences into vectors
  • It is simpler than IBM’s BiMPM model
  • Achieves slightly better validation results than ESIM

Method

Accuracy

CBOW Baseline

64.7

ESIM (Tree-based encoder)

72.2

Our implementation of BiMPM

73.4

Our implementation of the Inner-Attention model

72.3

21 of 65

RepEval - Approach 2 Results

21

  • We were the 3rd team out of 5 that participated.

  • Better teams used stacked LSTMs for adding context into words
  • The best team used an ensemble of models exploiting characters

Table: RepEval results on the test set for each team (Nangia et al., 2017)

22 of 65

Contents

22

  • About me

  • RepEval Shared Task, EMNLP 2017, Workshop

Refining Raw Sentence Representations for Textual Entailment Recognition

via Attention

  • Implicit Emotions Shared Task, EMNLP 2018, Workshop (2nd place, out of 26 teams; best system analysis paper)

IIIDYT at IEST 2018: Implicit Emotion Classification With Deep

Contextualized Word Representations

  • Current research: On combining character- and word-level representations.

23 of 65

Implicit Emotion Shared Task

Implicit Emotion Classification with Deep Contextualized Representations

23

24 of 65

Implicit Emotion ST - Task Description

24

Task:

Given a tweet with a word removed, predict the emotion of such word. Classes are: sad, fear, disgust, surprise,

anger, and joy.

Dataset:

Tweets with a hidden word, and a label indicating the emotion of the removed word

Examples:

It's [#TRIGGERWORD#] when you feel like you are invisible to others. sad

My step mom got so [#TRIGGERWORD#] when she came home from work and saw that the boys

didn't come to Austin with me. sad

We are so [#TRIGGERWORD#] that people must think we are on good drugs or just really good

actors. joy

25 of 65

Implicit Emotion ST - Proposed Architecture

25

BiLSTM

ELMo Layer

Max Pooling

Context Layer

Sentence Encoder

Linear

Softmax & Argmax

Label

Classifier

[#TRIGGERWORD#]

It’s

when

Word Encoder

...

26 of 65

Implicit Emotion ST - Results

26

27 of 65

Implicit Emotion ST - Info Sources

27

28 of 65

Implicit Emotion ST - Methods

28

29 of 65

Implicit Emotion ST - Tools

29

30 of 65

Implicit Emotion ST - Ablation Study

30

31 of 65

Implicit Emotion ST - Dropout Ablation Study

31

32 of 65

Implicit Emotion ST - Confusion Matrix

32

33 of 65

Implicit Emotion ST - Annotation Artifact

33

Separate joy cluster corresponds to those sentences containing the “un[#TRIGGERWORD#]” pattern.

34 of 65

Implicit Emotion ST - Emoji

34

Different emoji affect classification performance in different ways

😷💕😍❤️😡😢😭😒😩😂😅😕

35 of 65

Contents

35

  • About me

  • RepEval Shared Task, EMNLP 2017, Workshop

Refining Raw Sentence Representations for Textual Entailment Recognition

via Attention

  • Implicit Emotions Shared Task, EMNLP 2018, Workshop (2nd place, out of 26 teams; best system analysis paper)

IIIDYT at IEST 2018: Implicit Emotion Classification With Deep

Contextualized Word Representations

  • Current research: On combining character- and word-level representations.

36 of 65

Gating Mechanisms for Combining Character and Word-level Word Representations

36

37 of 65

Problem

37

Incorporating subword information (characters, morphemes, byte-pair encoding, etc.) has been proven to create better word representations, however:

  • There is no principled way of combining representations at different levels of hierarchy in NLP.

  • Researchers usually just concatenate them or use a method without theoretical support.

  • It is difficult to know the impact a combination technique will have for a given task.

38 of 65

Research Question

38

Are there any fundamental principles underlying the way in which we combine representations in NLP?

A first step towards answering the question above, would be to answer:

What is the best way in which we can combine character and word-level representations?

39 of 65

Research Question

39

What is the best way in which we can combine character and word-level representations?

Concat?

(1 - )

Scalar-weighted sum?

Word

Characters

Sum?

Multiply?

40 of 65

Approach

40

  • Train models with different combination methods for character and word representations

  • Test these trained models in several simple transfer tasks

  • Training sets: SNLI & MultiNLI

  • Train several models with the same settings but different random seeds to control for randomness in parameter initializations. This makes results statistically more robust..

41 of 65

Architecture

41

?

Max Pooling

BiLSTM

Linear

Softmax & Argmax

Label

Architecture based in Conneau et al., 2017

Word Encoder

Context Layer

Sentence Encoder

Feature Extractor

Classifier

42 of 65

Word Encoder Variations

42

What is the best way of combining word and character level vectors?

Word-only

EMB

Concat

EMB

LSTM

Scalar Gate

EMB

LSTM

Linear (d x 1)

g x c + (1 - g) x w

Vector Gate

EMB

LSTM

Linear (d x d)

g c + (1 - g) w

Char-only

LSTM

43 of 65

Results

43

44 of 65

Analysis

44

  • Training in MultiNLI is better than Training in SNLI (9 / 11); Tasks related to SICK are a special case.
  • Best models contain character-aware representation most of the time (10 / 11)
  • Concat seems to be the best-performing method most of the time (6 / 11)
  • Gating mechanisms are better than concatenation in 4 / 11 tasks:
    • Scalar gate 1 / 11 (SICKR)
    • Vector gate 3 / 11 (SST2, SST5, STSB)
    • Why?

45 of 65

Overview of the Reviews

45

Attempted to submit a previous version of this research as short paper to EMNLP 2018, but was rejected because:

  • Analysis of the gating mechanisms was expected
  • Insights from experiment analyses were insufficient
  • Poor comparisons with previous models
  • Results were not convincing enough

46 of 65

Vector gate representations

46

Test accuracy

84.4

Vector Gate

EMB

LSTM

Linear (d x d)

g c + (1 - g) w

47 of 65

Vector gate representations

47

Vector Gate

EMB

LSTM

Linear (d x d)

g c + (1 - g) w

84.4

48 of 65

Vector gate representations

48

Vector Gate

EMB

LSTM

Linear (d x d)

g c + (1 - g) w

84.4

49 of 65

Character only representations

49

For reference only

79.4

Char-only

LSTM

50 of 65

Concat representations

50

84.6

Concat

EMB

LSTM

51 of 65

Concat representations

51

Concat

EMB

LSTM

84.6

52 of 65

Concat representations (randomly initialized word embeddings)

52

81.6

Concat

EMB

LSTM

53 of 65

Concat representations (randomly initialized word embeddings + norm.)

53

Concat

EMB

LSTM

79.3

54 of 65

Concat representations (GloVe pre-trained word embeddings + norm.)

54

Concat

EMB

LSTM

84.0

55 of 65

Concat representations (MultiNLI)

55

Concat

EMB

LSTM

56 of 65

Discussion

56

Characters don’t seem to help in the SNLI task

Possible causes:

  • English is mostly an analytical language which means it has few inflections (words don’t change much when conjugating verbs, for example), therefore the patterns learned from characters might not be more significant than the “information about the world” encoded by the pre-trained GloVe embeddings.

  • SNLI is dataset created from image descriptions. Such descriptive language might contain patterns at the world-level significant enough for an LSTM to capture, without the need to add extra information from characters.

  • The character-level LSTM is trying to encode vectors too big for characters (character embedding dim=50, LSTM output dim=300).

However, they do help in MultiNLI

  • Possibly because the language in this dataset is more complex than the one used in SNLI, since it comes from more sources (telephone conversations, newspapers, etc.).

57 of 65

Discussion

57

The distributions of character and word-level representations seem to differ greatly

  • This is expected since pre-trained GloVe word representations were trained in a different dataset using a different method.

Normalizing word and character level representations worsens classification results

  • This is probably due to the norm of GloVe embeddings encoding important information.

cat

normalized cat

vector gate

normalized vector gate

2D PCA projections of word representations learned by different architectures. ‘words’ correspond to GloVe word embeddings, and ‘chars’ to corresponds to word representations coming from character-level vectors.

58 of 65

Ongoing research

58

We could apply domain adaptation knowledge to our problem.

We want to adapt the character-level domain to the word-level domain.

Ganin et al., (2016) state that:

for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains.”

cat

normalized cat

vector gate

normalized vector gate

2D PCA projections of word representations learned by different architectures. ‘words’ correspond to GloVe word embeddings, and ‘chars’ to corresponds to word representations coming from character-level vectors.

59 of 65

Ongoing research

59

60 of 65

Ongoing research

60

61 of 65

Ongoing research

61

Max Pooling

BiLSTM

Linear

Softmax & Argmax

Label

GloVe Vector

Character-level representation

Concat

or

Scalar Gate

or

Vector Gate

Gradient Reversal Layer

Label: char or word

Discriminator

62 of 65

Ongoing research

62

Concat non-adapted (0.8507)

Concat adapted (0.853)

63 of 65

Latest Findings

63

Domain-adversarial training forces the model to use character-level word representations.

Preliminary results show that adversarial training can improve results in the NLI downstream task for some models.

Domain-adversarial training is more difficult to achieve when there are gating mechanisms present.

64 of 65

Future Research Directions

64

  • Use known tricks for better optimizing adversarial domain classification.

  • Evaluate representations on word-based tasks (as opposed to sentence evaluation tasks).

  • Build a theory around what I have been doing. For example, training gating mechanisms correspond to learning probability distributions conditioned on both words and characters.

Goals: Submit results to NAACL (December 10, 2018)

65 of 65

Thank you

65