1 of 20

WebNLG: Natural Language Generation

Team Members:

Agnese Chiatti

Thanh Tran

Tejas Mahale

Rajeev Bhatt Ambati

1

2 of 20

Motivation

  • Preprocessing
    • Submissions are simple NMT models.
    • High BLEU score shows the importance of �preprocessing.

  • Solutions
    • Improved Delexicalisation
    • Aggregation
    • Constituency Parsed Trees

2

Method

BLEU

UPF-FORGE

35.70

MELBOURNE

33.27

PKU WRITER

25.36

ADAPT

10.53

TILB-NMT

25.12

Interim Baseline

1.56

3 of 20

Motivation

  • Problem of unknown tokens.
    • Neural Network architectures cannot handle out-of-vocabulary (OOV) words.

3

4 of 20

Motivation

  • Misplacing of nouns.
    • Embeddings of similar words are clustered.

  • Solutions
    • Pointer Generator Networks
    • Improved Data Preparation:
      • Delexicalisation, Aggregation, grammar-based templates

5 of 20

Outline

  • Data Preprocessing
    • Improved Delexicalisation
    • Aggregation
    • Constituency Parsed Trees
  • Models
    • Attention
    • Pointer Generator Networks
  • Results
  • Conclusion

5

6 of 20

Improved Delexicalisation

  • Example of input�
    • TRIPLES:
      • Addiction_(journal) | publisher | Wiley-Blackwell
      • Addiction_(journal) | ISSN_number | "1360-0443"
      • Addiction_(journal) | LCCN_number | 93645978
      • Addiction_(journal) | abbreviation | "Addiction“�
    • LEX:
      • The Addiction Journal is published by Wiley - Blackwell and is abbreviated �to Addiction. The ISSN number is 1360 - 0443 and the LCCN number is 93645978 .�

6

7 of 20

Improved Delexicalisation

  • Example of delexicalization:�
    • TRIPLES: ENTITY1 UNIVERSITY publisher ENTITY2 ORGANIZATION ENTITY1 UNIVERSITY issn number ENTITY3 UNK ENTITY1 UNIVERSITY lccn number ENTITY4 NUMBER ENTITY1 UNIVERSITY abbreviation ENTITY5 UNK
    • LEX: �The ENTITY5 Journal is published by ENTITY2 and is abbreviated to ENTITY1 number is ENTITY3 and the LCCN number is ENTITY4.

7

8 of 20

Aggregation

  • Examples of Aggregation:�
    • TRIPLES:��ENTITY1 UNIVERSITY publisher ENTITY2 ORGANIZATION ENTITY1 UNIVERSITY issn number ENTITY3 UNK ENTITY1 UNIVERSITY lccn number ENTITY4 NUMBER ENTITY1 UNIVERSITY abbreviation ENTITY5 UNK

8

9 of 20

Constituency Parsed Trees

  • Inspired by observations on UPF-FORGE (Mille and Dasiopoulou, 2017)
    • But manual template generation is inefficient
    • If combined with a similar approach to UMEL, can it create synergy?

  • From sentences to constituents
  • CFG (terminal/non terminal)

  • On the other hand:
    • Longer sequences require more iterations �(increased computational cost)
    • Risk of underfitting

9

10 of 20

Constituency Parsed Trees for WebNLG

  • On training:
    • Lex sentences are parsed to constituency trees�Stanford Core NLP toolkit (Manning et al., 2014)�
  • On Dev/Test:
    • Incoming (delexicalised) triple sequences are POS tagged
    • WordNet-based similarity with 50 randomly picked training examples �is computed (pairwise comparison)�
    • We followed Mihalcea et al. (2006) approach for similarity computation
      • Corpus-based similarity to try to capture semantic similarity
      • WordNet is queries for token-level similarity
      • Then, for each sequence, check the most similar word in another sequence
      • Average based on sequence length

10

11 of 20

Attention Model

  • Problem formulation

  • Attention distribution

11

12 of 20

Attention Model

  • Context vector

  • Scoring function
    • Bahdanau et.al:

Feed-forward network

    • Luong et.al:

12

13 of 20

Pointer Generator Networks

  • Generation probability

  • Total probability �distribution

  • Log-likelihood

13

14 of 20

Visualization

  • Interactive Visualization

14

15 of 20

Visualization

15

16 of 20

Results

With our Evaluation schema Short < 100 char, Long otherwise

16

17 of 20

Results

With WebNLG Challenge evaluation script� (tends to overestimate)

17

18 of 20

Lessons Learned

  • Delexicalisation improved performance on both sets
  • Delexicalisation + aggregation was more effective, overall, than constituency trees
  • But the constituency-based solution was the only one performing better on longer sequences than on shorter sequences
  • Attention-embedded solution led to top performance, would have placed 1st on unseen classes in the challenge
  • Terms were successfully copied from input to output when using Pointer Generator Network
  • Team distribution of work was great! ☺

18

19 of 20

Next Steps

  • Test attention seq2seq on Ctrees
  • For constituency parsing:
    • averaging across top-K candidates instead of picking most similar tree only
    • considering auxiliary data for development and test instead of deriving from training lex

  • Fine tune Pointer Generator Network
  • Or combine it with Attention approach
  • Testing Recursive Neural Networks on this task

19

20 of 20

Thank you!

Q&A ?

20