1 of 16

OUTLINEGEN

BY: SHIVANSH S, DHAVAL TAUNK

MULTILINGUAL OUTLINE GENERATION FOR ENCYCLOPEDIC TEXT IN LOW RESOURCE LANGUAGES

MANISH GUPTA, VASUDEVA VARMA

INFORMATION RETRIEVAL AND EXTRACTION LAB, IIIT HYDERABAD

2 of 16

MOTIVATION

In current information age, there is an abundance of content and web-pages available in English. However, communities speaking low-resource languages struggle for representation on the web.

Our goal is to bridge this gap via automatic content generation. To achieve this, our focus is primarily on Wikipedia, since it serves as an important source of free, reliable and neutral information.

MOTIVATION

2

3 of 16

WIKIOUTLINES

  • WikiOutlines: A Multilingual Outline-Generation dataset from minimal information.

DATASETS

3

Article Title

Language & Domain

Article Outline

Source

Target

4 of 16

WHY A NEW DATASET?

Comparison between current and previous dataset.

DATASETS

4

Percentage of outlines which are same as the most popular outline

Across many languages and domains

Diversity within outlines

5 of 16

WIKIOUTLINES STATS

Number of section distribution language-wise and domain-wise.

DATASETS

5

Total number of articles in dataset.

6 of 16

DATASETS

6

7 of 16

OUTLINEGEN

We propose the task of Multilingual Outline Generation for Wikipedia Articles using minimal information using WikiOutlines.

In this task, we experiment with generative and statistical model to find best performing pipeline for outlines. We want to use minimal information to make it easier for humans (and automated systems) to start writing Wikipedia articles.

OUTLINEGEN

7

8 of 16

OVERVIEW

OUTLINEGEN

8

9 of 16

WEIGHTED FINITE STATE AUTOMATA

OUTLINEGEN

9

WFSA Example Model

10 of 16

WEIGHTED FINITE STATE AUTOMATA

OUTLINEGEN

10

WFSA for English, Companies

11 of 16

GENERATIVE METHODS

  • Main problem with WFSA is that it produces a single outline for language, domain pair.
  • We try multi-lingual supervised outline generation using mT5 and mBART.
  • For this, we experiment with mBART-base and mT5-large.
  • Final input to the model is <Article Title, Language, Domain>

XWIKIGEN

11

12 of 16

METRICS USED

ROUGE-L

N-gram overlap based metric to measure correctness of generated text with respect to gold text.

METEOR

BLEU

XLM-SCORE

OUTLINEGEN

12

Improves word matching between prediction and reference by using synonyms, stemming, word-order swapping etc

Compares n-grams between generation and reference based on precision between the n-grams.

Variation of BERT-score used to measure semantic similarity. We use XLM instead of BERT for multi-lingual outputs.

13 of 16

RESULTS

OUTLINEGEN

13

Overall results across all methods. Clearly mT5 performs the best.

14 of 16

EXAMPLE GENERATIONS

OUTLINEGEN

14

Generated outlines using our best performing model (mT5)

15 of 16

SUMMARY

  • We developed two multi-lingual models to generate Wikipedia Article Outline’s using minimal information.
  • Proposed a WFSA-based statistical method, and a generative transformer-based method.

OUTLINEGEN

15

16 of 16

THANK YOU