1 of 16

OUTLINEGEN

BY: SHIVANSH S, DHAVAL TAUNK

MULTILINGUAL OUTLINE GENERATION FOR ENCYCLOPEDIC TEXT IN LOW RESOURCE LANGUAGES

MANISH GUPTA, VASUDEVA VARMA

INFORMATION RETRIEVAL AND EXTRACTION LAB, IIIT HYDERABAD

2 of 16

MOTIVATION

In current information age, there is an abundance of content and web-pages available in English. However, communities speaking low-resource languages struggle for representation on the web.

Our goal is to bridge this gap via automatic content generation. To achieve this, our focus is primarily on Wikipedia, since it serves as an important source of free, reliable and neutral information.

MOTIVATION

3 of 16

WIKIOUTLINES

WikiOutlines: A Multilingual Outline-Generation dataset from minimal information.

DATASETS

Article Title

Language & Domain

Article Outline

Source

Target

4 of 16

WHY A NEW DATASET?

Comparison between current and previous dataset.

DATASETS

Percentage of outlines which are same as the most popular outline

Across many languages and domains

Diversity within outlines

5 of 16

WIKIOUTLINES STATS

Number of section distribution language-wise and domain-wise.

DATASETS

Total number of articles in dataset.

7 of 16

OUTLINEGEN

We propose the task of Multilingual Outline Generation for Wikipedia Articles using minimal information using WikiOutlines.

In this task, we experiment with generative and statistical model to find best performing pipeline for outlines. We want to use minimal information to make it easier for humans (and automated systems) to start writing Wikipedia articles.

OUTLINEGEN

8 of 16

OVERVIEW

OUTLINEGEN

9 of 16

WEIGHTED FINITE STATE AUTOMATA

OUTLINEGEN

WFSA Example Model

10 of 16

WEIGHTED FINITE STATE AUTOMATA

OUTLINEGEN

WFSA for English, Companies

11 of 16

GENERATIVE METHODS

Main problem with WFSA is that it produces a single outline for language, domain pair.
We try multi-lingual supervised outline generation using mT5 and mBART.
For this, we experiment with mBART-base and mT5-large.
Final input to the model is <Article Title, Language, Domain>

XWIKIGEN

12 of 16

METRICS USED

ROUGE-L

N-gram overlap based metric to measure correctness of generated text with respect to gold text.

METEOR

BLEU

XLM-SCORE

OUTLINEGEN

Improves word matching between prediction and reference by using synonyms, stemming, word-order swapping etc

Compares n-grams between generation and reference based on precision between the n-grams.

Variation of BERT-score used to measure semantic similarity. We use XLM instead of BERT for multi-lingual outputs.

13 of 16

RESULTS

OUTLINEGEN

Overall results across all methods. Clearly mT5 performs the best.

14 of 16

EXAMPLE GENERATIONS

OUTLINEGEN

Generated outlines using our best performing model (mT5)

15 of 16

SUMMARY

We developed two multi-lingual models to generate Wikipedia Article Outline’s using minimal information.
Proposed a WFSA-based statistical method, and a generative transformer-based method.

OUTLINEGEN

1 of 16

2 of 16

3 of 16

4 of 16

5 of 16

6 of 16

7 of 16

8 of 16

9 of 16

10 of 16

11 of 16

12 of 16

13 of 16

14 of 16

15 of 16

16 of 16