OUTLINEGEN
BY: SHIVANSH S, DHAVAL TAUNK
MULTILINGUAL OUTLINE GENERATION FOR ENCYCLOPEDIC TEXT IN LOW RESOURCE LANGUAGES
MANISH GUPTA, VASUDEVA VARMA
INFORMATION RETRIEVAL AND EXTRACTION LAB, IIIT HYDERABAD
MOTIVATION
In current information age, there is an abundance of content and web-pages available in English. However, communities speaking low-resource languages struggle for representation on the web.
Our goal is to bridge this gap via automatic content generation. To achieve this, our focus is primarily on Wikipedia, since it serves as an important source of free, reliable and neutral information.
MOTIVATION
2
WIKIOUTLINES
DATASETS
3
Article Title
Language & Domain
Article Outline
Source
Target
WHY A NEW DATASET?
Comparison between current and previous dataset.
DATASETS
4
Percentage of outlines which are same as the most popular outline
Across many languages and domains
Diversity within outlines
WIKIOUTLINES STATS
Number of section distribution language-wise and domain-wise.
DATASETS
5
Total number of articles in dataset.
DATASETS
6
OUTLINEGEN
We propose the task of Multilingual Outline Generation for Wikipedia Articles using minimal information using WikiOutlines.
In this task, we experiment with generative and statistical model to find best performing pipeline for outlines. We want to use minimal information to make it easier for humans (and automated systems) to start writing Wikipedia articles.
OUTLINEGEN
7
OVERVIEW
OUTLINEGEN
8
WEIGHTED FINITE STATE AUTOMATA
OUTLINEGEN
9
WFSA Example Model
WEIGHTED FINITE STATE AUTOMATA
OUTLINEGEN
10
WFSA for English, Companies
GENERATIVE METHODS
XWIKIGEN
11
METRICS USED
ROUGE-L
N-gram overlap based metric to measure correctness of generated text with respect to gold text.
METEOR
BLEU
XLM-SCORE
OUTLINEGEN
12
Improves word matching between prediction and reference by using synonyms, stemming, word-order swapping etc
Compares n-grams between generation and reference based on precision between the n-grams.
Variation of BERT-score used to measure semantic similarity. We use XLM instead of BERT for multi-lingual outputs.
RESULTS
OUTLINEGEN
13
Overall results across all methods. Clearly mT5 performs the best.
EXAMPLE GENERATIONS
OUTLINEGEN
14
Generated outlines using our best performing model (mT5)
SUMMARY
OUTLINEGEN
15
THANK YOU