XWikiGen: Cross-lingual Summarization for Encyclopedic Text
Generation in Low Resource Languages
Dhaval Taunk | Shivprasad Sagare | Anupam Patil
Shivansh Subramanian | Manish Gupta | Vasudeva Varma
WebConf 2023
Wikipedia for Indian Languages
Number of Wikipedia articles per language
Log of number of new articles created or edits made
Huge information divide Need to increase the human productivity
Encyclopedic generation
Neutral Point of View
Encyclopedic writing style
Factual correctness and grounding
Encyclopedic generation is not the same as general text generation
Wikipedia and encyclopedia specific requirements
Solution Approach
Generated article stubs
Use SOTA Text Generation Models
Relevant data source
Generate Wikipedia-style text in Indic languages using the created dataset with the help of multilingual language models.
For generating the articles, it is required to look for the relevant datasets. No such dataset existed.
For generating the articles, recent state of the art LLM’s can be used
These article stubs can then be reviewed and edited by humans
Created a dataset from Wikipedia articles by taking the Wikipedia article text and corresponding citation text.
Set of reference URLs
Section title�<hindi> परिचय�<english> Introduction
…..�<tamil>அறிமுகம்
XWikiGen
<hindi> रॉजर फ़ेडरर (जन्म 8 अगस्त 1981) एक व्यवसायिक स्विस टेनिस खिलाड़ी हैं, जिनकी वर्तमान में एटीपी वरीयता 2 है। उनके नाम 2 फ़रवरी 2004 से 17 अगस्त 2008 तक 237 हफ़्तों तक प्रथम वरीयता पर रहने का रिकॉर्ड है। फ़ेडरर को व्यापक रूप से इस युग के महानतम एकल खिलाड़ी के रूप में जाना जाता है।
<english> Roger Federer (born 8 August 1981) is a Swiss former professional tennis player. He was ranked world No. 1 by the Association of Tennis Professionals (ATP) for 310 weeks, including a record 237 consecutive weeks, and finished as the year-end No. 1 five times.
ரோஜர் ஃபெடரர் (பிறப்பு - ஆகத்து 8, 1981) சுவிட்சர்லாந்தைச் சேர்ந்த டென்னிசு வீரர். 20 கிராண்`ட் சிலாம் எனப்படும் பெருவெற்றித் தொடர்களை வென்றுள்ளார். மேலும், மொத்தம் 302 வாரங்கள் தரவரிசைப் பட்டியலில் முதல் இடம் பிடித்தவராகவும், தொடர்ச்சியாக 237 வாரங்கள் தரவரிசைப் பட்டியலில் முதலிடம் பெற்றிருந்தமையும் இவரது முக்கிய சாதனைகளுள் ஒன்றாகும்
XWikiGen
Building the dataset
6
Wikidata API
Wikipedia dump
Wikipedia
Article
Preprocessing
URL’s
XWikiRef: Multilingual, multi-document, Multidomain Dataset
Domain
books
films
politicians
sportsman
writers
Languages
bn
hi
ml
mr
or
pa
ta
en
XWikiRef
~69K articles
~105K section specific summaries
Data Stats
8
| bn | hi | ml | mr | or | pa | ta | en | Total |
Books | 313 | 922 | 458 | 87 | 73 | 221 | 493 | 1467 | 4034 |
Film | 1501 | 1025 | 2919 | 480 | 794 | 421 | 3733 | 1810 | 12683 |
Politicians | 2006 | 3927 | 2513 | 988 | 1060 | 1123 | 4932 | 1628 | 18177 |
Sportsman | 5470 | 6334 | 1783 | 2280 | 319 | 1975 | 2552 | 919 | 21632 |
Writers | 1603 | 2024 | 2251 | 784 | 498 | 2245 | 1940 | 714 | 12059 |
Total | 10893 | 14232 | 9924 | 4619 | 2744 | 5985 | 13650 | 6538 | 68585 |
XWikiGen Pipeline
2 Stage pipeline
Cross-lingual - Multi-document Summarization based approach
Approximately 90% of reference text in Indic Wikipedia is in English
Methodology (Extractive Stage – Salience [1])
Concatenate Section title with reference text sentence.
Pass through Language Model and get the scores
Reverse sort the sentences based on scores (Cross Entropy loss)
Pick top K sentences as output of this stage.
[1] QA-GNN: Question Answering using Language Models and Knowledge Graphs, Yasunaga et. al., NAACL 2021
Methodology (Extractive Stage –HipoRank [2])
[2] Discourse-Aware Unsupervised Summarization for Long Scientific Documents, Dong et. al., EACL 2021
Directed hierarchical graph
Asymmetric Edge Weighting
Ranking Algorithm
Importance Calculation
Underline pre-trained models for abstractive stage
mBART
mT5
Experimental Settings
Multi-lingual setting
Multi-domain setting
Multi-lingual - Multi-domain setting
Evaluation Metrics
ROUGE
chrF++
METEOR
Results
15
Overall results for all experiment settings
Results
16
Detailed results for multi-lingual - multi-domain setting (HipoRank + mBART)
Example Predictions
17
Example Predictions
18
Example Predictions
19
Contributions
These models significantly reduce the manual efforts required in writing a Wikipedia article and thus, help the community in efficiently enhancing the content in Wikipedia for LR Language.
The XWikiGen pipeline to generate Wikipedia article from citations.
The XWikiRef dataset for the task of cross lingual multi document summarization.
1
2
Codebase: https://github.com/DhavalTaunk08/XWikiGen
Corresponding author: Dhaval Taunk (dhaval.taunk@research.iiit.ac.in)
Thank you
22
Example Figure