1 of 22

XWikiGen: Cross-lingual Summarization for Encyclopedic Text

Generation in Low Resource Languages

Dhaval Taunk | Shivprasad Sagare | Anupam Patil

Shivansh Subramanian | Manish Gupta | Vasudeva Varma

WebConf 2023

2 of 22

Wikipedia for Indian Languages

Number of Wikipedia articles per language 

Log of number of new articles created or edits made

              Huge information divide                                                                                 Need to increase the human productivity

3 of 22

Encyclopedic generation

Neutral Point of View

Encyclopedic writing style

Factual correctness and grounding

Encyclopedic generation is not the same as general text generation

Wikipedia and encyclopedia specific requirements

4 of 22

Solution Approach

Generated article stubs

Use SOTA Text Generation Models

Relevant data source 

Generate Wikipedia-style text in Indic languages using the created dataset with the help of multilingual language models.

For generating the articles, it is required to look for the relevant datasets. No such dataset existed.

For generating the articles, recent state of the art LLM’s can be used

These article stubs can then be reviewed and edited by humans

Created a dataset from Wikipedia articles by taking the Wikipedia article text and corresponding citation text.

5 of 22

Set of reference URLs

  1. https://www.atptour.com/en/players/roger-federer/f324/bio
  2. https://www.rediff.com/sports/2005/jul/04wimb1.htm
  3. https://usatoday30.usatoday.com/sports/tennis/wimb/2005-07-03-Roddick-marvels_x.htm

Section title<hindi> परिचय<english> Introduction

…..�<tamil>அறிமுகம்

XWikiGen

<hindi> रॉजर फ़ेडरर (जन्म 8 अगस्त 1981) एक व्यवसायिक स्विस टेनिस खिलाड़ी हैं, जिनकी वर्तमान में एटीपी वरीयता 2 है। उनके नाम 2 फ़रवरी 2004 से 17 अगस्त 2008 तक 237 हफ़्तों तक प्रथम वरीयता पर रहने का रिकॉर्ड है। फ़ेडरर को व्यापक रूप से इस युग के महानतम एकल खिलाड़ी के रूप में जाना जाता है।

<english> Roger Federer (born 8 August 1981) is a Swiss former professional tennis player. He was ranked world No. 1 by the Association of Tennis Professionals (ATP) for 310 weeks, including a record 237 consecutive weeks, and finished as the year-end No. 1 five times.

ரோஜர் ஃபெடரர் (பிறப்பு - ஆகத்து 8, 1981) சுவிட்சர்லாந்தைச் சேர்ந்த டென்னிசு வீரர். 20 கிராண்`ட் சிலாம் எனப்படும் பெருவெற்றித் தொடர்களை வென்றுள்ளார். மேலும், மொத்தம் 302 வாரங்கள் தரவரிசைப் பட்டியலில் முதல் இடம் பிடித்தவராகவும், தொடர்ச்சியாக 237 வாரங்கள் தரவரிசைப் பட்டியலில் முதலிடம் பெற்றிருந்தமையும் இவரது முக்கிய சாதனைகளுள் ஒன்றாகும்

XWikiGen

6 of 22

Building the dataset

6

Wikidata API

    • Select Wikidata entities

Wikipedia dump

    • Extract Wikipedia articles using selected entities

Wikipedia

Article

    • Extract the section text

Preprocessing

    • Retrieve corresponding URL’s for each section

URL’s

    • Scrape the URL’s to get the citation text

7 of 22

XWikiRef: Multilingual, multi-document, Multidomain Dataset

Domain

books

films

politicians

sportsman

writers

Languages

bn

hi

ml

mr

or

pa

ta

en

XWikiRef

~69K articles

~105K section specific summaries

8 of 22

Data Stats

  • Table showing number of wikipedia articles per domain per language in our dataset. An total of 68585 articles are present in our dataset.

8

bn

hi

ml

mr

or

pa

ta

en

Total

Books

313

922

458

87

73

221

493

1467

4034

Film

1501

1025

2919

480

794

421

3733

1810

12683

Politicians

2006

3927

2513

988

1060

1123

4932

1628

18177

Sportsman

5470

6334

1783

2280

319

1975

2552

919

21632

Writers

1603

2024

2251

784

498

2245

1940

714

12059

Total

10893

14232

9924

4619

2744

5985

13650

6538

68585

9 of 22

XWikiGen Pipeline

2 Stage pipeline

Cross-lingual - Multi-document Summarization based approach

Approximately 90% of reference text in Indic Wikipedia is in English

10 of 22

Methodology (Extractive Stage – Salience [1])

Concatenate Section title with reference text sentence.

Pass through Language Model and get the scores

Reverse sort the sentences based on scores (Cross Entropy loss)

Pick top K sentences as output of this stage.

  • We experimented with 2 different Extractive Summarization approaches: (1) Salience, (2) HipoRank.
    • Section title plays a critical role in identifying what information is available in the corresponding section text.
  • Therefore, we needed a way to integrate section title while performing the extractive summarization.

[1] QA-GNN: Question Answering using Language Models and Knowledge Graphs, Yasunaga et. al., NAACL 2021

11 of 22

Methodology (Extractive Stage –HipoRank [2])

  • It is an unsupervised graph based extractive summarization technique. 

Directed hierarchical graph

    • Intra Sectional Links
    • Inter Sections Links

Asymmetric Edge Weighting

    • Over sentences
    • Over sections

Ranking Algorithm

    • Cosine Similarity

Importance Calculation

    • Weighted sum of inter-sectional and intra-sectional centrality scores

12 of 22

Underline pre-trained models for abstractive stage

  • We experiment with both mT5 and mBART in our abstractive summarization stage to generate the Wikipedia article text.

mBART

    • Multilingual variant of BART.
    • Autoregressive Seq2seq model
    • Seq2Seq denoising Auto-Encoder

  • Multilingual Variant of T5
  • Pre-trained on Common Crawl dataset.

mT5

13 of 22

Experimental Settings

    • Combined all languages across each domain

Multi-lingual setting

    • Combined all domains across each language

Multi-domain setting

    • Combined all languages and domain pairs

Multi-lingual - Multi-domain setting

14 of 22

Evaluation Metrics

ROUGE

chrF++

METEOR

15 of 22

Results

15

Overall results for all experiment settings

16 of 22

Results

16

Detailed results for multi-lingual - multi-domain setting (HipoRank + mBART)

17 of 22

Example Predictions

17

18 of 22

Example Predictions

18

19 of 22

Example Predictions

19

20 of 22

Contributions

These models significantly reduce the manual efforts required in writing a Wikipedia article and thus, help the community in efficiently enhancing the content in Wikipedia for LR Language.

The XWikiGen pipeline to generate Wikipedia article from citations.

The XWikiRef dataset for the task of cross lingual multi document summarization.

1

2

Codebase:  https://github.com/DhavalTaunk08/XWikiGen

Corresponding author: Dhaval Taunk (dhaval.taunk@research.iiit.ac.in) 

21 of 22

Thank you

22 of 22

22

Example Figure