1 of 18

L3Cube-MahaSum: A Comprehensive Dataset and BART Models

for Abstractive Text

Summarization in Marathi

2 of 18

Agenda

  • Introduction
  • Motivation
  • Techniques explored
  • XL-SUM dataset
  • Methodology
  • Metrics Used
  • XL-sum vs MAHAsum
  • Conclusion

3 of 18

Introduction

  • A comprehensive collection of more than 25,000 diverse Marathi news articles.
  • Trained an IndicBART model optimized for Indic languages.
  • Publicly Available:

Both dataset and models are available for public use.�[GitHub: L3Cube-Pune/MarathiNLP]

4 of 18

Motivation

Limited Resources: �Indic languages like Marathi lack sufficient datasets and NLP models.

Need for Domain-Specific Data: �Essential for capturing linguistic nuances and advancing Marathi NLP.

Boosting Research: �Promote tools, applications, and collaboration for Indic languages.

5 of 18

Techniques Explored

  • Singular Value Decomposition (SVD) : Used for dimensionality reduction and summarization in earlier statistical approaches.
  • BART Model : A transformer-based encoder-decoder model, effective for abstractive summarization tasks.
  • T5 Model : A pre-trained text-to-text transformer model that excels in summarization by reframing NLP tasks as text generation.
  • Graph-Based Approaches : Utilizes graph representations of text to identify key sentences and summarize content.
  • Neural Networks (Bi-Directional LSTM) : Recurrent architectures that capture contextual information in both forward and backward directions for text summarization.

6 of 18

METHODOLOGY

7 of 18

8 of 18

Data Acquisition: XL-Sum Marathi

  • Large-scale multilingual dataset with over 1 million article-summary pairs across 44 languages.
  • Includes 10,903 pairs for Marathi sourced from BBC Marathi.
  • Summaries are abstractive, concise, and capture key points effectively.
  • Extensively human-evaluated for high quality.
  • Used in our research to benchmark IndicBART performance for Marathi summarization.

9 of 18

Data Acquisition: MahaSum

  • Manually curated from prominent Marathi news portals, including Lokmat and Loksatta.
  • Comprises 25,374 news articles with: Headline ,Concise summary, Full article text

10 of 18

Preprocessing

Workflow:

11 of 18

Tokenization

Used IndicBART tokenizer tailored for Marathi text.

Included:

  • Special tokens for Marathi.
  • Padding tokens for uniform sequence lengths.
  • BOS and EOS tokens for sequence demarcation.

Data split:

  • 80% Training
  • 20% Evaluation

12 of 18

Model training

  • Models trained using Seq2SeqTrainer with a batch size of 4 and 3 epochs.
  • Incorporated 500 warm-up steps for stabilized learning.
  • Progress tracked via logging (every 100 steps) and checkpoints (every 1000 steps).

13 of 18

Metric used

ROUGE scores is primarily based on Recall, and it was actually designed keeping in the mind of text-summarization

  • ROUGE 1: The overlap of unigrams (single words) between the generated summary and the reference summary.
  • ROUGE 2: The overlap of bigrams (two consecutive words) between the generated summary and the reference summary.
  • ROUGE L: The Longest Common Subsequence (LCS) between the generated summary and the reference summary.

14 of 18

Result

15 of 18

16 of 18

XL - Sum vs MAHASUM

17 of 18

Conclusion

  • MahaSUM Dataset: The first gold standard dataset for abstractive summarization in Marathi, with 25,374 well-curated news articles from diverse sources.
  • IndicBART Model: Demonstrated significant performance improvements when trained on high-quality, language-specific datasets.
  • Future Research Implications: Curated, large-scale datasets are essential for advancing NLP in low-resource languages, paving the way for better summarization models and broader real-world applications.

18 of 18

Thank You!