L3Cube-MahaSum: A Comprehensive Dataset and BART Models
for Abstractive Text
Summarization in Marathi
Agenda
Introduction
Both dataset and models are available for public use.�[GitHub: L3Cube-Pune/MarathiNLP]
Motivation
Limited Resources: �Indic languages like Marathi lack sufficient datasets and NLP models.
Need for Domain-Specific Data: �Essential for capturing linguistic nuances and advancing Marathi NLP.
Boosting Research: �Promote tools, applications, and collaboration for Indic languages.
Techniques Explored
METHODOLOGY
Data Acquisition: XL-Sum Marathi
Data Acquisition: MahaSum
Preprocessing
Workflow:
Tokenization
Used IndicBART tokenizer tailored for Marathi text.
Included:
Data split:
Model training
Metric used
ROUGE scores is primarily based on Recall, and it was actually designed keeping in the mind of text-summarization
Result
XL - Sum vs MAHASUM
Conclusion
Thank You!