Svetla Koeva1, Ivelina Stoyanova2, Jordan Kralev3
1,2: Institute for Bulgarian Language, Bulgarian Academy of Sciences
3: IBL – Bulgarian Academy of Sciences / Technical University of Sofia
svetla@dcl.bas.bg | iva@dcl.bas.bg | jkralev@dcl.bas.bg
IfGPT: A Dataset in Bulgarian for Large Language Models
Advancing NLP for Low-Resource Languages LowResNLP 2025 @ RANLP 2025
Международна юбилейна конференция на Института за български език
LowResNLP 2025 | 13 September 2025
Developing datasets for LLMs is a major challenge for languages with limited resources:
Motivation
Международна юбилейна конференция на Института за български език
LowResNLP 2025 | 13 September 2025
The main objective is to compile a large dataset IfGPT for Bulgarian combining existing corpora and datasets with newly compiled datasets, ensuring the texts are clean, deduplicated and of high-quality, supplied with extensive metadata.
The aim is to avoid redundant compilation of datasets by different users and multiple efforts required to cleanse the data and facilitate reusing the data to solve different application tasks.
Objective
Международна юбилейна конференция на Института за български език
LowResNLP 2025 | 13 September 2025
There are many large and widely used text databases:
Existing large datasets
Международна юбилейна конференция на Института за български език
LowResNLP 2025 | 13 September 2025
However:
Existing large datasets
Международна юбилейна конференция на Института за български език
LowResNLP 2025 | 13 September 2025
Existing datasets of Bulgarian
Международна юбилейна конференция на Института за български език
LowResNLP 2025 | 13 September 2025
Existing datasets of Bulgarian
Международна юбилейна конференция на Института за български език
LowResNLP 2025 | 13 September 2025
There are additional sources for datasets in Bulgarian:
Compilation of new datasets:
Compiling new datasets of Bulgarian
Международна юбилейна конференция на Института за български език
LowResNLP 2025 | 13 September 2025
Removing duplicate texts in the dataset improves the performance of LLMs. Two step procedure:
Improving the quality of the texts:
Improving the quality of IfGPT: Deduplication and cleaning up
Международна юбилейна конференция на Института за български език
LowResNLP 2025 | 13 September 2025
Personally identifiable information is identified and handled as follows:
Bias information is treated in a similar way:
Improving the quality of IfGPT: PII and Bias information
Международна юбилейна конференция на Института за български език
LowResNLP 2025 | 13 September 2025
Current structure of the IfGPT dataset:
IfGPT: Current structure
Source | # texts | # tokens | License |
MARCELL | 25K | 45M | Public domain |
CURLICAT | 113K | 35M | Creative Commons (CC) |
BulNC Administrative | 17K | 79M | Public domain |
BulNC Wikipedia | 89K | 41M | CC / GNU |
BulNC Subtitles | 146K | 27M | OPUS |
Международна юбилейна конференция на Института за български език
LowResNLP 2025 | 13 September 2025
The metadata are stored in a Neo4J graph database with a schema capturing the key metadata entries and their connections.
IfGPT: Metadata management
Международна юбилейна конференция на Института за български език
LowResNLP 2025 | 13 September 2025
IfGPT: Metadata management
Identifier
Licence
PublicationDate
DocumentTitle
Source
Medium
Url
Domain
Keywords
NumberWords
NumberSentences
NumberTokens
PIInformation
BiasedInformation
Author
Style
Type
Subdomain
TranslatedDocument
LicenseLink
ParagraphNumber
TaskCategories
DocumentNode
Международна юбилейна конференция на Института за български език
LowResNLP 2025 | 13 September 2025
IfGPT: Metadata management
Name
ParentCategory
DomainNode
Name
Url
SourceNode
Type
LicenceNode
Relations
Document – Domain
BELONGS_TO
Domain – Domain
SUBCATEGORY_OF
Document – Licence
LICENSED_WITH
Document – Author
WRITTEN_BY
Document – Source
PUBLISHED_IN
Международна юбилейна конференция на Института за български език
LowResNLP 2025 | 13 September 2025
IfGPT: Online search interface
Международна юбилейна конференция на Института за български език
LowResNLP 2025 | 13 September 2025
IfGPT: Online search interface
Международна юбилейна конференция на Института за български език
LowResNLP 2025 | 13 September 2025
IfGPT: Online search interface
Международна юбилейна конференция на Института за български език
LowResNLP 2025 | 13 September 2025
The future development of the IfGPT dataset includes:
IfGPT: Future development
IfGPT is created as a large dataset equipped with rich metadata for efficient search and retrieval of suitable documents, clearly defined tasks and thematic domains.
It enables fast and efficient fine-tuning of LLMs and RAG.
Международна юбилейна конференция на Института за български език
LowResNLP 2025 | 13 September 2025
Acknowledgments
Международна юбилейна конференция на Института за български език
LowResNLP 2025 | 13 September 2025