1 of 19

Svetla Koeva1, Ivelina Stoyanova2, Jordan Kralev3

1,2: Institute for Bulgarian Language, Bulgarian Academy of Sciences

3: IBL – Bulgarian Academy of Sciences / Technical University of Sofia

svetla@dcl.bas.bg | iva@dcl.bas.bg | jkralev@dcl.bas.bg

IfGPT: A Dataset in Bulgarian for Large Language Models

Advancing NLP for Low-Resource Languages LowResNLP 2025 @ RANLP 2025

Международна юбилейна конференция на Института за български език

LowResNLP 2025 | 13 September 2025

2 of 19

Developing datasets for LLMs is a major challenge for languages with limited resources:

  • Data scarcity – there are few sources for compiling large datasets for pre-training and fine-tuning LLMs;
  • Copyright restrictions – difficult to find datasets that do not raise copyright issues;
  • Quality of the data – freely accessible data is often noisy and inhomogeneous; procedures for data cleansing and selecting only high-quality texts further limit the scope of the data.

 

 

 

Motivation

Международна юбилейна конференция на Института за български език

LowResNLP 2025 | 13 September 2025

3 of 19

The main objective is to compile a large dataset IfGPT for Bulgarian combining existing corpora and datasets with newly compiled datasets, ensuring the texts are clean, deduplicated and of high-quality, supplied with extensive metadata.

The aim is to avoid redundant compilation of datasets by different users and multiple efforts required to cleanse the data and facilitate reusing the data to solve different application tasks.

 

 

 

Objective

Международна юбилейна конференция на Института за български език

LowResNLP 2025 | 13 September 2025

4 of 19

There are many large and widely used text databases:

  • CommonCrawl – raw web page data and metadata; massive quantity of data but low quality; derived from CommonCrawl are also:
    • OSCAR (Open Super-large Crawled Aggregated coRpus) – large multilingual corpus created by language classification and filtering of the CommonCrawl dataset.
    • C4 and mC4 – created using heuristic methods to filter out non-linguistic content and underwent extensive deduplication.
    • CC100 – provides monolingual data for more than 100 languages.
  • Pile – an 825 GB English text corpus developed for LLM training.
  • MassiveText – a collection of large English language text datasets from various sources, including websites, books, news articles and code.

 

 

Existing large datasets

Международна юбилейна конференция на Института за български език

LowResNLP 2025 | 13 September 2025

5 of 19

However:

  • They rarely include Bulgarian data.
  • The multilingual data (including Bulgarian) is a very small proportion of the dataset.
  • Most of the existing datasets have already been included in datasets for LLM pretraining.

 

 

Existing large datasets

Международна юбилейна конференция на Института за български език

LowResNLP 2025 | 13 September 2025

6 of 19

  • Bulgarian National Corpus (BulNC, 420 mln. tokens) contains a wide range of texts of different sizes, different styles, time periods (synchronous and diachronic) and licences. Each text in the collection is labelled with metadata.
  • General News in Bulgarian (600 mln. tokens) contains news from different thematic domains. The news items and their metadata were collected automatically from various (mainly Bulgarian) Internet sources, approx. 2 mln. web pages.

 

 

Existing datasets of Bulgarian

Международна юбилейна конференция на Института за български език

LowResNLP 2025 | 13 September 2025

7 of 19

  • Bulgarian CURLICAT – Curated Multilingual Language Resources for CEF.AT (35 mln. tokens) consists of texts from various sources divided into seven thematic domains: Culture, Education, European Union, Finance, Politics, Economy and Science.
  • Bulgarian MARCELL – Multilingual resources for CEF.AT in the legal domain (45 mln. tokens) consists of legislative documents extracted from the Bulgarian State Gazette, documents from official institutions such as the government, the Bulgarian National Assembly, the Constitutional Court, etc.

 

 

 

Existing datasets of Bulgarian

Международна юбилейна конференция на Института за български език

LowResNLP 2025 | 13 September 2025

8 of 19

There are additional sources for datasets in Bulgarian:

  • ELG,
  • CLARIN,
  • GitHub,
  • HuggingFace, etc.

Compilation of new datasets:

  • Public administrative and governmental data,
  • Websites and technical documentation,
  • Media websites,
  • Open science portal.

 

 

 

Compiling new datasets of Bulgarian

Международна юбилейна конференция на Института за български език

LowResNLP 2025 | 13 September 2025

9 of 19

Removing duplicate texts in the dataset improves the performance of LLMs. Two step procedure:

  • Prefiltering based on metadata – year of publishing, source, etc.
  • Main deduplication based on the MinHash and Locality Sensitive Hashing (LSH) algorithm.

Improving the quality of the texts:

  • Removing boilerplate,
  • Removing web elements (navigation, formatting, etc).
  • Converting all into text format (pdf, documents, etc.). OCR texts avoided due to lower quality.

 

 

 

Improving the quality of IfGPT: Deduplication and cleaning up

Международна юбилейна конференция на Института за български език

LowResNLP 2025 | 13 September 2025

10 of 19

Personally identifiable information is identified and handled as follows:

  • MAPA anonymisation package for Bulgarian.
  • Naive rule-based methods to detect sentences of the document with potentially sensitive information.
  • The number of sentences with such information is counted and the output is the proportion of these sentences in the text.

Bias information is treated in a similar way:

  • Potentially biased or abusive sentences are identified using lexical resources and rule-based methods.
  • The number of sentences containing potential bias are counted and the output is the proportion of these sentences in the text.

 

 

 

Improving the quality of IfGPT: PII and Bias information

Международна юбилейна конференция на Института за български език

LowResNLP 2025 | 13 September 2025

11 of 19

Current structure of the IfGPT dataset:

 

 

 

IfGPT: Current structure

Source

# texts

# tokens

License

MARCELL

25K

45M

Public domain

CURLICAT

113K

35M

Creative Commons (CC)

BulNC Administrative

17K

79M

Public domain

BulNC Wikipedia

89K

41M

CC / GNU

BulNC Subtitles

146K

27M

OPUS

Международна юбилейна конференция на Института за български език

LowResNLP 2025 | 13 September 2025

12 of 19

The metadata are stored in a Neo4J graph database with a schema capturing the key metadata entries and their connections.

  • Document nodes with properties describing the source and properties of the text.
  • Author nodes providing details of the authors, biography, etc.
  • Domain nodes defining a shallow hierarchical structure of domains.
  • Licence nodes defining the licence used.
  • Source nodes providing the name and url of the source.

 

 

 

IfGPT: Metadata management

Международна юбилейна конференция на Института за български език

LowResNLP 2025 | 13 September 2025

13 of 19

IfGPT: Metadata management

Identifier

Licence

PublicationDate

DocumentTitle

Source

Medium

Url

Domain

Keywords

NumberWords

NumberSentences

NumberTokens

PIInformation

BiasedInformation

Author

Style

Type

Subdomain

TranslatedDocument

LicenseLink

ParagraphNumber

TaskCategories

DocumentNode

Международна юбилейна конференция на Института за български език

LowResNLP 2025 | 13 September 2025

14 of 19

IfGPT: Metadata management

Name

ParentCategory

DomainNode

Name

Url

SourceNode

Type

LicenceNode

Relations

Document – Domain

BELONGS_TO

Domain – Domain

SUBCATEGORY_OF

Document – Licence

LICENSED_WITH

Document – Author

WRITTEN_BY

Document – Source

PUBLISHED_IN

Международна юбилейна конференция на Института за български език

LowResNLP 2025 | 13 September 2025

15 of 19

 

 

 

IfGPT: Online search interface

Международна юбилейна конференция на Института за български език

LowResNLP 2025 | 13 September 2025

16 of 19

 

 

 

IfGPT: Online search interface

Международна юбилейна конференция на Института за български език

LowResNLP 2025 | 13 September 2025

17 of 19

 

 

 

IfGPT: Online search interface

Международна юбилейна конференция на Института за български език

LowResNLP 2025 | 13 September 2025

18 of 19

The future development of the IfGPT dataset includes:

  • Expanding the dataset with new text samples.
  • Completing the metadata description for some empty metadata categories.
  • Detailed description of Task Categories for which given text sample is suitable (e.g. question answering).

 

 

 

IfGPT: Future development

IfGPT is created as a large dataset equipped with rich metadata for efficient search and retrieval of suitable documents, clearly defined tasks and thematic domains.

It enables fast and efficient fine-tuning of LLMs and RAG.

Международна юбилейна конференция на Института за български език

LowResNLP 2025 | 13 September 2025

19 of 19

Acknowledgments

  • The work is part of the project Infrastructure for Fine-tuning Pre-trained Large Language Models, Grant Agreement No. ПВУ – 55 from 12.12.2024 /BG-RRP-2.017-0030-C01/.

 

 

 

  • https://ifgpt.dcl.bas.bg

Международна юбилейна конференция на Института за български език

LowResNLP 2025 | 13 September 2025