1 of 24

An Improved Bulgarian Natural Language Processing Pipeline

Melania Berbatova, Filip Ivanov

Sofia University St. Kliment Ohridski

2 of 24

01

Table of contents

02

03

04

Introduction

Pipeline implementation

Experiments and evaluation

Conclusion and future work

3 of 24

Introduction

01

4 of 24

Language pipelines

A language pipeline consists of a sequence of steps targeted towards processing and analyzing natural language data. A typical language pipeline might include steps such as tokenization, part-of-speech tagging, parsing, and semantic analysis among others.

These steps are used as a preprocessing stage in many different tasks and applications that involve analyzing human language.

A language pipeline consists of a sequence of steps targeted towards processing and analyzing natural language data. A typical language pipeline might include steps such as tokenization, part-of-speech tagging, parsing, and semantic analysis among others.

These steps are used as a preprocessing stage in many different tasks and applications that involve analyzing human language.

Figure 1: An example of a language pipeline

5 of 24

Research motivation

There are previous works for building a Bulgarian pipeline based on previous versions of spaCy (Popov, 2020) or custom-built software (Simov, 2012). However, since these works have been published, new neural-based based algorithms for tasks such as lemmatization have been put into practice, which improve the performance and also facilitate the automatic evaluation.

A language pipeline with good performance is crucial both for conducting

research in the field of word processing and other related areas and for creating software applications for various purposes.

6 of 24

Research goals

The goals of the current work are:

• to create an open-source pipeline for the Bulgarian language in spaCy v.3;

• to improve available lists of tokenizer exceptions, stop words and regular expressions for handling specific symbols and punctuation;

• to switch from rule-based to neural edit-tree lemmatization;

• to create custom modules for sentence splitting and for word sense disambiguation.

7 of 24

Pipeline implementation

02

8 of 24

Overview

The pipeline consists of two rule-based and four trainable components.

The machine learning components are trained on data from two language resources – BulTreeBank and BulNet.

All components share the same token representations, which are based on the pretrained Bulgarian fasttext word embeddings.

Tokenization

Sentence Splitting

Part-of-speech Tagging

Dependency Parsing

Lemmatization

Word Sense Disambiguation

Figure 2: Sequence of the steps of the developed Bulgarian language pipeline

9 of 24

Training data

(a) BulTreeBank

The data in the Bulgarian treebank consists of a total number of 11 138 sentences, of which 81% from Bulgarian newspapers, 16% from fiction texts, and 3% from administrative documents. Every token is annotated with its lemma, part of speech tag, list of morphological features and dependency features.

Figure 3: A sentence example from BulTreeBank.

# text = На заека му омръзна да студува.�1      На    на    ADP R      _      2      case 2:case  _�2      заека заек NOUN  Ncmh   Definite=Def|Gender=Masc|Number=Sing      4      iobj        4:iobj _�3      му    аз     PRON  Ppetds3m        Case=Dat|Gender=Masc|Number=Sing|Person=3|PronType=Prs 4      expl  4:expl   _�4      омръзна       омръзне-ми VERB   Vnpif-o3s        Aspect=Perf|Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin|Voice=Act 0        root  0:root   _�5      да    да    AUX Tx    _      6      aux   6:aux _�6      студува       студувам     VERB   Vpiif-r3s        Aspect=Imp|Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act 4        csubj 4:csubj SpaceAfter=No�7      .       .       PUNCT punct _      4      punct 4:punct _

10 of 24

Training data

(b) BulNet

BulNet is the Bulgarian version of the semantic database WordNet. It contains different meanings for 41 797 words in Bulgarian.

Lemma

Identiciator 

Description

ходя 

btbwn-038000007-v

Преминавам определено разстояние (обикновено с превозно средство или с животно).

ходя 

btbwn-041000314-v 

В любовни отношения съм с някого. 

ходя

btbwn-038000141-v

Движа се като правя стъпка след стъпка в равномерен ритъм.

ходя 

btbwn-038000422-v

Насочвам се към определено място или човек с някаква цел.

Table 1. Possible meanings of the verb „ходя“ (to go), available in the Bulgarian version of WordNet.

11 of 24

Rule-based components

(a) Tokenizer

The main goal of the custom Bulgarian tokenizer is to model correctly the Bulgarian language. It consists of lists of special cases (metrics, abbreviations, titles), list of stopwords, regular expressions for special symbols and punctuation, as well as rules for the tokenization process.

Започнах

обучение

в

СУ

PREFIX

TOKEN

TOKEN

TOKEN

SPECIAL

на

01.10.2014

година

.

TOKEN

TOKEN

TOKEN

SUFIX

SUFIX

Table 2. An example of a tokenized sentence.

12 of 24

Rule-based components

(a) Sentence Splitter

The Sentence Splitter is a custom developed rule-based components. consists of rules for treating punctuation and a variety of edge cases, connected to the uses of initials and abbreviations. In this manner, it is able to avoid splitting sentences where the dot is used in abbreviations, such as:

  • Св. Николай Чудотворец е роден 15 март 270 г. в Патара, Ликия.

(St. Nicholas the Wonderworker was born on March 15, 270 in Patara, Lycia.)

13 of 24

Trainable components

(a) POS-tagger and morphologizer

The part of speech tagger and morphologizer components are implemented by the spaCy’s tagger model, which uses a linear layer with softmax activation to predict tag scores for every token’s vector.

The POS tagging module uses as features the token vectors, as well as information from the morphologizer, which is a trainable component that predicts morphological features and fine-grained POS tags following the Universal Dependencies UPOS and FEATS annotation guidelines.

14 of 24

Trainable components

(b) Dependency parser

A dependency parser (DEP) is a model which marks the relationships between “head” words and words that modify those heads. The spaCy parser uses a modification of the non-monotonic arc-eager transition system, which jointly learns dependency parsing and labelled dependency parsing.

Figure 4: A visual example of POS tagging and dependency parsing

15 of 24

Trainable components

(c) Lemmatizer

In the Neural edit-tree lemmatization algorithm, the lemmatization task is treated as a classification problem. The classes represent all learned edit trees, and the softmax function is used for computing the probability distribution over all trees for a particular token.

An edit tree consists of the following types of nodes- inferior nodes, which split the string into a prefix, an infix, and a suffix, and leaf nodes, which apply the learned transformation.

Figure 2: Edit tree for the inflected form “най-добрият“ (the best) and its lemma “добър“ (good).

Figure 5: Edit tree for the inflected form “най-добрият“ (the best) and its lemma “добър“ (good).

16 of 24

Additional components

  1. Word Sense Disambiguation

.

Figure 6: An overall view of the WSD system. Different approaches can be used for the sense selection.

17 of 24

Experiments

03

18 of 24

Experiments

(a) Word embeddings

For the performance of the WSD model, which uses information from the preceding steps of the pipeline, we experimented with different pretrained word embeddings architectures, such as BERT, RoBERTa, Flair and fastText.

(b) WSD algorithms

For the word sense disambiguation task, we experimented with two main types of algorithms - graph-based (PageRank) and similarity-based (cosine similarity).

19 of 24

Automatic Evaluation

Table 4. Comparison of the results of the current pipeline (2023) and the previous implementation.

Algorithm

WSD accuracy

Cos similarity - FastText

0.6524

Cos similarity - Flair

0.6399

Cos similarity - RoBERTa

0.6282

PageRank - undirected graph

0.5876

PageRank - directed graph

0.6113

Table 3. Performance comparison of different approaches for solving the WSD task.

Metric

Pipeline-2020

Pipeline-2023

TOK

-

99.97

POS

94.49

98.35

LEMMA

-

93.11

UAS

89.71

89.96

LAS

83.95

84.88

WSD

-

65.24

20 of 24

Error analisys

(a) Lemmatization

We identified three main types of mistakes made by the lemmatizer:

  1. Errors caused by the suffixes “-(сe)” and “-сe” in reflexive verbs - the data on which the lemmatizer was trained differs from the way it is written in BulNet.
  2. Incorrect suggestion for a base form of an existing word, such as “булеварда“ instead of “булевард“ and “пейзажа“ instead of “пейзаж“.
  3. Prediction non-existent words, such as “лип” instead of “липа”, and “щайг” instead of “щайга“.

21 of 24

Error analysis

(b) Word Sense disambiguation

There are two main sources of errors of the WSD module:

  1. Multi-token expressions in BulNet, such as “черен чай” (black tea) and “маслодайни рози“ (oil roses).
  2. Overlapping senses – often there are cases where the meaning of a word in a particular sentence can fall in more than one of the predefined senses.

Example

Еxpected sense 

Predicted sense 

Пристъпих напред и вдигнах ръка.

(I stepped forward and raised my hand.)

btbwn-038000141-v 

Движа се като правя стъпка след стъпка в равномерен ритъм.

(I move by taking step after step in a steady rhythm.)

btbwn-038000146-v 

Правя, направям една или няколко стъпки, обикновено в посоката, към която гледам, към която съм обърнат.

(I take one or more steps, usually in the direction I'm looking or, facing.)

 

Table 5. Example of a case, where the predicted and the original sense are very close.

22 of 24

Conclusion and future work

04

(AI)

23 of 24

Conclusion and future work

The presented pipeline can be used in multiple ways, some of which include:

  • in sentiment analysis and hate speech detection tasks, by lemmatizing text and searching in a list of predefined signaling words;
  • in machine translation, to find the right meaning of an ambiguous word and produce the right translation;
  • in text categorization tasks, by providing additional information about the text, such as additional features of the words and sentences of its contents.

These and any other applications can be built by appending the pipeline with additional components (such as one for text categorization) or integrating it with other systems.

24 of 24

Thanks!

Acknowledgements

An initial version of the sentencizer and the lists of the tokenizer exceptions were compiled by Luboslav Krastev and Daniel Traikov from Apis Europe. Data from the BulgNet is provided with the kind assistance of Prof. Dr. Kiril Simov from the institute of Information and Communication Technologies of the Bulgarian Academy of Sciences.

CREDITS: This presentation template was created by Slidesgo and includes icons by Flaticon, infographics & images by Freepik and content by Eliana Delacour