An Improved Bulgarian Natural Language Processing Pipeline
Melania Berbatova, Filip Ivanov
Sofia University St. Kliment Ohridski
01
Table of contents
02
03
04
Introduction
Pipeline implementation
Experiments and evaluation
Conclusion and future work
Introduction
01
Language pipelines
A language pipeline consists of a sequence of steps targeted towards processing and analyzing natural language data. A typical language pipeline might include steps such as tokenization, part-of-speech tagging, parsing, and semantic analysis among others.
These steps are used as a preprocessing stage in many different tasks and applications that involve analyzing human language.
A language pipeline consists of a sequence of steps targeted towards processing and analyzing natural language data. A typical language pipeline might include steps such as tokenization, part-of-speech tagging, parsing, and semantic analysis among others.
These steps are used as a preprocessing stage in many different tasks and applications that involve analyzing human language.
Figure 1: An example of a language pipeline
Research motivation
There are previous works for building a Bulgarian pipeline based on previous versions of spaCy (Popov, 2020) or custom-built software (Simov, 2012). However, since these works have been published, new neural-based based algorithms for tasks such as lemmatization have been put into practice, which improve the performance and also facilitate the automatic evaluation.
A language pipeline with good performance is crucial both for conducting
research in the field of word processing and other related areas and for creating software applications for various purposes.
Research goals
The goals of the current work are:
• to create an open-source pipeline for the Bulgarian language in spaCy v.3;
• to improve available lists of tokenizer exceptions, stop words and regular expressions for handling specific symbols and punctuation;
• to switch from rule-based to neural edit-tree lemmatization;
• to create custom modules for sentence splitting and for word sense disambiguation.
Pipeline implementation
02
Overview
The pipeline consists of two rule-based and four trainable components.
The machine learning components are trained on data from two language resources – BulTreeBank and BulNet.
All components share the same token representations, which are based on the pretrained Bulgarian fasttext word embeddings.
Tokenization
Sentence Splitting
Part-of-speech Tagging
Dependency Parsing
Lemmatization
Word Sense Disambiguation
Figure 2: Sequence of the steps of the developed Bulgarian language pipeline
Training data
(a) BulTreeBank
The data in the Bulgarian treebank consists of a total number of 11 138 sentences, of which 81% from Bulgarian newspapers, 16% from fiction texts, and 3% from administrative documents. Every token is annotated with its lemma, part of speech tag, list of morphological features and dependency features.
Figure 3: A sentence example from BulTreeBank.
# text = На заека му омръзна да студува.�1 На на ADP R _ 2 case 2:case _�2 заека заек NOUN Ncmh Definite=Def|Gender=Masc|Number=Sing 4 iobj 4:iobj _�3 му аз PRON Ppetds3m Case=Dat|Gender=Masc|Number=Sing|Person=3|PronType=Prs 4 expl 4:expl _�4 омръзна омръзне-ми VERB Vnpif-o3s Aspect=Perf|Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin|Voice=Act 0 root 0:root _�5 да да AUX Tx _ 6 aux 6:aux _�6 студува студувам VERB Vpiif-r3s Aspect=Imp|Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act 4 csubj 4:csubj SpaceAfter=No�7 . . PUNCT punct _ 4 punct 4:punct _
Training data
(b) BulNet
BulNet is the Bulgarian version of the semantic database WordNet. It contains different meanings for 41 797 words in Bulgarian.
Lemma | Identiciator | Description |
ходя | btbwn-038000007-v | Преминавам определено разстояние (обикновено с превозно средство или с животно). |
ходя | btbwn-041000314-v | В любовни отношения съм с някого. |
ходя | btbwn-038000141-v | Движа се като правя стъпка след стъпка в равномерен ритъм. |
ходя | btbwn-038000422-v | Насочвам се към определено място или човек с някаква цел. |
Table 1. Possible meanings of the verb „ходя“ (to go), available in the Bulgarian version of WordNet.
Rule-based components
(a) Tokenizer
The main goal of the custom Bulgarian tokenizer is to model correctly the Bulgarian language. It consists of lists of special cases (metrics, abbreviations, titles), list of stopwords, regular expressions for special symbols and punctuation, as well as rules for the tokenization process.
„ | Започнах | обучение | в | СУ |
PREFIX | TOKEN | TOKEN | TOKEN | SPECIAL |
на | 01.10.2014 | година | . | “ |
TOKEN | TOKEN | TOKEN | SUFIX | SUFIX |
Table 2. An example of a tokenized sentence.
Rule-based components
(a) Sentence Splitter
The Sentence Splitter is a custom developed rule-based components. consists of rules for treating punctuation and a variety of edge cases, connected to the uses of initials and abbreviations. In this manner, it is able to avoid splitting sentences where the dot is used in abbreviations, such as:
(St. Nicholas the Wonderworker was born on March 15, 270 in Patara, Lycia.)
Trainable components
(a) POS-tagger and morphologizer
The part of speech tagger and morphologizer components are implemented by the spaCy’s tagger model, which uses a linear layer with softmax activation to predict tag scores for every token’s vector.
The POS tagging module uses as features the token vectors, as well as information from the morphologizer, which is a trainable component that predicts morphological features and fine-grained POS tags following the Universal Dependencies UPOS and FEATS annotation guidelines.
Trainable components
(b) Dependency parser
A dependency parser (DEP) is a model which marks the relationships between “head” words and words that modify those heads. The spaCy parser uses a modification of the non-monotonic arc-eager transition system, which jointly learns dependency parsing and labelled dependency parsing.
Figure 4: A visual example of POS tagging and dependency parsing
Trainable components
(c) Lemmatizer
In the Neural edit-tree lemmatization algorithm, the lemmatization task is treated as a classification problem. The classes represent all learned edit trees, and the softmax function is used for computing the probability distribution over all trees for a particular token.
An edit tree consists of the following types of nodes- inferior nodes, which split the string into a prefix, an infix, and a suffix, and leaf nodes, which apply the learned transformation.
Figure 2: Edit tree for the inflected form “най-добрият“ (the best) and its lemma “добър“ (good).
Figure 5: Edit tree for the inflected form “най-добрият“ (the best) and its lemma “добър“ (good).
Additional components
.
Figure 6: An overall view of the WSD system. Different approaches can be used for the sense selection.
Experiments
03
Experiments
(a) Word embeddings
For the performance of the WSD model, which uses information from the preceding steps of the pipeline, we experimented with different pretrained word embeddings architectures, such as BERT, RoBERTa, Flair and fastText.
(b) WSD algorithms
For the word sense disambiguation task, we experimented with two main types of algorithms - graph-based (PageRank) and similarity-based (cosine similarity).
Automatic Evaluation
Table 4. Comparison of the results of the current pipeline (2023) and the previous implementation.
Algorithm | WSD accuracy |
Cos similarity - FastText | 0.6524 |
Cos similarity - Flair | 0.6399 |
Cos similarity - RoBERTa | 0.6282 |
PageRank - undirected graph | 0.5876 |
PageRank - directed graph | 0.6113 |
Table 3. Performance comparison of different approaches for solving the WSD task.
Metric | Pipeline-2020 | Pipeline-2023 |
TOK | - | 99.97 |
POS | 94.49 | 98.35 |
LEMMA | - | 93.11 |
UAS | 89.71 | 89.96 |
LAS | 83.95 | 84.88 |
WSD | - | 65.24 |
|
Error analisys
(a) Lemmatization
We identified three main types of mistakes made by the lemmatizer:
Error analysis
(b) Word Sense disambiguation
There are two main sources of errors of the WSD module:
Example | Еxpected sense | Predicted sense |
Пристъпих напред и вдигнах ръка. (I stepped forward and raised my hand.) | btbwn-038000141-v Движа се като правя стъпка след стъпка в равномерен ритъм. (I move by taking step after step in a steady rhythm.) | btbwn-038000146-v Правя, направям една или няколко стъпки, обикновено в посоката, към която гледам, към която съм обърнат. (I take one or more steps, usually in the direction I'm looking or, facing.) |
Table 5. Example of a case, where the predicted and the original sense are very close.
Conclusion and future work
04
(AI)
Conclusion and future work
The presented pipeline can be used in multiple ways, some of which include:
These and any other applications can be built by appending the pipeline with additional components (such as one for text categorization) or integrating it with other systems.
Thanks!
Acknowledgements
An initial version of the sentencizer and the lists of the tokenizer exceptions were compiled by Luboslav Krastev and Daniel Traikov from Apis Europe. Data from the BulgNet is provided with the kind assistance of Prof. Dr. Kiril Simov from the institute of Information and Communication Technologies of the Bulgarian Academy of Sciences.
CREDITS: This presentation template was created by Slidesgo and includes icons by Flaticon, infographics & images by Freepik and content by Eliana Delacour