1 of 19

XI INTERNATIONAL CONFERENCE

“INFORMATION TECHNOLOGY AND IMPLEMENTATION” (IT&I-2024)

Toxicity Detection for Ukrainian-Language Texts in the TextAttributor System

Nataliia Darchuk, Oкsana Zuban, Valentyna Robeiko, Yuliia Tsyhvintseva, Mykola Sazhok

2 of 19

Outline

What is and why toxicity
A lexicon-based methodology
Machine learning methods
Experiments, implementations, integration, comparisons.
Conclusions and future work

3 of 19

Toxic text

Contains indications of aggressive communication (harassment, threats, obscenities, cyberbullying, trolling, outrage, and identity-based hate speech).
Verbalizes negative facts, emotions, and assessments. These destructive emotion-generating words cause the recipient to experience anxiety, fear, confusion, shame, guilt, oppression, and control.
Both computer-added and automatic determining the tonality of a text is crucial particularly during the war period for:
detection of destructive and other harmful information injected into the mind or subconscious, which leads to an inadequate perception of reality
preventing negative impacts to individual's psychological state, public consciousness
avoiding infringes upon the rights and legitimate interests of users, society, and the state

4 of 19

A lexicon-based methodology

Three lexicographic lists of toxic words are developed based on:

A textual samples of approximately two million words, comprising texts from blogs, news sites, online publications, comments to online publications from social networks, and so forth;
a database of semantic taxa, compiled from Ukrainian journalistic texts totaling 40,000 words;
a tonality dictionary of Ukrainian vocabulary compiled by O. Tolochko

5 of 19

A lexicon-based methodology

1. The "Dictionary of Emotionogens":

over 5,000 lexemes that verbalize negative facts, emotions, and assessments, causing the recipient to experience anxiety, fear, confusion, shame, guilt, oppression, and control.

6 of 19

A lexicon-based methodology

2. The "Hate Speech Dictionary":

3,000 lexemes that verbalize aggressive communication, including harassment, threats, obscenities, cyberbullying, trolling, outrage, and identity-based hate text.

7 of 19

A lexicon-based methodology

Hate Speech Dictionary

8 of 19

A lexicon-based methodology

3. The "Dictionary of Toxic Compounds":

1,500 stable phrases that idiomatically reflect toxic sentiment. The items in the list are classified by 26 semantic features.

9 of 19

Implementation within the TextAttributor system

A lexicon-based method for toxicity detection of Ukrainian- language texts was implemented within the TextAttributor 1.0 system as a rule-based module.

10 of 19

Implementation within the TextAttributor system

Automatic analysis �of the Ukrainian-language text

• text attribution

• setting the toxicity index of the text

• linguistic examination of the text

https://ta.mova.info

11 of 19

Rule-Based Toxicity Detection Module

The text toxicity index and the statistical map of the linguistic expert report on text toxicity.

emotionogens – 18;
vulgarisms – 2;
negative names for a person by comparison with mythical creatures – 1;
negative names for a person based on intellectual ability – 2;
negative names for a person based on health characteristics – 1;
negative names for a person (sarcasm, idiomatic expression) – 1;
negative names for a person based on body parts or physiological processes – 1

12 of 19

Rule-Based Toxicity Detection Module

Text fragment analyzed in the rule-based toxicity detection module

13 of 19

Rule-Based Toxicity Detection Module

Text fragment analyzed in the rule-based toxicity detection module

emotionogens – 18;
vulgarisms – 2;
sexism – 2

14 of 19

Machine learning methods

Classic methods: Naive Bayes approach
Deep learning: fastText (Word2Vec with subword embeddings, continuous bag of words)
LLM prompting: LLAMA 3.2 with a clear instruction (like, “Classify the following sentence as 'toxic' or 'non-toxic'”)

15 of 19

Machine Learning experiments

Data

Datasets	Samples	Documents	Words	Toxic Documents
A	Training 1	600	94905	192
B	Training 2	11387	1.8 mln	2155
A, B	Test	68	10800	34

Technique	F1	Precision	Recall
Classical	71.2	60.3	85.1
Deep Learning	79.4	82.3	76.9
Prompting	75.3	74.3	76.5

Results

16 of 19

Web-interface

Your choice: General toxicity

New task

Text:

Results

Check out!

Look, katsaps.

you still have a few underdogs grazing on the channel, so look, tell your friends who did it.

What is in the photo is the consequence.

What is in the video in the post above is a consequence.

The cause is the effect.

As soon as this crap in the photo stops flying to our cities and killing our people, it will stop flying to you too.

Before you started bombing Khikhlov, you didn't get any flying?

17 of 19

Synergy?

Scatter plot with Pearson's coefficient of variation for all 520 collected customer's texts

Scatter plot with Pearson's coefficient of variation for 329 customer's texts with similar statistical data by the two methods in determining the toxicity/low-toxicity/neutrality

18 of 19

Conclusions

Rule-based and machine learning approaches both provide valuable insights into text toxicity
The rule-based approach excels in providing a detailed linguistic analysis of toxic vocabulary based on a lexicographic database, making it suitable for in-depth expert analysis
Machine learning techniques provide a scalable solution for handling large volumes of text, offering an efficient and automated method of classifying toxic content
Certain discrepancies in toxicity assessment underline the need for further refinement in both modules to improve the overall system accuracy and reliability
Further research includes exploring more LLMs, toxic index estimation improvement, chunking for lengthy texts, hybrid model development, UI enhancement (more intuitive, heatmaps, graphs etc.)

19 of 19

Thank you!

Automatic analysis �of the Ukrainian-language text

• text attribution

• setting the toxicity index of the text

• linguistic examination of the text

https://ta.mova.info