1 of 19

XI INTERNATIONAL CONFERENCE

“INFORMATION TECHNOLOGY AND IMPLEMENTATION” (IT&I-2024)

Toxicity Detection for Ukrainian-Language Texts in the TextAttributor System

Nataliia Darchuk, Oкsana Zuban, Valentyna Robeiko, Yuliia Tsyhvintseva, Mykola Sazhok

2 of 19

Outline

  • What is and why toxicity
  • A lexicon-based methodology
  • Machine learning methods
  • Experiments, implementations, integration, comparisons.
  • Conclusions and future work

3 of 19

Toxic text

  • Contains indications of aggressive communication (harassment, threats, obscenities, cyberbullying, trolling, outrage, and identity-based hate speech).
  • Verbalizes negative facts, emotions, and assessments. These destructive emotion-generating words cause the recipient to experience anxiety, fear, confusion, shame, guilt, oppression, and control.
  • Both computer-added and automatic determining the tonality of a text is crucial particularly during the war period for:
  • detection of destructive and other harmful information injected into the mind or subconscious, which leads to an inadequate perception of reality
  • preventing negative impacts to individual's psychological state, public consciousness
  • avoiding infringes upon the rights and legitimate interests of users, society, and the state

4 of 19

A lexicon-based methodology

Three lexicographic lists of toxic words are developed based on:

  • A textual samples of approximately two million words, comprising texts from blogs, news sites, online publications, comments to online publications from social networks, and so forth;
  • a database of semantic taxa, compiled from Ukrainian journalistic texts totaling 40,000 words;
  • a tonality dictionary of Ukrainian vocabulary compiled by O. Tolochko

5 of 19

A lexicon-based methodology

1. The "Dictionary of Emotionogens":

over 5,000 lexemes that verbalize negative facts, emotions, and assessments, causing the recipient to experience anxiety, fear, confusion, shame, guilt, oppression, and control.

6 of 19

A lexicon-based methodology

2. The "Hate Speech Dictionary":

3,000 lexemes that verbalize aggressive communication, including harassment, threats, obscenities, cyberbullying, trolling, outrage, and identity-based hate text.

7 of 19

A lexicon-based methodology

Hate Speech Dictionary

8 of 19

A lexicon-based methodology

3. The "Dictionary of Toxic Compounds":

1,500 stable phrases that idiomatically reflect toxic sentiment. The items in the list are classified by 26 semantic features.

9 of 19

Implementation within the TextAttributor system

A lexicon-based method for toxicity detection of Ukrainian- language texts was implemented within the TextAttributor 1.0 system as a rule-based module.

10 of 19

Implementation within the TextAttributor system

Automatic analysis of the Ukrainian-language text

• text attribution

• setting the toxicity index of the text

• linguistic examination of the text

11 of 19

Rule-Based Toxicity Detection Module

The text toxicity index and the statistical map of the linguistic expert report on text toxicity.

  • emotionogens – 18;
  • vulgarisms – 2;
  • negative names for a person by comparison with mythical creatures – 1;
  • negative names for a person based on intellectual ability – 2;
  • negative names for a person based on health characteristics – 1;
  • negative names for a person (sarcasm, idiomatic expression) – 1;
  • negative names for a person based on body parts or physiological processes – 1

12 of 19

Rule-Based Toxicity Detection Module

Text fragment analyzed in the rule-based toxicity detection module

13 of 19

Rule-Based Toxicity Detection Module

Text fragment analyzed in the rule-based toxicity detection module

  • emotionogens – 18;
  • vulgarisms – 2;
  • sexism – 2

14 of 19

Machine learning methods

  • Classic methods: Naive Bayes approach
  • Deep learning: fastText (Word2Vec with subword embeddings, continuous bag of words)
  • LLM prompting: LLAMA 3.2 with a clear instruction (like, “Classify the following sentence as 'toxic' or 'non-toxic'”)

15 of 19

Machine Learning experiments

Data

Datasets

Samples

Documents

Words

Toxic Documents

A

Training 1

600

94905

192

B

Training 2

11387

1.8 mln

2155

A, B

Test

68

10800

34

Technique

F1

Precision

Recall

Classical

71.2

60.3

85.1

Deep Learning

79.4

82.3

76.9

Prompting

75.3

74.3

76.5

Results

16 of 19

Web-interface

Your choice: General toxicity

New task

Text:

Results

Check out!

Look, katsaps.

you still have a few underdogs grazing on the channel, so look, tell your friends who did it.

What is in the photo is the consequence.

What is in the video in the post above is a consequence.

The cause is the effect.

As soon as this crap in the photo stops flying to our cities and killing our people, it will stop flying to you too.

Before you started bombing Khikhlov, you didn't get any flying?

17 of 19

Synergy?

Scatter plot with Pearson's coefficient of variation for all 520 collected customer's texts

Scatter plot with Pearson's coefficient of variation for 329 customer's texts with similar statistical data by the two methods in determining the toxicity/low-toxicity/neutrality

18 of 19

Conclusions

  • Rule-based and machine learning approaches both provide valuable insights into text toxicity
  • The rule-based approach excels in providing a detailed linguistic analysis of toxic vocabulary based on a lexicographic database, making it suitable for in-depth expert analysis
  • Machine learning techniques provide a scalable solution for handling large volumes of text, offering an efficient and automated method of classifying toxic content
  • Certain discrepancies in toxicity assessment underline the need for further refinement in both modules to improve the overall system accuracy and reliability
  • Further research includes exploring more LLMs, toxic index estimation improvement, chunking for lengthy texts, hybrid model development, UI enhancement (more intuitive, heatmaps, graphs etc.)

19 of 19

Thank you!

Automatic analysis of the Ukrainian-language text

• text attribution

• setting the toxicity index of the text

• linguistic examination of the text