XI INTERNATIONAL CONFERENCE
“INFORMATION TECHNOLOGY AND IMPLEMENTATION” (IT&I-2024)
Toxicity Detection for Ukrainian-Language Texts in the TextAttributor System
Nataliia Darchuk, Oкsana Zuban, Valentyna Robeiko, Yuliia Tsyhvintseva, Mykola Sazhok
Outline
Toxic text
A lexicon-based methodology
Three lexicographic lists of toxic words are developed based on:
A lexicon-based methodology
1. The "Dictionary of Emotionogens":
over 5,000 lexemes that verbalize negative facts, emotions, and assessments, causing the recipient to experience anxiety, fear, confusion, shame, guilt, oppression, and control.
A lexicon-based methodology
2. The "Hate Speech Dictionary":
3,000 lexemes that verbalize aggressive communication, including harassment, threats, obscenities, cyberbullying, trolling, outrage, and identity-based hate text.
A lexicon-based methodology
Hate Speech Dictionary
A lexicon-based methodology
3. The "Dictionary of Toxic Compounds":
1,500 stable phrases that idiomatically reflect toxic sentiment. The items in the list are classified by 26 semantic features.
Implementation within the TextAttributor system
A lexicon-based method for toxicity detection of Ukrainian- language texts was implemented within the TextAttributor 1.0 system as a rule-based module.
Implementation within the TextAttributor system
Automatic analysis �of the Ukrainian-language text
• text attribution
• setting the toxicity index of the text
• linguistic examination of the text
Rule-Based Toxicity Detection Module
The text toxicity index and the statistical map of the linguistic expert report on text toxicity.
Rule-Based Toxicity Detection Module
Text fragment analyzed in the rule-based toxicity detection module
Rule-Based Toxicity Detection Module
Text fragment analyzed in the rule-based toxicity detection module
Machine learning methods
Machine Learning experiments
Data
Datasets | Samples | Documents | Words | Toxic Documents |
A | Training 1 | 600 | 94905 | 192 |
B | Training 2 | 11387 | 1.8 mln | 2155 |
A, B | Test | 68 | 10800 | 34 |
Technique | F1 | Precision | Recall |
Classical | 71.2 | 60.3 | 85.1 |
Deep Learning | 79.4 | 82.3 | 76.9 |
Prompting | 75.3 | 74.3 | 76.5 |
Results
Web-interface
Your choice: General toxicity
New task
Text:
Results
Check out!
Look, katsaps.
you still have a few underdogs grazing on the channel, so look, tell your friends who did it.
What is in the photo is the consequence.
What is in the video in the post above is a consequence.
The cause is the effect.
As soon as this crap in the photo stops flying to our cities and killing our people, it will stop flying to you too.
Before you started bombing Khikhlov, you didn't get any flying?
Synergy?
Scatter plot with Pearson's coefficient of variation for all 520 collected customer's texts
Scatter plot with Pearson's coefficient of variation for 329 customer's texts with similar statistical data by the two methods in determining the toxicity/low-toxicity/neutrality
Conclusions
Thank you!
Automatic analysis �of the Ukrainian-language text
• text attribution
• setting the toxicity index of the text
• linguistic examination of the text