FORUM FOR INFORMATION RETRIEVAL EVALUATION, 2023
Multilingual Hate Speech and Offensive Language Detection of Low Resource Languages
CEUR Workshop Proceedings(CEUR-WS.org)
Abhinav Reddy Gutha
Indian Institute of Technology, Goa
December 17th at 9:45 AM,
Goa Business School, Goa University, Panjim
Introduction | 1 |
About Us | 2 |
Track Description | 3 |
Data Description | 4 |
| |
Preprocessing | 5 |
Methedologies | 6 |
Results | 7 |
Conclusion | 8 |
| |
Agenda
About Us
Team Code Fellas
Sai Adarsh
Ananya Alekar
Dinesh Reddy
Abhinav Reddy
Dr. Clint Pazhayidam George
Assistant Professor at IIT Goa
Hate Speech Detection
Social media enables users to share opinions anonymously, contributing to the misuse of platforms for spreading hate. This has led to an increase in hate crimes and offensive content on platforms like Twitter, Facebook, and Reddit. The repercussions extend beyond individual users, affecting the broader public with rising cases of mental health issues. Consequently, there is a pressing need for effective hate speech detection.
Bodo
Assamese
Bengali
Our Task
We decided to focus on task 4, which involves identifying hate speech in Bengali, Bodo, and Assamese languages. This task revolves around binary classification, where each dataset (for the three languages) comprises sentences classified into categories such as hate or offensive (HOF) and not hate (NOT). The data primarily originates from comments on platforms like Twitter, Facebook, and YouTube.
Dataset Information
Language | HOF | NOT | Total |
Assamese | 2347 | 1689 | 4036 |
Bengali | 766 | 515 | 1281 |
Bodo | 998 | 681 | 1679 |
Preprocessing Techniques Applied
During our research, minimal processing is applied. This is because extensive preprocessing can strip away some of the useful information. You can observe the techniques which we have applied during the whole process.
Removing usernames with regex.
Removing all the URLs and numerics using regex.
Shortening extended words to its normal form e.g, hellooooooo is converted to hello
Removing all newline characters from text.
Tokenizing tweets and padding sequences to convert text to tensors.
1
Our Methedologies
Here’s how we applied from the basic machine learning models to complex deep learning and transformer based architectures.
Machine Translation where we convert all languages to one language.
Machine Learning Based models like Logistic Regression, SVM, Decision Tree Classifier, etc.
2
LSTM/BiLSTM models along with CNN - 1D layers for Deep Learning based approach.
3
Tranformer Based approach where we have used pretrained Bert based models.
4
Machine Translation
One technique which we tried during the preprocessing stage was machine translation. But this approach may affect the overall outcome of the model. Figure shown here is one such example which can lead to false interpretation for the model. So we have decided to not go for this approach.
Machine Learning Models
After applying the vector embeddings to the model, we have applied the following models. The F1 Score which we got was under the range of 0.58 - 0.63. Apart from this, we have also used a LSA based approach in order to understand the underlying meaning of the words and documents.
Logistic Regression
Support Vector Machines (SVM)
Decision Tree Classifier
XG Boost Classifier
Deep Learning Architectures
After applying the ML based models, our research turned its attention to deep learning based models. After doing some hyperparameterization and applying the neural network onto the data, our F1 score has increased for all 3 languages with assamese ranging around 0.65-0.67, bengali with 0.58-0.62 and bodo around 0.8-0.83 approximately.
LSTM
LSTM with CNN-1D
BiLSTM
BiLSTM with CNN-1D
Transformer Based Approach
Finally our research turned its attention onto applying the pretrained Bert based embeddings. After doing some hyperparameterization and applying these pretrained models onto the data, our F1 score has increased for assamese and bengali with assamese ranging around 0.69 for some models and bengali with 0.64 to 0.71.
Bert Base Multilingual Uncased
Assamese, Bengali Bert
XLM-Roberta
Distil Bert
Indic Bert
Assamese
Indic Bert (Bert Based)
Bengali
Bengali MuRIL (Bert Based)
Bodo
BiLSTM (Neural Network Based)
Our Results
Observations
1. Our research highlights the advantage of using specialized BERT models pre-trained on languages like Assamese and Bengali. These models, tailored to linguistic nuances, significantly enhance performance for these languages in the northeastern region of India, emphasizing the importance of language-specific pre-training in NLP tasks.
2. Bodo, a low-resource language in India, poses unique challenges as it lacks dedicated pre-trained models unlike Assamese and Bengali. In response, we favor neural network-based approaches for Bodo, outperforming BERT models in this context. While adapting existing BERT models to the Devanagari script is possible, our results suggest that neural network-based methods achieve better performance.
3. In addition to the provided models, we are experimenting with large transformer models within the BERT family. However, the relatively small dataset size leads to overfitting during training, posing a challenge in utilizing these models effectively.
Thank You !!