1 of 15

FORUM FOR INFORMATION RETRIEVAL EVALUATION, 2023

Multilingual Hate Speech and Offensive Language Detection of Low Resource Languages

CEUR Workshop Proceedings(CEUR-WS.org)

Abhinav Reddy Gutha

Indian Institute of Technology, Goa

December 17th at 9:45 AM,

Goa Business School, Goa University, Panjim

2 of 15

Introduction

1

About Us

2

Track Description

3

Data Description

4

Preprocessing

5

Methedologies

6

Results

7

Conclusion

8

Agenda

3 of 15

About Us

Team Code Fellas

Sai Adarsh

Ananya Alekar

Dinesh Reddy

Abhinav Reddy

Dr. Clint Pazhayidam George

Assistant Professor at IIT Goa

4 of 15

Hate Speech Detection

Social media enables users to share opinions anonymously, contributing to the misuse of platforms for spreading hate. This has led to an increase in hate crimes and offensive content on platforms like Twitter, Facebook, and Reddit. The repercussions extend beyond individual users, affecting the broader public with rising cases of mental health issues. Consequently, there is a pressing need for effective hate speech detection.

5 of 15

Bodo

Assamese

Bengali

Our Task

We decided to focus on task 4, which involves identifying hate speech in Bengali, Bodo, and Assamese languages. This task revolves around binary classification, where each dataset (for the three languages) comprises sentences classified into categories such as hate or offensive (HOF) and not hate (NOT). The data primarily originates from comments on platforms like Twitter, Facebook, and YouTube.

6 of 15

Dataset Information

Language

HOF

NOT

Total

Assamese

2347

1689

4036

Bengali

766

515

1281

Bodo

998

681

1679

7 of 15

Preprocessing Techniques Applied

During our research, minimal processing is applied. This is because extensive preprocessing can strip away some of the useful information. You can observe the techniques which we have applied during the whole process.

Removing usernames with regex.

Removing all the URLs and numerics using regex.

Shortening extended words to its normal form e.g, hellooooooo is converted to hello

Removing all newline characters from text.

Tokenizing tweets and padding sequences to convert text to tensors.

8 of 15

1

Our Methedologies

Here’s how we applied from the basic machine learning models to complex deep learning and transformer based architectures.

Machine Translation where we convert all languages to one language.

Machine Learning Based models like Logistic Regression, SVM, Decision Tree Classifier, etc.

2

LSTM/BiLSTM models along with CNN - 1D layers for Deep Learning based approach.

3

Tranformer Based approach where we have used pretrained Bert based models.

4

9 of 15

Machine Translation

One technique which we tried during the preprocessing stage was machine translation. But this approach may affect the overall outcome of the model. Figure shown here is one such example which can lead to false interpretation for the model. So we have decided to not go for this approach.

10 of 15

Machine Learning Models

After applying the vector embeddings to the model, we have applied the following models. The F1 Score which we got was under the range of 0.58 - 0.63. Apart from this, we have also used a LSA based approach in order to understand the underlying meaning of the words and documents.

Logistic Regression

Support Vector Machines (SVM)

Decision Tree Classifier

XG Boost Classifier

11 of 15

Deep Learning Architectures

After applying the ML based models, our research turned its attention to deep learning based models. After doing some hyperparameterization and applying the neural network onto the data, our F1 score has increased for all 3 languages with assamese ranging around 0.65-0.67, bengali with 0.58-0.62 and bodo around 0.8-0.83 approximately.

LSTM

LSTM with CNN-1D

BiLSTM

BiLSTM with CNN-1D

12 of 15

Transformer Based Approach

Finally our research turned its attention onto applying the pretrained Bert based embeddings. After doing some hyperparameterization and applying these pretrained models onto the data, our F1 score has increased for assamese and bengali with assamese ranging around 0.69 for some models and bengali with 0.64 to 0.71.

Bert Base Multilingual Uncased

Assamese, Bengali Bert

XLM-Roberta

Distil Bert

Indic Bert

13 of 15

Assamese

Indic Bert (Bert Based)

Bengali

Bengali MuRIL (Bert Based)

Bodo

BiLSTM (Neural Network Based)

Our Results

14 of 15

Observations

1. Our research highlights the advantage of using specialized BERT models pre-trained on languages like Assamese and Bengali. These models, tailored to linguistic nuances, significantly enhance performance for these languages in the northeastern region of India, emphasizing the importance of language-specific pre-training in NLP tasks.

2. Bodo, a low-resource language in India, poses unique challenges as it lacks dedicated pre-trained models unlike Assamese and Bengali. In response, we favor neural network-based approaches for Bodo, outperforming BERT models in this context. While adapting existing BERT models to the Devanagari script is possible, our results suggest that neural network-based methods achieve better performance.

3. In addition to the provided models, we are experimenting with large transformer models within the BERT family. However, the relatively small dataset size leads to overfitting during training, posing a challenge in utilizing these models effectively.

15 of 15

Thank You !!