1 of 1

Unmasking Sarcasm: Exploring mBERT+CNN and LSTM Models for Sarcasm Identification in Code-Mixed Tamil and Malayalam Texts

Conclusion and Future Work

Among the proposed models, mBERT+CNN model outperformed LSTM model with macro F1 scores of 0.74 and 0.72 for Tamil-English and Malayalam-English datasets respectively
Advanced text augmentation and context-aware models will be explored further

On social media platforms, users express their thoughts and emotions in diverse ways including sentiments, sarcasm, signs of depression, and expressions of hatredness. Sarcasm is a form of expression where the intended meaning is opposite to the literal meaning of the words used, often to convey mockery or irony. This can significantly alter the perceived sentiment of a message, making it challenging for Natural Language Processing (NLP) tasks such as opinion and sentiment analysis. Detecting sarcasm is crucial because it helps in accurately interpreting user sentiments and can enhance the effectiveness of automated systems in processing and responding to user-generated content. In this direction, “Sarcasm Identification of Dravidian Languages in DravidianCodeMix@FIRE-2024"- a shared task organized at Forum for Information Retrieval Evaluation (FIRE) 2024, invites the research community to address the challenges of sarcasm detection in code-mixed Dravidian languages (Tamil-English and Malayalam-English). To explore the strategies for sarcasm identification in Dravidian languages, in this paper, we - team MUCS, describe the models proposed for the shared task. Two distinct models: i) Long Short-Term Memory (LSTM) model - trained using Keras embeddings, and ii) mBERT+CNN model - a combination of Transfer Learning (TL) (fine-tuning Multilingual Bidirectional Encoder Representations from Transformers (mBERT)) and Deep Learning (DL) approach (Convolutional Neural Network (CNN)) for building classifier, are proposed for sarcasm identification. Further, to overcome the data imbalance issue in the dataset, text augmentation technique is explored using Contextual word embedding augmenter from Natural Language Processing Augmentation (NLPAug) library, to increase the number of samples in the minority class. Among the proposed models, mBERT+CNN model outperformed LSTM model with macro F1 scores of 0.74 and 0.72 for Tamil-English and Malayalam-English datasets respectively.

Abstract

Social networking sites have become a great source of user-generated textual content that lends itself to engagement via likes, shares, comments, and discussion
This textual content corresponds to various topics such as hate speech, hope speech, fake news etc., and sarcasm is one among them
Sarcastic comments often draw a contrast between the literal words expressed and their intended implication
Sarcasm detection in low-resource Dravidian languages like Tamil and Malayalam, are exacerbated by the complexity of linguistic structures, the rich variety of expressions used in text to express sarcasm and the unavailability of annotated data, in these languages
The sample text with their corresponding labels from the given datasets for code-mixed Tamil-English and Malayalam-English are shown in Table 1

Introduction

Methodology

Statistics of code-mixed Tamil-English and Malayalam-English dataset is shown in Table 2

Table 1: Samples of code-mixed Tamil-English and Malayalam-English comments along with their English translations and labels

Figure 1: Framework of the proposed LSTM and mBERT+CNNmodels

Table 2: Statistics of code-mixed Tamil-English and Malayalam-English dataset

Natural Language Processing Augmentation (NLPAug) - NLPAug is a Python library designed to facilitate text augmentation for NLP tasks
Contextual word embedding augmenter technique is used to augment the ‘Sarcastic’ class samples by employing ‘insert’ option
This technique uses multilingualBERT - a pretrained language model
In Tamil-English and Malayalam-English datasets, sarcastic samples were increased to 21,740 and 10,689 samples respectively
The examples of Tamil-English and Malayalam-English sample texts and augmented text using NLPAugContextualWordEmbAug technique are shown in Table 3

Table 3: Sample text and augmented text in Tamil-English and Malayalam English dataset

Pre-processing

User mentions (e.g., "@username"), numbers, URLs, and HTML tags are removed
Converted emojis into their textual representations

Text Augmentation

Model Building

The augmented dataset is pre-processed and used to construct two distinct models: i) LSTM and ii) mBERT+CNN.

LSTM - a Deep Learning Approach

The steps involved in implementing this approach are given below:

Text and Label Fusion
Feature Extraction : Keras embedding is employed - converts raw text into numerical representations
Classifier Construction - LSTM classifier
Hyperparameters and their values used in LSTM model is shown in Table 4

mBERT+CNN model
mBERT - for feature extraction, CNN - for classification
Hyperparameters and their values used in mBERT+CNN model are shown in Table 5

Table 5: Hyperparameter and their values used in mBERT+CNN model

Table4: Hyperparameter and their values used in LSTM model

The framework of the proposed methodology is shown in Figure 1

Experiments and Results

Dataset: YouTube comments with native as well as Roman scripts
The performances of the proposed models with and without augmentation on the Development and Test sets are shown in Tables 6 and 7 respectively
The comparison of macro F1 scores of all the participating teams for the sarcasm detection task in both code-mixed Tamil-English and Malayalam-English text are shown in Figure 2 (a) and 2 (b) respectively
Few misclassified Malayalam-English samples using mBERT+CNN model with actual and predicted labels and the probable reasons for misclassification are shown in Table 8

Models	Languages
	Tamil-English						Malayalam-English
	Without Augmentation			With Augmentation			Without Augmentation			With Augmentation
	P	R	F1	P	R	F1	P	R	F1	P	R	F1
LSTM	0.47	0.50	0.21	0.63	0.50	0.21	0.59	0.50	0.16	0.59	0.50	0.16
mBERT+CNN	0.75	0.69	0.71	0.75	0.72	0.73	0.74	0.69	0.71	0.79	0.68	0.73
P: Precision; R: Recall; F1: Macro F1 score

Models	Languages
	Tamil-English						Malayalam-English
	Without Augmentation			With Augmentation			Without Augmentation			With Augmentation
	P	R	F1	P	R	F1	P	R	F1	P	R	F1
LSTM	0.14	0.50	0.21	0.60	0.59	0.45	0.59	0.50	0.15	0.55	0.51	0.19
mBERT+CNN	0.59	0.50	0.43	0.73	0.76	0.74	0.71	0.73	0.72	0.76	0.69	0.72
P: Precision; R: Recall; F1: Macro F1 score

Figure 2: Comparison of macro F1 scores of the participating teams in Tamil-English and Malayalam-English dataset

Table 8: Misclassified Malayalam-English samples using mBERT+CNN model

Table 7: Performances of the proposed models without and with augmentation on Tamil-English and Malayalam-English Test sets

Table 6: Performances of the proposed models without and with augmentation on Tamil-English and Malayalam-English Validation sets

References: 1) B. R. Chakravarthi, S. N, B. B, N. K, T. Durairaj, R. Ponnusamy, P. K. Kumaresan, K. K. Ponnusamy, C. Rajkumar, Overview of Sarcasm Identification of Dravidian Languages in DravidianCodeMix@FIRE-2024, in: Forum of Information Retrieval and Evaluation FIRE 2024, DAIICT , Gandhinagar, 2024.

Sonith D, Kavya G, and H L Shashirekha

Department of Computer Science, Mangalore University, Mangalore, Karnataka, India

Template ID: assessingslate Size: 48x36