1 of 1

Unmasking Sarcasm: Exploring mBERT+CNN and LSTM Models for Sarcasm Identification in Code-Mixed Tamil and Malayalam Texts

Conclusion and Future Work

  • Among the proposed models, mBERT+CNN model outperformed LSTM model with macro F1 scores of 0.74 and 0.72 for Tamil-English and Malayalam-English datasets respectively
  • Advanced text augmentation and context-aware models will be explored further

On social media platforms, users express their thoughts and emotions in diverse ways including sentiments, sarcasm, signs of depression, and expressions of hatredness. Sarcasm is a form of expression where the intended meaning is opposite to the literal meaning of the words used, often to convey mockery or irony. This can significantly alter the perceived sentiment of a message, making it challenging for Natural Language Processing (NLP) tasks such as opinion and sentiment analysis. Detecting sarcasm is crucial because it helps in accurately interpreting user sentiments and can enhance the effectiveness of automated systems in processing and responding to user-generated content. In this direction, “Sarcasm Identification of Dravidian Languages in DravidianCodeMix@FIRE-2024"- a shared task organized at Forum for Information Retrieval Evaluation (FIRE) 2024, invites the research community to address the challenges of sarcasm detection in code-mixed Dravidian languages (Tamil-English and Malayalam-English). To explore the strategies for sarcasm identification in Dravidian languages, in this paper, we - team MUCS, describe the models proposed for the shared task. Two distinct models: i) Long Short-Term Memory (LSTM) model - trained using Keras embeddings, and ii) mBERT+CNN model - a combination of Transfer Learning (TL) (fine-tuning Multilingual Bidirectional Encoder Representations from Transformers (mBERT)) and Deep Learning (DL) approach (Convolutional Neural Network (CNN)) for building classifier, are proposed for sarcasm identification. Further, to overcome the data imbalance issue in the dataset, text augmentation technique is explored using Contextual word embedding augmenter from Natural Language Processing Augmentation (NLPAug) library, to increase the number of samples in the minority class. Among the proposed models, mBERT+CNN model outperformed LSTM model with macro F1 scores of 0.74 and 0.72 for Tamil-English and Malayalam-English datasets respectively.

Abstract

  • Social networking sites have become a great source of user-generated textual content that lends itself to engagement via likes, shares, comments, and discussion
  • This textual content corresponds to various topics such as hate speech, hope speech, fake news etc., and sarcasm is one among them
  • Sarcastic comments often draw a contrast between the literal words expressed and their intended implication
  • Sarcasm detection in low-resource Dravidian languages like Tamil and Malayalam, are exacerbated by the complexity of linguistic structures, the rich variety of expressions used in text to express sarcasm and the unavailability of annotated data, in these languages
  • The sample text with their corresponding labels from the given datasets for code-mixed Tamil-English and Malayalam-English are shown in Table 1

Introduction

Methodology

  • Statistics of code-mixed Tamil-English and Malayalam-English dataset is shown in Table 2

Table 1: Samples of code-mixed Tamil-English and Malayalam-English comments along with their English translations and labels

Figure 1: Framework of the proposed LSTM and mBERT+CNNmodels

Table 2: Statistics of code-mixed Tamil-English and Malayalam-English dataset

  • Natural Language Processing Augmentation (NLPAug) - NLPAug is a Python library designed to facilitate text augmentation for NLP tasks
  • Contextual word embedding augmenter technique is used to augment the ‘Sarcastic’ class samples by employing ‘insert’ option
  • This technique uses multilingualBERT - a pretrained language model
  • In Tamil-English and Malayalam-English datasets, sarcastic samples were increased to 21,740 and 10,689 samples respectively
  • The examples of Tamil-English and Malayalam-English sample texts and augmented text using NLPAugContextualWordEmbAug technique are shown in Table 3

Table 3: Sample text and augmented text in Tamil-English and Malayalam English dataset

Pre-processing

  • User mentions (e.g., "@username"), numbers, URLs, and HTML tags are removed
  • Converted emojis into their textual representations

Text Augmentation

Model Building

The augmented dataset is pre-processed and used to construct two distinct models: i) LSTM and ii) mBERT+CNN.

  • LSTM - a Deep Learning Approach

The steps involved in implementing this approach are given below:

  • Text and Label Fusion
  • Feature Extraction : Keras embedding is employed - converts raw text into numerical representations
  • Classifier Construction - LSTM classifier
  • Hyperparameters and their values used in LSTM model is shown in Table 4

  • mBERT+CNN model
  • mBERT - for feature extraction, CNN - for classification
  • Hyperparameters and their values used in mBERT+CNN model are shown in Table 5

Table 5: Hyperparameter and their values used in mBERT+CNN model

Table4: Hyperparameter and their values used in LSTM model

  • The framework of the proposed methodology is shown in Figure 1

Experiments and Results

  • Dataset: YouTube comments with native as well as Roman scripts
  • The performances of the proposed models with and without augmentation on the Development and Test sets are shown in Tables 6 and 7 respectively
  • The comparison of macro F1 scores of all the participating teams for the sarcasm detection task in both code-mixed Tamil-English and Malayalam-English text are shown in Figure 2 (a) and 2 (b) respectively
  • Few misclassified Malayalam-English samples using mBERT+CNN model with actual and predicted labels and the probable reasons for misclassification are shown in Table 8

Models

Languages

Tamil-English

Malayalam-English

Without

Augmentation

With

Augmentation

Without

Augmentation

With

Augmentation

P

R

F1

P

R

F1

P

R

F1

P

R

F1

LSTM

0.47

0.50

0.21

0.63

0.50

0.21

0.59

0.50

0.16

0.59

0.50

0.16

mBERT+CNN

0.75

0.69

0.71

0.75

0.72

0.73

0.74

0.69

0.71

0.79

0.68

0.73

P: Precision; R: Recall; F1: Macro F1 score

Models

Languages

Tamil-English

Malayalam-English

Without

Augmentation

With

Augmentation

Without

Augmentation

With

Augmentation

P

R

F1

P

R

F1

P

R

F1

P

R

F1

LSTM

0.14

0.50

0.21

0.60

0.59

0.45

0.59

0.50

0.15

0.55

0.51

0.19

mBERT+CNN

0.59

0.50

0.43

0.73

0.76

0.74

0.71

0.73

0.72

0.76

0.69

0.72

P: Precision; R: Recall; F1: Macro F1 score

Figure 2: Comparison of macro F1 scores of the participating teams in Tamil-English and Malayalam-English dataset

Table 8: Misclassified Malayalam-English samples using mBERT+CNN model

Table 7: Performances of the proposed models without and with augmentation on Tamil-English and Malayalam-English Test sets

Table 6: Performances of the proposed models without and with augmentation on Tamil-English and Malayalam-English Validation sets

References: 1) B. R. Chakravarthi, S. N, B. B, N. K, T. Durairaj, R. Ponnusamy, P. K. Kumaresan, K. K. Ponnusamy, C. Rajkumar, Overview of Sarcasm Identification of Dravidian Languages in DravidianCodeMix@FIRE-2024, in: Forum of Information Retrieval and Evaluation FIRE 2024, DAIICT , Gandhinagar, 2024.

Sonith D, Kavya G, and H L Shashirekha

Department of Computer Science, Mangalore University, Mangalore, Karnataka, India

Template ID: assessingslate Size: 48x36