Unmasking Sarcasm: Exploring mBERT+CNN and LSTM Models for Sarcasm Identification in Code-Mixed Tamil and Malayalam Texts
Conclusion and Future Work
On social media platforms, users express their thoughts and emotions in diverse ways including sentiments, sarcasm, signs of depression, and expressions of hatredness. Sarcasm is a form of expression where the intended meaning is opposite to the literal meaning of the words used, often to convey mockery or irony. This can significantly alter the perceived sentiment of a message, making it challenging for Natural Language Processing (NLP) tasks such as opinion and sentiment analysis. Detecting sarcasm is crucial because it helps in accurately interpreting user sentiments and can enhance the effectiveness of automated systems in processing and responding to user-generated content. In this direction, “Sarcasm Identification of Dravidian Languages in DravidianCodeMix@FIRE-2024"- a shared task organized at Forum for Information Retrieval Evaluation (FIRE) 2024, invites the research community to address the challenges of sarcasm detection in code-mixed Dravidian languages (Tamil-English and Malayalam-English). To explore the strategies for sarcasm identification in Dravidian languages, in this paper, we - team MUCS, describe the models proposed for the shared task. Two distinct models: i) Long Short-Term Memory (LSTM) model - trained using Keras embeddings, and ii) mBERT+CNN model - a combination of Transfer Learning (TL) (fine-tuning Multilingual Bidirectional Encoder Representations from Transformers (mBERT)) and Deep Learning (DL) approach (Convolutional Neural Network (CNN)) for building classifier, are proposed for sarcasm identification. Further, to overcome the data imbalance issue in the dataset, text augmentation technique is explored using Contextual word embedding augmenter from Natural Language Processing Augmentation (NLPAug) library, to increase the number of samples in the minority class. Among the proposed models, mBERT+CNN model outperformed LSTM model with macro F1 scores of 0.74 and 0.72 for Tamil-English and Malayalam-English datasets respectively.
Abstract
Introduction
Methodology
Table 1: Samples of code-mixed Tamil-English and Malayalam-English comments along with their English translations and labels
Figure 1: Framework of the proposed LSTM and mBERT+CNNmodels
Table 2: Statistics of code-mixed Tamil-English and Malayalam-English dataset
Table 3: Sample text and augmented text in Tamil-English and Malayalam English dataset
Pre-processing
Text Augmentation
Model Building
The augmented dataset is pre-processed and used to construct two distinct models: i) LSTM and ii) mBERT+CNN.
The steps involved in implementing this approach are given below:
Table 5: Hyperparameter and their values used in mBERT+CNN model
Table4: Hyperparameter and their values used in LSTM model
Experiments and Results
Models | Languages | |||||||||||
Tamil-English | Malayalam-English | |||||||||||
Without Augmentation | With Augmentation | Without Augmentation | With Augmentation | |||||||||
P | R | F1 | P | R | F1 | P | R | F1 | P | R | F1 | |
LSTM | 0.47 | 0.50 | 0.21 | 0.63 | 0.50 | 0.21 | 0.59 | 0.50 | 0.16 | 0.59 | 0.50 | 0.16 |
mBERT+CNN | 0.75 | 0.69 | 0.71 | 0.75 | 0.72 | 0.73 | 0.74 | 0.69 | 0.71 | 0.79 | 0.68 | 0.73 |
P: Precision; R: Recall; F1: Macro F1 score | ||||||||||||
Models | Languages | |||||||||||
Tamil-English | Malayalam-English | |||||||||||
Without Augmentation | With Augmentation | Without Augmentation | With Augmentation | |||||||||
P | R | F1 | P | R | F1 | P | R | F1 | P | R | F1 | |
LSTM | 0.14 | 0.50 | 0.21 | 0.60 | 0.59 | 0.45 | 0.59 | 0.50 | 0.15 | 0.55 | 0.51 | 0.19 |
mBERT+CNN | 0.59 | 0.50 | 0.43 | 0.73 | 0.76 | 0.74 | 0.71 | 0.73 | 0.72 | 0.76 | 0.69 | 0.72 |
P: Precision; R: Recall; F1: Macro F1 score | ||||||||||||
Figure 2: Comparison of macro F1 scores of the participating teams in Tamil-English and Malayalam-English dataset
Table 8: Misclassified Malayalam-English samples using mBERT+CNN model
Table 7: Performances of the proposed models without and with augmentation on Tamil-English and Malayalam-English Test sets
Table 6: Performances of the proposed models without and with augmentation on Tamil-English and Malayalam-English Validation sets
References: 1) B. R. Chakravarthi, S. N, B. B, N. K, T. Durairaj, R. Ponnusamy, P. K. Kumaresan, K. K. Ponnusamy, C. Rajkumar, Overview of Sarcasm Identification of Dravidian Languages in DravidianCodeMix@FIRE-2024, in: Forum of Information Retrieval and Evaluation FIRE 2024, DAIICT , Gandhinagar, 2024.
Sonith D, Kavya G, and H L Shashirekha
Department of Computer Science, Mangalore University, Mangalore, Karnataka, India
Template ID: assessingslate Size: 48x36