2 of 22

TABLE OF CONTENTS

Results and metrics achieved

RESULTS

A brief look into sentiment analysis using ML

INTRODUCTION

Our inferences and scope for future studies

INFERENCE

The models and methods used in this study

METHODOLOGIES

3 of 22

INTRODUCTION

4 of 22

SENTIMENT ANALYSIS

Sentiment Analysis tools work to classify text based on the underlying emotion.
ML models are trained using specific datasets or rule-based lexicons.
Crucial for various applications - Business and market research, Politics, Social Media Monitoring.

5 of 22

CHALLENGES IN IDENTIFYING SARCASM

While sentiment analysis works well with directly expressed emotions, sarcasm detection is not easy:

Sarcasm relies heavily on context
Often involves tone and body-language
Data imbalance in datasets with sarcasm
Ambiguity is often difficult for machines to pick up on

6 of 22

DATASET STATS

SARCASM IN TAMIL

Tamil has 19866 non-sarcastic comments and 7170 sarcastic comments

26.5%

SARCASM IN MALAYALAM

Malayalam has 9798 non-sarcastic comments and 2259 sarcastic comments

18.6%

The imbalance in the number of sarcastic vs non sarcastic comments is highlighted in the graph.

7 of 22

METHODOLOGIES

9 of 22

OUR TOP PERFORMING MODELS

Countvectorizer with Multilayer Perceptron for Classifier

MALAYALAM

Tf-IDF vectorizer with Multilayer Perceptron for classification.

TAMIL

10 of 22

Why TF-IDF Vectorizer?

Assigns weights to words based on their frequency in a document and rarity across a corpus.
Measures the importance of terms in a specific document relative to their occurrence in the entire dataset.
Contextual Significance: Captures the significance of words and, emphasizing unique terms associated with sarcasm in given context.
Multilingual Consideration: Effectively handles code-mixed content to identify sarcasm in comments containing a mix of languages.

11 of 22

Why CountVectorizer?

Represents text by counting the frequency of each word in a document. This allows us to capture the occurrence of terms, providing a simple and straightforward representation of the document.
Context Significance: allows us to capture the frequency of words without explicitly considering their contextual relationships. In our task, where the language may vary widely and the context can be diverse, a context-agnostic approach is beneficial.
Multilingual Consideration: Countvectorizer can aptly represent the linguistic diversity in the dataset. This ensures that the model is capable of identifying sarcasm across various language expressions commonly found in YouTube comments.

12 of 22

Why Multilayer Perceptron?

Neural network architecture adept at capturing complex patterns in data and learning intricate relationships within feature-rich datasets.
Non-linear Pattern Recognition: MLP excels at capturing non-linear relationships and discern subtle linguistic nuances to identify sarcasm effectively.
Adaptability to High-Dimensional Data: TF-IDF generates high-dimensional features, and MLP is leverages the weighted terms to distinguish between sarcastic and non-sarcastic expressions.

13 of 22

OUR RESULTS

14 of 22

Successful models for each language :

Count Vectorizer and MLP Classifier
TF-IDF Vectorizer and MLP Classifier
TF-IDF Vectorizer and Random Forest Classifier

TAMIL

MALAYALAM

Count Vectorizer and MLP Classifier
Count Vectorizer with Logistic Regression
TF-IDF and MLP Classifier

15 of 22

Metric values obtained :

16 of 22

Our results

Cross-linguistic sarcasm detection in Tamil and Malayalam yielded high rankings for our team (SSNCSE1): 2nd in Tamil Language and 1st in Malayalam Language.
The validation accuracy for Tamil ranged from 0.72 to 0.78, with F1-scores between 0.73 and 0.77. In Malayalam, accuracy ranged from 0.72 to 0.85, with F1-scores between 0.60 and 0.77.

17 of 22

INFERENCES

18 of 22

Our study contribution

Challenges such as language diversity, code-mixing, and class inequality were effectively addressed by the proposed approaches.
The study emphasizes the universality of these challenges in both Tamil and Malayalam, highlighting the need for multilingual techniques in sentiment analysis.

19 of 22

Future Scope

Findings contribute to cross-linguistic sarcasm detection, with implications for practical applications like sentiment-driven content analysis and cyberbullying identification.
The study underscores the importance of flexible models capable of understanding language nuances in online interactions.

20 of 22