1 of 24

Forum for Information Retrieval Evaluation

2 of 24

Lakshmi S. Gopal

Research Scholar

Aswathy A

Research Scholar

Krishnendu K

Project Assistant

Dr. Hemalatha T

Assistant Professor

Amrita Center for Wireless Networks & Applications (AWNA),

Amrita VIshwa Vidyapeetham,

Amritapuri Campus

Amrita Center for Wireless Networks & Applications (AWNA),

Amrita VIshwa Vidyapeetham,

Amritapuri Campus

Amrita Center for Wireless Networks & Applications (AWNA),

Amrita VIshwa Vidyapeetham,

Amritapuri Campus

Amrita Center for Wireless Networks & Applications (AWNA),

Amrita VIshwa Vidyapeetham,

Amritapuri Campus

Forum for Information Retrieval Evaluation

3 of 24

Our Center

A leading multi-disciplinary research center at Amrita Vishwa Vidyapeetham, our scientists, academicians and students innovate leading edge technologies in the field of wireless, sensors, computing, and networking to solve grand challenges facing humanity today in the areas of disaster monitoring & preparedness, climate change, energy, water, and health.

1+

Million Lives Impacted

6

Products Commercialised

100+

Million Research Funding

450+

International Publications

14+

Patents Granted

Sri Mata Amritanandamayi Devi,�Chancellor,

Amrita Vishwa Vidyapeetham

Forum for Information Retrieval Evaluation

4 of 24

Our Disaster

Management Related Projects

Amrita Wireless Sensor Network for Landslide Detection System for Himalayas

Landslide Multi-Hazard Risk Assessment, Preparedness and Early Warning in South Asia: Integrating Meteorology, Landscape and Society (LANDSLIP)

Flood Evac (Vulnerability of Transportation Structures, Warning and Evacuation in Case of Major Inland Flooding)

Micronet – Mobile Infrastructure for Coastal Region Offshore Communications & Networks

Forum for Information Retrieval Evaluation

5 of 24

Content

Introduction

Task Details

Methodology

Exploratory Data Analysis

Data Preprocessing

Model Creation for Multi Label Classification

Results and Evaluation

Conclusion

Forum for Information Retrieval Evaluation

6 of 24

Introduction

  • Vaccination is crucial for public health, saving lives and reducing disease burden.�
  • Despite its effectiveness, vaccine hesitancy persists, fueled by easy access to both credible and misleading information in the digital age.

  • Crowdsourced data platforms provide valuable insights into vaccine-related experiences and concerns.

  • Machine learning is a vital tool in public health, capable of analyzing large datasets and revealing patterns in vaccine-related sentiments expressed on social media.

Forum for Information Retrieval Evaluation

7 of 24

  • The research paper aims to contribute to understanding vaccine hesitancy by analyzing crowdsourced vaccine data using machine learning.

  • The study seeks a deeper understanding of concerns expressed through crowdsourcing to inform targeted public health interventions and communication strategies.

  • Leveraging machine learning algorithms, the research aims to shed light on intricate dynamics of vaccine hesitancy for more effective vaccination campaigns and improved public health outcomes.

Introduction

Forum for Information Retrieval Evaluation

8 of 24

Task Details

We aim to develop a multi label classification model

Developed classifier labels a tweet based on specific concern(s) about vaccines

A tweet can have more than one label (concern)

Forum for Information Retrieval Evaluation

9 of 24

  • Unnecessary: Suggests immunizations aren't necessary or better alternatives exist.
  • Mandatory: Opposes making vaccinations obligatory.
  • Pharma: Big Pharma prioritizes profit over other considerations.
  • Conspiracy: Hints at a broader conspiracy beyond Big Pharma's profit motive.
  • Political: Raises concerns that governments and politicians use vaccines to further their agendas.
  • Country: Criticizes a vaccination based on its country of origin.
  • Rushed: Questions the adequacy of vaccine testing and the reliability of available data.
  • Ingredients: Concerns about vaccine contents or the utilized technology.
  • Side-effect: Worry about vaccine side effects, including potential deaths.
  • Ineffective: Concern that immunizations are ineffective and useless in certain cases.
  • Religious: Opposes vaccinations for religious reasons.
  • None: No explicit justification is provided.

Following are the different labels in focus

Forum for Information Retrieval Evaluation

10 of 24

Methodology

  • The proposed methodology aims to perform multi label classification on the given dataset. �
  • Following steps were employed:
  • Exploratory Data Analysis (EDA)�
  • Data Preprocessing�
  • Model Creation for Multi Label Classification

Forum for Information Retrieval Evaluation

11 of 24

Exploratory Data Analysis

Forum for Information Retrieval Evaluation

  • The given data initially had 3 columns, ‘ID’, ‘tweet’ and ‘labels’ �
  • Data was transformed to a one hot encoded dataset with 9921 rows and 14 columns�
  • data contained no null or NaN values
  • To understand the use of vocabulary in the tweets, word clouds were generated�
  • Figure shows the word clouds of the most frequent and less frequent labels in the dataset

Side - Effects

Religious

  • We observed that ‘side-effect’ & ‘religious’ labels was the most frequent & less frequent respectively �
  • Around 80% of tweets were found to have a single label�
  • Balancing the given data may be necessary to ensure optimal performance�
  • Frequent occurrences of terms like 'vaccine,' 'covid,' and 'Pfizer' are observed in the word clouds.�
  • Keywords similar to 'side-effect,' such as 'death,' 'adverse reaction,' 'blood clot,' etc., are frequently found in the 'side-effect' word cloud.�
  • Keywords similar to the 'religious' label, such as 'religion,' 'faith,' 'psalm,' etc., were found to be less frequent in the 'religious' word cloud.

Forum for Information Retrieval Evaluation

12 of 24

Data Preprocessing

  • Basic data cleaning methods were employed�
  • Uses Python libraries - NLTK, regular expressions, Spacy
  • Word Tokenizations
  • Lowercase conversions
  • URLs often found along with a tweet (image, video urls)
  • Stopwords (the, a, is etc) are removed except for ‘not’ and ‘no’ to maintain the context
  • Special characters, smileys

Forum for Information Retrieval Evaluation

13 of 24

Model Creation

  • Multi label classification was experimented using classifier chains & multi output classifier
  • Classifier chains offer a powerful approach to address interdependencies in multi-label classification tasks
  • Each classifier predicts a label based on input features�
  • Predictions from previous classifiers are incorporated as additional features for subsequent classifiers
  • Train the first classifier (Ingredients) on feature Tweet to predict Label Ingredient.
  • Train the second classifier (Side-effect) on features Tweet and Ingredients to predict Label side-effect.
  • The key mathematical concept is the iterative updating of features and labels as each classifier is trained and used in the chain, capturing dependencies between labels in the multi-label classification task.

Tweet

Ingredients

@cath__kath AstraZeneca is made with the kidney cells of a little girl aborted back in the 70s.

1

Side-Effects

0

Tweet

Ingredients

@cath__kath AstraZeneca is made with the kidney cells of a little girl aborted back in the 70s.

1

Side-Effects

0

Tweet

Ingredients

@cath__kath AstraZeneca is made with the kidney cells of a little girl aborted back in the 70s.

1

Mandatory

0

Features

Class

Forum for Information Retrieval Evaluation

14 of 24

Model Creation

  • Multi output classifier
  • The model used is typically an extension of traditional machine learning models, such as decision trees, random forests, support vector machines, or neural networks.

  • The model is trained on the dataset using a loss function that considers all the output labels simultaneously.

  • The loss function is a measure of the difference between predicted and true values for all outputs.

  • During training, the model adjusts its parameters to minimize this loss across all outputs.

Forum for Information Retrieval Evaluation

15 of 24

Model Creation

  • Classification algorithms were wrapped in the classifier chain & multi output classifier
  • Classifier chain with Logistic Regression
  • In multi-label classification, logistic regression is extended by using a classifier chain to handle multiple labels.

  • Each classifier in the chain is a logistic regression model.

  • For label i, logistic regression predicts the probability P(Yi=1∣X,Y1,Y2,...,Yi−1)

  • Add the predicted probabilities as features for subsequent classifiers.

Forum for Information Retrieval Evaluation

16 of 24

Model Creation

  • Classifier chain with Support Vector Machines
  • SVM aims to find the hyperplane that maximally separates data points of different classes.

  • The hyperplane is the decision boundary, and the margin is the distance between the hyperplane and the nearest data points (support vectors).

  • SVM can efficiently handle non-linear relationships in the data using the kernel trick.

  • In a classifier chain, each classifier in the chain is an SVM model

Forum for Information Retrieval Evaluation

17 of 24

Model Creation

  • Multi output classifier with support vector machines
  • Train an SVM classifier on features X and the binary label Yi

  • Predict the binary label Yi based on X and the predicted labels from previous classifiers

  • Update Yi to be the true label values for the next iteration

  • Add the trained SVM model to the multi-output classifier.
  • For each label i:
  • SVM's ability to handle non-linear relationships is leveraged within the multi-output framework

Forum for Information Retrieval Evaluation

18 of 24

Results & Evaluation

Model with default parameter values

Hyper parameter tuned model

Scoring Metric

Score

Accuracy

0.42

Precision

0.67

Recall

0.45

F1 micro

0.54

Scoring Metric

Score

Accuracy

0.48

Precision

0.68

Recall

0.54

F1 micro

0.60

Logistic Regression wrapped in a classifier chain

Tuned Parameters: C: 0.1, 1, 1; penalty: 'l1', 'l2'

Model with default parameter values

Hyper parameter tuned model

Scoring Metric

Score

Accuracy

0.49

Precision

0.67

Recall

0.54

F1 micro

0.60

Scoring Metric

Score

Accuracy

0.49

Precision

0.68

Recall

0.54

F1 micro

0.61

SVM wrapped in a classifier chain

Tuned Parameters: C: 1; dual=false; penalty: 'l1',

Forum for Information Retrieval Evaluation

19 of 24

Results & Evaluation

Learning curves of Logistic Regression wrapped in Classifier Chain.

(a) shows the learning curve of the model before fine tuning.

(b) shows the learning curve of the model and after fine tuning.

Forum for Information Retrieval Evaluation

20 of 24

Results & Evaluation

Learning curves of Support Vector Machines wrapped in Classifier Chain.

(a) shows the learning curve of the model before fine tuning.

(b) shows the learning curve of the model after fine tuning.

Forum for Information Retrieval Evaluation

21 of 24

Results & Evaluation

SVM wrapped in a multi output classifier

Scoring Metric

Score

Accuracy

0.46

Precision

0.80

Recall

0.54

F1 micro

0.64

Hyper parameter tuning was not performed for the multi output classifier

Final Results

Run File

Methodology

F1 macro

Jacc

Model-1

SVM wrapped in Multi Output Classifier

0.38

0.45

Model-2

LR wrapped in Classifier Chain

0.38

0.41

Model-3

SVM wrapped in Classifier Chain

0.3

0.43

Forum for Information Retrieval Evaluation

22 of 24

Conclusion

  • We utilized social media data related to Covid vaccine to explore public concerns about vaccinations.
  • Tested multiple models with the available data.
  • Utilized a Classifier Chain model incorporating Support Vector Machines (SVM) and Logistic Regression.
  • Employed a Multi-Output Classifier wrapping Support Vector Machines.
  • Achieved the highest performance with the Multi-Output Classifier with 64% f1 score

Forum for Information Retrieval Evaluation

23 of 24

Conclusion

  • Enhancing the dataset quality.
  • Exploring additional preprocessing methods.
  • Considering data augmentation strategies.

Opportunities for Improvement:

  • Acknowledged room for performance enhancement.
  • Suggested potential improvement avenues:

Forum for Information Retrieval Evaluation

24 of 24

Lakshmi S. Gopal

Amrita VIshwa Vidyapeetham,

lakshmisgopal@am.amrita.edu