2 of 24

Lakshmi S. Gopal

Research Scholar

Aswathy A

Research Scholar

Krishnendu K

Project Assistant

Dr. Hemalatha T

Assistant Professor

Amrita Center for Wireless Networks & Applications (AWNA),

Amrita VIshwa Vidyapeetham,

Amritapuri Campus

Amrita Center for Wireless Networks & Applications (AWNA),

Amrita VIshwa Vidyapeetham,

Amritapuri Campus

Amrita Center for Wireless Networks & Applications (AWNA),

Amrita VIshwa Vidyapeetham,

Amritapuri Campus

Amrita Center for Wireless Networks & Applications (AWNA),

Amrita VIshwa Vidyapeetham,

Amritapuri Campus

Forum for Information Retrieval Evaluation

3 of 24

Our Center

A leading multi-disciplinary research center at Amrita Vishwa Vidyapeetham, our scientists, academicians and students innovate leading edge technologies in the field of wireless, sensors, computing, and networking to solve grand challenges facing humanity today in the areas of disaster monitoring & preparedness, climate change, energy, water, and health.

1⁺

Million Lives Impacted

Products Commercialised

100⁺

Million Research Funding

450⁺

International Publications

14⁺

Patents Granted

^{Sri Mata Amritanandamayi Devi,�}^Chancellor,

^{Amrita Vishwa Vidyapeetham}

Forum for Information Retrieval Evaluation

4 of 24

Our Disaster

Management Related Projects

Amrita Wireless Sensor Network for Landslide Detection System for Himalayas

Landslide Multi-Hazard Risk Assessment, Preparedness and Early Warning in South Asia: Integrating Meteorology, Landscape and Society (LANDSLIP)

Flood Evac (Vulnerability of Transportation Structures, Warning and Evacuation in Case of Major Inland Flooding)

Micronet – Mobile Infrastructure for Coastal Region Offshore Communications & Networks

Forum for Information Retrieval Evaluation

5 of 24

Content

Introduction

Task Details

Methodology

Exploratory Data Analysis

Data Preprocessing

Model Creation for Multi Label Classification

Results and Evaluation

Conclusion

Forum for Information Retrieval Evaluation

6 of 24

Introduction

Vaccination is crucial for public health, saving lives and reducing disease burden.�
Despite its effectiveness, vaccine hesitancy persists, fueled by easy access to both credible and misleading information in the digital age.

Crowdsourced data platforms provide valuable insights into vaccine-related experiences and concerns.

Machine learning is a vital tool in public health, capable of analyzing large datasets and revealing patterns in vaccine-related sentiments expressed on social media.

Forum for Information Retrieval Evaluation

7 of 24

The research paper aims to contribute to understanding vaccine hesitancy by analyzing crowdsourced vaccine data using machine learning.

The study seeks a deeper understanding of concerns expressed through crowdsourcing to inform targeted public health interventions and communication strategies.

Leveraging machine learning algorithms, the research aims to shed light on intricate dynamics of vaccine hesitancy for more effective vaccination campaigns and improved public health outcomes.

Introduction

Forum for Information Retrieval Evaluation

8 of 24

Task Details

We aim to develop a multi label classification model

Developed classifier labels a tweet based on specific concern(s) about vaccines

A tweet can have more than one label (concern)

Forum for Information Retrieval Evaluation

9 of 24

Unnecessary: Suggests immunizations aren't necessary or better alternatives exist.
Mandatory: Opposes making vaccinations obligatory.
Pharma: Big Pharma prioritizes profit over other considerations.
Conspiracy: Hints at a broader conspiracy beyond Big Pharma's profit motive.
Political: Raises concerns that governments and politicians use vaccines to further their agendas.
Country: Criticizes a vaccination based on its country of origin.
Rushed: Questions the adequacy of vaccine testing and the reliability of available data.
Ingredients: Concerns about vaccine contents or the utilized technology.
Side-effect: Worry about vaccine side effects, including potential deaths.
Ineffective: Concern that immunizations are ineffective and useless in certain cases.
Religious: Opposes vaccinations for religious reasons.
None: No explicit justification is provided.

Following are the different labels in focus

Forum for Information Retrieval Evaluation

10 of 24

Methodology

The proposed methodology aims to perform multi label classification on the given dataset. �
Following steps were employed:

Exploratory Data Analysis (EDA)�
Data Preprocessing�
Model Creation for Multi Label Classification

Forum for Information Retrieval Evaluation

11 of 24

Exploratory Data Analysis

Forum for Information Retrieval Evaluation

The given data initially had 3 columns, ‘ID’, ‘tweet’ and ‘labels’ �
Data was transformed to a one hot encoded dataset with 9921 rows and 14 columns�
data contained no null or NaN values

To understand the use of vocabulary in the tweets, word clouds were generated�
Figure shows the word clouds of the most frequent and less frequent labels in the dataset

Side - Effects

Religious

We observed that ‘side-effect’ & ‘religious’ labels was the most frequent & less frequent respectively �
Around 80% of tweets were found to have a single label�
Balancing the given data may be necessary to ensure optimal performance�
Frequent occurrences of terms like 'vaccine,' 'covid,' and 'Pfizer' are observed in the word clouds.�
Keywords similar to 'side-effect,' such as 'death,' 'adverse reaction,' 'blood clot,' etc., are frequently found in the 'side-effect' word cloud.�
Keywords similar to the 'religious' label, such as 'religion,' 'faith,' 'psalm,' etc., were found to be less frequent in the 'religious' word cloud.

Forum for Information Retrieval Evaluation

12 of 24

Data Preprocessing

Basic data cleaning methods were employed�
Uses Python libraries - NLTK, regular expressions, Spacy

Word Tokenizations
Lowercase conversions
URLs often found along with a tweet (image, video urls)
Stopwords (the, a, is etc) are removed except for ‘not’ and ‘no’ to maintain the context
Special characters, smileys

Forum for Information Retrieval Evaluation

13 of 24

Model Creation

Multi label classification was experimented using classifier chains & multi output classifier
Classifier chains offer a powerful approach to address interdependencies in multi-label classification tasks

Each classifier predicts a label based on input features�
Predictions from previous classifiers are incorporated as additional features for subsequent classifiers

Train the first classifier (Ingredients) on feature Tweet to predict Label Ingredient.
Train the second classifier (Side-effect) on features Tweet and Ingredients to predict Label side-effect.
The key mathematical concept is the iterative updating of features and labels as each classifier is trained and used in the chain, capturing dependencies between labels in the multi-label classification task.

Ingredients

@cath__kath AstraZeneca is made with the kidney cells of a little girl aborted back in the 70s.

Side-Effects

Ingredients

@cath__kath AstraZeneca is made with the kidney cells of a little girl aborted back in the 70s.

Side-Effects

Ingredients

@cath__kath AstraZeneca is made with the kidney cells of a little girl aborted back in the 70s.

Mandatory

Features

Class

Forum for Information Retrieval Evaluation

14 of 24

Model Creation

Multi output classifier

The model used is typically an extension of traditional machine learning models, such as decision trees, random forests, support vector machines, or neural networks.

The model is trained on the dataset using a loss function that considers all the output labels simultaneously.

The loss function is a measure of the difference between predicted and true values for all outputs.

During training, the model adjusts its parameters to minimize this loss across all outputs.

Forum for Information Retrieval Evaluation

15 of 24

Model Creation

Classification algorithms were wrapped in the classifier chain & multi output classifier
Classifier chain with Logistic Regression

In multi-label classification, logistic regression is extended by using a classifier chain to handle multiple labels.

Each classifier in the chain is a logistic regression model.

For label i, logistic regression predicts the probability P(Yi=1∣X,Y1,Y2,...,Yi−1)

Add the predicted probabilities as features for subsequent classifiers.

Forum for Information Retrieval Evaluation

16 of 24

Model Creation

Classifier chain with Support Vector Machines

SVM aims to find the hyperplane that maximally separates data points of different classes.

The hyperplane is the decision boundary, and the margin is the distance between the hyperplane and the nearest data points (support vectors).

SVM can efficiently handle non-linear relationships in the data using the kernel trick.

In a classifier chain, each classifier in the chain is an SVM model

Forum for Information Retrieval Evaluation

17 of 24

Model Creation

Multi output classifier with support vector machines

Train an SVM classifier on features X and the binary label Yi

Predict the binary label Yi based on X and the predicted labels from previous classifiers

Update Yi to be the true label values for the next iteration

Add the trained SVM model to the multi-output classifier.

For each label i:

SVM's ability to handle non-linear relationships is leveraged within the multi-output framework

Forum for Information Retrieval Evaluation

18 of 24

Results & Evaluation

Model with default parameter values

Hyper parameter tuned model

Scoring Metric

Score

Accuracy

0.42

Precision

0.67

Recall

0.45

F1 micro

0.54

Scoring Metric

Score

Accuracy

0.48

Precision

0.68

Recall

0.54

F1 micro

0.60

Logistic Regression wrapped in a classifier chain

Tuned Parameters: C: 0.1, 1, 1; penalty: 'l1', 'l2'

Model with default parameter values

Hyper parameter tuned model

Scoring Metric

Score

Accuracy

0.49

Precision

0.67

Recall

0.54

F1 micro

0.60

Scoring Metric

Score

Accuracy

0.49

Precision

0.68

Recall

0.54

F1 micro

0.61

SVM wrapped in a classifier chain

Tuned Parameters: C: 1; dual=false; penalty: 'l1',

Forum for Information Retrieval Evaluation

19 of 24

Results & Evaluation

Learning curves of Logistic Regression wrapped in Classifier Chain.

(a) shows the learning curve of the model before fine tuning.

(b) shows the learning curve of the model and after fine tuning.

Forum for Information Retrieval Evaluation

20 of 24

Results & Evaluation

Learning curves of Support Vector Machines wrapped in Classifier Chain.

(a) shows the learning curve of the model before fine tuning.

(b) shows the learning curve of the model after fine tuning.

Forum for Information Retrieval Evaluation

21 of 24

Results & Evaluation

SVM wrapped in a multi output classifier
Scoring Metric	Score
Accuracy	0.46
Precision	0.80
Recall	0.54
F1 micro	0.64

Hyper parameter tuning was not performed for the multi output classifier

Final Results
Run File	Methodology	F1 macro	Jacc
Model-1	SVM wrapped in Multi Output Classifier	0.38	0.45
Model-2	LR wrapped in Classifier Chain	0.38	0.41
Model-3	SVM wrapped in Classifier Chain	0.3	0.43

Forum for Information Retrieval Evaluation

1 of 24

2 of 24

3 of 24

4 of 24

5 of 24

6 of 24

7 of 24

8 of 24

9 of 24

10 of 24

11 of 24

12 of 24

13 of 24

14 of 24

15 of 24

16 of 24

17 of 24

18 of 24

19 of 24

20 of 24

21 of 24

22 of 24

23 of 24

24 of 24