1 of 1

An Analysis of COVID-19 related Twitter Data for�Asian Hate Speech Using Machine Learning�Algorithms

10th Annual COE Graduate Poster Presentation Competition

Student(s): Sandeep Shah (MS)

Advisor(s): Xiaohong Yuan

Cross-Disciplinary Research Area: ML and Cyber security

Introduction

  • Cyberhate, cyberbullying, and cyberthreat are very�common today as technology is growing.
  • In this research, we use Support Vector Machine (SVM)�and Random Forest classification algorithms.
  • The main goal is to use a classification model and analyze the trend of tweets.
  • Dataset contains 209M covid-19 tweets from the date rage January 2020 to March 2021.
  • We predict the hate speech using the model used for the classification.

Random Forest

  • Random Forest (RF) is also a machine learning algorithm that is used for classification and regression.
  • It creates different decision trees to classify their classes.
  • A decision tree is a simply branched tree which decides if the statement is true or false.
  • The number of decision trees depends on the number of instances and the features related to them.

Support Vector Machine

  • Random Forest (RF) is also a machine learning algorithm that is used for classification and regression.
  • It creates different decision trees to classify their classes.
  • A decision tree is a simply branched tree which decides if the statement is true or false.
  • The number of decision trees depends on the number of instances and the features related to them.

Dataset

  • The data is created by Ziems, Caleb, et al. [1]. It is the largest dataset of anti-Asian hate and counter speech on Twitter which was created during the Covid-19 pandemic.
  • This dataset contains 206 million tweets within the date range of January 2020 to March 2021.
  • The dataset contains total tweets of 206,348,116 out of which 0.64% of the data are classified as Hateful tweets, 0.55% of the data are classified as counter speech tweets, and the rest of the data are neutral tweets.

Vectorizer

  • Two vectorizers is used to extract the features of the tweets.
  • TF-IDF vectorizer and Count vectorizer
  • A TF-IDF vectorizer is used to learn the frequency of each word and return a score of each word present in the dataset
  • a count vectorizer which simply counts the words creating a bag of words and returns actual number of words repeated.

Results

The performance metrics used:�• Accuracy is the percentage of correctly classified records over the total number of records.�• Precision is the ratio of records correctly classified as hate (or counter-hate or neutral) over the total number of records classified as hate (or counter-hate or neutral).�• Recall is the ratio of records correctly classified as hate (or counter-hate or neutral) over the total number of�actual hate (or counter-hate or neutral).�• F1-score is the harmonic mean (in percentile) of precision and recall.

Conclusion

  • In this research, we have observed that the SVM classifier performs better in classifying hate-related tweets.
  • We also observed that count vectorizer works better than Tf-idf vectorizer for classifying tweets.
  • After predicting the hate-related tweets using SVM model for the Covid-19 related dataset collected from April to November 2020, we have found that the hate and counter-hate speech during the month of June, July, October, and November are higher than the other four months.
  • We assume that there may be correlation between the number of coronavirus cases and events that caused high number of coronavirus cases and hate-related speech.

Fig. 1 shows the confusion matrix and the prediction performance metrics using Random Forest classifier and Fig. 2 shows the confusion matrix and prediction performance metrics using SVM.

Fig. 1: Confusion Matrix and Performance Metrics using Support Vector Machine using Count Vector.

Fig. 2: Confusion Matrix and Performance Metrics using Random Forest using Count Vector.

Fig. 4. Number of Hate and Counter-Hate Tweets from Apr to Nov 2020.

Fig. 3. Percentage of Hate and Counter-hate from Apr to Novn2020

Fig. 3 shows the graph with the total percentage of hate tweets and counter-hate tweets. We can see the trend of tweets for each month and can analyze that people continued to post hate speech and counter-hate speech throughout the year. Fig. 4 hows the number of tweets related to hate and�counter-hate from April to November 2020. For each month we have selected different number of tweets and the results shows that the hate speech and counter hate speech have increased with the increase of total number of tweets.