1 of 15

Digital Democracy

Using social media to improve political discourse

Yash Shah, Swetha Thomas, Hongyu Li, Raveena Kshatriya, and Abhi Thadeshwar

Advisors: Andres Abeliuk and Alex Spangher

2 of 15

Motivation

Twitter is one of the primary tools used for interactions between politicians and the public
Even though the tweets have length restrictions, the volume of tweets can increase exponentially. This is especially true during elections.
This makes it difficult for politicians to sift through tweets and find ones that require response
Thus the motive of using twitter to convey a topic of social importance to the concerned people is in vain
This is the motivation and premise of our project

3 of 15

Proposed Solution

Social Media is About Creating Dialogues NOT Monologues

To attempt to solve this issue, we are working towards developing a dashboard for political candidates
The dashboard will aggregate all the tweets directed towards the candidate and cluster the relevant topics or keywords
Then relevant questions from each cluster will be be highlighted to the candidate

4 of 15

Impact

The dashboard will make it easier for a candidate to get a pointer on what the topics of social and political importance are
By answering the selected tweets from the cluster the politician can cater to the questions of the majority of the tweeting audience
The politician can also examine the different opinions the audience has about their actions or viewpoints, and determine what issues they should focus on in their speeches and campaigns
This will provide a politician with extremely valuable information about how the public feels about them and the issues they support, and could be a great resource for devising campaign strategies and speeches

5 of 15

The Dataset

Tweets scraped about different political candidates from the 2020 Presidential Election
From days in the time period of October 2019-Feb 2020
Most of the tweets were from or directed towards the candidates
We have this data for around 31 days, with around 1 million tweets/day
Each tweet has over 30 features like tweet id, user id, hashtags, mentions, location, and date and time.
Many of the tweets were simple retweets with no added comments

Special thanks to Ashok Deb for the data recollection

6 of 15

Preliminary Data Analysis

We began by examining patterns in the dataset
We looked at data points such as the top hashtags, who were the most frequent candidates being tweeted about, and how many of the tweets actually contained original content

7 of 15

Preliminary Data Analysis

8 of 15

Preprocessing

Removing retweeted tweets
Removing very short tweets
Removing tweets in languages other than English
Focusing on a specific candidate (Trump, Bernie, and etc...)
Detecting questions and removing tweets that do not require a response (still in process)
Reducing number of tweets from 700k to approximately 100k per day

9 of 15

Approach

K-Means Clustering using various similarity metrics like TF-IDF and Word embeddings

Individual similarity using combinations using Jaccard, Cosine again leveraging different Word embeddings like GloVe and Word2Vec.

10 of 15

K- Means CLustering

An unsupervised machine learning algorithm used to create K clusters of similar data points
After preprocessing, we used TF-IDF, word2vec and GloVe embeddings to create numerical feature vectors from the tweets

K-means clusters with TF-IDF feature vectors

K-means clusters with GloVe word embedding feature vectors

11 of 15

Cluster Word Clouds

12 of 15

2 ) Individual Similarity

Represented tweets as vectors using word2vec embeddings.
Calculated cosine similarity between pairs of tweets of the each individual candidate for the day
Set a threshold value for the similarity measure to group similar tweets together.
The bottleneck to this approach is that the pair combinations rise exponentially.
So checking pairwise similarity for each possible pair becomes computationally inefficient.

13 of 15

Example of Similar Tweet Pairs Obtained

@berniesanders if you send me enough money for my knee replacement bernie i will vote for you because you are the best in the world you will help everybody on the planet you’re a wonderful man but i need money now

Similar tweets obtained:-

@berniesanders bernie i think that you are a great presidential candidate and you have great ideas if you can help me with this obamacare or abomination lost my doctor lost my health care now i need a hip replacement what do i do help me please if you care i will hear back
@berniesanders so let me know if you can help me because obama failed me and i will vote for you and all my friends all my cousins all my family will vote for you but i need a hip replacement please please help me contact me now please
@berniesanders bernie i think that you really really have something going for you especially when you talk about free healthcare right now i can’t even walk down the street i need help because obama crush me lost everything
@berniesanders bernie i pray for you every night before i go to sleep i need a hip replacement which i would’ve had if it wasn’t for obama care but that’s ok i believe in you you remind me of our savior

14 of 15

Conclusion

Conducted thorough analysis of the dataset and understood the trends present in the data.
Difficulties faced when determining if the tweets belonging to clusters made sense
Realized the importance of having a labeled dataset in order to obtain meaningful results
Began the process of clustering tweets and finding similarities between tweets to create a labeled dataset

15 of 15

Future Work

Clustering has a lot of clusters which do not make sense
However it is the best option in case of an unlabelled dataset
We are working on applying Locality Sensitive hashing and any other viable similarity technique to group similar tweets together
This will enable us to get a labelled dataset which can then be fed to any supervised machine learning algorithms
We are also working on methods to detect tweets that actually have content and are worth replying to vs. tweets that are hateful or simple comments.