1 of 15

Digital Democracy

Using social media to improve political discourse

Yash Shah, Swetha Thomas, Hongyu Li, Raveena Kshatriya, and Abhi Thadeshwar

Advisors: Andres Abeliuk and Alex Spangher

2 of 15

Motivation

  • Twitter is one of the primary tools used for interactions between politicians and the public
  • Even though the tweets have length restrictions, the volume of tweets can increase exponentially. This is especially true during elections.
  • This makes it difficult for politicians to sift through tweets and find ones that require response
  • Thus the motive of using twitter to convey a topic of social importance to the concerned people is in vain
  • This is the motivation and premise of our project

3 of 15

Proposed Solution

Social Media is About Creating Dialogues NOT Monologues

  • To attempt to solve this issue, we are working towards developing a dashboard for political candidates
  • The dashboard will aggregate all the tweets directed towards the candidate and cluster the relevant topics or keywords
  • Then relevant questions from each cluster will be be highlighted to the candidate

4 of 15

Impact

  • The dashboard will make it easier for a candidate to get a pointer on what the topics of social and political importance are
  • By answering the selected tweets from the cluster the politician can cater to the questions of the majority of the tweeting audience
  • The politician can also examine the different opinions the audience has about their actions or viewpoints, and determine what issues they should focus on in their speeches and campaigns
  • This will provide a politician with extremely valuable information about how the public feels about them and the issues they support, and could be a great resource for devising campaign strategies and speeches

5 of 15

The Dataset

  • Tweets scraped about different political candidates from the 2020 Presidential Election
  • From days in the time period of October 2019-Feb 2020
  • Most of the tweets were from or directed towards the candidates
  • We have this data for around 31 days, with around 1 million tweets/day
  • Each tweet has over 30 features like tweet id, user id, hashtags, mentions, location, and date and time.
  • Many of the tweets were simple retweets with no added comments

Special thanks to Ashok Deb for the data recollection

6 of 15

Preliminary Data Analysis

  • We began by examining patterns in the dataset
  • We looked at data points such as the top hashtags, who were the most frequent candidates being tweeted about, and how many of the tweets actually contained original content

7 of 15

Preliminary Data Analysis

8 of 15

Preprocessing

  1. Removing retweeted tweets
  2. Removing very short tweets
  3. Removing tweets in languages other than English
  4. Focusing on a specific candidate (Trump, Bernie, and etc...)
  5. Detecting questions and removing tweets that do not require a response (still in process)
  6. Reducing number of tweets from 700k to approximately 100k per day

9 of 15

Approach

  1. K-Means Clustering using various similarity metrics like TF-IDF and Word embeddings

  1. Individual similarity using combinations using Jaccard, Cosine again leveraging different Word embeddings like GloVe and Word2Vec.

10 of 15

  1. K- Means CLustering
  • An unsupervised machine learning algorithm used to create K clusters of similar data points
  • After preprocessing, we used TF-IDF, word2vec and GloVe embeddings to create numerical feature vectors from the tweets

K-means clusters with TF-IDF feature vectors

K-means clusters with GloVe word embedding feature vectors

11 of 15

Cluster Word Clouds

12 of 15

2 ) Individual Similarity

  • Represented tweets as vectors using word2vec embeddings.
  • Calculated cosine similarity between pairs of tweets of the each individual candidate for the day
  • Set a threshold value for the similarity measure to group similar tweets together.
  • The bottleneck to this approach is that the pair combinations rise exponentially.
  • So checking pairwise similarity for each possible pair becomes computationally inefficient.

13 of 15

Example of Similar Tweet Pairs Obtained

@berniesanders if you send me enough money for my knee replacement bernie i will vote for you because you are the best in the world you will help everybody on the planet you’re a wonderful man but i need money now

Similar tweets obtained:-

  1. @berniesanders bernie i think that you are a great presidential candidate and you have great ideas if you can help me with this obamacare or abomination lost my doctor lost my health care now i need a hip replacement what do i do help me please if you care i will hear back
  2. @berniesanders so let me know if you can help me because obama failed me and i will vote for you and all my friends all my cousins all my family will vote for you but i need a hip replacement please please help me contact me now please
  3. @berniesanders bernie i think that you really really have something going for you especially when you talk about free healthcare right now i can’t even walk down the street i need help because obama crush me lost everything
  4. @berniesanders bernie i pray for you every night before i go to sleep i need a hip replacement which i would’ve had if it wasn’t for obama care but that’s ok i believe in you you remind me of our savior

14 of 15

Conclusion

  • Conducted thorough analysis of the dataset and understood the trends present in the data.
  • Difficulties faced when determining if the tweets belonging to clusters made sense
  • Realized the importance of having a labeled dataset in order to obtain meaningful results
  • Began the process of clustering tweets and finding similarities between tweets to create a labeled dataset

15 of 15

Future Work

  • Clustering has a lot of clusters which do not make sense
  • However it is the best option in case of an unlabelled dataset
  • We are working on applying Locality Sensitive hashing and any other viable similarity technique to group similar tweets together
  • This will enable us to get a labelled dataset which can then be fed to any supervised machine learning algorithms
  • We are also working on methods to detect tweets that actually have content and are worth replying to vs. tweets that are hateful or simple comments.