Machine Learning for Disinformation Analysis
Hashtag Similarity Mapping via Dimensionality Reduction
Background
Background
The First Trump Impeachment
We have previously collected 67.7M tweets about the First Trump Impeachment (including 3.6M user profiles), and stored them in a Google BigQuery database for further analysis.
Previous Research
Bot Classification
Previous research goals:
"Bots, Disinformation, and the First Trump Impeachment"
- Michael Rossetti, Tauhid Zaman (https://arxiv.org/abs/2204.08915)
Previous Research
Sentiment Classification
Previous research goal: predict the degree to which any given tweet expresses either Pro-Trump or Anti-Trump sentiments
0
1
0.5
Explore results on our website
Problem
The Problem
Evidence of Disinformation Spread
The Problem
What is Q-Anon?
Q-Anon is a popular conspiracy theory / disinformation campaign. Q-Anon supporters use hashtags such as #QANON, #WWG1WGA, #GREATAWAKENING, etc. to communicate and organize. We saw Q-Anon supporters participate in the attack on the US Capitol on January 6, 2021. How can we prevent events like this in the future?
The Problem
Challenges facing Social Networks and Online Platforms
Challenges in Trust and Safety:
Related Work
Hashtag Co-occurrence Analysis
"A network-based approach to QAnon user dynamics and topic diversity during the COVID-19 infodemic"
- Wentao Xu, Kazutoshi Sasahara (https://arxiv.org/abs/2111.00537)
hashtag co-occurrence network hashtag semantic map
Research Plan
Objectives and Constraints
Goals / Objectives:
Assumptions / Constraints:
Research Plan
Our Contributions
Differentiating factors for this project:
Methods
Step 1: Data Collection
Twitter API Stream Listener
Step 2: One Hot Encoding
user_id | tag |
408 | #RESIST |
408 | #IMPEACH |
535 | #AUTHOR |
1150 | #RESIST |
1187 | #1A |
3301 | #RESIST |
4011 | #MAGA |
4011 | #TRUMP2020 |
4822 | #VOTEBLUE |
4822 | #NEVERTRUMP |
4822 | #RESIST |
6789 | #MAGA |
6789 | #QANON |
| 408 | 535 | 1150 | 1187 | 3301 | 4011 | 4822 | 6789 |
#1A | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
#AUTHOR | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
#IMPEACH | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
#MAGA | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
#NEVERTRUMP | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
#QANON | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
#RESIST | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 |
#TRUMP2020 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
#VOTEBLUE | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
one-hot encodings
BigQuery database records
Step 3: Dimensionality Reduction
| 408 | 535 | 1150 | 1187 | 3301 | 4011 | 4822 | 6789 |
#1A | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
#AUTHOR | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
#IMPEACH | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
#MAGA | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
#NEVERTRUMP | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
#QANON | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
#RESIST | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 |
#TRUMP2020 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
#VOTEBLUE | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| X | Y |
#1A | 3.1256592 | 1.3281181 |
#AUTHOR | 4.894096 | 3.5918093 |
#IMPEACH | 3.2395177 | 4.7176585 |
#MAGA | 2.6817877 | 1.94366 |
#NEVERTRUMP | 4.229362 | 3.7070966 |
#QANON | 2.0955057 | 3.2642653 |
#RESIST | 3.4142375 | 5.007076 |
#TRUMP2020 | 2.9583123 | 2.3489096 |
#VOTEBLUE | 4.33293 | 4.940848 |
reduced "embeddings"
one-hot encodings
Step 4: Plotting the Embeddings
| X | Y |
#1A | 3.1256592 | 1.3281181 |
#AUTHOR | 4.894096 | 3.5918093 |
#IMPEACH | 3.2395177 | 4.7176585 |
#MAGA | 2.6817877 | 1.94366 |
#NEVERTRUMP | 4.229362 | 3.7070966 |
#QANON | 2.0955057 | 3.2642653 |
#RESIST | 3.4142375 | 5.007076 |
#TRUMP2020 | 2.9583123 | 2.3489096 |
#VOTEBLUE | 4.33293 | 4.940848 |
reduced "embeddings"
Step 5: Clustering the Embeddings
| X | Y |
#1A | 3.1256592 | 1.3281181 |
#AUTHOR | 4.894096 | 3.5918093 |
#IMPEACH | 3.2395177 | 4.7176585 |
#MAGA | 2.6817877 | 1.94366 |
#NEVERTRUMP | 4.229362 | 3.7070966 |
#QANON | 2.0955057 | 3.2642653 |
#RESIST | 3.4142375 | 5.007076 |
#TRUMP2020 | 2.9583123 | 2.3489096 |
#VOTEBLUE | 4.33293 | 4.940848 |
Cluster |
RED-1 |
UNK |
BLUE-1 |
RED-1 |
BLUE-2 |
RED-2 |
BLUE-1 |
RED-1 |
BLUE-2 |
reduced "embeddings"
Results / Demo!
Dimensionality Reduction Techniques
Dimensionality Reduction:
Principal Components Analysis (PCA)
Dimensionality Reduction:
T-distributed Stochastic Neighbor Embedding (T-SNE)
Dimensionality Reduction:
Uniform Manifold Approximation & Projection (UMAP)
Dimensionality Reduction:
Uniform Manifold Approximation & Projection (UMAP)
Dimensionality Reduction:
Uniform Manifold Approximation & Projection (UMAP)
Dimensionality Reduction:
Uniform Manifold Approximation & Projection (UMAP)
UMAP Enhanced Clustering:
Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN)
Clustering of the embeddings helps us identify lesser known terms
… which can now be monitored
UMAP Enhanced Clustering:
Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN)
Conclusions
Thank you!
Extra Slides / Appendix
Contact
Presentation by: Michael Rossetti
Contact: Email | LinkedIn | GitHub
Interests:
Data Collection
Impeachment Topics
Start Date | Term |
2019-12-12 | #FactsMatter |
| #IGHearing |
| #IGReport |
| #ImpeachAndConvict |
| #ImpeachAndConvictTrump |
| #SenateHearing |
| #TrumpImpeachment |
| impeach |
| impeached |
| impeachment |
| Trump to Pelosi |
Start Date | Term |
2019-12-18 | #25thAmendmentNow |
| #ImpeachAndRemove |
| #ImpeachmentEve |
| #ImpeachmentRally |
| #NotAboveTheLaw |
| #trumpletter |
Start Date | Term |
2020-01-22 | #GOPCoverup |
| #ShamTrial |
Start Date | Term |
2020-02-06 | #AquittedForever |
| #CountryOverParty |
| #CoverUpGOP |
| #MitchMcCoverup |
| #MoscowMitch |
FYI - the term "rain" would also catch mentions of "#RainBows" (but not vice-versa)
Data Collection
Impeachment Timeline
We collected tweets from December 12, 2019 to March 24, 2020.
This covers the full Senate trial, including the time leading up and the time after.
Date | Event |
2019-10-28 | House approves resolution |
2019-11-13 | House Intelligence Committee Hearings |
2019-12-04 | House Judiciary Committee Hearings |
2019-12-12 | House Judiciary Committee Approves Articles |
2020-01-15 | House Sends Articles to Senate |
2020-01-15 | Senate Announces Rules |
2020-01-15 | Senate Approves Rules |
2020-01-23 | Opening Arguments |
2020-01-31 | Senate Blocks Witnesses |
2020-02-03 | Closing Remarks |
2020-02-05 | Senate Acquits |
Data Collection
Tweet Collection Timeline and Results
Primary Collection Period (Continuous Collection)
Development
In total, we collected 67.6M tweets authored by 3.6M unique users.
Data Analysis
Inclusion of Hashtags in User Profiles
"Top Tags" Limit | Distinct Hashtags | Distinct Users | Rows | Avg Hashtags Per User |
N/A | 258,622 | 451,698 | 1,288,844 | 2.85 |
100,000 | 100,000 | 413,107 | 1,130,222 | 2.74 |
50,000 | 50,000 | 396,724 | 1,070,454 | 2.70 |
10,000 | 10,000 | 358,549 | 919,697 | 2.57 |
5,000 | 5,000 | 339,183 | 845,728 | 2.49 |
1,000 | 1,000 | 286,740 | 656,976 | 2.29 |
250 | 250 | 236,988 | 506,119 | 2.14 |
100 | 100 | 204,423 | 419,540 | 2.05 |
75 | 75 | 196,717 | 394,703 | 2.01 |
50 | 50 | 183,516 | 360,424 | 1.96 |
25 | 25 | 165,458 | 297,095 | 1.80 |
10 | 10 | 139,761 | 216,441 | 1.55 |
Of the 3.6M users, 451K users (12.5%) included at least one hashtag in their profile. They included 2.85 unique tags on average, for a total of 1.28M records.
We are most interested in the most frequently used hashtags, so we will focus on the top 25-75 hashtag range, with less than 200K users using those top tags.
Data Analysis
Inclusion of Hashtags in User Profiles
Of the 3.6M users, 451K users (12.5%) included at least one hashtag in their profile. They included 2.85 unique tags on average, for a total of 1.28M records.
We are most interested in the most frequently used hashtags, so we will focus on the top 25-75 hashtag range, with less than 200K users using those top tags.
Data Analysis
Top Hashtags found in User Profiles
Previous Analysis
Top Hashtags in Bot Profiles
Pro-Trump bots spread Q-Anon related content (via their User profiles)
Previous Analysis
Top Hashtags in Bot Tweets
Pro-Trump bots spread Q-Anon related content (via their Tweets)
Previous Sentiment Labeling Approach
Top Hashtags used by Anti-Trump Bot Cluster
Hashtag | Description |
#BIDEN2020 | |
#BLM | 'Black Lives Matter' |
#BLUEWAVE | |
#BLUEWAVE2020 | |
#DEMCAST | A left-leaning media outlet |
#FBR | 'Follow Black Resistance' |
#IMPEACH | |
#IMPEACHANDREMOVE | |
#IMPEACHTRUMP | |
#IMPEACHTRUMPNOW | |
#IMPOTUS | 'Impeached POTUS' |
Hashtag | Description |
#METOO | |
#NOTMYPRESIDENT | |
#RESIST | |
#RESISTANCE | |
#RESISTER | |
#THERESISTANCE | |
#VOTEBLUE | |
#VOTEBLUE2020 | |
#VOTEBLUENOMATTERWHO | |
#WTP2020 | 'We The People 2020' |
Previous Sentiment Labeling Approach
Top Hashtags used by Pro-Trump Bot Cluster
Hashtag | Description |
#1A | The First Amendment |
#2A | The Second Amendment |
#AMERICAFIRST | A Trump campaign slogan |
#BUILDKATESWALL | |
#BUILDTHEWALL | A Trump campaign slogan |
#CODEOFVETS | |
#CONSERVATIVE | |
#DEPLORABLE | Hillary Clinton quote |
#DRAINTHESWAMP | A Trump campaign slogan |
#KAG | 'Keep America Great' |
#MAGA | 'Make America Great Again' |
Hashtag | Description |
#NRA | The National Rifle Association |
#PATRIOT | |
#POTUS45 | Refers to 45th President (Trump) |
#QANON | Related to Q-Anon conspiracy theory |
#THEGREATAWAKENING | Related to Q-Anon conspiracy theory |
#TRUMP | |
#TRUMP2020 | |
#TRUMPTRAIN | |
#VETERAN | |
#WALKAWAY | |