1 of 42

Machine Learning for Disinformation Analysis

Hashtag Similarity Mapping via Dimensionality Reduction

2 of 42

Background

3 of 42

Background

The First Trump Impeachment

We have previously collected 67.7M tweets about the First Trump Impeachment (including 3.6M user profiles), and stored them in a Google BigQuery database for further analysis.

4 of 42

Previous Research

Bot Classification

Previous research goals:

  • classify the likelihood any given user is a "bot" (i.e. an automated account)
  • quantify the influence of the bots

"Bots, Disinformation, and the First Trump Impeachment"

- Michael Rossetti, Tauhid Zaman (https://arxiv.org/abs/2204.08915)

5 of 42

Previous Research

Sentiment Classification

Previous research goal: predict the degree to which any given tweet expresses either Pro-Trump or Anti-Trump sentiments

0

1

0.5

Explore results on our website

6 of 42

Problem

7 of 42

The Problem

Evidence of Disinformation Spread

8 of 42

The Problem

What is Q-Anon?

Q-Anon is a popular conspiracy theory / disinformation campaign. Q-Anon supporters use hashtags such as #QANON, #WWG1WGA, #GREATAWAKENING, etc. to communicate and organize. We saw Q-Anon supporters participate in the attack on the US Capitol on January 6, 2021. How can we prevent events like this in the future?

9 of 42

The Problem

Challenges facing Social Networks and Online Platforms

Challenges in Trust and Safety:

  1. How to conduct near real-time monitoring and moderation of disinformation and other prohibited content?
  2. How to keep up with malicious actors who evolve their tactics (e.g. using new hashtags over time)
  3. How to reduce costs associated with this content moderation effort?

10 of 42

Related Work

Hashtag Co-occurrence Analysis

"A network-based approach to QAnon user dynamics and topic diversity during the COVID-19 infodemic"

- Wentao Xu, Kazutoshi Sasahara (https://arxiv.org/abs/2111.00537)

hashtag co-occurrence network hashtag semantic map

11 of 42

Research Plan

Objectives and Constraints

Goals / Objectives:

  • Produce a two-dimensional mapping of the top hashtags from our Impeachment 2020 dataset
  • Visually compare the distances between hashtags on the map to determine which topics are closest (i.e. most similar) to Q-Anon related terms
  • Cluster the hashtags into more formal groups (STRETCH GOAL)

Assumptions / Constraints:

  • Use unsupervised machine learning methods only
  • Focus on the most frequently used hashtags only (because they are the most influential, and because we can only see so many points on a plot)

12 of 42

Research Plan

Our Contributions

Differentiating factors for this project:

  • Related work uses hashtags in tweets, but we use hashtags in user profiles because we believe they provide a stronger signal about a user's sentiments
  • Our dataset is unique, and focused on a domestic political event
  • Related work constructs and utilizes a hashtag co-occurrence network, while our approach is simpler and does not require this network graph compilation step
  • Related work focuses on hashtags co-occurring with "#QANON", but we focus on a broader range of top (i.e. most frequently used) hashtags, not just disinformation terms

13 of 42

Methods

14 of 42

Step 1: Data Collection

Twitter API Stream Listener

  • Software written in Python (open source)
  • Deployed to run continuously on a Heroku server
  • Fetches data from the Twitter API
  • Listens for any tweets mentioning Impeachment-related terms (see appendix)
  • Stores tweet and user profile data in a Google BigQuery database

15 of 42

Step 2: One Hot Encoding

user_id

tag

408

#RESIST

408

#IMPEACH

535

#AUTHOR

1150

#RESIST

1187

#1A

3301

#RESIST

4011

#MAGA

4011

#TRUMP2020

4822

#VOTEBLUE

4822

#NEVERTRUMP

4822

#RESIST

6789

#MAGA

6789

#QANON

408

535

1150

1187

3301

4011

4822

6789

#1A

0

0

0

1

0

0

0

0

#AUTHOR

0

1

0

0

0

0

0

0

#IMPEACH

1

0

0

0

0

0

0

0

#MAGA

0

0

0

0

0

1

0

1

#NEVERTRUMP

0

0

0

0

0

0

1

0

#QANON

0

0

0

0

0

0

0

1

#RESIST

1

0

1

0

1

0

1

0

#TRUMP2020

0

0

0

0

0

1

0

0

#VOTEBLUE

0

0

0

0

0

0

1

0

one-hot encodings

BigQuery database records

16 of 42

Step 3: Dimensionality Reduction

408

535

1150

1187

3301

4011

4822

6789

#1A

0

0

0

1

0

0

0

0

#AUTHOR

0

1

0

0

0

0

0

0

#IMPEACH

1

0

0

0

0

0

0

0

#MAGA

0

0

0

0

0

1

0

1

#NEVERTRUMP

0

0

0

0

0

0

1

0

#QANON

0

0

0

0

0

0

0

1

#RESIST

1

0

1

0

1

0

1

0

#TRUMP2020

0

0

0

0

0

1

0

0

#VOTEBLUE

0

0

0

0

0

0

1

0

X

Y

#1A

3.1256592

1.3281181

#AUTHOR

4.894096

3.5918093

#IMPEACH

3.2395177

4.7176585

#MAGA

2.6817877

1.94366

#NEVERTRUMP

4.229362

3.7070966

#QANON

2.0955057

3.2642653

#RESIST

3.4142375

5.007076

#TRUMP2020

2.9583123

2.3489096

#VOTEBLUE

4.33293

4.940848

reduced "embeddings"

one-hot encodings

17 of 42

Step 4: Plotting the Embeddings

X

Y

#1A

3.1256592

1.3281181

#AUTHOR

4.894096

3.5918093

#IMPEACH

3.2395177

4.7176585

#MAGA

2.6817877

1.94366

#NEVERTRUMP

4.229362

3.7070966

#QANON

2.0955057

3.2642653

#RESIST

3.4142375

5.007076

#TRUMP2020

2.9583123

2.3489096

#VOTEBLUE

4.33293

4.940848

reduced "embeddings"

18 of 42

Step 5: Clustering the Embeddings

X

Y

#1A

3.1256592

1.3281181

#AUTHOR

4.894096

3.5918093

#IMPEACH

3.2395177

4.7176585

#MAGA

2.6817877

1.94366

#NEVERTRUMP

4.229362

3.7070966

#QANON

2.0955057

3.2642653

#RESIST

3.4142375

5.007076

#TRUMP2020

2.9583123

2.3489096

#VOTEBLUE

4.33293

4.940848

Cluster

RED-1

UNK

BLUE-1

RED-1

BLUE-2

RED-2

BLUE-1

RED-1

BLUE-2

reduced "embeddings"

19 of 42

Results / Demo!

20 of 42

Dimensionality Reduction Techniques

  • Principal Component Analysis (PCA)
    • Benefits: interpretable / explainable
    • Drawbacks: affected by outliers
  • T-distributed Stochastic Neighbor Embedding (T-SNE)
    • Benefits: more robust to outliers
    • Drawbacks: less interpretable / explainable
  • Uniform Manifold Approximation & Projection (UMAP)
    • Benefits: fast, scalable, may perform better on complex data
    • Drawbacks: less interpretable / explainable
  • etc.

21 of 42

Dimensionality Reduction:

Principal Components Analysis (PCA)

22 of 42

Dimensionality Reduction:

T-distributed Stochastic Neighbor Embedding (T-SNE)

23 of 42

Dimensionality Reduction:

Uniform Manifold Approximation & Projection (UMAP)

24 of 42

Dimensionality Reduction:

Uniform Manifold Approximation & Projection (UMAP)

25 of 42

Dimensionality Reduction:

Uniform Manifold Approximation & Projection (UMAP)

26 of 42

Dimensionality Reduction:

Uniform Manifold Approximation & Projection (UMAP)

Why stop in two dimensions when we can try three?

Let's explore via live demo!

27 of 42

UMAP Enhanced Clustering:

Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN)

Clustering of the embeddings helps us identify lesser known terms

… which can now be monitored

28 of 42

UMAP Enhanced Clustering:

Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN)

29 of 42

Conclusions

  • These methods can be used by social network trust and safety teams, and applied to any online discussion where hashtags are used
  • Dimensionality reduction, when applied to a matrix of hashtags and one hot encoded user identifiers, can detect which terms are related to each other, including which terms are related to known disinformation terms
  • These methods, due to their unsupervised nature, can detect new topics that may emerge over time, where human annotations may not yet be available
  • The unsupervised approach also saves time and costs related to human labeling

30 of 42

Thank you!

31 of 42

Extra Slides / Appendix

32 of 42

Contact

Presentation by: Michael Rossetti

Contact: Email | LinkedIn | GitHub

Interests:

  • Machine Learning (AI/ML)
  • Natural Language Processing (NLP)
  • Sentiment Analysis
  • Social and Information Networks

33 of 42

Data Collection

Impeachment Topics

Start Date

Term

2019-12-12

#FactsMatter

#IGHearing

#IGReport

#ImpeachAndConvict

#ImpeachAndConvictTrump

#SenateHearing

#TrumpImpeachment

impeach

impeached

impeachment

Trump to Pelosi

Start Date

Term

2019-12-18

#25thAmendmentNow

#ImpeachAndRemove

#ImpeachmentEve

#ImpeachmentRally

#NotAboveTheLaw

#trumpletter

Start Date

Term

2020-01-22

#GOPCoverup

#ShamTrial

Start Date

Term

2020-02-06

#AquittedForever

#CountryOverParty

#CoverUpGOP

#MitchMcCoverup

#MoscowMitch

FYI - the term "rain" would also catch mentions of "#RainBows" (but not vice-versa)

34 of 42

Data Collection

Impeachment Timeline

We collected tweets from December 12, 2019 to March 24, 2020.

This covers the full Senate trial, including the time leading up and the time after.

Date

Event

2019-10-28

House approves resolution

2019-11-13

House Intelligence Committee Hearings

2019-12-04

House Judiciary Committee Hearings

2019-12-12

House Judiciary Committee Approves Articles

2020-01-15

House Sends Articles to Senate

2020-01-15

Senate Announces Rules

2020-01-15

Senate Approves Rules

2020-01-23

Opening Arguments

2020-01-31

Senate Blocks Witnesses

2020-02-03

Closing Remarks

2020-02-05

Senate Acquits

35 of 42

Data Collection

Tweet Collection Timeline and Results

Primary Collection Period (Continuous Collection)

Development

In total, we collected 67.6M tweets authored by 3.6M unique users.

36 of 42

Data Analysis

Inclusion of Hashtags in User Profiles

"Top Tags" Limit

Distinct Hashtags

Distinct Users

Rows

Avg Hashtags Per User

N/A

258,622

451,698

1,288,844

2.85

100,000

100,000

413,107

1,130,222

2.74

50,000

50,000

396,724

1,070,454

2.70

10,000

10,000

358,549

919,697

2.57

5,000

5,000

339,183

845,728

2.49

1,000

1,000

286,740

656,976

2.29

250

250

236,988

506,119

2.14

100

100

204,423

419,540

2.05

75

75

196,717

394,703

2.01

50

50

183,516

360,424

1.96

25

25

165,458

297,095

1.80

10

10

139,761

216,441

1.55

Of the 3.6M users, 451K users (12.5%) included at least one hashtag in their profile. They included 2.85 unique tags on average, for a total of 1.28M records.

We are most interested in the most frequently used hashtags, so we will focus on the top 25-75 hashtag range, with less than 200K users using those top tags.

37 of 42

Data Analysis

Inclusion of Hashtags in User Profiles

Of the 3.6M users, 451K users (12.5%) included at least one hashtag in their profile. They included 2.85 unique tags on average, for a total of 1.28M records.

We are most interested in the most frequently used hashtags, so we will focus on the top 25-75 hashtag range, with less than 200K users using those top tags.

38 of 42

Data Analysis

Top Hashtags found in User Profiles

39 of 42

Previous Analysis

Top Hashtags in Bot Profiles

Pro-Trump bots spread Q-Anon related content (via their User profiles)

40 of 42

Previous Analysis

Top Hashtags in Bot Tweets

Pro-Trump bots spread Q-Anon related content (via their Tweets)

41 of 42

Previous Sentiment Labeling Approach

Top Hashtags used by Anti-Trump Bot Cluster

Hashtag

Description

#BIDEN2020

#BLM

'Black Lives Matter'

#BLUEWAVE

#BLUEWAVE2020

#DEMCAST

A left-leaning media outlet

#FBR

'Follow Black Resistance'

#IMPEACH

#IMPEACHANDREMOVE

#IMPEACHTRUMP

#IMPEACHTRUMPNOW

#IMPOTUS

'Impeached POTUS'

Hashtag

Description

#METOO

#NOTMYPRESIDENT

#RESIST

#RESISTANCE

#RESISTER

#THERESISTANCE

#VOTEBLUE

#VOTEBLUE2020

#VOTEBLUENOMATTERWHO

#WTP2020

'We The People 2020'

42 of 42

Previous Sentiment Labeling Approach

Top Hashtags used by Pro-Trump Bot Cluster

Hashtag

Description

#1A

The First Amendment

#2A

The Second Amendment

#AMERICAFIRST

A Trump campaign slogan

#BUILDKATESWALL

#BUILDTHEWALL

A Trump campaign slogan

#CODEOFVETS

#CONSERVATIVE

#DEPLORABLE

Hillary Clinton quote

#DRAINTHESWAMP

A Trump campaign slogan

#KAG

'Keep America Great'

#MAGA

'Make America Great Again'

Hashtag

Description

#NRA

The National Rifle Association

#PATRIOT

#POTUS45

Refers to 45th President (Trump)

#QANON

Related to Q-Anon conspiracy theory

#THEGREATAWAKENING

Related to Q-Anon conspiracy theory

#TRUMP

#TRUMP2020

#TRUMPTRAIN

#VETERAN

#WALKAWAY