1 of 21

Crowdsourced Fact-Checking at Twitter:

How Does the Crowd Compare With Experts?

Mohammed Saeed

Maelle Nicolas

Paolo Papotti

Nicolas Traub

Gianluca Demartini

1

SIGIR Student Travel Grant

2 of 21

2

Expert

Fact-Checker

Claim is False!

3 of 21

3

Expert

Fact-Checker

Machine-based

Algorithm

Crowd of

Non-Experts

BirdWatch

4 of 21

Research Questions

How are check-worthy claims selected by Birdwatch users?

Can the crowd identify check-worthy claims before experts do?

Are crowd workers able to reliably assess the veracity of a tweet?

What sources of information are used to support a fact-checking decision in Birdwatch and how reliable are they?

4

Claim

Selection

Fact-checking pipeline

Evidence

Retrieval

Claim

Verification

Claim

Selection

Evidence

Retrieval

5 of 21

BirdWatch Tour

5

Noter #1

Noter #2

Noter #3

Rater #1

Rater #2

A user writes a tweet.

BirdWatch users provide a note on the tweet.

Other BirdWatch users rate the notes done.

A final verdict is associated with the tweet’s truthfulness.

6 of 21

Example (1/2)

6

Note #1

Potentially Misleading Dec 17

According to numerous independent sources, Trump lost the election. Politifact, 1/6/21: https://www.politifact.com/factchecks/2021/jan/07/donald-trump/trump-clings-fantasy-landslide-victory-egging-supp/ ”All 50 states and the District of Columbia have certified their election results, which Congress sought to finalize Jan. 6? There is no evidence that voter fraud affected that outcome.”

Given current evidence, I believe this tweet is:

NOT_MISLEADING

MISINFORMED_OR_POTENTIALLY_MISLEADING

I believe this tweet contains a digitally altered

photo or video.

Did you link to sources you believe most people

would consider trustworthy?

7 of 21

Example (2/2)

7

Rating #1 Dec 17

Do you agree with this note’s conclusion?

Is this note helpful?

HELPFUL

SOMEWHAT_HELPFUL

NOT_HELPFUL

Does this note cite high-quality sources?

Does the note directly address the tweet’s claim?

Is the note hard to understand?

Does the note contain spam, harassment, or abuse?

Does this note miss key points?

Fact-Check Nov 07

Claim: Donald Trump won the 2020 election, by a lot.

Verdict: Not Credible

Fact Checker: Lead Stories

Country: United States

Link:https://leadstories.com/hoax-alert/2020/11/fact-check-donald-j-trump-on-twitter-I-won-by-a-lot.html

8 of 21

Datasets

Two datasets:

BirdWatch Data containing notes and ratings for ~ 12K tweets
ClaimReview Data containing fact-checks done on ~ 77K claims

The tweets and fact-checks are matched using mTurk

We obtain 2208 tweets matched with ClaimReview fact-checks

8

MisinfoMe: Who is Interacting with Misinformation? (ISWC 2019)

9 of 21

How are check-worthy claims selected by Birdwatch users?

9

Claim

Selection

Evidence

Retrieval

Claim

Verification

Topic-wise analysis as a proxy for claim check-worthiness (BERTopic)

10 of 21

Can the crowd identify check-worthy claims before experts do?

10

Claim

Selection

Evidence

Retrieval

Claim

Verification

We analyze tweets , Birdwatch notes, and ClaimReview fact-checks time-wise.

Majority of cases were users spreading false news after they have been fact-checked

On average, a Birdwatch provides a response 10X faster than an expert for 129/2208 tweets

We analyze tweets (T), Birdwatch notes(B), and ClaimReview fact-checks (C) time-wise. As a note can only occur after a tweet, we have three different configurations:

(i) Tweet occurs first, then Birdwatch note, then ClaimReview fact-check (TBC),
(ii) Tweet then ClaimReview fact-check then Birdwatch note (TCB)
(iii) ClaimReview fact then Tweet then Birdwatch note (CTB).

TBC:

: There are 129/2208 tweets in our matched data for this case. In all tweets, Birdwatch users provide a response much faster than experts.
On average, a Birdwatch provides a response 10X faster than an expert

TCB:

In our dataset, a ClaimReview rarely occurs after a tweet
and before a Birdwatch. We observe faster responses from ClaimReview than Birdwatch users for 26/2208 tweets.

CTB:

The majority of the matched tweets follow this pattern, with most of them related to US politics and COVID-19 several users tend to spread false news even after they have been fact-checked, specifically those related to Trump winning the elections.

11 of 21

What sources of information are used to support a fact-checking decision in Birdwatch and how reliable are they?

11

Claim

Selection

Evidence

Retrieval

Claim

Verification

	BirdWatch	ClaimReview
# Domain Names	2014	73
Examples	FoxNews, Breitbart	PolitiFact, CDC

To assess the quality of web sources, we rely on an external tool (NewsGuard)

12 of 21

Are crowd workers able to reliably assess the veracity of a tweet? (1/3)

12

Claim

Selection

Evidence

Retrieval

Claim

Verification

External Agreement:

Majority of ClaimReview labels match the Birdwatch ones.

External Agreement:

Reasons for mismatches (next slides)

13 of 21

Are crowd workers able to reliably assess the veracity of a tweet? (2/3)

13

Claim

Selection

Evidence

Retrieval

Claim

Verification

Among the 209 notes that are labeled as credible by the ClaimReview fact-checks and misinformed by the Birdwatch participants, the most common cause are texts with multiple claims, i.e., multiple facts are reported in a tweet and the fact-checked claims differ (ID #3).
In other cases, tweets are mistakenly labelled as misinformed, e.g., because a joke is taken seriously by a Twitter user (ID #5).
Finally, assuming correct ClaimReview labels, we believe in some cases the mismatch is due to biased Birdwatch users. For the tweets labeled as not credible by ClaimReview fact-checks and not misleading by Birdwatch notes, we observe cases where a Birdwatch note is the negated version of the ClaimReview fact-check (ID #2), thus producing opposite labels.
There are also mismatch of labels, even though the Birdwatch user provides evidence from a link that has a high journalistic score (0.875).

14 of 21

Are crowd workers and computational methods able to reliably assess the veracity of a tweet? (3/3)

14

Claim

Selection

Evidence

Retrieval

Claim

Verification

Method	Matched Claims
ClaimBuster	118/2208
E-BART	369/2208
BirdWatch	1492/2208

ClaimBuster: the first-ever end-to-end fact-checking system (VLDB 2017)

E-BART: Jointly Predicting and Explaining Truthfulness (TTO 2021)

15 of 21

Key Takeaways

Correlation in claim selection decisions

Crowd is effective in identifying tweets with pre-debunked misleading claims

Small set of high-quality sources for experts, unlike Birdwatch participants

Birdwatch users show high enough levels of agreement to reach decisions in the vast majority of cases.

15

Birdwatch users and ClaimReview experts show correlation in claim selection decisions w.r.t. major news and events, but with important differences due to the circulation of claims that have been already debunked by experts.
The crowd seems to be effective also in identifying tweets with misleading claims even before they get fact-checked by an expert. Also, both popular and non-popular tweets get verified by Birdwatch users. Computing the check-worthiness of a tweet does not lead to effective results using current off the shelf APIs.
Expert fact-checkers rely on a relatively small set of high-quality sources to verify claims, while Birdwatch participants provide a variety of sources that seem to be neglected by fact-checkers. While most of these sources are evaluated as credible (by journalists) and useful (by the Birdwatch crowd), malicious users might game the algorithm and effectively label notes as unhelpful according to their ideology.
The Birdwatch crowd focuses mostly on misleading tweets and shows high agreement with expert fact-checkers in terms of classification label. Computational methods have room for improvement in automatically verifying tweets.
Deployed Only in US

16 of 21

Thanks!

Any questions ?

You can find me at

@MhmdSaeedms
saeedm@eurecom.fr

Some BirdWatch Notes

Pineapple does not belong on pizza.
I love sushi
Hello bird!
Love this program!!

16

https://github.com/MhmdSaiid/

BirdWatch

17 of 21

Back-up

17

“

18 of 21

How are check-worthy claims selected by Birdwatch users?

18

Claim

Selection

Evidence

Retrieval

Claim

Verification

ClaimBuster API provides a score for claim check worthiness (between 0 and 1)

We run API on BirdWatch tweets and ClaimReview fact-checks

Low median score of 0.4

Detecting Check-worthy Factual Claims by ClaimBuster (KDD 2017)

19 of 21

What sources of information are used to support a fact-checking decision in Birdwatch and how reliable are they? (3/3)

19

Claim

Selection

Evidence

Retrieval

Claim

Verification

	BirdWatch	ClaimReview
Median	1.0	1.0
Minimum Value	0.495	0.875

20 of 21

Are crowd workers able to reliably assess the veracity of a tweet? (1/4)

20

Claim

Selection

Evidence

Retrieval

Claim

Verification

Internal Agreement:

Standard metrics fail due to the large sparsity in the data and the huge number of missing value.

We use compute the variance as a metric for agreement.

We ponder whether Birdwatch participants provide accurate judgements. We first compare agreement (i) among themselves and then (ii) with ClaimReview expert fact-checkers. We then analyze different scoring functions for note aggregation, and finally report results for computational methods.
Krippendorff’s alpha and Fleiss’s kappa
Lower variance means that all Birdwatch participants agree on the classification label.
We see that most tweets have two notes, and the majority of users perfectly agree on the final classification label. The same applies to tweets with more note counts, where most of the notes agree on the final label, with conflicts happening on some tweets but with a small subset with full disagreement. A topic analysis of tweets shows that 48.3% of tweets with full disagreement are related to either politics or COVID-19.
We use the participants’ classification labels to see whether the tweet is classified as misinformed or not.

21 of 21

Is their assessment always considered helpful by others?

21

Claim

Selection

Evidence

Retrieval

Claim

Verification

Note Helpfulness Score

A note helpfulness score computed for each note

533/2208 pass the threshold, with 333 notes labeling the tweets according to ClaimReview checks.

About 95% of notes label the tweets as misleading, thus indicating that Birdwatch users tend to rate misleading tweets more than non-misleading ones, in agreement with previous work.