1 of 20

Classifying the News

of Our Times

Luke Armbruster

3 of 20

So What’s the Problem?

Why Fake News?

Fake news spreading !=�informed decision making
Technically difficult to autotag without human element
Business value

Objective

Specify proxy factors that can identify mainstream, fake, conspiracy, and satire sources using multiclass logistic regression

Assumptions

High integrity classified sources
Proxies for news types exist
Real world model can be built on skewed data set
Linear model behavior

Fake news is a very hot topic these days after the 2016 U.S. Presidential election. Facebook and Google are responding with their own actions to reduce the spread of fake news, including down-ranking in search results and pulling advertisements funds. The technological challenge of automatically tagging fake news without thoroughly cross-checking statements-a task only a human can accomplish well at this time, is a serious obstacle to preventing the spread of fake news. Excited by the opportunity to address this unique and challenging technological challenge, I found the inspiration to tackle the following project.

Objective:

This project analyzes posts from a period of 73 days before and after the U.S. Presidential election (from August 26, 2016, to January 20, 2017) on Facebook news pages via the Facebook Graph API. The purpose of using this information is to meeet the following objectives:

- Identify what factors uniquely identify whether a shared Facebook news posting is from a mainstream, fake,

conspiracy, or satire source. I suspect that the differences between fake, conspiracy, and satire sources are subtle, and, therefore, the potential minute differences betwen them would need to be inspected more closely than between mainstream and fake news. For these reasons, a multiclass inspection over a binary inspection (i.e. fake and not fake news) seems more approrpiate.

- Leverage differences between news site to build a multiclass logistic regression model that tags a shared Facebook news post by type using post status text and user engagement. A logistic regression model is appropriate in this project in order to evaluate the affects of predictors on the target probabilities. The value of an appropriately calibrated model could help Facebook identify what posts to flag for taking actions to disincentivize the dissemination of fake news or for use by computational journalists to track the issues related to fake news to meet future reporting responsibilities.

Assumptions:

- The list compiled by Melissa Zimdars have Facebook posts consistent with the category assigned. For example, I assume a fake news post does not contain accurate information as a whole.

- Proxies for thorough cross-checking and verifying the validity of statements/accounts, i.e. good journalism, can consistently and accurately predict type of news.

- A model built on a skewed sample can be used to accurately categorize news in the real world when the ratios of exposure to the type of news is changing and dependent on user preferences.

- The logit probabilities of the types of news are linear with respect to the model parameters.

4 of 20

Data Sources

Training period: Spans 73 days before and after U.S. presidential election Source (129 Facebook Pages Total): Posts from 13 mainstream sources identified in 2014 Pew Research Study Posts from 41 fake, 38 conspiracy, and 37 satire sources labeled by Dr. Melissa Zimdars of Merimack College (many cross-checked against output from Daniel Sieradski’s BS tool)

5 of 20

Predictors

505 select 1-2 word grams from post message	Engagement type proportion of total count (i.e. comments, shares, likes, loves, wows, hahas, sads, & angrys)
Time elements: hour, week day, timing of election
Media attachment type

6 of 20

Big Picture Data Statistics

129 Facebook pages Over 270 thousand posts Over 9 million engagement counts Only 8% of posts have message text

7 of 20

# of Sites Unrelated to

Post Volume

8 of 20

Disproportionate Posting and Engagement Level among Sources

9 of 20

LogisticReg(L1, OVR, C=1)

10 of 20

Test Set Results

Score: 57%

Baseline: 41%

Difference: +16%

11 of 20

Test Set Results

Key: mainstream (0), fake (1), conspiracy (2), satire (3)

12 of 20

Greatest Coefficients

Mainstream	Fake*	Conspiracy*	Satire
Like	Sad (-)	Comment (-)	Like
Sad	Stop cheering	Tour tickets	Haha
Comment	Follow deplorable		Angry (-)
Neverhillary (-)	Share expose		Love (-)
	Follow american Video type (-)		Neverhillary(-)
*Relatively small coefficients

Precision by Type: 0.6 (M), 0.5 (F), 0.7 (C), 0.5 (S) Recall by Type: 0.7 (M), 0.8 (F), 0.08 (C), 0.2 (S)

13 of 20

Engagement Check

Precision by Type: 0.6 (M), 0.5 (F), 0.7 (C), 0.5 (S) Recall by Type: 0.7 (M), 0.8 (F), 0.08 (C), 0.2 (S)

14 of 20

Page-Level Engagement Check

15 of 20

Media Type Check

16 of 20

Take-Aways

Some success in relying on source lists compiled by Dr. Melissa Zimdars to predict the type of news in a Facebook post. Model adds an accuracy score 16% above baseline, but recall for conspiracy is embarrassingly low. Based on EDA, some prominent coefficients make sense, while others do not.

17 of 20

Future Work

Calibrate and verify the model using more posts, messages, and news sources. Do proxies stand? Vary number and types of words from message text used in the model. Possibly include n-grams of link names. Examine the content of the pre-categorized posts in the dataset and, if necessary, correct the label. Evaluate other feature engineering options to compare against the performance of Lasso Use GridSearch to vary C parameter

18 of 20

A News Week

19 of 20

A 24-Hour News Cycle

20 of 20

Volume with Reference

to the Election