1 of 20

Classifying the News

of Our Times

Luke Armbruster

2 of 20

3 of 20

So What’s the Problem?

Why Fake News?

  • Fake news spreading !=�informed decision making
  • Technically difficult to autotag without human element
  • Business value

Objective

  • Specify proxy factors that can identify mainstream, fake, conspiracy, and satire sources using multiclass logistic regression

Assumptions

  • High integrity classified sources
  • Proxies for news types exist
  • Real world model can be built on skewed data set
  • Linear model behavior

4 of 20

Data Sources

  • Training period:
    • Spans 73 days before and after U.S. presidential election
  • Source (129 Facebook Pages Total):
    • Posts from 13 mainstream sources identified in 2014 Pew Research Study
  • Posts from 41 fake, 38 conspiracy, and 37 satire sources labeled by Dr. Melissa Zimdars of Merimack College (many cross-checked against output from Daniel Sieradski’s BS tool)

5 of 20

Predictors

  • 505 select 1-2 word grams from post message
  • Engagement type proportion of total count (i.e. comments, shares, likes, loves, wows, hahas, sads, & angrys)
  • Time elements: hour, week day, timing of election
  • Media attachment type

6 of 20

Big Picture Data Statistics

  • 129 Facebook pages

  • Over 270 thousand posts

  • Over 9 million engagement counts

  • Only 8% of posts have message text

7 of 20

# of Sites Unrelated to

Post Volume

8 of 20

Disproportionate Posting and Engagement Level among Sources

9 of 20

LogisticReg(L1, OVR, C=1)

10 of 20

Test Set Results

Test Set Results

Score: 57%

Baseline: 41%

Difference: +16%

11 of 20

Test Set Results

Key: mainstream (0), fake (1), conspiracy (2), satire (3)

12 of 20

Greatest Coefficients

Mainstream

Fake*

Conspiracy*

Satire

  • Like
  • Sad (-)
  • Comment (-)
  • Like
  • Sad
  • Stop cheering
  • Tour tickets
  • Haha
  • Comment
  • Follow deplorable

  • Angry (-)
  • Neverhillary

(-)

  • Share expose

  • Love (-)

  • Follow american
  • Video type (-)

  • Neverhillary(-)

*Relatively small coefficients

Precision by Type: 0.6 (M), 0.5 (F), 0.7 (C), 0.5 (S) Recall by Type: 0.7 (M), 0.8 (F), 0.08 (C), 0.2 (S)

13 of 20

Engagement Check

Precision by Type: 0.6 (M), 0.5 (F), 0.7 (C), 0.5 (S) Recall by Type: 0.7 (M), 0.8 (F), 0.08 (C), 0.2 (S)

14 of 20

Page-Level Engagement Check

15 of 20

Media Type Check

16 of 20

Take-Aways

  • Some success in relying on source lists compiled by Dr. Melissa Zimdars to predict the type of news in a Facebook post. Model adds an accuracy score 16% above baseline, but recall for conspiracy is embarrassingly low.
  • Based on EDA, some prominent coefficients make sense, while others do not.

17 of 20

Future Work

  • Calibrate and verify the model using more posts, messages, and news sources. Do proxies stand?
  • Vary number and types of words from message text used in the model. Possibly include n-grams of link names.
  • Examine the content of the pre-categorized posts in the dataset and, if necessary, correct the label.
  • Evaluate other feature engineering options to compare against the performance of Lasso
  • Use GridSearch to vary C parameter

18 of 20

A News Week

19 of 20

A 24-Hour News Cycle

20 of 20

Volume with Reference

to the Election