Technically difficult to autotag without human element
Business value
Objective
Specify proxy factors that can identify mainstream, fake, conspiracy, and satire sources using multiclass logistic regression
Assumptions
High integrity classified sources
Proxies for news types exist
Real world model can be built on skewed data set
Linear model behavior
4 of 20
Data Sources
Training period:
Spans 73 days before and after U.S. presidential election
Source (129 Facebook Pages Total):
Posts from 13 mainstream sources identified in 2014 Pew Research Study
Posts from 41 fake, 38 conspiracy, and 37 satire sources labeled by Dr. Melissa Zimdars of Merimack College (many cross-checked against output from Daniel Sieradski’s BS tool)
5 of 20
Predictors
505 select 1-2 word grams from post message
Engagement type proportion of total count (i.e. comments, shares, likes, loves, wows, hahas, sads, & angrys)
Time elements: hour, week day, timing of election
Media attachment type
6 of 20
Big Picture Data Statistics
129 Facebook pages
Over 270 thousand posts
Over 9 million engagement counts
Only 8% of posts have message text
7 of 20
# of Sites Unrelated to
Post Volume
8 of 20
Disproportionate Posting and Engagement Level among Sources
Precision by Type: 0.6 (M), 0.5 (F), 0.7 (C), 0.5 (S) Recall by Type: 0.7 (M), 0.8 (F), 0.08 (C), 0.2 (S)
13 of 20
Engagement Check
Precision by Type: 0.6 (M), 0.5 (F), 0.7 (C), 0.5 (S) Recall by Type: 0.7 (M), 0.8 (F), 0.08 (C), 0.2 (S)
14 of 20
Page-Level Engagement Check
15 of 20
Media Type Check
16 of 20
Take-Aways
Some success in relying on source lists compiled by Dr. Melissa Zimdars to predict the type of news in a Facebook post. Model adds an accuracy score 16% above baseline, but recall for conspiracy is embarrassingly low.
Based on EDA, some prominent coefficients make sense, while others do not.
17 of 20
Future Work
Calibrate and verify the model using more posts, messages, and news sources. Do proxies stand?
Vary number and types of words from message text used in the model. Possibly include n-grams of link names.
Examine the content of the pre-categorized posts in the dataset and, if necessary, correct the label.
Evaluate other feature engineering options to compare against the performance of Lasso