1 of 15

On the Coherence of �Fake News Articles

  • Iknoor Singh, University of Sheffield, UK
  • Deepak P, Queen’s University Belfast, UK
  • Anoop K, University of Calicut, India

1

Contact: deepaksp@acm.org

8th International Workshop on News Recommendation and Analytics (INRA 2020)

ECML PKDD 2020 Conference, September 2020

2 of 15

2

2016

2017

FAKE NEWS

Intentionally False

3 of 15

Data-driven Fake News Detection

  • Three kinds of features:
    • Content
    • Network
    • Propagation
  • Most work on political fake news on Twitter makes heavy use of network and propagation
  • However, for domains such as long articles in politics, as well as most fake news in medical and science, network and propagation features are sparse
    • Content is abundant and is also where the fakeness lies

3

4 of 15

Identifying Fake News in Text Heavy Domains (1/2)

  • Conventional Supervised Approach:
    • Develop labelled datasets for the domains
    • Train a classifier
    • Deploy it
  • Alternative Approach:
    • Investigate the character of fake news vis-à-vis legitimate news
    • Identify meta-characterization of the differences that are persistent and high-level
    • Develop heuristics that computationally characterize such differences
    • Computationally test whether these lead to statistically significant differences
    • Several of these high-level meta-heuristics could together yield a fake news detection framework

4

5 of 15

Identifying Fake News in Text Heavy Domains (2/2)

  • Advantages of the Alternative Approach:
    • A more domain-informed understanding of fake news; we know what features we are using
    • We can understand and critique the various ‘dimensions of differences’
    • Spurious correlations won’t be employed
      • A data-driven approach could use the correlation of p(Fake|’UserID = X’) to identify fake news, but this undermines X’s ability to evolve and change with time. X’s content will continue to be labelled fake.
    • More generalizable to changing characteristics of fake news
  • Disadvantages of the Alternative Approach:
    • May be hard to approach accuracies of the purely data-driven approach

5

6 of 15

Our Aim: Is Lexical Coherence a Dimension of Difference between Fake and Legitimate News?

  • Are fake news articles less coherent?
  • Address the above question computationally.
  • What is coherence?
    • How well do the parts of the article fit together?
    • How interconnected are the different parts of the article?
    • How well does the article read as a unified whole?
  • Related to the notion of cohesion used in linguistics; we use the term coherence since this is more familiar to the computing community.
  • Cohesion is considered as an important feature in assessing text quality. (References in paper)

6

7 of 15

Objective

  • In statistical terms:
    • H1: Fake News is less coherent.
    • H0: Fake and Real News don’t differ in coherence.
  • Our Computational Building Blocks for Coherence Assessment:
    • Text Embeddings
    • Explicit Semantic Embeddings (ESA)
    • Entity Linking

7

8 of 15

Text Embedding Coherence Assesment

  • Each article is a set of sentences
  • Each sentence can be represented as a vector formed by the average of pre-trained word embedding vectors of component words
  • Coherence of article: Average of pairwise similarities between sentence vectors

8

9 of 15

ESA Coherence Assessment

  • Each article is a set of sentences
  • Each sentence can be represented as a vector formed by the average of the Wikipedia vectors of component words
  • Coherence of article: Average of pairwise similarities between sentence vectors

9

10 of 15

Entity Linking Coherence Assessment

  • Each article is a set of entities it contains
  • Each entity can be represented as a vector formed by the Wikipedia2Vec technique
  • Coherence of article: Average of pairwise similarities between entity vectors

10

11 of 15

Datasets

  • ISOT Fake News
    • 10k+ articles
    • 10-15 sentences per article
  • Health and Well Being Dataset
    • 1k articles; we curated the dataset
    • 25-30 sentences per article

11

12 of 15

Results

12

Largest difference observed for Word Embedding coherence scores

13 of 15

General Trend: Unimodal distribution with different peaks

13

14 of 15

Questions

  • Why are fake news articles less coherent?
    • Inexperienced media houses?
    • Multiple sub-stories?
    • Mixing of emotional and factual narratives?
    • Would technologies like GPT-3 generate more coherent fake news?

14

15 of 15

Contact: deepaksp@acm.org

15

Book Chapters:

Deepak P, “On Unsupervised Methods for Fake News Detection”

Deepak P, “Ethical Considerations in Data-driven Fake News Detection”

in Data Science for Fake News: Surveys and Perspectives

Upcoming Springer Book (Fall 2020)

Email me if you would like to get a version as soon as ready.