1 of 14

SCATE: Shared Cross Attention Transformer Encoders for Multimodal fake news detection

TANMAY SACHAN, NIKHIL PINNAPARAJU, MANISH GUPTA, VASUDEVA VERMA

2 of 14

What is fake news? Why is it important?

  • 1. Fake news consists of either alteration of news facts or creation of new ones in an attempt to sensationalize or mislead the population.
  • 2. With the rise of social media platforms, the spread of news is fast and irreversible.
  • 3. Spread of fake news has already impacted the society in ways that were unimaginable a decade ago, from stock market fluctuations to aggravating political divide in entire countries.

3 of 14

Detection of fake news

  • 1. There’s multiple websites (Snopes, Politifact) that are regularly updated with articles that are identified as fake by multiple sources.
  • 2. These websites generally make use of humans to correctly identify fake news articles.
  • 3. While accurate, these websites are too slow for going over the vast amount of news produced from any region of the world on a daily basis.
  • 4. A lot of fake news slips through undetected.

4 of 14

Multimodal fake news

  • 1. News articles are often accompanied by some supporting images.
  • 2. Multimodal fake news consists of doctoring the text or the image (or both) to create fake news.
  • 3. Our work focuses on the detection of Multimodal fake news articles.

Article

Image

Sharks seen roaming in New Jersey�streets and metro�stations. #Sandy

Brilliant telling�photo: “I don’t�believe in global�warming”.

5 of 14

Datasets

  • In our work we make use of datasets from two sources, Twitter and Weibo.
  • 1. Twitter
    • Twitter mediaeval dataset consists of articles 14514 posts with 480 unique images.
  • 2. Weibo
    • For weibo, we further use 2 datasets, Weibo A and Weibo B. The two datasets differ in the time period in which they were collected.
    • WeiboA consists of 7955 posts with 7955 unique images.
    • WeiboB consists of 10084 posts with 9525 unique images.
  • The length of the textual data for these datasets is of the order of a sentence (200 characters).

6 of 14

Baselines

  • In our work we deal with 4 baseline models.
  • 1. Confirming the efficacy of multimodality in fake news detection
    • We consider a text-only and an image-only baseline.
    • Text-only – Consists of word vectors passed through a Bi-LSTM followed by a classifier.
    • Image-only – Consists of a VGG-19 encoder followed by a classifier.
  • 2. Comparing our model scores
    • We consider 2 of the recent state of the art models to compare the performance of our model.
    • CARMN [1] and SpotFake [2]

[1] C. Song, N. Ning, Y. Zhang, and B. Wu, “A multimodal fake news detection model based on crossmodal attention residual and multichannel convolutional neural networks,” Information

Processing & Management, vol. 58, no. 1, p. 102437, 2021.

[2] S. Singhal, R. R. Shah, T. Chakraborty, P. Kumaraguru, and S. Satoh, “Spotfake: A multi-modal framework for fake news detection,” in BigMM. IEEE, 2019, pp. 39–47.

7 of 14

Our model

  • Our model SCATE consists of a text-encoder consisting of BERT (Chinese BERT for weibo data), and an image encoder consisting of a VGG-19.
  • The encoded representations are passed through dense layers to perform a linear transform into identical dimensions.

8 of 14

Our model

  • The new vectors pass through attention scaling layers, which scale the vectors relative to each other using scaled dot product attention.
  • A shared feedforward layer learns shared representation of the modalities.
  • The transformer block can be used multiple times to transform the vectors as the model learns the representations better.

9 of 14

Our model

  • After the encoder block, the transformed vectors are concatenated with the original transformed vectors to preserve some initial information from the vectors.
  • We perform a compact bilinear pooling of the modalities which results in a vector of a much higher dimensionality while preserving a lot of the dependence between the two modalities.

10 of 14

Our model

  • Post CBP, we have a final dot product self scaling layer, consisting of an additional dense layer that learns to scale the vector based on its own elements (similar to self-attention in transformers).
  • Finally, we have a standard classifier with 2 dense layers.
  • The model is trained using a binary cross entropy loss.

11 of 14

Results and analysis

Comparing the scores of our model on the test sets of the datasets we see a significant improvement.

Twitter:

WeiboA:

WeiboB:

12 of 14

Observations

  • 1. The increase in accuracy and F1-scores from CARMN can be likely attributed to the fact that the attention we’re calculating is at a post level and not at a token level.
    • Considering that we’re dealing with a classification problem with a binary cross entropy loss, training a model at token level using such a loss is much harder as the backpropagation signal is weak
  • 2. The increase over SpotFake can be attributed to the fact that our model understands intermodal dependence much better due to the sharing of modalities within our transformer blocks.

13 of 14

Ablation analysis

  • 1. Through rigorous ablation studies of our model, we find that the the transformer encoders and the dot product scaling layer are necessary to get the results we obtained.
  • 2. We also performed experiments where we used a shared attention layer instead of separate ones, and we saw a drop in scores. This can be attributed to the low dimension of the scaling layers, and the output being a scalar quantity (used to scale the vectors). To minimize the information loss, we find that different attention scaling layers for each modality result in a higher score.
  • The results are included in the paper.

14 of 14

Thank you!