1 of 28

Towards Proactively Forecasting Sentence-Specific Information Popularity within Online News Documents

Sayar Ghosh Roy 1, Anshul Padhi 1, Risubh Jain 1, Manish Gupta 1, 2, Vasudeva Varma 1

International Institute of Information Technology, Hyderabad 1

Microsoft, India 2

Proceedings of the 33rd ACM Conference on Hypertext and Social Media (HT '22)

June 28 July 1, 2022, Barcelona, Spain

2 of 28

Introduction

  • Past studies on popularity prediction have focused exclusively on document-level labeling
  • We introduce the task of proactively forecasting popularities of sentences within online news documents solely utilizing their natural language content
  • We curate InfoPop, the first dataset with popularity labels for over 1.7M sentences from over 50K online news documents
  • We propose a novel transfer learning approach involving sentence salience prediction as an auxiliary task
  • We present a non-trivial takeaway: though popularity and salience are different concepts, transfer learning from salience prediction enhances popularity forecasting

3 of 28

Related Work on Document-level Popularity Prediction

Two types of popularity prediction problems based on choice of popularity surrogate:

  • Popularity based on Internet browsing: Considers the #received page load requests over sufficient time (pageviews or hit count) as information popularity surrogate [1, 2, 7]
  • Social Media Popularity: Estimates prospective engagement of a piece of content put up on a particular social media platform utilizing user-behavior based markers like #comments [4, 5], #shares [3, 6], etc. as surrogates of social-media popularity

For informative documents (including online news), a preferred surrogate of popularity has been pageview hits [8, 10, 11]

Intuitively, pageviews captures the generic browsing trends of the population not limited to social media actions

4 of 28

Sentence-Specific Information Popularity

  • Document popularity based on pageviews capture the amount of notice a document receives on the Internet [2, 7]
    • Consequence of the everyday Internet browsing actions of world population
  • With increased Internet penetration, average #queries encountered daily by commercial search engines has exceeded the billion mark [source]
    • Google Trends ~ shows popular topics of interest segregated by region & timespan
      • Collection of worldwide queries to mark information of universal interest
  • Key Principle: Within local context of a news document D, if information piece I1 is more queried-after by global population than piece I2, then: popularity(I1) > popularity(I2)

5 of 28

Assigning Sentence-Specific Popularity Labels

News document D with sentences [s1, s2, ..., sN]

  • Comparing each sentence si to every search query encountered by a commercial search engine would be computationally infeasible
  • Only a negligibly small % of all encountered queries would positively contribute to sentences' popularities within D
    • Queries for which D could be deemed relevant
  • Filter incoming queries to consider sublist Q = [q1, q2, ..., q|Q|], for which D was deemed as significantly relevant
  • D was shown within the top 10 search results when search engine encountered q

6 of 28

Task Description

Inspired by research on document popularity prediction [1, 12]

  • We frame sentence-specific popularity forecasting as a regression task
  • Not a simple binary classification of documents’ sentences into popular or not

Some document-popularity forecasting approaches rely on post-publication signals like pageview hits in the first half-hour after publication [1, 9], but

  • We forecast sentence popularity scores proactively (before a document's publication)
    • We utilize incoming search queries only to define sentence-specific popularity and construct InfoPop

7 of 28

Task Formulation

Query-insensitive relative normalized scoring of sentences

  • Given a document as input
  • Assign normalized score in range [0, 1] to every sentence
  • Indicating their intra-document relative information popularity
  • Without utilizing any external signals

Figure 1: Task outline. Looking solely at document text as input, forecast prospective sentence-specific popularity scores

8 of 28

InfoPop Dataset

  • Collected and preprocessed news documents from 26 reputed online news websites
  • Created a set of over 50K cleaned news documents

Figure 2: InfoPop: news sources (on x-axis) with their corresponding #documents (on y-axis)

9 of 28

InfoPop Dataset

  • Accessed incoming queries from Microsoft Bing
  • Mapped each document to global assemblage of queries that deemed it as significantly relevant
  • Using definition of sentence-specific information popularity, assigned popularity scores to over 1.7M sentences
  • Used cosine-similarity between corresponding TF-IDF vectors as measure of query-sentence similarity

Figure 3: InfoPop: Distribution of #sentences per document

10 of 28

Proposed Approach

  • Observation: sentence salience and sentence popularity, though quite different, have certain similarities
    • Salience is well studied in context of text summarization
      • Enjoys availability of sizeable amounts of news domain data
  • Hypothesis: Transfer Learning from auxiliary task of salience prediction would boost neural models' popularity forecasting ability
  • STILTs-based (Supplementary Training on Intermediate Labeled-data Tasks) [16] Transfer Learning (TL) setup using a constructed task of sentence salience prediction
    • Pre-train a neural architecture on a supervised TL subtask
    • Fine-tune the model for popularity forecasting

11 of 28

Salience Prediction versus Text Summarization

  • In document summarization, a sentence is salient if it contains information related to primary document semantics & can be included in its summary [13]
    • Given article plus gold summary, salience of a sentence = ROUGE overlap between sentence and summary
  • In a typical extractive setting, binary summary-inclusion labels are computed for sentences greedily to maximize ROUGE overlap between complete oracle and true summary
    • Implicit minimization of information redundancy
    • If a lexically similar sentence was previously included in oracle, a salient sentence might receive a summary-inclusion label of 0 [14]

We only capture salience & expect very similar sentences to have similar labels

12 of 28

Auxiliary Subtasks

  • Utilized publicly available CNN-DailyMail news summarization dataset (splits from [15])
  • Packaged auxiliary task as a sentence sequence regression problem
  • Train architectures adapting STILTs TL
  • Perform empirical cross-task evaluation
  • Compute 3 weakly supervised salience scores for sentences based on ROUGE 1, 2, and L overlaps with document summary
  • Create 3 auxiliary subtasks (due to the 3 labeling schemes), tag as S1, S2, and SL

Table 1: Selected sentences (in order) from a document in Sentence Salience Prediction dataset with 3 types of salience labels based on ROUGE 1, ROUGE 2, and ROUGE L

13 of 28

Neural Architectures

  • Sentence-specific information popularity forecasting is framed as a sequence regression task, where a sentence's score is relative to the complete article
    • Effective neural models require global context
  • Design neural sentence sequence regression architectures
    • BaseReg: Rudimentary neural baseline using CNNs and RNNs
    • BERTReg: BERT-based scoring of documents’ sentences
  • MSE (Mean Squared Error) loss between true and inferred sentence scores for training
  • Handle arbitrarily large documents using both BERTReg and BaseReg employing a sliding window mechanism over a document's sentences with a preset stride

14 of 28

Neural Architectures

Figure 4: BaseReg: RNN for Sentence Sequence Regression

Figure 5: BERTReg: BERT for Sentence Sequence Regression

15 of 28

Evaluation Metrics

  • Top k overlap: |Ak Pk|/k, where Ak and Pk are sets of actual and predicted top-k highest scored sentences, respectively
  • Regression errors: Mean Squared Error (MSE) & Mean Absolute Error (MAE) between arrays of actual and predicted sentence labels
  • Rank Correlation Metrics: Spearman's rank correlation (ρ) and Kendall's Tau (𝜏). ρ, 𝜏 ∈ [-1,1]
  • nDCG: Captures the normalized gain or usefulness of a sentence based on both its position in inferred rank list and its actual score. nDCG [0, 1]

16 of 28

Sentence-Specific Popularity Forecasting Results

Table 2: Sentence Popularity Forecasting Results

17 of 28

Sentence-Specific Popularity Forecasting Results

  • BERTReg ~ best architecture for the task
  • Performance upgrades upon employment of the Transfer Learning setup
    • BERTReg with TL = SL boosts average nDCG over 1% from vanilla BERTReg and over 7% from best unsupervised baseline
    • t-test: BERTReg with TL = SL significantly outperforms vanilla BERTReg on nDCG at significance level p < 0.01
  • BaseReg ~ ineffective for popularity forecasting, but TL setup is a positive addition

18 of 28

Performance Enhancement due to Transfer Learning

  • TL from Salience Prediction improves Popularity Forecasting

Attribute the performance enhancement to 2 factors

  • Datasets used for both tasks are sourced from online news
    • TL allows model to witness more domain-specific data [17]
  • Though popularity forecasting differs from salience prediction, they have certain similarities like
    • Penalizing not lexically dense sentences
    • Understanding that specific sentences do not carry any notable information

19 of 28

Performance of Approaches on Auxiliary Subtasks

Table 3: Performance of various approaches on auxiliary transfer learning subtasks (S1, S2, SL)

20 of 28

Task Comparison

  • Sentence ranking baselines ~ more capable of capturing salience than forecasting information popularity
  • Salience of initial sentences in news articles is typically higher than their popularity
    • Position baseline ~ scores for salience prediction are much greater than for popularity forecasting
  • Predicting sentence salience is less complicated than forecasting information popularity
    • Supervised salience prediction models achieve much better results on ρ, 𝜏, and nDCG compared to popularity forecasting models with same architecture
    • BaseReg which performed poorly on popularity forecasting, achieves respectable scores for salience prediction

21 of 28

Popularity and Salience

Salient sentences capture summary-inclusion worthy ideas that are central to core semantics of an article [18]

  • Information piece might deviate significantly from an article's primary topic yet be popular

Consider a from a particular article (with ID 34499) within InfoPop: “Weinsheimer has spent 27 years at DOJ, where he tried homicide and public corruption cases.”

  • Not salient enough for inclusion in article summary as it is barely related to the article's core topic (Scott Schools' resignation)
  • But, contains one of the most popular information pieces in the document

22 of 28

Popularity and Salience

Table 4: Selected sentences from a document in InfoPop with their true and forecasted popularities and predicted salience. Popularity forecasts are from our best performing model on nDCG (BERTReg with TL = SL). Salience predictions are based on BERTReg trained on S1.

[TPL: True Popularity Label, FPL: Forecasted Popularity Label, PSL: Predicted Salience Label, TPR: True Popularity Rank, FPR: Forecasted Popularity Rank, PSR: Predicted Salience Rank]

23 of 28

Empirical Cross-task Evaluation

  • Values across metrics fall below those achieved by some unsupervised baseline
  • Further experimentally shows distinction between information popularity and salience

Table 5: Cross-task evaluation − performance of BERTReg trained for popularity forecasting (PF) evaluated on salience prediction and vice-versa

24 of 28

Conclusions

  • We introduced the task of proactively forecasting sentence-specific information popularity
  • Contributed InfoPop, a dataset containing 51,770 news articles from 26 news websites with over 1.7 million sentences labeled with normalized popularity scores
  • Experimented with several baselines, demonstrated efficacy of our STILTs-based Transfer Learning approach involving an auxiliary supervised salience prediction task
  • Best models achieved nDCG values over 0.8 for sentence popularity forecasting
  • Interesting takeaway: though popularity forecasting and salience prediction are very different problems, transferring learnings from salience prediction enhances a model's popularity forecasting proficiency

25 of 28

Thank You

26 of 28

References

[1] Yaser Keneshloo, Shuguang Wang, E. Han, and Naren Ramakrishnan. 2016. Predicting the Popularity of News Articles. In SDM.

[2] Sotiris Lamprinidis, Daniel Hardt, and Dirk Hovy. 2018. Predicting News Headline Popularity with Syntactic and Semantic Knowledge Using Multi-Task Learning. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Brussels, Belgium, 659–664. https://doi.org/10.18653/v1/D18-1068

[3] Nuno Moniz and Luís Torgo. 2018. Multi-Source Social Feedback of Online News Feeds. CoRR https://archive.ics.uci.edu/ml/datasets/abs/1801.07055 (2018).

[4] Georgios Rizos, Symeon Papadopoulos, and Yiannis Kompatsiaris. 2016. Predicting News Popularity by Mining Online Discussions. In Proceedings of the 25th International Conference Companion on World Wide Web (Montréal, Québec, Canada) (WWW ’16 Companion). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE, 737–742. https://doi.org/10.1145/2872518.2890096

[5] Alexandru Tatar, Panayotis Antoniadis, Marcelo Dias de Amorim, and Serge Fdida. 2012. Ranking News Articles Based on Popularity Prediction. In 2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining. 106–110. https://doi.org/10.1109/ASONAM.2012.28

[6] Md. Taufeeq Uddin, Muhammed Jamshed Alam Patwary, Tanveer Ahsan, and Mohammed Shamsul Alam. 2016. Predicting the popularity of online news from content metadata. In 2016 International Conference on Innovations in Science, Engineering and Technology (ICISET). 1–5. https://doi.org/10.1109/ICISET.2016.7856498

27 of 28

References

[7] Anton Voronov, Yao Shen, and Pritom Kumar Mondal. 2019. Forecasting Popularity of News Article by Title Analyzing with BN-LSTM Network. In Proceedings of the 2019 International Conference on Data Mining and Machine Learning (Hong Kong, Hong Kong) (ICDMML 2019). Association for Computing Machinery, New York, NY, USA, 19–27. https://doi.org/10.1145/3335656.3335679

[8] Anthony Chen, Pallavi Gudipati, Shayne Longpre, Xiao Ling, and Sameer Singh. 2021. Evaluating Entity Disambiguation and the Role of Popularity in Retrieval-Based NLP. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 4472–4485. https://doi.org/10.18653/v1/2021.acl-long.345

[9] Mohamed Ahmed, Stella Spagna, Felipe Huici, and Saverio Niccolini. 2013. A Peek into the Future: Predicting the Evolution of Popularity in User Generated Content. In Proceedings of the Sixth ACM International Conference on Web Search and Data Mining (Rome, Italy) (WSDM ’13). Association for Computing Machinery, New York, NY, USA, 607–616. https://doi.org/10.1145/2433396.2433473

[10] Alexander Pugachev, Anton Voronov, and Ilya Makarov. 2020. Prediction of News Popularity via Keywords Extraction and Trends Tracking. Recent Trends in Analysis of Images, Social Networks and Texts 1357 (2020), 37 – 51.

[11] Yun-Zhu Song, Hong-Han Shuai, Sung-Lin Yeh, Yi-Lun Wu, Lun-Wei Ku, and Wen-Chih Peng. 2020. Attractive or Faithful? Popularity-Reinforced Learning for Inspired Headline Generation. Proceedings of the AAAI Conference on Artificial Intelligence 34, 05 (Apr. 2020), 8910–8917. https://doi.org/10.1609/aaai.v34i05.6421

28 of 28

References

[12] Shivashankar Subramanian, Timothy Baldwin, and Trevor Cohn. 2018. Content-based Popularity Prediction of Online Petitions Using a Deep Regression Model. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Association for Computational Linguistics, Melbourne, Australia, 182–188. https://doi.org/10.18653/v1/P18-2030

[13] Ramesh Nallapati, Feifei Zhai, and Bowen Zhou. 2016. Summarunner: A recurrent neural network based sequence model for extractive summarization of documents. arXiv preprint arXiv:1611.04230 (2016).

[14] Ruipeng Jia, Yanan Cao, Haichao Shi, Fang Fang, Yanbing Liu, and Jianlong Tan. 2020. DistilSum: Distilling the Knowledge for Extractive Summarization. In Proceedings of the 29th ACM International Conference on Information and Knowledge Management (Virtual Event, Ireland) (CIKM ’20). Association for Computing Machinery, New York, NY, USA, 2069–2072. https://doi.org/10.1145/3340531.3412078

[15] Karl Moritz Hermann, Tomás Kociský, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching Machines to Read and Comprehend. In NIPS. 1693–1701. http://papers.nips.cc/paper/5945-teaching-machines-to-read-and-comprehend

[16] Jason Phang, Thibault Févry, and Samuel R. Bowman. 2018. Sentence Encoders on STILTs: Supplementary Training on Intermediate Labeled-data Tasks. ArXiv abs/1811.01088 (2018).

[17] Suchin Gururangan, Ana Marasović, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A. Smith. 2020. Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks. arXiv:2004.10964 [cs.CL]

[18] Ming Zhong, Pengfei Liu, Yiran Chen, Danqing Wang, Xipeng Qiu, and Xuanjing Huang. 2020. Extractive Summarization as Text Matching. In Proceedings of the 58th Annual Meeting of the ACL. ACL, Online, 6197–6208. https://doi.org/10.18653/v1/2020.acl-main.552