1 of 28

Towards Proactively Forecasting Sentence-Specific Information Popularity within Online News Documents

Sayar Ghosh Roy ¹, Anshul Padhi ¹, Risubh Jain¹, Manish Gupta ^{1, 2}, Vasudeva Varma ¹

International Institute of Information Technology, Hyderabad ¹

Microsoft, India ²

Proceedings of the 33rd ACM Conference on Hypertext and Social Media (HT '22)

June 28 — July 1, 2022, Barcelona, Spain

2 of 28

Introduction

Past studies on popularity prediction have focused exclusively on document-level labeling
We introduce the task of proactively forecasting popularities of sentences within online news documents solely utilizing their natural language content
We curate InfoPop, the first dataset with popularity labels for over 1.7M sentences from over 50K online news documents
We propose a novel transfer learning approach involving sentence salience prediction as an auxiliary task
We present a non-trivial takeaway: though popularity and salience are different concepts, transfer learning from salience prediction enhances popularity forecasting

3 of 28

Related Work on Document-level Popularity Prediction

Two types of popularity prediction problems based on choice of popularity surrogate:

Popularity based on Internet browsing: Considers the #received page load requests over sufficient time (pageviews or hit count) as information popularity surrogate [1, 2, 7]
Social Media Popularity: Estimates prospective engagement of a piece of content put up on a particular social media platform utilizing user-behavior based markers like #comments [4, 5], #shares [3, 6], etc. as surrogates of social-media popularity

For informative documents (including online news), a preferred surrogate of popularity has been pageview hits [8, 10, 11]

Intuitively, pageviews captures the generic browsing trends of the population not limited to social media actions

4 of 28

Sentence-Specific Information Popularity

Document popularity based on pageviews capture the amount of notice a document receives on the Internet [2, 7]

Consequence of the everyday Internet browsing actions of world population

With increased Internet penetration, average #queries encountered daily by commercial search engines has exceeded the billion mark [source]

Google Trends ~ shows popular topics of interest segregated by region & timespan

Collection of worldwide queries to mark information of universal interest

Key Principle: Within local context of a news document D, if information piece I₁ is more queried-after by global population than piece I₂, then: popularity(I₁) > popularity(I₂)

5 of 28

Assigning Sentence-Specific Popularity Labels

News document D with sentences [s₁, s₂, ..., s_N]

Comparing each sentence s_i to every search query encountered by a commercial search engine would be computationally infeasible
Only a negligibly small % of all encountered queries would positively contribute to sentences' popularities within D

Queries for which D could be deemed relevant

Filter incoming queries to consider sublist Q = [q₁, q₂, ..., q_|Q|], for which D was deemed as significantly relevant
D was shown within the top 10 search results when search engine encountered q

6 of 28

Task Description

Inspired by research on document popularity prediction [1, 12]

We frame sentence-specific popularity forecasting as a regression task
Not a simple binary classification of documents’ sentences into popular or not

Some document-popularity forecasting approaches rely on post-publication signals like pageview hits in the first half-hour after publication [1, 9], but

We forecast sentence popularity scores proactively (before a document's publication)

We utilize incoming search queries only to define sentence-specific popularity and construct InfoPop

7 of 28

Task Formulation

Query-insensitive relative normalized scoring of sentences

Given a document as input
Assign normalized score in range [0, 1] to every sentence
Indicating their intra-document relative information popularity
Without utilizing any external signals

Figure 1: Task outline. Looking solely at document text as input, forecast prospective sentence-specific popularity scores

8 of 28

InfoPop Dataset

Collected and preprocessed news documents from 26 reputed online news websites
Created a set of over 50K cleaned news documents

Figure 2: InfoPop: news sources (on x-axis) with their corresponding #documents (on y-axis)

9 of 28

InfoPop Dataset

Accessed incoming queries from Microsoft Bing
Mapped each document to global assemblage of queries that deemed it as significantly relevant

Using definition of sentence-specific information popularity, assigned popularity scores to over 1.7M sentences
Used cosine-similarity between corresponding TF-IDF vectors as measure of query-sentence similarity

Figure 3: InfoPop: Distribution of #sentences per document

10 of 28

Proposed Approach

Observation: sentence salience and sentence popularity, though quite different, have certain similarities

Salience is well studied in context of text summarization

Enjoys availability of sizeable amounts of news domain data

Hypothesis: Transfer Learning from auxiliary task of salience prediction would boost neural models' popularity forecasting ability
STILTs-based (Supplementary Training on Intermediate Labeled-data Tasks) [16] Transfer Learning (TL) setup using a constructed task of sentence salience prediction

Pre-train a neural architecture on a supervised TL subtask
Fine-tune the model for popularity forecasting

11 of 28

Salience Prediction versus Text Summarization

In document summarization, a sentence is salient if it contains information related to primary document semantics & can be included in its summary [13]

Given article plus gold summary, salience of a sentence = ROUGE overlap between sentence and summary

In a typical extractive setting, binary summary-inclusion labels are computed for sentences greedily to maximize ROUGE overlap between complete oracle and true summary

Implicit minimization of information redundancy
If a lexically similar sentence was previously included in oracle, a salient sentence might receive a summary-inclusion label of 0 [14]

We only capture salience & expect very similar sentences to have similar labels

12 of 28

Auxiliary Subtasks

Utilized publicly available CNN-DailyMail news summarization dataset (splits from [15])
Packaged auxiliary task as a sentence sequence regression problem
Train architectures adapting STILTs TL
Perform empirical cross-task evaluation

Compute 3 weakly supervised salience scores for sentences based on ROUGE 1, 2, and L overlaps with document summary
Create 3 auxiliary subtasks (due to the 3 labeling schemes), tag as S₁, S₂, and S_L

Table 1: Selected sentences (in order) from a document in Sentence Salience Prediction dataset with 3 types of salience labels based on ROUGE 1, ROUGE 2, and ROUGE L

13 of 28

Neural Architectures

Sentence-specific information popularity forecasting is framed as a sequence regression task, where a sentence's score is relative to the complete article

Effective neural models require global context

Design neural sentence sequence regression architectures

Base_Reg: Rudimentary neural baseline using CNNs and RNNs
BERT_Reg: BERT-based scoring of documents’ sentences

MSE (Mean Squared Error) loss between true and inferred sentence scores for training
Handle arbitrarily large documents using both BERT_Reg and Base_Reg employing a sliding window mechanism over a document's sentences with a preset stride

14 of 28

Neural Architectures

Figure 4: Base_Reg: RNN for Sentence Sequence Regression

Figure 5: BERT_Reg: BERT for Sentence Sequence Regression

15 of 28

Evaluation Metrics

Top k overlap: |A_k ∩ P_k|/k, where A_k and P_k are sets of actual and predicted top-k highest scored sentences, respectively
Regression errors: Mean Squared Error (MSE) & Mean Absolute Error (MAE) between arrays of actual and predicted sentence labels
Rank Correlation Metrics: Spearman's rank correlation (ρ) and Kendall's Tau (𝜏). ρ, 𝜏 ∈ [-1,1]
nDCG: Captures the normalized gain or usefulness of a sentence based on both its position in inferred rank list and its actual score. nDCG ∈ [0, 1]

16 of 28

Sentence-Specific Popularity Forecasting Results

Table 2: Sentence Popularity Forecasting Results

17 of 28

Sentence-Specific Popularity Forecasting Results

BERT_Reg ~ best architecture for the task
Performance upgrades upon employment of the Transfer Learning setup

BERT_Reg with TL = S_L boosts average nDCG over 1% from vanilla BERT_Reg and over 7% from best unsupervised baseline
t-test: BERT_Reg with TL = S_L significantly outperforms vanilla BERT_Reg on nDCG at significance level p < 0.01

Base_Reg ~ ineffective for popularity forecasting, but TL setup is a positive addition

18 of 28

Performance Enhancement due to Transfer Learning

TL from Salience Prediction improves Popularity Forecasting

Attribute the performance enhancement to 2 factors

Datasets used for both tasks are sourced from online news

TL allows model to witness more domain-specific data [17]

Though popularity forecasting differs from salience prediction, they have certain similarities like

Penalizing not lexically dense sentences
Understanding that specific sentences do not carry any notable information

19 of 28

Performance of Approaches on Auxiliary Subtasks

Table 3: Performance of various approaches on auxiliary transfer learning subtasks (S₁, S₂, S_L)

20 of 28

Task Comparison

Sentence ranking baselines ~ more capable of capturing salience than forecasting information popularity
Salience of initial sentences in news articles is typically higher than their popularity

Position baseline ~ scores for salience prediction are much greater than for popularity forecasting

Predicting sentence salience is less complicated than forecasting information popularity

Supervised salience prediction models achieve much better results on ρ, 𝜏, and nDCG compared to popularity forecasting models with same architecture
Base_Reg which performed poorly on popularity forecasting, achieves respectable scores for salience prediction

21 of 28

Popularity and Salience

Salient sentences capture summary-inclusion worthy ideas that are central to core semantics of an article [18]

Information piece might deviate significantly from an article's primary topic yet be popular

Consider a from a particular article (with ID 34499) within InfoPop: “Weinsheimer has spent 27 years at DOJ, where he tried homicide and public corruption cases.”

Not salient enough for inclusion in article summary as it is barely related to the article's core topic (Scott Schools' resignation)
But, contains one of the most popular information pieces in the document

22 of 28

Popularity and Salience

Table 4: Selected sentences from a document in InfoPop with their true and forecasted popularities and predicted salience. Popularity forecasts are from our best performing model on nDCG (BERT_Reg with TL = S_L). Salience predictions are based on BERT_Reg trained on S₁.

[TPL: True Popularity Label, FPL: Forecasted Popularity Label, PSL: Predicted Salience Label, TPR: True Popularity Rank, FPR: Forecasted Popularity Rank, PSR: Predicted Salience Rank]

23 of 28

Empirical Cross-task Evaluation

Values across metrics fall below those achieved by some unsupervised baseline
Further experimentally shows distinction between information popularity and salience

Table 5: Cross-task evaluation − performance of BERT_Reg trained for popularity forecasting (PF) evaluated on salience prediction and vice-versa

24 of 28

Conclusions

We introduced the task of proactively forecasting sentence-specific information popularity
Contributed InfoPop, a dataset containing 51,770 news articles from 26 news websites with over 1.7 million sentences labeled with normalized popularity scores
Experimented with several baselines, demonstrated efficacy of our STILTs-based Transfer Learning approach involving an auxiliary supervised salience prediction task
Best models achieved nDCG values over 0.8 for sentence popularity forecasting
Interesting takeaway: though popularity forecasting and salience prediction are very different problems, transferring learnings from salience prediction enhances a model's popularity forecasting proficiency

26 of 28

References

[1] Yaser Keneshloo, Shuguang Wang, E. Han, and Naren Ramakrishnan. 2016. Predicting the Popularity of News Articles. In SDM.

[2] Sotiris Lamprinidis, Daniel Hardt, and Dirk Hovy. 2018. Predicting News Headline Popularity with Syntactic and Semantic Knowledge Using Multi-Task Learning. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Brussels, Belgium, 659–664. https://doi.org/10.18653/v1/D18-1068

[3] Nuno Moniz and Luís Torgo. 2018. Multi-Source Social Feedback of Online News Feeds. CoRR https://archive.ics.uci.edu/ml/datasets/abs/1801.07055 (2018).

[4] Georgios Rizos, Symeon Papadopoulos, and Yiannis Kompatsiaris. 2016. Predicting News Popularity by Mining Online Discussions. In Proceedings of the 25th International Conference Companion on World Wide Web (Montréal, Québec, Canada) (WWW ’16 Companion). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE, 737–742. https://doi.org/10.1145/2872518.2890096

[5] Alexandru Tatar, Panayotis Antoniadis, Marcelo Dias de Amorim, and Serge Fdida. 2012. Ranking News Articles Based on Popularity Prediction. In 2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining. 106–110. https://doi.org/10.1109/ASONAM.2012.28

[6] Md. Taufeeq Uddin, Muhammed Jamshed Alam Patwary, Tanveer Ahsan, and Mohammed Shamsul Alam. 2016. Predicting the popularity of online news from content metadata. In 2016 International Conference on Innovations in Science, Engineering and Technology (ICISET). 1–5. https://doi.org/10.1109/ICISET.2016.7856498

27 of 28

References

[7] Anton Voronov, Yao Shen, and Pritom Kumar Mondal. 2019. Forecasting Popularity of News Article by Title Analyzing with BN-LSTM Network. In Proceedings of the 2019 International Conference on Data Mining and Machine Learning (Hong Kong, Hong Kong) (ICDMML 2019). Association for Computing Machinery, New York, NY, USA, 19–27. https://doi.org/10.1145/3335656.3335679

[8] Anthony Chen, Pallavi Gudipati, Shayne Longpre, Xiao Ling, and Sameer Singh. 2021. Evaluating Entity Disambiguation and the Role of Popularity in Retrieval-Based NLP. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 4472–4485. https://doi.org/10.18653/v1/2021.acl-long.345

[9] Mohamed Ahmed, Stella Spagna, Felipe Huici, and Saverio Niccolini. 2013. A Peek into the Future: Predicting the Evolution of Popularity in User Generated Content. In Proceedings of the Sixth ACM International Conference on Web Search and Data Mining (Rome, Italy) (WSDM ’13). Association for Computing Machinery, New York, NY, USA, 607–616. https://doi.org/10.1145/2433396.2433473

[10] Alexander Pugachev, Anton Voronov, and Ilya Makarov. 2020. Prediction of News Popularity via Keywords Extraction and Trends Tracking. Recent Trends in Analysis of Images, Social Networks and Texts 1357 (2020), 37 – 51.

[11] Yun-Zhu Song, Hong-Han Shuai, Sung-Lin Yeh, Yi-Lun Wu, Lun-Wei Ku, and Wen-Chih Peng. 2020. Attractive or Faithful? Popularity-Reinforced Learning for Inspired Headline Generation. Proceedings of the AAAI Conference on Artificial Intelligence 34, 05 (Apr. 2020), 8910–8917. https://doi.org/10.1609/aaai.v34i05.6421

28 of 28

References

[12] Shivashankar Subramanian, Timothy Baldwin, and Trevor Cohn. 2018. Content-based Popularity Prediction of Online Petitions Using a Deep Regression Model. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Association for Computational Linguistics, Melbourne, Australia, 182–188. https://doi.org/10.18653/v1/P18-2030

[13] Ramesh Nallapati, Feifei Zhai, and Bowen Zhou. 2016. Summarunner: A recurrent neural network based sequence model for extractive summarization of documents. arXiv preprint arXiv:1611.04230 (2016).

[14] Ruipeng Jia, Yanan Cao, Haichao Shi, Fang Fang, Yanbing Liu, and Jianlong Tan. 2020. DistilSum: Distilling the Knowledge for Extractive Summarization. In Proceedings of the 29th ACM International Conference on Information and Knowledge Management (Virtual Event, Ireland) (CIKM ’20). Association for Computing Machinery, New York, NY, USA, 2069–2072. https://doi.org/10.1145/3340531.3412078

[15] Karl Moritz Hermann, Tomás Kociský, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching Machines to Read and Comprehend. In NIPS. 1693–1701. http://papers.nips.cc/paper/5945-teaching-machines-to-read-and-comprehend

[16] Jason Phang, Thibault Févry, and Samuel R. Bowman. 2018. Sentence Encoders on STILTs: Supplementary Training on Intermediate Labeled-data Tasks. ArXiv abs/1811.01088 (2018).

[17] Suchin Gururangan, Ana Marasović, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A. Smith. 2020. Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks. arXiv:2004.10964 [cs.CL]

[18] Ming Zhong, Pengfei Liu, Yiran Chen, Danqing Wang, Xipeng Qiu, and Xuanjing Huang. 2020. Extractive Summarization as Text Matching. In Proceedings of the 58th Annual Meeting of the ACL. ACL, Online, 6197–6208. https://doi.org/10.18653/v1/2020.acl-main.552

1 of 28

2 of 28

3 of 28

4 of 28

5 of 28

6 of 28

7 of 28

8 of 28

9 of 28

10 of 28

11 of 28

12 of 28

13 of 28

14 of 28

15 of 28

16 of 28

17 of 28

18 of 28

19 of 28

20 of 28

21 of 28

22 of 28

23 of 28

24 of 28

25 of 28

26 of 28

27 of 28

28 of 28