1 of 24

Query-Document Topic Mismatch Detection�

Sahil Chelaramani, Ankush Chatterjee, Sonam Damani, Kedhar Nath Narahari, Meghana Joshi, Manish Gupta and Puneet Agrawal�

gmanish@microsoft.com

2 of 24

Why is query-document topic mismatch detection important?

  • Web ranking signals
    • query-document relevance match
    • document authority
    • document freshness
    • document credibility
    • query-document topic match
    • location match for location sensitive queries
    • ...
  • Documents that are topically similar to query are preferable.

3 of 24

Why topic mismatch happens?

  • Because of dropping, substitution, shuffling or addition of entities, intents or other qualifiers in a document compared to a query.

VU=Very Unsatisfactory

U=Unsatisfactory

N=Neutral

S=Satisfactory

VS=Very Satisfactory

4 of 24

Topic mismatch detection (TMD) problem

  • Given: query, document
  • Predict:
    • Very Satisfactory: document is almost completely about/related and almost exclusively about an interpretation of the query, and there is no topic mismatch.
    • Satisfactory: document is mostly about/related to an interpretation of the query. It is not completely about the query as it has some content about other topics/subject beyond the query, or it lacks some content about the query.
    • Neutral: document is somewhat about/related to an interpretation of the query. It still has some content about the query but much of its content is about other topics or it lacks sufficient content about the query.
    • Unsatisfactory: document is only superficially about/related to an interpretation of the query. It may contain some related keywords or thin content about the query, but it is hard to see a useful connection, however it is not completely unrelated.
    • Very Unsatisfactory: document is off-topic and not about/related to any possible interpretation of the query, resulting into a complete topic mismatch.

5 of 24

Naïve first-cut solutions

  • Using query-URL clickthrough graph
    • But click signals are very sparse.
  • Topic modeling for both the query and the document, and build models that check for topic mismatch based on discrepancy in topic distributions.

6 of 24

Deep learning based semantic methods

  • BiLSTMs
  • finetuning the pretrained BERT on a ranking corpus
  • further finetuning for topic mismatch prediction
  • novel modification (task-specific input smoothing) for BERT
  • Richer doc representation: document metadata (like title, URL or snippet) and signals from entire content (like topics or key-phrases).

7 of 24

BiLSTMs

  • BiLSTM Baseline
    • four bidirectional LSTMs with attention
    • No weight sharing across BiLSTMs
    • standard categorical cross entropy loss
  • BERT+TMD
    • query and document (title, URL, snippet, key-phrases) tokens are concatenated into a sequence separated by SEP tokens
    • CLS token connected to dense layer and then to output layer.
    • finetune the pre-trained model on TMD task labeled data.

8 of 24

Better pre-trained BERT model?

  • another closely related task: ranking.
    • classifying every query-document pair as M = {Perfect, Excellent, Good, Fair and Bad}
    • separate labeled dataset of ∼631K (query, document) pairs

    • query-document ranking has a good positive correlation to topic-mismatch prediction
    • two-stage BERT finetuning
      • BERT+Ranking: first finetune BERT for the ranking task (on manually annotated ∼5.3M (query, document) pairs)
      • BERT+R+TMD: Then, finetune for the TMD task.

Correlation between the ranking task and topic mismatch prediction task (in percentages).

9 of 24

Point-wise input vs smoothed input

  • Each token in our (query, document) representation can be thought of as a point in H dimensional space.
  • For better generalization to semantically related variants of these words, one would want to rather use a smoothed representation for each point.
  • This smoothing can be done by representing each point as a weighted summation of representations of its own as well as that of its semantically related variants.
  • Consider a query “iphone5 price”.
    • “iphone6” can be considered as a variant for “iphone5”.

10 of 24

BERT with Smoothed Input (SI)

  •  

11 of 24

BERT with Key-phrases (KP) and Topic Distribution

  • Quadratic complexity BERT model is not amenable to long documents
  • Hence, key-phrases and topic distribution obtained from doc body
  • Key-phrases
    • Obtained from KP extraction library.
    • remove KPs which are subsumed by other KPs
    • remove KPs which are redundant wrt doc URL, title or snippet
    • Individual KPs for the doc are delimited by commas
  • Topic modeling
    • LDA (Latent Dirichlet Allocation)
    • Settings: (1) train on concatenated query + document text versus train separately for queries and for documents. (2) vary number of topics.
    • Best setting: train separately for queries with 10 topics and for documents with 40 topics

12 of 24

BERT with Key-phrases (KP) and Topic Distribution

13 of 24

TMD Dataset

  • 490K queries sampled from Bing’s query log for Aug 2019 to Jan 2020
  • top five URLs for each query were judged
  • ∼2.43M query-document pairs.

14 of 24

Traditional Baselines

  • Clicks and Impressions (CI) based method
    • For every (query document) pair, we compute three sets of features
      • Frequency features: QC=total clicks for query, QI=total impressions for query, DC=total clicks for document, DI=total impressions for document.
      • Normalized features: QNC=p(document clicked|query), QNI=p(document shown|query), DNC=p(query|document clicked), DNI=p(query|document shown).
      • CTR=click through rate.
    • we train a 2-layer MLP (with ReLU non-linearity)
  • Topic models based method

  • CI: providing normalized features with CTR helps.
  • Topic modeling is better than CI.
  • 10 topics for queries and 40 for documents works best.

15 of 24

Overall Main Results

  • LDA topics lead to significant gains, KP typically lead to minor gains only which are not statistically significant, while CI do not lead to any gains.
  • BiLSTMs and BERT based method significantly outperform the traditional clicks, impressions and topic models based methods
  • BERT models perform better than BiLSTMs
  • BERT+Ranking performs worse compared to BERT+TMD

16 of 24

Overall Main Results

  • BERT finetuned for ranking and then for topic mismatch perform better than BERT+TMD
  • Addition of LDA topics to the BERT+Ranking+TMD+QTSU model improves the results (rows 6 and 9)
  • BERT+SI > BERT (rows 6 and 8)
  • BERT+SI+ R+TMD with QTSU+LDATD+KP (row 13) leads to best results; adding CI features do not help (row 14)

17 of 24

Overall Main Results

  • BERT-Large models (rows 15-18) lead to better results compared to BERT-Base (rows 4-14)
  • best results are obtained using the BERT-Large (24 layer) model finetuned on both ranking and TMD tasks using QTSU+KP+LDA topics as features (row 18)

  • Note: BERT-large model needs almost double the RAM as well as inference latency compared to the BERT-base model in row 13.

18 of 24

Error analysis

  • System works best for the extreme VerySat and VeryUnsat classes, and worst for the Sat class.
  • The model confuses a lot of other class instances to be Sat leading to a bad recall.
  • Best precision is for the Sat class, while the best recall is for the VerySat class.

19 of 24

Examples of query-URL instances correctly (top) / incorrectly (bottom) predicted by our model.

20 of 24

Related Work

  • Topic Analysis for Queries
    • Topic shift detection in user sessions
      • Useful for session segmentation [9, 14]
    • Query intent detection
      • Good survey: Brenes et al. [3]
      • Deep learning methods [11]
    • Query reformulation
    • For TMD, understanding query intent is essential but not sufficient.
  • Document Representation
    • bag of words and Term Frequency-Inverse Document Frequency (TF-IDF)
    • Topic models [13] like Latent Dirichlet Allocation (LDA) [2]
    • average of word embeddings obtained using word2vec [19].
    • distributed embedding models like Doc2Vec [15], Universal Sentence Encoder [5] and InferSent [7]
    • hierarchical neural network models [16, 22]
    • BERT [8]

21 of 24

Take-aways

  • Proposed query-document TMD problem.
  • Investigated the effectiveness of standard BERT model using comprehensive query and document representations.
    • Queries: query words and topic distribution using LDA
    • Documents: URL, title, snippet, key-phrases and topic distribution using LDA.
  • Proposed a novel task-specific way to train smoothed embeddings.
  • Best BERT model is finetuned for both ranking and TMD task
  • AUC of 0.75, accuracy of ∼48% and a weighted F1 of ∼44% for the 5-class prediction task.

22 of 24

Refererences

1. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014)

2. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. JMLR 3(Jan), 993–1022 (2003)

3. Brenes, D.J., Gayo-Avello, D., Perez-Gonz ´ alez, K.: Survey and evaluation of query intent detection methods. In: Workshop on Web Search Click Data. pp. 1–7 (2009)

4. Cao, Z., Qin, T., Liu, T.Y., Tsai, M.F., Li, H.: Learning to rank: from pairwise approach to listwise approach. In: Proceedings of the 24th international conference on Machine learning. pp. 129–136 (2007)

5. Cer, D., Yang, Y., Kong, S.y., Hua, N., Limtiaco, N., John, R.S., Constant, N., GuajardoCespedes, M., Yuan, S., Tar, C., et al.: Universal sentence encoder. arXiv preprint arXiv:1803.11175 (2018)

6. Chapelle, O., Chang, Y.: Yahoo! learning to rank challenge overview. In: Proceedings of the learning to rank challenge. pp. 1–24 (2011)

7. Conneau, A., Kiela, D., Schwenk, H., Barrault, L., Bordes, A.: Supervised learning of universal sentence representations from natural language inference data. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. pp. 670–680 (2017)

8. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

9. Gayo-Avello, D.: A survey on session detection methods in query logs and a proposal for future evaluation. Information Sciences 179(12), 1822–1843 (2009)

10. Gupta, M., Agrawal, P.: Compression of deep learning models for text: A survey. arXiv preprint arXiv:2008.05221 (2020)

11. Hashemi, H.B., Asiaee, A., Kraft, R.: Query intent detection using convolutional neural networks. In: Intl. Conf. on Web Search and Data Mining, Workshop on Query Understanding (2016)

23 of 24

Refererences

12. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural computation 9(8), 1735– 1780 (1997)

13. Jelodar, H., Wang, Y., Yuan, C., Feng, X., Jiang, X., Li, Y., Zhao, L.: Latent dirichlet allocation (lda) and topic modeling: models, applications, a survey. Multimedia Tools and Applications 78(11), 15169–15211 (2019)

14. Jones, R., Klinkner, K.L.: Beyond the session timeout: automatic hierarchical segmentation of search topics in query logs. In: CIKM. pp. 699–708 (2008)

15. Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: International conference on machine learning. pp. 1188–1196 (2014)

16. Li, J., Luong, M.T., Jurafsky, D.: A hierarchical neural autoencoder for paragraphs and documents. arXiv preprint arXiv:1506.01057 (2015)

17. Li, X., de Rijke, M.: Do topic shift and query reformulation patterns correlate in academic search? In: ECIR. pp. 146–159. Springer (2017)

18. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)

19. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems. pp. 3111–3119 (2013)

20. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: NIPS. pp. 5998–6008 (2017)

21. Xiong, L., Hu, C., Xiong, C., Campos, D., Overwijk, A.: Open domain web keyphrase extraction beyond language modeling. In: EMNLP-IJCNLP. pp. 5178–5187 (2019)

22. Yang, Z., Yang, D., Dyer, C., He, X., Smola, A., Hovy, E.: Hierarchical attention networks for document classification. In: Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies. pp. 1480–1489 (2016)

24 of 24

Thanks!