1 of 26

Leveraging Site Search Logs to Identify Missing Content on Enterprise Webpages

Harsh Jhamtani, Rishiraj Saha Roy, Niyati Chhaya, Eric Nyberg

2 of 26

MOTIVATION

2

3 of 26

Motivation: Site Search

  • I want to navigate to page about ‘X’. Navigating through menus and options is a pain. Let me use site search box.

  • I expected info about ‘X’ on this page. It’s not there. Let me use site search box.

3

4 of 26

Leveraging Site Search Logs to Identify Missing Content Issues and Suggest Rectifications/Updates for Corresponding Webpages

5 of 26

Data set

5

query

webpage

https://blog.wikimedia.org/2012/09/19/what-are-readers-looking-for-wikipedia-search-data-now-available/

6 of 26

Naive Methods

Sort (query, referral) tuples by query count

Sort tuples of given webpage by query counts

6

However, such analysis can be misleading

7 of 26

Naive Methods

7

Navigation vs. Missing Content

Query

Percentage count

Illustrator download

5%

Flash download

6.9%

Illustrator pricing

5.9%

For http://www.adobe.com/products/illustrator.html, there was no link to download or pricing information. But still flash download was the most popular query

8 of 26

SOLUTION

8

9 of 26

Assumptions

9

Navigation: Distribution of queries will be independent of referral webpage

I.e. P(Q=q | w ) = P(Q=q)

I will use this to disambiguate between the navigation intention queries and missing content queries.

10 of 26

Pearson Residual

Queries were found to follow Poisson distributions

Following from previous assumptions, expected count of query q from webpage w

Standardized Pearson Residual

  • C[i][j]: Count of query i from webpage w

10

11 of 26

Solution: Phase 1

11

Phase 1

Claim: Significant tuples represent missing content and not navigation intention

12 of 26

Solution: Phase 1

12

Higher the pearson residual, more severe is the problem

=> Can use to prioritize the missing content issues

13 of 26

With website content can do more

page_se_score

best_match_score

13

14 of 26

Solution: Phase 2

14

Phase 2

  • Missing content on page : Content relevant to q ∗ is absent on w ∗ but present on other page(s) in W ;
  • Missing content on site: Content relevant to q ∗ is not present in W ;
  • Unsatisfactorily present content: Content relevant to q ∗ is present on w ∗ but does not satisfy user requirements properly;

15 of 26

Solution: Rectification of Issues

15

Rectifying issues:

  • Missing content on page:
    • Leverage click-through data to infer which search result link was satisfactory to user
    • Use content / link to the inferred webpage(s)
  • Missing content on site
    • Topics for author to write about
  • Unsatisfactorily present content
    • Leverage click-through data as described earlier

16 of 26

EXPERIMENTS AND INSIGHTS

16

17 of 26

Parameter Tuning

17

  • α and β

Recall that α and β can be interpreted as governing whether content corresponding to a query is present in a given text, so both take the same value.

    • We obtained a set of 1000 binary-relevance judged (w, q) pairs (by humans, 500 relevant and non-relevant pairs each), such that rel(w, q) = 1 denotes “true” significance corresponding to missing content, and rel(w, q) = 0 otherwise.

    • Optimal α ∗ (and equivalently, β ∗ ) as per the MaxPCC criteria [4] can be derived as α ∗ = arg max α (|A 0 | + |A 1 |)/|A|, where

- A 0 = {(w, q) | rel(w, q) = 0,

- page se score(w, q) < α},

- A 1 = {(w, q) | rel(w, q) = 1,

- page se score(w, q) >=α}, and

- |A| = |A 0 | + |A 1 | = 1000.

18 of 26

Parameter Tuning

18

Recall that if pearson residual e ij > δ then corresponding tuple is considered significant.

  • The choice of δ was guided by the distribution of Pearson residuals as follows:

    • As mentioned earlier, e ij > δ are considered significant. Due to possible noise and randomness in data, we should be skeptical about small positive residuals that signify slightly higher-than-expected counts.
    • Positive residuals were found to follow an exponential distribution with rate = 0.0139. The log likelihood of the fit, normalized by the number of values, was −5.28. We set δ as the mean of the distribution, which was 71.94, as we believe that positive values below the mean can be a result of noise.

19 of 26

Parameter Tuning

19

  • Query clustering:

    • Chinese whispers (scalable)
    • Edges(0/1)- based on jaccard similarity with threshold 0.7

20 of 26

Results: Classification

20

Distribution of classes for the data set under tuned parameters

21 of 26

Results: Examples

21

Examples from Adobe data set

22 of 26

Prior Work

22

  • Predicting ‘difficult’ web queries [1]

  • Adding links for SEO [2]

  • Optimizing site structure for navigation [3]

[1] Yom-Tov, E., Fine, S., Carmel, D., Darlow, A.: Learning to estimate query difficulty including applications to missing content detection and distributed information retrieval. In: SIGIR ’05 (2005)

[2] Lin, W.l., Liu, Y.z.: A novel website structure optimization model for more effective

web navigation. In: WKDD ’08 (2008)

[3] Cui, M., Hu, S.: Search engine optimization research for website promotion. In: ICM

’11 (2011)

23 of 26

CONCLUSION

23

24 of 26

Conclusion

24

  • We have formulated a very practical research problem. Such a problem has not been formulated earlier - this is an important contribution of our work.

  • Our method is light weight and builds on query logs, which are often readily available

25 of 26

Limitations

25

  • A thorough evaluation is lacking. Note that evaluation is tough for the proposed problem - a ‘real’ evaluation can only occur in a deployed scenario, where measures, lets say number of queries issued, can be compared from before and after rectifying an identified missing content issue.

  • Lot of image data in modern websites - that has not been considered in presented analysis.

26 of 26

THANKS!

Any questions?

26