Leveraging Site Search Logs to Identify Missing Content on Enterprise Webpages
Harsh Jhamtani, Rishiraj Saha Roy, Niyati Chhaya, Eric Nyberg
MOTIVATION
2
Motivation: Site Search
3
Leveraging Site Search Logs to Identify Missing Content Issues and Suggest Rectifications/Updates for Corresponding Webpages
“
”
Data set
5
query | webpage |
https://blog.wikimedia.org/2012/09/19/what-are-readers-looking-for-wikipedia-search-data-now-available/
Naive Methods
Sort (query, referral) tuples by query count
Sort tuples of given webpage by query counts
6
However, such analysis can be misleading
Naive Methods
7
Navigation vs. Missing Content
Query | Percentage count |
Illustrator download | 5% |
Flash download | 6.9% |
Illustrator pricing | 5.9% |
For http://www.adobe.com/products/illustrator.html, there was no link to download or pricing information. But still flash download was the most popular query
SOLUTION
8
Assumptions
9
Navigation: Distribution of queries will be independent of referral webpage
I.e. P(Q=q | w ) = P(Q=q)
I will use this to disambiguate between the navigation intention queries and missing content queries.
Pearson Residual
Queries were found to follow Poisson distributions
Following from previous assumptions, expected count of query q from webpage w
Standardized Pearson Residual
10
Solution: Phase 1
11
Phase 1
Claim: Significant tuples represent missing content and not navigation intention
Solution: Phase 1
12
Higher the pearson residual, more severe is the problem
=> Can use to prioritize the missing content issues
With website content can do more
page_se_score
best_match_score
13
Solution: Phase 2
14
Phase 2
Solution: Rectification of Issues
15
Rectifying issues:
EXPERIMENTS AND INSIGHTS
16
Parameter Tuning
17
Recall that α and β can be interpreted as governing whether content corresponding to a query is present in a given text, so both take the same value.
- A 0 = {(w, q) | rel(w, q) = 0,
- page se score(w, q) < α},
- A 1 = {(w, q) | rel(w, q) = 1,
- page se score(w, q) >=α}, and
- |A| = |A 0 | + |A 1 | = 1000.
Parameter Tuning
18
Recall that if pearson residual e ij > δ then corresponding tuple is considered significant.
Parameter Tuning
19
Results: Classification
20
Distribution of classes for the data set under tuned parameters
Results: Examples
21
Examples from Adobe data set
Prior Work
22
[1] Yom-Tov, E., Fine, S., Carmel, D., Darlow, A.: Learning to estimate query difficulty including applications to missing content detection and distributed information retrieval. In: SIGIR ’05 (2005)
[2] Lin, W.l., Liu, Y.z.: A novel website structure optimization model for more effective
web navigation. In: WKDD ’08 (2008)
[3] Cui, M., Hu, S.: Search engine optimization research for website promotion. In: ICM
’11 (2011)
CONCLUSION
23
Conclusion
24
Limitations
25
THANKS!
Any questions?
26