JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.

1 of 28

Expert Users:

A Hybrid Approach to Clickstream Analytics

Elizabeth Haubert

OpenSource Connections

April 10, 2018

2 of 28

Outline

What do we need for a relevance test set?
What can we build from our (incomplete) logs?
How can we efficiently augment that data with explicit user feedback?�

3 of 28

Test what?

API

Data

RESULTS

QUERIES

4 of 28

Implicit Feedback

Query Features	Session Features	User Features
Click Position	# Queries Per Session	# Clicks Per User
# Clicks	# No-click queries	# Queries Per User
Query Length	Session Time	User Dwell Time
	# Reformulations
	URLs visited

5 of 28

Laboratory Benchmarks

The Cranfield (TREC) Model

Documents
Topics
Judgements
Evaluation Score

The Philosophy of Information Retrieval Evaluation (2002) Ellen Voorhees.

https://www.nist.gov/publications/philosophy-information-retrieval-evaluation

6 of 28

Evaluation Metrics

Without

Set difference

With Judgements

Precision
Recall
MRR: Mean Reciprocal Rank
ERR: Expected Reciprocal Rank
DCG: Discounted Cumulative Gain
NDCG: Normalized DCG

7 of 28

Laboratory to Practice

Documents
Topics
Judgements
Evaluation Score

How Many?

1,000,000 docs / 50 Topics

= 20,000 docs / Topic

Rank: top 100 / 20,000

0.5% of docs in topic per topic

8 of 28

Queries Instead of Topics

	Query	Doc	Judgement Scale	Year
LETOR - 3 (TREC - GOV)	575	568 k	2	2008
LETOR - 3 (TREC-OHSUMED)	106	16 k	3	2008
LETOR 4	2476	85 k	3	2009
Yahoo!	36,251	883 k	5	2010
Microsoft	31,531	3,771 k	5	2010

LETOR: Benchmark Dataset for Research on Learning to Rank for Information Retrieval (2007)

https://pdfs.semanticscholar.org/dbcd/79bd7edcdcbb5912a50796fc3c2746729eb5.pdf

9 of 28

Laboratory to Practice

Documents
Topics - Search Catalog
Judgements
Evaluation Score

10 of 28

Laboratory to Practice

Documents
Topics - Search Catalog
Judgements
Evaluation Score

Subject matter
Transactional Navigational Informational
Shatford-Panofsky Framework

11 of 28

A Note on Sampling

Query Frequency
Query Categories
Query Response Times

12 of 28

Laboratory to Practice

Documents
Topics
Judgements
Evaluation Score

13 of 28

Trec Interactive Track History

TREC-4:

Assessors picked “relevant” documents on topic
Testers picked “relevant” documents on topic
If they didn’t use the same query, comparing the results didn’t make much sense.

TREC-5:

Find docs which describe some ‘aspect’ of the topic
Assessors compiled a list of which docs corresponded to which aspects
Compared system recall and precision

TREC-6:

Everyone used the same topics and baseline. Success!

14 of 28

Human Judgements

When you must:

Define the queries
Play up empathy: write a story.
Give guidelines, if they are known.
Keep the judgement scale down.

15 of 28

Collecting Judgements By Survey

Were these results helpful?

Please rate this document:

16 of 28

Inferring Judgements

Transactional:

Presented, Clicked, In-Cart, Purchased

Not transactional:

More clicks = mostly better
Position Bias

17 of 28

Query Chaining

Doc1: { “Title”: “Caring for cats”

“Body” : “Feed cats. Take videos”.}

Doc2: { “Title”: “Why CAT videos are funny”

“Body”: “Because they are goofy”}

Doc3: { “Title”: “Mouser videos”

“Body”: “A mouser caught a rat.” }

Doc4: { “Title”: “Puss in boots”

“Body”: “Story about a cat.” }

Click 1

Click 2

18 of 28

Query Chaining

Doc1: { “Title”: “Caring for cats”

“Body” : “Feed cats. Take videos”.}

Doc2: { “Title”: “Why CAT videos are funny”

“Body”: “Because they are goofy”}

Doc3: { “Title”: “Mouser videos”

“Body”: “A mouser caught a rat.” }

Doc4: { “Title”: “Puss in boots”

“Body”: “Story about a cat.” }

Click 1

Click 2

-2

-1

Score

19 of 28

Implicit Feedback

Query Features	Session Features	User Features
Click Position	# Queries Per Session	# Clicks Per User
# Clicks	# No-click queries	# Queries Per User
Query Length	Session Time	User Dwell Time
Pages	# Reformulations	Similarity to other users
Dwell Time	URLs visited

20 of 28

Sanity Check

Calibrate rankings to testers
Calibrate testers to rankings

21 of 28

Sanity Check

Calibrate rankings to testers
Calibrate testers to rankings

22 of 28

Sanity Check

Calibrate rankings to testers
Calibrate testers to rankings

23 of 28

Sanity Check

Calibrate rankings to testers
Calibrate testers to rankings

24 of 28

Sanity Check

Calibrate rankings to testers
Calibrate testers to rankings

25 of 28

Sanity Check

Calibrate rankings to testers
Calibrate testers to rankings

26 of 28

Sparsity

Data Augmentation

Collect ratings from calibrated users

Reduced sample size

Evaluate ratings from smaller-sample

27 of 28

Summary

Retrieval evaluation metrics need

<query, document, judgement> tuples

Build judgement models from implicit feedback
Augment data with controlled explicit feedback
It is time for a new retrieval model

28 of 28

Image Credits