1 of 28

Expert Users:

A Hybrid Approach to Clickstream Analytics

Elizabeth Haubert

OpenSource Connections

April 10, 2018

2 of 28

Outline

  • What do we need for a relevance test set?
  • What can we build from our (incomplete) logs?
  • How can we efficiently augment that data with explicit user feedback?�

3 of 28

Test what?

UI

API

Data

RESULTS

QUERIES

4 of 28

Implicit Feedback

Query Features

Session Features

User Features

  • Click Position
  • # Queries Per Session
  • # Clicks Per User
  • # Clicks
  • # No-click queries
  • # Queries Per User
  • Query Length
  • Session Time
  • User Dwell Time

  • # Reformulations

  • URLs visited

5 of 28

Laboratory Benchmarks

The Cranfield (TREC) Model

  • Documents
  • Topics
  • Judgements
  • Evaluation Score

The Philosophy of Information Retrieval Evaluation (2002) Ellen Voorhees.

https://www.nist.gov/publications/philosophy-information-retrieval-evaluation

6 of 28

Evaluation Metrics

Without

  • Set difference

With Judgements

  • Precision
  • Recall
  • MRR: Mean Reciprocal Rank
  • ERR: Expected Reciprocal Rank
  • DCG: Discounted Cumulative Gain
  • NDCG: Normalized DCG

7 of 28

Laboratory to Practice

  • Documents
  • Topics
  • Judgements
  • Evaluation Score

How Many?

1,000,000 docs / 50 Topics

= 20,000 docs / Topic

Rank: top 100 / 20,000

0.5% of docs in topic per topic

8 of 28

Queries Instead of Topics

Query

Doc

Judgement Scale

Year

LETOR - 3 (TREC - GOV)

575

568 k

2

2008

LETOR - 3

(TREC-OHSUMED)

106

16 k

3

2008

LETOR 4

2476

85 k

3

2009

Yahoo!

36,251

883 k

5

2010

Microsoft

31,531

3,771 k

5

2010

LETOR: Benchmark Dataset for Research on Learning to Rank for Information Retrieval (2007)

https://pdfs.semanticscholar.org/dbcd/79bd7edcdcbb5912a50796fc3c2746729eb5.pdf

9 of 28

Laboratory to Practice

  • Documents
  • Topics - Search Catalog
  • Judgements
  • Evaluation Score

10 of 28

Laboratory to Practice

  • Documents
  • Topics - Search Catalog
  • Judgements
  • Evaluation Score
  • Subject matter
  • Transactional Navigational Informational
  • Shatford-Panofsky Framework

11 of 28

A Note on Sampling

  • Query Frequency
  • Query Categories
  • Query Response Times

12 of 28

Laboratory to Practice

  • Documents
  • Topics
  • Judgements
  • Evaluation Score

13 of 28

Trec Interactive Track History

  • TREC-4:
    • Assessors picked “relevant” documents on topic
    • Testers picked “relevant” documents on topic
    • If they didn’t use the same query, comparing the results didn’t make much sense.
  • TREC-5:
    • Find docs which describe some ‘aspect’ of the topic
    • Assessors compiled a list of which docs corresponded to which aspects
    • Compared system recall and precision
  • TREC-6:
    • Everyone used the same topics and baseline. Success!

14 of 28

Human Judgements

When you must:

  • Define the queries
  • Play up empathy: write a story.
  • Give guidelines, if they are known.
  • Keep the judgement scale down.

15 of 28

Collecting Judgements By Survey

Were these results helpful?

Please rate this document:

16 of 28

Inferring Judgements

  • Transactional:
    • Presented, Clicked, In-Cart, Purchased
  • Not transactional:
    • More clicks = mostly better
    • Position Bias

17 of 28

Query Chaining

Doc1: { “Title”: “Caring for cats”

“Body” : “Feed cats. Take videos”.}

Doc2: { “Title”: “Why CAT videos are funny”

“Body”: “Because they are goofy”}

Doc3: { “Title”: “Mouser videos”

“Body”: “A mouser caught a rat.” }

Doc4: { “Title”: “Puss in boots”

“Body”: “Story about a cat.” }

Click 1

Click 2

18 of 28

Query Chaining

Doc1: { “Title”: “Caring for cats”

“Body” : “Feed cats. Take videos”.}

Doc2: { “Title”: “Why CAT videos are funny”

“Body”: “Because they are goofy”}

Doc3: { “Title”: “Mouser videos”

“Body”: “A mouser caught a rat.” }

Doc4: { “Title”: “Puss in boots”

“Body”: “Story about a cat.” }

Click 1

Click 2

-2

0

-1

+1

Score

19 of 28

Implicit Feedback

Query Features

Session Features

User Features

  • Click Position
  • # Queries Per Session
  • # Clicks Per User
  • # Clicks
  • # No-click queries
  • # Queries Per User
  • Query Length
  • Session Time
  • User Dwell Time
  • Pages
  • # Reformulations
  • Similarity to other users
  • Dwell Time
  • URLs visited

20 of 28

Sanity Check

  • Calibrate rankings to testers
  • Calibrate testers to rankings

21 of 28

Sanity Check

  • Calibrate rankings to testers
  • Calibrate testers to rankings

22 of 28

Sanity Check

  • Calibrate rankings to testers
  • Calibrate testers to rankings

23 of 28

Sanity Check

  • Calibrate rankings to testers
  • Calibrate testers to rankings

24 of 28

Sanity Check

  • Calibrate rankings to testers
  • Calibrate testers to rankings

25 of 28

Sanity Check

  • Calibrate rankings to testers
  • Calibrate testers to rankings

26 of 28

Sparsity

  • Data Augmentation
    • Collect ratings from calibrated users
  • Reduced sample size
    • Evaluate ratings from smaller-sample

27 of 28

Summary

  • Retrieval evaluation metrics need

<query, document, judgement> tuples

  • Build judgement models from implicit feedback
  • Augment data with controlled explicit feedback
  • It is time for a new retrieval model

28 of 28

Image Credits