1 of 65

1

CS 589 Lecture 3

Monday 6:30-9:00

Kidde 228

All zoom links in Canvas

IR Evaluation

(Pseudo)-relevance feedback

photo: https://www.scubedstudios.com/information-retrieval/

2 of 65

Review of Lecture 2: Probabilistic Ranking Principle

  • Given the query q, all documents should be ranked by their probability of relevance
    • RSJ model:
      • Doesn’t leverage the TF information
      • Rely on relevance judgment
    • BM25 model
      • Estimate the probability using the 2-Poisson model
      • Obtain a parameter-free model by approximating the 2-Poisson model

2

3 of 65

Review of Lecture 2: Estimating

3

Problems with this estimation?

not enough data

Bayes rule

4 of 65

Review of Lecture 2: Approximating the 2-Poisson model

4

slide credit: Stanford CS276 Information retrieval

5 of 65

Review of Lecture 2: LM-based retrieval model

  • Given the query q, rank all documents by the probability of generating q from d

5

Dirichlet:

6 of 65

Pop quiz (LM-based retrieval model, attendance)

  • Compute the feedback retrieval model: https://drive.google.com/file/d/1APqRRT41fLUcNtQv3IJ3WbEa7NXhAHK-/view?usp=sharing

6

7 of 65

Lecture 3

  • Basic evaluation metrics for an IR system
    • Precision/recall
    • MAP, MRR, NDCG

  • The Cranfield evaluation methodology
    • Pooling strategy

  • Online evaluation, A/B test

  • Relevance feedback

7

8 of 65

Information Retrieval Techniques

8

Inverted index (Lecture 4)

50 billion pages on the web

Evaluation (Lecture 3)

"cs 589 stevens"

retrieval models (Lecture 1,2,5)

9 of 65

Information retrieval evaluation

  • After learning CS589, you graduate and join a company selling bikes

9

Build a e-Commerce search engine for bikes!

10 of 65

Information retrieval evaluation

  • After learning CS589, you graduate and join a company selling bikes

10

Why using DL?

Deep learning!

11 of 65

Information retrieval evaluation

  • How to know
    • Which search algorithm should we use?
    • How to increase the user traffic for next quarter?
    • How to improve the revenue?

11

12 of 65

Metrics used by Industry

  • The search results are correct

  • Return results fast

  • Users come back

  • Revenue growth

12

  • Retention rate

  • Revenue

  • Precision, recall, etc.

  • Latency

13 of 65

Precision, recall

13

Precision@3 = 2/3

"cs 589 stevens"

relevant

relevant

14 of 65

Rank-based measurements

  • Binary relevance (1 = click, 0 = not click)
    • Precision@K
    • Mean average precision (MAP)
    • Mean reciprocal rank (MRR)

  • Multiple levels of relevance (2: click > 10s, 1: click < 10s, 0: not click)
    • Normalized discounted cumulative gain (NDCG)

14

15 of 65

Precision/recall of retrieved documents

  • Precision: out of all retrieved docs, how many relevant

  • Recall: out of all relevant docs, how many are retrieved

15

16 of 65

Prec vs Recall: which one is more important?

16

"TSLA stock price"

17 of 65

Prec vs Recall: which one is more important?

17

"sneaker"

18 of 65

Prec vs Recall: which one is more important?

18

1988

19 of 65

Precision-recall curve

19

+

+

+

+

Precision@k Recall@k

1.0

0.1

0.2

1.0

0.3 ….

0.6

4/8 4/4

3/5 3/4

2/2 2/4

1/1 1/4

Recall

Precision

Slides from UIUC CS598

precision usually decreases (not always)

2/3 2/4

20 of 65

Average precision @n

  • Consider the rank position of each relevant and retrieved doc
    • K1, K2, … KR

  • Compute Precision@K for K = K1, K2, … KR

  • Average precision:

20

# relevant documents, not # rel&retrieved documents (why?)

# retrieved documents

+

+

+

+

Precision@k Recall@k

4/8 4/4

3/5 3/4

2/2 2/4

1/1 1/4

2/3 2/4

21 of 65

Mean Average Precision @n

21

Slides from Stanford CS276

#relevant documents

Suppose there are 5 relevant documents for both query 1 and 2

22 of 65

Mean reciprocal rank

  • Users only need 1 relevant document

22

Slides from UVA CS4780

RR = 1.0 / (1.0 + rank_1)

p starts from 0

23 of 65

Beyond binary relevance? Click > 10s?

  • Discounted cumulative gain (DCG)
    • Popular measure for evaluating web search and related tasks

  • Information gain-based evaluation (economics)
    • For each relevant document, the user has gained some information
    • The higher the relevance, the higher gain
    • The gain is discounted when the relevant document appears in a lower position

23

position 1

position 2

position 100

gain + 1/log(2)

gain + 4/log(100)

24 of 65

Discounted cumulative gain (DCG)

24

2 0 1 2 2 1 0 0 0 2

0 2 0 0 1 2 2 0 1 2

p starts from 1

click > 10s

click < 10s

not click

For binary rel, whether using 2^ the results are the same

25 of 65

Discounted cumulative gain (DCG)

25

2 0 1 2 2 1 0 0 0 2

0 2 0 0 1 2 2 0 1 2

p starts from 1

click > 10s

click < 10s

not click

26 of 65

Why normalizing DCG?

  • If we do not normalize DCG, the performance will be biased towards systems that perform well on queries with larger DCG scales

26

2 0 1 2 2 1 0 0 0 2

0 2 0 0 1 2 2 0 1 2

“clothing”

“TV”

system A

system B

DCG=3.36

DCG=1.26

DCG=5.79

DCG=1.09

avg=2.31

avg=3.44

bias towards B

27 of 65

Normalized Discounted cumulative gain (nDCG)

27

2 0 1 2 2 1 0 0 0 2

0 2 0 0 1 2 2 0 1 2

28 of 65

Normalized Discounted cumulative gain (nDCG)

28

2 0 1 2 2 1 0 0 0 2

0 2 0 0 1 2 2 0 1 2

What if # relevant items < k? Does the nDCG/iDCG change?

29 of 65

Relevance evaluation methodology

  • Offline evaluation:
    • Evaluation based on annotators’ annotation (explicit)
      • TREC conference
      • Cranfield experiments
      • Pooling
    • Evaluation based on user click through logs (implicit)

  • Online evaluation
    • A/B testing

29

30 of 65

Lecture 3

  • Basic evaluation metrics for an IR system
    • Precision/recall
    • MAP, MRR, NDCG

  • The Cranfield evaluation methodology
    • Pooling strategy

  • Online evaluation, A/B test

  • Relevance feedback

30

What if we don’t have labels?

31 of 65

Text REtrieval Conference (TREC)

  • Since 1992, hosted by NIST

  • Relevance judgment are based on human annotations
    • The relevance judgment goes beyond keywords matching

  • Different tracks for TREC
    • Web
    • Question answering
    • Microblog

31

32 of 65

Text REtrieval Conference (TREC)

32

33 of 65

The Cranfield experiment (1958)

  • Imagine you need to help users search for literatures in a digital library, how would you design such a system?

33

computer science

artificial intelligence

bioinformatics

query = “subject = AI & subject = bioinformatics”

system 1: the Boolean retrieval system

34 of 65

The Cranfield experiment (1958)

  • Imagine you need to help users search for literatures in a digital library, how would you design such a system?

34

system 2: indexing documents by lists of words

query = “artificial intelligence”

artificial

bags of words representation

Document-term matrix

intelligence

book

the

cat

artificial

dog

business

Doc1

0

1

3

1

0

1

0

Doc2

1

0

0

0

0

0

1

query

1

0

1

0

1

0

0

35 of 65

The Cranfield experiment (1958)

  • Basic ingredients
    • A corpus of documents (1.4k paper abstracts)
    • A set of 225 queries
    • Binary relevance judgment for each (q, d) pair
    • Reuse the relevance judgments for each (q, d) pair

35

36 of 65

Scalability problem in human annotation

  • TREC contains 225 x 1.4k = 315k (query, documents) pairs

  • How to annotate so many pairs?

36

  • Pooling strategy
    • For each of K system, first run the system to get top 100 results
    • Annotate the union of all such documents

37 of 65

TREC pooling strategy

37

https://devblogs.microsoft.com/ise/efficient-ground-truth-generation-search-evaluation/

38 of 65

Lecture 3

  • Basic evaluation metrics for an IR system
    • Precision/recall
    • MAP, MRR, NDCG

  • The Cranfield evaluation methodology
    • Pooling strategy

  • Online evaluation, A/B test

  • Relevance feedback

38

39 of 65

Evaluation based on user click through logs

  • TREC style relevance judgment
    • Explicit relevance judgment
    • Difficult to achieve large scalability
    • Relevance judgement is outdated

  • Relevance judgment using user clicks
    • Implicit relevance judgment
    • Effortless relevance judgment at a large scale

39

query = “best phone”, time = 2012, relevance = 1

query = “best phone”, time = 2023, relevance = 0

Nokia

40 of 65

Evaluation based on user click through logs

  • Click logs for “CIKM”

40

the most relevant document

slides from Stanford CS276

41 of 65

Evaluation based on user click through logs

  • System logs the users engagement behaviors:
    • Time stamp
    • Session id
    • Query id, query content
    • Items viewed by the user (in sequential order)
    • Whether each item has been clicked by the user
    • User’s demographic information, search/click history, location, device
    • Dwell time, browsing time for each document
    • Eye tracking information

41

42 of 65

Evaluation based on user click through logs

  • Click logs are stored in large tables
  • Using SQL to extract a subset of query logs

42

43 of 65

Online evaluation methodology

  • Assumption made by offline evaluation
    • After reranking, relevance judgment stays the same

  • Relevance judgment is dynamic, subject to user bias
    • Bias based on positions
    • Preference shifting over time, location
    • Decoy effects

43

44 of 65

Position bias [Craswell 08]

  • Position bias
    • Higher position receives more attention
    • The same item gets lower click in lower position

44

click

not click

An experimental comparison of click position-bias models

45 of 65

Position bias [Craswell 08]

  • Which model captures the position bias?
    • Baseline hypothesis: no position bias
    • Mixture hypothesis: relevance + constant bias
    • Cascade model: a linear traversal through the ranking, and that documents below a clicked result are not examined

45

An experimental comparison of click position-bias models

46 of 65

Position bias [Craswell 08]

  • Controlled experiment:
    • Show document A and B at position m and m+1
    • Flip the two documents
    • Four outcomes: A clicked or skipped, B clicked or skipped

  • Test the three hypothesis by comparing their probability with the true click probability:

46

An experimental comparison of click position-bias models

47 of 65

Position bias [Craswell 08]

  • Result of CE:
    • At upper rank, the baseline model works better
    • At lower rank, the cascade model works the best

47

An experimental comparison of click position-bias models

carefully review

find the relevant one and leave

48 of 65

Decoy effects

48

$400, 20G

$500, 30G

vs

$550, 20G

click probability = 0.3

click probability = 0.4

click probability = 0.5

click probability = 0.5

49 of 65

Online evaluation methodology

  • Evaluation by actually having the system deployed and observe user response
    • A/B testing: split user traffic to compare system A and system B

49

50 of 65

Statistical significance testing

  • How sure can you be that an observed difference doesn’t simply result from the particular queries you chose?

50

Slides from UIUC CS598

System A

0.20

0.21

0.22

0.19

0.17

0.20

0.21

System B

0.40

0.41

0.42

0.39

0.37

0.40

0.41

Experiment 1

Query

1

2

3

4

5

6

7

Average

0.20

0.40

System A

0.02

0.39

0.16

0.58

0.04

0.09

0.12

System B

0.76

0.07

0.37

0.21

0.02

0.91

0.46

Experiment 2

Query

1

2

3

4

5

6

7

Average

0.20

0.40

pvalue=0.015625

51 of 65

Statistical significance testing

51

Slides from UIUC CS598

System A

0.02

0.39

0.16

0.58

0.04

0.09

0.12

System B

0.76

0.07

0.37

0.21

0.02

0.91

0.46

Query

1

2

3

4

5

6

7

Average

0.20

0.40

Sign Test

+

-

+

-

-

+

-

p=1.0

Wilcoxon

+0.74

- 0.32

+0.21

- 0.37

- 0.02

+0.82

- 0.38

p=0.9375

0

95% of outcomes

Wilcoxon test:

Sign test: statsmodels.stats.descriptivestats.sign_test

Wilcoxon: scipy.stats.wilcoxon

52 of 65

Dynamically Adjust the Impression Count

  • Problem with A/B testing:
    • Using 50% for system A, 50% for system B will likely deteriorate user retention
    • Solution: multi armed bandit

52

Source: https://stats.stackexchange.com/questions/326449/multi-armed-bandit-problem

Upper confidence bound:

53 of 65

Multi-armed bandit

53

Source: https://stats.stackexchange.com/questions/326449/multi-armed-bandit-problem

54 of 65

Interleaving

54

remove dup

A clicks = 3, B clicks = 1

55 of 65

Online evaluation methodology

  • Bing has an existing ranking algorithm A
    • Testing algorithm B is better than A
      • Strategy 1: Running A of 1 month, running B for the next month
      • Strategy 2: Running A 50% of the time, B 50% of the time

  • Disadvantage with Strategy 1 and 2:
    • If B fails, it will hurts user experience from the B group

  • Running B 5% of the time, running A 95% of the time

55

56 of 65

Retrieval feedback in session search

56

query = “best phone”

Does the user prefer lower priced phone, or high end phones? Larger storage, better camera?

$400, 20G, Nokia

$500, 30G, Nokia

$600, 40G, iphone

observed click

session 1

session 2

    • The new query should match more relevant documents and fewer non-relevant documents

57 of 65

(Pseudo)relevance feedback language model

57

sparsity

infer w/ EM algo

retrieve

get document model

Model-based feedback in the language modeling approach to information retrieval

58 of 65

Review of Lecture 1: Vector-Space model

58

Vector space model:

TF-IDF weighting:

59 of 65

Rocchio feedback

  • Feedback for vector-space model

  • Rocchio’s practical issues
    • Large vocabularies (only consider important words)
    • Robust and effective
    • Requires relevance feedback
    • Rocchio demo: https://tinyurl.com/4bkw7cj2

59

rel docs

non-rel docs

beta >> gamma

60 of 65

Pseudo-relevance feedback

  • What if we do not have relevance judgments?
    • Use the top retrieved documents as “pseudo relevance documents”

  • Why does pseudo-relevance feedback work?

60

query = “fish tank”

61 of 65

Relevance feedback in RSJ model

61

(Robertson & Sparck Jones 76)

Probability for a word to appear in a relevant doc

Probability for a word to appear in a non-relevant doc

62 of 65

Query expansion

62

63 of 65

Query reformulation

  • Query expansion/reformulation techniques
    • Using manually created synonyms
    • Using automatically derived thesaurus
    • Using query log mining

63

manual thesaurus

automatic thesaurus

64 of 65

Open Challenge (Final Project)

  • Matching relevant hate speech sentences based on a description of the hate speech:

64

https://transparency.fb.com/policies/community-standards/

[XXX group people] are dirty worms. They are filthy and inhuman!

65 of 65

Summary

  • Know how to compute Prec/recall, MAP, NDCG, MRR
    • Try implementing them on your own for HW1 and reproduce the results

  • Know how the Cranfield experimental methodology and pooling works

  • Know how the feedback retrieval model works

65

not click