1 of 65

1

CS 589 Lecture 3

Monday 6:30-9:00

Kidde 228

All zoom links in Canvas

IR Evaluation

(Pseudo)-relevance feedback

photo: https://www.scubedstudios.com/information-retrieval/

2 of 65

Review of Lecture 2: Probabilistic Ranking Principle

Given the query q, all documents should be ranked by their probability of relevance

RSJ model:

Doesn’t leverage the TF information
Rely on relevance judgment

BM25 model

Estimate the probability using the 2-Poisson model
Obtain a parameter-free model by approximating the 2-Poisson model

2

3 of 65

Review of Lecture 2: Estimating

3

Problems with this estimation?

not enough data

Bayes rule

4 of 65

Review of Lecture 2: Approximating the 2-Poisson model

4

slide credit: Stanford CS276 Information retrieval

5 of 65

Review of Lecture 2: LM-based retrieval model

Given the query q, rank all documents by the probability of generating q from d

5

Dirichlet:

6 of 65

Pop quiz (LM-based retrieval model, attendance)

Compute the feedback retrieval model: https://drive.google.com/file/d/1APqRRT41fLUcNtQv3IJ3WbEa7NXhAHK-/view?usp=sharing

6

7 of 65

Lecture 3

Basic evaluation metrics for an IR system

Precision/recall
MAP, MRR, NDCG

The Cranfield evaluation methodology

Pooling strategy

Online evaluation, A/B test

Relevance feedback

7

8 of 65

Information Retrieval Techniques

8

Inverted index (Lecture 4)

50 billion pages on the web

Evaluation (Lecture 3)

"cs 589 stevens"

retrieval models (Lecture 1,2,5)

9 of 65

Information retrieval evaluation

After learning CS589, you graduate and join a company selling bikes

9

Build a e-Commerce search engine for bikes!

10 of 65

Information retrieval evaluation

After learning CS589, you graduate and join a company selling bikes

10

Why using DL?

Deep learning!

11 of 65

Information retrieval evaluation

How to know

Which search algorithm should we use?
How to increase the user traffic for next quarter?
How to improve the revenue?

11

12 of 65

Metrics used by Industry

The search results are correct

Return results fast

Users come back

Revenue growth

12

Retention rate

Revenue

Precision, recall, etc.

Latency

13 of 65

Precision, recall

13

Precision@3 = 2/3

"cs 589 stevens"

relevant

14 of 65

Rank-based measurements

Binary relevance (1 = click, 0 = not click)

Precision@K
Mean average precision (MAP)
Mean reciprocal rank (MRR)

Multiple levels of relevance (2: click > 10s, 1: click < 10s, 0: not click)

Normalized discounted cumulative gain (NDCG)

14

15 of 65

Precision/recall of retrieved documents

Precision: out of all retrieved docs, how many relevant

Recall: out of all relevant docs, how many are retrieved

15

16 of 65

Prec vs Recall: which one is more important?

16

"TSLA stock price"

17 of 65

Prec vs Recall: which one is more important?

17

"sneaker"

18 of 65

Prec vs Recall: which one is more important?

18

1988

19 of 65

Precision-recall curve

19

+

–

+

–

+

–

Precision@k Recall@k

1.0

0.1

0.2

1.0

0.3 ….

0.6

4/8 4/4

3/5 3/4

2/2 2/4

1/1 1/4

Recall

Precision

Slides from UIUC CS598

precision usually decreases (not always)

2/3 2/4

20 of 65

Average precision @n

Consider the rank position of each relevant and retrieved doc

K₁, K₂, … K_R

Compute Precision@K for K = K₁, K₂, … K_R

Average precision:

20

# relevant documents, not # rel&retrieved documents (why?)

# retrieved documents

+

–

+

–

+

–

Precision@k Recall@k

4/8 4/4

3/5 3/4

2/2 2/4

1/1 1/4

2/3 2/4

21 of 65

Mean Average Precision @n

21

Slides from Stanford CS276

#relevant documents

Suppose there are 5 relevant documents for both query 1 and 2

22 of 65

Mean reciprocal rank

Users only need 1 relevant document

22

Slides from UVA CS4780

RR = 1.0 / (1.0 + rank_1)

p starts from 0

23 of 65

Beyond binary relevance? Click > 10s?

Discounted cumulative gain (DCG)

Popular measure for evaluating web search and related tasks

Information gain-based evaluation (economics)

For each relevant document, the user has gained some information
The higher the relevance, the higher gain
The gain is discounted when the relevant document appears in a lower position

23

position 1

position 2

position 100

gain + 1/log(2)

gain + 4/log(100)

24 of 65

Discounted cumulative gain (DCG)

24

2 0 1 2 2 1 0 0 0 2

0 2 0 0 1 2 2 0 1 2

p starts from 1

click > 10s

click < 10s

not click

For binary rel, whether using 2^ the results are the same

25 of 65

Discounted cumulative gain (DCG)

25

2 0 1 2 2 1 0 0 0 2

0 2 0 0 1 2 2 0 1 2

p starts from 1

click > 10s

click < 10s

not click

26 of 65

Why normalizing DCG?

If we do not normalize DCG, the performance will be biased towards systems that perform well on queries with larger DCG scales

26

2 0 1 2 2 1 0 0 0 2

0 2 0 0 1 2 2 0 1 2

“clothing”

“TV”

system A

system B

DCG=3.36

DCG=1.26

DCG=5.79

DCG=1.09

avg=2.31

avg=3.44

bias towards B

27 of 65

Normalized Discounted cumulative gain (nDCG)

27

2 0 1 2 2 1 0 0 0 2

0 2 0 0 1 2 2 0 1 2

28 of 65

Normalized Discounted cumulative gain (nDCG)

28

2 0 1 2 2 1 0 0 0 2

0 2 0 0 1 2 2 0 1 2

What if # relevant items < k? Does the nDCG/iDCG change?

29 of 65

Relevance evaluation methodology

Offline evaluation:

Evaluation based on annotators’ annotation (explicit)

TREC conference
Cranfield experiments
Pooling

Evaluation based on user click through logs (implicit)

Online evaluation

A/B testing

29

30 of 65

Lecture 3

Basic evaluation metrics for an IR system

Precision/recall
MAP, MRR, NDCG

The Cranfield evaluation methodology

Pooling strategy

Online evaluation, A/B test

Relevance feedback

30

What if we don’t have labels?

31 of 65

Text REtrieval Conference (TREC)

Since 1992, hosted by NIST

Relevance judgment are based on human annotations

The relevance judgment goes beyond keywords matching

Different tracks for TREC

Web
Question answering
Microblog

31

32 of 65

Text REtrieval Conference (TREC)

32

33 of 65

The Cranfield experiment (1958)

Imagine you need to help users search for literatures in a digital library, how would you design such a system?

33

computer science

artificial intelligence

bioinformatics

query = “subject = AI & subject = bioinformatics”

system 1: the Boolean retrieval system

34 of 65

The Cranfield experiment (1958)

Imagine you need to help users search for literatures in a digital library, how would you design such a system?

34

system 2: indexing documents by lists of words

query = “artificial intelligence”

artificial

bags of words representation

Document-term matrix

	intelligence	book	the	cat	artificial	dog	business
Doc1	0	1	3	1	0	1	0
Doc2	1	0	0	0	0	0	1
query	1	0	1	0	1	0	0

35 of 65

The Cranfield experiment (1958)

Basic ingredients

A corpus of documents (1.4k paper abstracts)
A set of 225 queries
Binary relevance judgment for each (q, d) pair
Reuse the relevance judgments for each (q, d) pair

35

36 of 65

Scalability problem in human annotation

TREC contains 225 x 1.4k = 315k (query, documents) pairs

How to annotate so many pairs?

36

Pooling strategy

For each of K system, first run the system to get top 100 results
Annotate the union of all such documents

37 of 65

TREC pooling strategy

37

https://devblogs.microsoft.com/ise/efficient-ground-truth-generation-search-evaluation/

38 of 65

Lecture 3

Basic evaluation metrics for an IR system

Precision/recall
MAP, MRR, NDCG

The Cranfield evaluation methodology

Pooling strategy

Online evaluation, A/B test

Relevance feedback

38

39 of 65

Evaluation based on user click through logs

TREC style relevance judgment

Explicit relevance judgment
Difficult to achieve large scalability
Relevance judgement is outdated

Relevance judgment using user clicks

Implicit relevance judgment
Effortless relevance judgment at a large scale

39

query = “best phone”, time = 2012, relevance = 1

query = “best phone”, time = 2023, relevance = 0

Nokia

40 of 65

Evaluation based on user click through logs

Click logs for “CIKM”

40

the most relevant document

slides from Stanford CS276

41 of 65

Evaluation based on user click through logs

System logs the users engagement behaviors:

Time stamp
Session id
Query id, query content
Items viewed by the user (in sequential order)
Whether each item has been clicked by the user
User’s demographic information, search/click history, location, device
Dwell time, browsing time for each document
Eye tracking information

41

42 of 65

Evaluation based on user click through logs

Click logs are stored in large tables
Using SQL to extract a subset of query logs

42

43 of 65

Online evaluation methodology

Assumption made by offline evaluation

After reranking, relevance judgment stays the same

Relevance judgment is dynamic, subject to user bias

Bias based on positions
Preference shifting over time, location
Decoy effects

43

44 of 65

Position bias [Craswell 08]

Position bias

Higher position receives more attention
The same item gets lower click in lower position

44

click

not click

An experimental comparison of click position-bias models

45 of 65

Position bias [Craswell 08]

Which model captures the position bias?

Baseline hypothesis: no position bias
Mixture hypothesis: relevance + constant bias
Cascade model: a linear traversal through the ranking, and that documents below a clicked result are not examined

45

An experimental comparison of click position-bias models

46 of 65

Position bias [Craswell 08]

Controlled experiment:

Show document A and B at position m and m+1
Flip the two documents
Four outcomes: A clicked or skipped, B clicked or skipped

Test the three hypothesis by comparing their probability with the true click probability:

46

An experimental comparison of click position-bias models

47 of 65

Position bias [Craswell 08]

Result of CE:

At upper rank, the baseline model works better
At lower rank, the cascade model works the best

47

An experimental comparison of click position-bias models

carefully review

find the relevant one and leave

48 of 65

Decoy effects

48

$400, 20G

$500, 30G

vs

$550, 20G

click probability = 0.3

click probability = 0.4

click probability = 0.5

49 of 65

Online evaluation methodology

Evaluation by actually having the system deployed and observe user response

A/B testing: split user traffic to compare system A and system B

49

50 of 65

Statistical significance testing

How sure can you be that an observed difference doesn’t simply result from the particular queries you chose?

50

Slides from UIUC CS598

System A

0.20

0.21

0.22

0.19

0.17

0.20

0.21

System B

0.40

0.41

0.42

0.39

0.37

0.40

0.41

Experiment 1

Query

1

2

3

4

5

6

7

Average

0.20

0.40

System A

0.02

0.39

0.16

0.58

0.04

0.09

0.12

System B

0.76

0.07

0.37

0.21

0.02

0.91

0.46

Experiment 2

Query

1

2

3

4

5

6

7

Average

0.20

0.40

pvalue=0.015625

51 of 65

Statistical significance testing

51

Slides from UIUC CS598

System A

0.02

0.39

0.16

0.58

0.04

0.09

0.12

System B

0.76

0.07

0.37

0.21

0.02

0.91

0.46

Query

1

2

3

4

5

6

7

Average

0.20

0.40

Sign Test

+

-

+

-

+

-

p=1.0

Wilcoxon

+0.74

- 0.32

+0.21

- 0.37

- 0.02

+0.82

- 0.38

p=0.9375

0

95% of outcomes

Wilcoxon test:

Sign test: statsmodels.stats.descriptivestats.sign_test

Wilcoxon: scipy.stats.wilcoxon

52 of 65

Dynamically Adjust the Impression Count

Problem with A/B testing:

Using 50% for system A, 50% for system B will likely deteriorate user retention
Solution: multi armed bandit

52

Source: https://stats.stackexchange.com/questions/326449/multi-armed-bandit-problem

Upper confidence bound:

53 of 65

Multi-armed bandit

53

Source: https://stats.stackexchange.com/questions/326449/multi-armed-bandit-problem

54 of 65

Interleaving

54

remove dup

A clicks = 3, B clicks = 1

55 of 65

Online evaluation methodology

Bing has an existing ranking algorithm A

Testing algorithm B is better than A

Strategy 1: Running A of 1 month, running B for the next month
Strategy 2: Running A 50% of the time, B 50% of the time

Disadvantage with Strategy 1 and 2:

If B fails, it will hurts user experience from the B group

Running B 5% of the time, running A 95% of the time

55

56 of 65

Retrieval feedback in session search

56

query = “best phone”

Does the user prefer lower priced phone, or high end phones? Larger storage, better camera?

$400, 20G, Nokia

$500, 30G, Nokia

$600, 40G, iphone

observed click

session 1

session 2

The new query should match more relevant documents and fewer non-relevant documents

57 of 65

(Pseudo)relevance feedback language model

57

sparsity

infer w/ EM algo

retrieve

get document model

Model-based feedback in the language modeling approach to information retrieval

58 of 65

Review of Lecture 1: Vector-Space model

58

Vector space model:

TF-IDF weighting:

59 of 65

Rocchio feedback

Feedback for vector-space model

Rocchio’s practical issues

Large vocabularies (only consider important words)
Robust and effective
Requires relevance feedback
Rocchio demo: https://tinyurl.com/4bkw7cj2

59

rel docs

non-rel docs

beta >> gamma

60 of 65

Pseudo-relevance feedback

What if we do not have relevance judgments?

Use the top retrieved documents as “pseudo relevance documents”

Why does pseudo-relevance feedback work?

60

query = “fish tank”

61 of 65

Relevance feedback in RSJ model

61

(Robertson & Sparck Jones 76)

Probability for a word to appear in a relevant doc

Probability for a word to appear in a non-relevant doc

62 of 65

Query expansion

62

63 of 65

Query reformulation

Query expansion/reformulation techniques

Using manually created synonyms
Using automatically derived thesaurus
Using query log mining

63

manual thesaurus

automatic thesaurus

64 of 65

Open Challenge (Final Project)

Matching relevant hate speech sentences based on a description of the hate speech:

64

https://transparency.fb.com/policies/community-standards/

[XXX group people] are dirty worms. They are filthy and inhuman!

65 of 65

Summary

Know how to compute Prec/recall, MAP, NDCG, MRR

Try implementing them on your own for HW1 and reproduce the results

Know how the Cranfield experimental methodology and pooling works

Know how the feedback retrieval model works

65

not click