1
CS 589 Lecture 3
Monday 6:30-9:00
Kidde 228
All zoom links in Canvas
IR Evaluation
(Pseudo)-relevance feedback
photo: https://www.scubedstudios.com/information-retrieval/
Review of Lecture 2: Probabilistic Ranking Principle
2
Review of Lecture 2: Estimating
3
Problems with this estimation?
not enough data
Bayes rule
Review of Lecture 2: Approximating the 2-Poisson model
4
slide credit: Stanford CS276 Information retrieval
Review of Lecture 2: LM-based retrieval model
5
Dirichlet:
Pop quiz (LM-based retrieval model, attendance)
6
Lecture 3
7
Information Retrieval Techniques
8
Inverted index (Lecture 4)
50 billion pages on the web
Evaluation (Lecture 3)
"cs 589 stevens"
retrieval models (Lecture 1,2,5)
Information retrieval evaluation
9
Build a e-Commerce search engine for bikes!
Information retrieval evaluation
10
Why using DL?
Deep learning!
Information retrieval evaluation
11
Metrics used by Industry
12
Precision, recall
13
Precision@3 = 2/3
"cs 589 stevens"
relevant
relevant
Rank-based measurements
14
Precision/recall of retrieved documents
15
Prec vs Recall: which one is more important?
16
"TSLA stock price"
Prec vs Recall: which one is more important?
17
"sneaker"
Prec vs Recall: which one is more important?
18
1988
Precision-recall curve
19
+
+
–
–
+
–
–
+
–
–
Precision@k Recall@k
1.0
0.1
0.2
1.0
0.3 ….
0.6
4/8 4/4
3/5 3/4
2/2 2/4
1/1 1/4
Recall
Precision
Slides from UIUC CS598
precision usually decreases (not always)
2/3 2/4
Average precision @n
20
# relevant documents, not # rel&retrieved documents (why?)
# retrieved documents
+
+
–
–
+
–
–
+
–
–
Precision@k Recall@k
4/8 4/4
3/5 3/4
2/2 2/4
1/1 1/4
2/3 2/4
Mean Average Precision @n
21
Slides from Stanford CS276
#relevant documents
Suppose there are 5 relevant documents for both query 1 and 2
Mean reciprocal rank
22
Slides from UVA CS4780
RR = 1.0 / (1.0 + rank_1)
p starts from 0
Beyond binary relevance? Click > 10s?
23
position 1
position 2
position 100
gain + 1/log(2)
gain + 4/log(100)
Discounted cumulative gain (DCG)
24
2 0 1 2 2 1 0 0 0 2
0 2 0 0 1 2 2 0 1 2
p starts from 1
click > 10s
click < 10s
not click
For binary rel, whether using 2^ the results are the same
Discounted cumulative gain (DCG)
25
2 0 1 2 2 1 0 0 0 2
0 2 0 0 1 2 2 0 1 2
p starts from 1
click > 10s
click < 10s
not click
Why normalizing DCG?
26
2 0 1 2 2 1 0 0 0 2
0 2 0 0 1 2 2 0 1 2
“clothing”
“TV”
system A
system B
DCG=3.36
DCG=1.26
DCG=5.79
DCG=1.09
avg=2.31
avg=3.44
bias towards B
Normalized Discounted cumulative gain (nDCG)
27
2 0 1 2 2 1 0 0 0 2
0 2 0 0 1 2 2 0 1 2
Normalized Discounted cumulative gain (nDCG)
28
2 0 1 2 2 1 0 0 0 2
0 2 0 0 1 2 2 0 1 2
What if # relevant items < k? Does the nDCG/iDCG change?
Relevance evaluation methodology
29
Lecture 3
30
What if we don’t have labels?
Text REtrieval Conference (TREC)
31
Text REtrieval Conference (TREC)
32
The Cranfield experiment (1958)
33
computer science
artificial intelligence
bioinformatics
query = “subject = AI & subject = bioinformatics”
system 1: the Boolean retrieval system
The Cranfield experiment (1958)
34
system 2: indexing documents by lists of words
query = “artificial intelligence”
artificial
bags of words representation
Document-term matrix
| intelligence | book | the | cat | artificial | dog | business |
Doc1 | 0 | 1 | 3 | 1 | 0 | 1 | 0 |
Doc2 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
query | 1 | 0 | 1 | 0 | 1 | 0 | 0 |
The Cranfield experiment (1958)
35
Scalability problem in human annotation
36
TREC pooling strategy
37
https://devblogs.microsoft.com/ise/efficient-ground-truth-generation-search-evaluation/
Lecture 3
38
Evaluation based on user click through logs
39
query = “best phone”, time = 2012, relevance = 1
query = “best phone”, time = 2023, relevance = 0
Nokia
Evaluation based on user click through logs
40
the most relevant document
slides from Stanford CS276
Evaluation based on user click through logs
41
Evaluation based on user click through logs
42
Online evaluation methodology
43
Position bias [Craswell 08]
44
click
not click
An experimental comparison of click position-bias models
Position bias [Craswell 08]
45
An experimental comparison of click position-bias models
Position bias [Craswell 08]
46
An experimental comparison of click position-bias models
Position bias [Craswell 08]
47
An experimental comparison of click position-bias models
carefully review
find the relevant one and leave
Decoy effects
48
$400, 20G
$500, 30G
vs
$550, 20G
click probability = 0.3
click probability = 0.4
click probability = 0.5
click probability = 0.5
Online evaluation methodology
49
Statistical significance testing
50
Slides from UIUC CS598
System A
0.20
0.21
0.22
0.19
0.17
0.20
0.21
System B
0.40
0.41
0.42
0.39
0.37
0.40
0.41
Experiment 1
Query
1
2
3
4
5
6
7
Average
0.20
0.40
System A
0.02
0.39
0.16
0.58
0.04
0.09
0.12
System B
0.76
0.07
0.37
0.21
0.02
0.91
0.46
Experiment 2
Query
1
2
3
4
5
6
7
Average
0.20
0.40
pvalue=0.015625
Statistical significance testing
51
Slides from UIUC CS598
System A
0.02
0.39
0.16
0.58
0.04
0.09
0.12
System B
0.76
0.07
0.37
0.21
0.02
0.91
0.46
Query
1
2
3
4
5
6
7
Average
0.20
0.40
Sign Test
+
-
+
-
-
+
-
p=1.0
Wilcoxon
+0.74
- 0.32
+0.21
- 0.37
- 0.02
+0.82
- 0.38
p=0.9375
0
95% of outcomes
Wilcoxon test:
Sign test: statsmodels.stats.descriptivestats.sign_test
Wilcoxon: scipy.stats.wilcoxon
Dynamically Adjust the Impression Count
52
Source: https://stats.stackexchange.com/questions/326449/multi-armed-bandit-problem
Upper confidence bound:
Multi-armed bandit
53
Source: https://stats.stackexchange.com/questions/326449/multi-armed-bandit-problem
Interleaving
54
remove dup
A clicks = 3, B clicks = 1
Online evaluation methodology
55
Retrieval feedback in session search
56
query = “best phone”
Does the user prefer lower priced phone, or high end phones? Larger storage, better camera?
$400, 20G, Nokia
$500, 30G, Nokia
$600, 40G, iphone
observed click
session 1
session 2
(Pseudo)relevance feedback language model
57
sparsity
infer w/ EM algo
retrieve
get document model
Model-based feedback in the language modeling approach to information retrieval
Review of Lecture 1: Vector-Space model
58
Vector space model:
TF-IDF weighting:
Rocchio feedback
59
rel docs
non-rel docs
beta >> gamma
Pseudo-relevance feedback
60
query = “fish tank”
Relevance feedback in RSJ model
61
(Robertson & Sparck Jones 76)
Probability for a word to appear in a relevant doc
Probability for a word to appear in a non-relevant doc
Query expansion
62
Query reformulation
63
manual thesaurus
automatic thesaurus
Open Challenge (Final Project)
64
https://transparency.fb.com/policies/community-standards/
[XXX group people] are dirty worms. They are filthy and inhuman!
Summary
65
not click