1 of 68

1

CS 589 Lecture 2

All zoom links in Canvas

Probability ranking principle

Probabilistic retrieval models

photo: https://www.scubedstudios.com/information-retrieval/

2 of 68

2

Instructor: Susan Liu

CA: Ahmad Ibraheem

xliu127@stevens.edu

OH: Tue 3:30-5:30

Discord by default

aibrahee@stevens.edu

OH: Thu 1-3 to be added here

Discord by default, in person on HW due weeks

3 of 68

A Brief History of IR

3

300 BC

Callimachus: the first library catalog

Punch cards, searching at 600 cards/min

1950s

1958

Cranfield evaluation methodology; word-based indexing

building IR systems on computers; relevance feedback

1960s

1970s

TREC; learning to rank; latent semantic indexing

1980s

TF-IDF; probability ranking principle

1990 - now

web search; supporting natural language queries;

4 of 68

TF-IDF demo

https://colab.research.google.com/drive/1ZC4F9L7lukbw_tfnvb5oVevZTppHTYlP?authuser=1

4

5 of 68

Review of Lecture 1: Vector-Space model

5

Vector space model:

TF-IDF weighting:

How to choose N?

6 of 68

Document length normalization

6

...news of presidential campaign.....

...presidential candidate...

...campaign.....campaign...................

..........................................................

........news..........................................

..........................news........................

..........................................................

...........presidential......presidential

query = "news about presidential campaign"

100 words

5,000 words

d1

d2

true relevance: d1 > d2

over-penalization

...news of presidential campaign.....

...presidential candidate...

..........................................................

...presidential campaign news.....

...presidential candidate...

true relevance: d2 > d1

Need normalization between 1 and |d|

Under-penalization

7 of 68

Document length pivoting

7

dl

normalization

avgdl

1

8 of 68

Document length pivoting

8

9 of 68

Pop quiz (Canvas)

9

Develop your solution from here: https://tinyurl.com/422tdd7n

10 of 68

Answer

Recall the VS model:

score(q, doc1) = 1/sqrt(2)/sqrt(2) = 0.4999, score(q, doc2) = 1/sqrt(2)/sqrt(4) = 0.3535, score(q, doc3) = 2/sqrt(2)/sqrt(9) = 0.4714
Therefore the answer is C: doc1 > doc3 > doc2

10

q = “covid 19”
doc1 = “covid patient”
doc2 = “19 99 car wash”
doc3 = “19 street covid testing facility is reopen next week”

11 of 68

Lecture 2

Review of fundamentals in stats

Random variables, Bayes rules, maximum likelihood estimation

Probabilistic ranking principle

Probability retrieval models

Model 1: Robertson & Spark Jones model (RSJ model)
Model 2: BM25 model
Model 3: Language model based retrieval model

11

12 of 68

Random variables

Random variables: biased coin

12

sequence = 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0

Bernoulli distribution:

Parameter:

Observation

sequence

13 of 68

Random variables

Random variables: biased coin

13

sequence = 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0

Bernoulli distribution:

Observation

seq

Maximum likelihood estimation

14 of 68

Maximum likelihood estimation

14

15 of 68

Bayes’ rules

15

Chain rule:

Bayes’ rule:

posterior

likelihood

prior

joint distribution

16 of 68

Random variables in information retrieval

16

Notations: in future slides, q denotes the query, d denotes the document, rel denotes the relevance judgment

Application	Coin toss	Information retrieval
Observation	0=head, 1=tail	0=non-relevant, 1=relevant
Model	alpha = probability of head	p(rel = 1\|q, d) = probability of relevance given q, d

17 of 68

Random variables in information retrieval

17

query = “artificial intelligence”

d = artificial, intelligence, machine, intelligence, information, retrieval

d = 56, 100, 120, 100, 50, 221

rel = 0, 1, 0, 0, 0, 0, 0, 0

q = artificial, intelligence

q = 56, 100

Notations: in future slides, q denotes the query, d denotes the document, rel denotes the relevance judgment

Observation

18 of 68

Probability ranking principle

Assume documents are labelled by 0/1 labels (i.e., the relevance judgement is either 0 or 1), given query q, documents should be ranked on their probabilities of relevance (van Rijsbergen 1979):

Theorem. The PRP is optimal, in the sense that it minimizes the expected loss (Ripley 1996)

18

PRP: rank documents by

19 of 68

Estimating

19

20 of 68

Scarcity of

20

q:CS589

d:Susan Liu

q:sports

q:CS589

What is ?

21 of 68

Estimating

21

Problems with this estimation?

not enough data
cannot adapt to new q

Bayes rule

22 of 68

Estimating

22

Using odds to rerank: helps cancel items to simply calculation (next slide)

Property.

23 of 68

Estimating

23

…

i.i.d assumption

wi = 0 or 1

Denote

(help cancel out items)

24 of 68

wi = 0 or 1

24

query = “ the artificial intelligence”

artificial

bags of words representation

	intelligence	book	the	cat	artificial	dog	business
Doc1	0	1	3	1	0	1	0
Doc2	1	0	0	0	0	0	1
query	1	0	1	0	1	0	0

25 of 68

Model 1: RSJ model

25

Given q and d, rank d by (Robertson & Sparck Jones 76):

Probability for a word to appear in a relevant doc for q

RSJ model solves the data scarcity problem:

26 of 68

RSJ model: Summary

Use Bayes rule to solve the data scarcity problem in PRP

Relies on binary occurrence, does not leverage TF

RSJ model was designed for retrieving short text and abstract!

Requires relevance judgment

No-relevance judgment version: [Croft & Harper 79]

Performance is not as good as tuned vector-space model

26

How to improve RSJ?

27 of 68

Desiderata of retrieval models

Recall the desiderata of a retrieval models:

The importance of TF is sub-linear

Penalizing term with large document frequency using IDF

Pivot length normalization

27

How to improve RSJ based on these desiderata?

28 of 68

Informative words in long documents are diluted

Rule 3 (tf): tf’s importance should be sublinear

28

q = “artificial intelligence”

d1 = ““Artificial intelligence was founded as an academic discipline in 1955, and in the years since has experienced several waves of optimism”

d2 = ““Artificial intelligence was founded as an academic discipline in 1955, artificial intelligence”

d2 is not twice more relevant than d1

29 of 68

Model 2: Okapi/BM25

Introduced in 1994

SOTA non-learning retrieval model

Still frequently used today

29

30 of 68

Okapi/BM25: Formulation

30

Source: https://en.wikipedia.org/wiki/Okapi_BM25

Main idea of BM25: reweighing TF using tf / (tf + k1) (saturation function)

31 of 68

Okapi/BM25: Extending RSJ by Modeling TF

Estimate probability using eliteness

eliteness is defined wrt (w, d)
a word can be

elite to a d
non-elite to a d

31

eliteness

32 of 68

Okapi/BM25: Extending RSJ by Modeling TF

Estimate probability using eliteness

Hidden variable between each document-term pair
A document-term pair is elite if the document is about the concept denoted by the term, e.g., for “artificial intelligence”, an elite word is “learning”
Eliteness is binary
Term occurrence depends on eliteness

32

eliteness

33 of 68

33

...news of presidential campaign.....

...presidential candidate...

..........................................................

...presidential campaign news.....

...presidential candidate...

..........................................................

...................president of SIT................

..........................................................

Title: President Election

Title: SIT Ranking

34 of 68

Okapi/BM25: Extending RSJ by Modeling TF

34

RSJ model

extend

(eliteness score)

35 of 68

Okapi/BM25: Modeling TF using Mixture Models

Using two Poisson models to model the mixture of probability:

35

Why one Poisson model does not work?

36 of 68

Okapi/BM25

36

eliteness

(2 Poisson model)

RSJ model

extend

37 of 68

Okapi/BM25

We do not know

MLE for ? Difficulty to estimate

Solution: Designing a parameter-free model such that it simulates

37

38 of 68

Approximating the 2-Poisson model

38

slide credit: Stanford CS276 Information retrieval

39 of 68

Analysis of BM25 formulation

39

IDF

Pivoted document length normalization

40 of 68

BM25: Summary

Model TF using the 2-Poisson model

2-Poisson model: Hard to estimate parameter

Approximate the model using saturation function

State of the art performance of non-learning model

Disadvantage: Limited relevance feedback

40

41 of 68

BM25: Multi-field retrieval

41

title

question

42 of 68

BM25F

Each variable is estimated as the weighted sum of its field value

42

alpha_f: hyper parameters

43 of 68

Language model-based retrieval: A more flexible framework

43

RSJ:

LM-based Retrieval model:

44 of 68

Model 3: Language model-based retrieval

A language model-based retrieval method [Ponte and Croft, 1998]

Bernoulli -> multinomial

44

corpus unigram LM

45 of 68

Model 3: Language model-based retrieval

45

...news of presidential campaign.....

...presidential candidate...

"presidential campaign news"

46 of 68

Language model-based retrieval

46

constant

. . .

Derivation: https://www.overleaf.com/read/jbztxtnbzwcx

How to estimate p_seen and p(w_i|C)?

47 of 68

Different senses of ‘model’ [Ponte and Croft, 98]

First sense (high level): an abstraction of the retrieval task itself

Second sense (mid level): modeling the distribution, e.g., 2-Poisson model

Thirds sense (low level): which statistical language model is used in

47

decouple

retrieval model

48 of 68

Statistical language model

A probability distribution over word sequences

p(“Today is Wednesday”) ≈ 0.001
p(“Today Wednesday is”) ≈ 0.0000000000001
p(“The eigenvalue is positive”) ≈ 0.00001

Unigram language model

Generate text by generating each word INDEPENDENTLY
Thus, p(w₁ w₂ ... w_n)=p(w₁)p(w₂)…p(w_n)
Parameters: {p(t_i)} p(t₁)+…+p(t_N)=1 (N is voc. size)

48

49 of 68

Estimating

Estimating based on the maximum likelihood estimation

Disadvantage: if the word is unseen, probability will be 0

Solution: language model smoothing:

49

(plug in Eq. 1)

Dirichlet smoothing

corpus unigram LM

50 of 68

Estimating

Dirichlet smoothing

Jelinek-Mercer smoothing

50

51 of 68

Language model-based retrieval: Summary

Advantages:

Defines a general framework, more accurate can further improve the model, e.g., relevance feedback

Disadvantages:

The assumed equivalence between query and document is unrealistic
Only studied unigram language model

51

52 of 68

Feedback language model [Zhai and Lafferty 01]

52

Model-based feedback in the language modeling approach to IR, Zhai & Lafferty, 2001

53 of 68

Feedback language model [Zhai and Lafferty 01]

53

sparsity

infer w/ EM algo

retrieve

get document model

Model-based feedback in the language modeling approach to IR, Zhai & Lafferty, 2001

54 of 68

Evaluation on smoothing methods [Zhai & Lafferty 02]

54

Two-stage language models for information retrieval

55 of 68

Evaluation on feedback LM [Zhai & Lafferty 01b]

55

Model-based feedback in the language modeling approach to IR, Zhai & Lafferty, 2001

56 of 68

Comparison between BM25 and LM [Bennett et al. 2008]

56

However, BM25 outperforms LM in other cases

A Comparative Study of Probabilistic and Language Models for Information Retrieval

57 of 68

Multi-field retrieval

HW1: BM25 outperforms TF-IDF in every field & combined

57

58 of 68

Equivalence to KL-divergence retrieval model

Kullback-Leibler divergence

58

why not the opposite?

constant

. . .

smoothed

Notes on the KL-divergence retrieval formula and Dirichlet prior smoothing

(Eq. 1)

59 of 68

Other smoothing methods

Additive smoothing

Good-Turing smoothing

Absolute discounting

Kneser-ney smoothing

59

60 of 68

Tuning parameters in smoothing models [Zhai and Lafferty 02]

Tuning parameter using “leave-one-out” method

Estimating parameter using Newton’s method (2nd derivative)

60

remove

Two-stage language models for information retrieval

61 of 68

Tuning parameters in smoothing models [Zhai and Lafferty 02]

Tuning parameter using MLE for the query probability

Expectation-Maximization algorithm:

61

Two-stage language models for information retrieval

62 of 68

Summary on Hyperparameter tuning

RSJ: no parameter

BM25: Due to the formulation of two-Poisson, parameters are difficult to estimate, so use a parameter free version to replace it

Language model

Leave-one-out
EM algorithm

62

63 of 68

Translation-based language model [Xue et al. 2008]

The retrieval model can benefit from incorporating knowledge in the formulation

Translation matrix:

63

Retrieval Models for Question and Answer Archives

64 of 68

Discussion on query length

What if the query is very long?

For example, the query is a paragraph or a document

The problem of retrieval is turned into a matching problem

i.e., semantic matching

64

65 of 68

Deep semantic matching [Pang et al. 2016]

65

q

v1

each cell:

distributed representation of words (word2vec)

w1

w2

….

wn

vm

v2

….

d

A Study of MatchPyramid Models on Ad-hoc Retrieval

66 of 68

Lecture 2: Summary

Probabilistic ranking principle

Probability retrieval models

Model 1: Robertson & Spark Jones model (RSJ model)
Model 2: BM25 model
Model 3: Language model based retrieval model

66

67 of 68

Homework 1

Homework 1 is released in Canvas

Implementing TF-IDF and BM25 on the LinkSO dataset: https://github.com/guanqun-yang/cs589assignment1

Each question has 3 fields: title, body, and answer, e.g., java.zip

68 of 68

Homework 1

HW1: Use only title of qid1 to retrieve the title of qid2, using both BM25 and TF-IDF

Evaluation:

Your implementation will be evaluated by 3 metrics: MRR, NDCG@5, NDCG@10
Your score depends on how different your solution is to the ground truth

Plagiarism is prohibited