1 of 68

1

CS 589 Lecture 2

All zoom links in Canvas

Probability ranking principle

Probabilistic retrieval models

photo: https://www.scubedstudios.com/information-retrieval/

2 of 68

2

Instructor: Susan Liu

CA: Ahmad Ibraheem

OH: Tue 3:30-5:30

Discord by default

aibrahee@stevens.edu

OH: Thu 1-3 to be added here

Discord by default, in person on HW due weeks

3 of 68

A Brief History of IR

3

300 BC

Callimachus: the first library catalog

Punch cards, searching at 600 cards/min

1950s

1958

Cranfield evaluation methodology; word-based indexing

building IR systems on computers; relevance feedback

1960s

1970s

TREC; learning to rank; latent semantic indexing

1980s

TF-IDF; probability ranking principle

1990 - now

web search; supporting natural language queries;

4 of 68

TF-IDF demo

  • https://colab.research.google.com/drive/1ZC4F9L7lukbw_tfnvb5oVevZTppHTYlP?authuser=1

4

5 of 68

Review of Lecture 1: Vector-Space model

5

Vector space model:

TF-IDF weighting:

How to choose N?

6 of 68

Document length normalization

6

...news of presidential campaign.....

...presidential candidate...

...campaign.....campaign...................

..........................................................

........news..........................................

..........................news........................

..........................................................

..........................................................

...........presidential......presidential

query = "news about presidential campaign"

100 words

5,000 words

d1

d2

true relevance: d1 > d2

over-penalization

...news of presidential campaign.....

...presidential candidate...

..........................................................

..........................................................

..........................................................

...presidential campaign news.....

...presidential candidate...

true relevance: d2 > d1

Need normalization between 1 and |d|

Under-penalization

7 of 68

Document length pivoting

7

dl

normalization

avgdl

1

8 of 68

Document length pivoting

8

9 of 68

Pop quiz (Canvas)

9

Develop your solution from here: https://tinyurl.com/422tdd7n

10 of 68

Answer

  • Recall the VS model:

  • score(q, doc1) = 1/sqrt(2)/sqrt(2) = 0.4999, score(q, doc2) = 1/sqrt(2)/sqrt(4) = 0.3535, score(q, doc3) = 2/sqrt(2)/sqrt(9) = 0.4714
  • Therefore the answer is C: doc1 > doc3 > doc2

10

  • q = “covid 19”
  • doc1 = “covid patient”
  • doc2 = “19 99 car wash”
  • doc3 = “19 street covid testing facility is reopen next week”

11 of 68

Lecture 2

  • Review of fundamentals in stats
    • Random variables, Bayes rules, maximum likelihood estimation

  • Probabilistic ranking principle

  • Probability retrieval models
    • Model 1: Robertson & Spark Jones model (RSJ model)
    • Model 2: BM25 model
    • Model 3: Language model based retrieval model

11

12 of 68

Random variables

  • Random variables: biased coin

12

sequence = 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0

Bernoulli distribution:

Parameter:

Observation

sequence

13 of 68

Random variables

  • Random variables: biased coin

13

sequence = 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0

Bernoulli distribution:

Observation

seq

Maximum likelihood estimation

14 of 68

Maximum likelihood estimation

14

15 of 68

Bayes’ rules

15

Chain rule:

Bayes’ rule:

posterior

likelihood

prior

joint distribution

16 of 68

Random variables in information retrieval

16

Notations: in future slides, q denotes the query, d denotes the document, rel denotes the relevance judgment

Application

Coin toss

Information retrieval

Observation

0=head, 1=tail

0=non-relevant, 1=relevant

Model

alpha = probability of head

p(rel = 1|q, d) = probability of relevance given q, d

17 of 68

Random variables in information retrieval

17

query = “artificial intelligence”

d = artificial, intelligence, machine, intelligence, information, retrieval

d = 56, 100, 120, 100, 50, 221

rel = 0, 1, 0, 0, 0, 0, 0, 0

q = artificial, intelligence

q = 56, 100

Notations: in future slides, q denotes the query, d denotes the document, rel denotes the relevance judgment

Observation

18 of 68

Probability ranking principle

  • Assume documents are labelled by 0/1 labels (i.e., the relevance judgement is either 0 or 1), given query q, documents should be ranked on their probabilities of relevance (van Rijsbergen 1979):

  • Theorem. The PRP is optimal, in the sense that it minimizes the expected loss (Ripley 1996)

18

PRP: rank documents by

19 of 68

Estimating

19

20 of 68

Scarcity of

20

q:CS589

d:Susan Liu

q:sports

q:CS589

What is ?

21 of 68

Estimating

21

Problems with this estimation?

  1. not enough data
  2. cannot adapt to new q

Bayes rule

22 of 68

Estimating

22

Using odds to rerank: helps cancel items to simply calculation (next slide)

Property.

23 of 68

Estimating

23

i.i.d assumption

wi = 0 or 1

Denote

(help cancel out items)

24 of 68

wi = 0 or 1

24

query = “ the artificial intelligence”

artificial

bags of words representation

intelligence

book

the

cat

artificial

dog

business

Doc1

0

1

3

1

0

1

0

Doc2

1

0

0

0

0

0

1

query

1

0

1

0

1

0

0

25 of 68

Model 1: RSJ model

25

Given q and d, rank d by (Robertson & Sparck Jones 76):

Probability for a word to appear in a relevant doc for q

RSJ model solves the data scarcity problem:

26 of 68

RSJ model: Summary

  • Use Bayes rule to solve the data scarcity problem in PRP

  • Relies on binary occurrence, does not leverage TF
    • RSJ model was designed for retrieving short text and abstract!

  • Requires relevance judgment
    • No-relevance judgment version: [Croft & Harper 79]

  • Performance is not as good as tuned vector-space model

26

How to improve RSJ?

27 of 68

Desiderata of retrieval models

  • Recall the desiderata of a retrieval models:

    • The importance of TF is sub-linear

    • Penalizing term with large document frequency using IDF

    • Pivot length normalization

27

How to improve RSJ based on these desiderata?

28 of 68

Informative words in long documents are diluted

  • Rule 3 (tf): tf’s importance should be sublinear

28

  • q = “artificial intelligence
  • d1 = ““Artificial intelligence was founded as an academic discipline in 1955, and in the years since has experienced several waves of optimism”
  • d2 = ““Artificial intelligence was founded as an academic discipline in 1955, artificial intelligence

d2 is not twice more relevant than d1

29 of 68

Model 2: Okapi/BM25

  • Introduced in 1994
    • SOTA non-learning retrieval model

  • Still frequently used today

29

30 of 68

Okapi/BM25: Formulation

30

Source: https://en.wikipedia.org/wiki/Okapi_BM25

Main idea of BM25: reweighing TF using tf / (tf + k1) (saturation function)

31 of 68

Okapi/BM25: Extending RSJ by Modeling TF

  • Estimate probability using eliteness

    • eliteness is defined wrt (w, d)
    • a word can be
      • elite to a d
      • non-elite to a d

31

eliteness

32 of 68

Okapi/BM25: Extending RSJ by Modeling TF

  • Estimate probability using eliteness

    • Hidden variable between each document-term pair
    • A document-term pair is elite if the document is about the concept denoted by the term, e.g., for “artificial intelligence”, an elite word is “learning”
    • Eliteness is binary
    • Term occurrence depends on eliteness

32

eliteness

33 of 68

33

...news of presidential campaign.....

...presidential candidate...

..........................................................

..........................................................

..........................................................

...presidential campaign news.....

...presidential candidate...

..........................................................

..........................................................

...................president of SIT................

..........................................................

..........................................................

..........................................................

..........................................................

Title: President Election

Title: SIT Ranking

34 of 68

Okapi/BM25: Extending RSJ by Modeling TF

34

RSJ model

extend

(eliteness score)

35 of 68

Okapi/BM25: Modeling TF using Mixture Models

  • Using two Poisson models to model the mixture of probability:

35

Why one Poisson model does not work?

36 of 68

Okapi/BM25

36

eliteness

(2 Poisson model)

RSJ model

extend

37 of 68

Okapi/BM25

  • We do not know

  • MLE for ? Difficulty to estimate

  • Solution: Designing a parameter-free model such that it simulates

37

38 of 68

Approximating the 2-Poisson model

38

slide credit: Stanford CS276 Information retrieval

39 of 68

Analysis of BM25 formulation

39

IDF

Pivoted document length normalization

40 of 68

BM25: Summary

  • Model TF using the 2-Poisson model

  • 2-Poisson model: Hard to estimate parameter
    • Approximate the model using saturation function

  • State of the art performance of non-learning model
  • Disadvantage: Limited relevance feedback

40

41 of 68

BM25: Multi-field retrieval

41

title

question

42 of 68

BM25F

  • Each variable is estimated as the weighted sum of its field value

42

alpha_f: hyper parameters

43 of 68

Language model-based retrieval: A more flexible framework

43

RSJ:

LM-based Retrieval model:

44 of 68

Model 3: Language model-based retrieval

  • A language model-based retrieval method [Ponte and Croft, 1998]

  • Bernoulli -> multinomial

44

corpus unigram LM

45 of 68

Model 3: Language model-based retrieval

45

...news of presidential campaign.....

...presidential candidate...

"presidential campaign news"

46 of 68

Language model-based retrieval

46

constant

. . .

Derivation: https://www.overleaf.com/read/jbztxtnbzwcx

How to estimate p_seen and p(w_i|C)?

47 of 68

Different senses of ‘model’ [Ponte and Croft, 98]

  • First sense (high level): an abstraction of the retrieval task itself

  • Second sense (mid level): modeling the distribution, e.g., 2-Poisson model

  • Thirds sense (low level): which statistical language model is used in

47

decouple

retrieval model

other problems (e.g., indexing)

+

48 of 68

Statistical language model

  • A probability distribution over word sequences
    • p(“Today is Wednesday”) ≈ 0.001
    • p(“Today Wednesday is”) ≈ 0.0000000000001
    • p(“The eigenvalue is positive) ≈ 0.00001
  • Unigram language model
    • Generate text by generating each word INDEPENDENTLY
    • Thus, p(w1 w2 ... wn)=p(w1)p(w2)…p(wn)
    • Parameters: {p(ti)} p(t1)+…+p(tN)=1 (N is voc. size)

48

49 of 68

Estimating

  • Estimating based on the maximum likelihood estimation

  • Disadvantage: if the word is unseen, probability will be 0

  • Solution: language model smoothing:

49

(plug in Eq. 1)

Dirichlet smoothing

corpus unigram LM

50 of 68

Estimating

  • Dirichlet smoothing

  • Jelinek-Mercer smoothing

50

51 of 68

Language model-based retrieval: Summary

  • Advantages:
    • Defines a general framework, more accurate can further improve the model, e.g., relevance feedback

  • Disadvantages:
    • The assumed equivalence between query and document is unrealistic
    • Only studied unigram language model

51

52 of 68

Feedback language model [Zhai and Lafferty 01]

52

Model-based feedback in the language modeling approach to IR, Zhai & Lafferty, 2001

53 of 68

Feedback language model [Zhai and Lafferty 01]

53

sparsity

infer w/ EM algo

retrieve

get document model

Model-based feedback in the language modeling approach to IR, Zhai & Lafferty, 2001

54 of 68

Evaluation on smoothing methods [Zhai & Lafferty 02]

54

Two-stage language models for information retrieval

55 of 68

Evaluation on feedback LM [Zhai & Lafferty 01b]

55

Model-based feedback in the language modeling approach to IR, Zhai & Lafferty, 2001

56 of 68

Comparison between BM25 and LM [Bennett et al. 2008]

56

However, BM25 outperforms LM in other cases

A Comparative Study of Probabilistic and Language Models for Information Retrieval

57 of 68

Multi-field retrieval

  • HW1: BM25 outperforms TF-IDF in every field & combined

57

58 of 68

Equivalence to KL-divergence retrieval model

  • Kullback-Leibler divergence

58

why not the opposite?

constant

. . .

smoothed

Notes on the KL-divergence retrieval formula and Dirichlet prior smoothing

(Eq. 1)

59 of 68

Other smoothing methods

  • Additive smoothing

  • Good-Turing smoothing

  • Absolute discounting

  • Kneser-ney smoothing

59

60 of 68

Tuning parameters in smoothing models [Zhai and Lafferty 02]

  • Tuning parameter using “leave-one-out” method

  • Estimating parameter using Newton’s method (2nd derivative)

60

remove

Two-stage language models for information retrieval

61 of 68

Tuning parameters in smoothing models [Zhai and Lafferty 02]

  • Tuning parameter using MLE for the query probability

  • Expectation-Maximization algorithm:

61

Two-stage language models for information retrieval

62 of 68

Summary on Hyperparameter tuning

  • RSJ: no parameter

  • BM25: Due to the formulation of two-Poisson, parameters are difficult to estimate, so use a parameter free version to replace it

  • Language model
    • Leave-one-out
    • EM algorithm

62

63 of 68

Translation-based language model [Xue et al. 2008]

  • The retrieval model can benefit from incorporating knowledge in the formulation

  • Translation matrix:

63

Retrieval Models for Question and Answer Archives

64 of 68

Discussion on query length

  • What if the query is very long?

    • For example, the query is a paragraph or a document

    • The problem of retrieval is turned into a matching problem
      • i.e., semantic matching

64

65 of 68

Deep semantic matching [Pang et al. 2016]

65

q

v1

each cell:

distributed representation of words (word2vec)

w1

w2

….

wn

vm

v2

….

d

A Study of MatchPyramid Models on Ad-hoc Retrieval

66 of 68

Lecture 2: Summary

  • Probabilistic ranking principle

  • Probability retrieval models
    • Model 1: Robertson & Spark Jones model (RSJ model)
    • Model 2: BM25 model
    • Model 3: Language model based retrieval model

66

67 of 68

Homework 1

  • Homework 1 is released in Canvas

  • Implementing TF-IDF and BM25 on the LinkSO dataset: https://github.com/guanqun-yang/cs589assignment1
    • Each question has 3 fields: title, body, and answer, e.g., java.zip

68 of 68

Homework 1

  • HW1: Use only title of qid1 to retrieve the title of qid2, using both BM25 and TF-IDF

  • Evaluation:
    • Your implementation will be evaluated by 3 metrics: MRR, NDCG@5, NDCG@10
    • Your score depends on how different your solution is to the ground truth

  • Plagiarism is prohibited