1 of 61

1

CS 589 Lecture 5

Monday 6:30-9:00

Kidde 228

All zoom links in Canvas

Slides adapted from Stanford CS276

Learning to rank

photo: https://www.scubedstudios.com/information-retrieval/

2 of 61

Lecture 4 Inverted Index Recap

  • Motivation: solving the challenge in scalability
    • Lots of 0, saving the storage cost

  • Query time improvement:
    • Linear time merge algorithm
    • Reducing time cost by skipping
    • Process query in order of increasing frequency

  • Index time improvement:
    • Compressing the posting pointer table, 28N -> 11N -> 9.75N
    • Compressing the posting list: storing gaps, prefix encoding
    • Speeding up dictionary search with prefix tree

2

3 of 61

Pop quiz (Inverted Index)

1. Which of the following are prefix encodings, i.e., there does NOT exists a sequence such that there are more than one ways to decode it?

A. {a: 1, b:11, c:0}

B. {a: 0, b: 10, b: 110, c:111}

C. {a: 0, b: 100, c: 101, d: 11}

D. {a: 0, b: 100, c: 10, d: 11}

3

4 of 61

Pop quiz (Inverted Index)

2. When we use the following two pointer algorithm to merge the following posting lists, select all documents that will be skipped. Can you implement the two pointer algorithm with Python?

4

2

14

18

27

30

40

47

51

11

12

13

48

56

59

60

70

80

52

53

100

List A:

List B:

5 of 61

A Brief History of IR

5

300 BC

Callimachus: the first library catalog

Punch cards, searching at 600 cards/min

1950s

1958

Cranfield evaluation methodology; word-based indexing

building IR systems on computers; relevance feedback

1960s

1970s

TREC; learning to rank; latent semantic indexing

1980s

TF-IDF; probability ranking principle

1990 - now

web search; supporting natural language queries;

6 of 61

Lecture 5 Overview

  • Learning to rank: using machine learning for ranking

  • Gradient boosted regression tree

  • RankNet

  • Deep learning to rank

6

7 of 61

Review of Lecture 1-3: Parameter estimation

7

query = “best phone”

$400, 20G, Nokia

$500, 30G, Nokia

$600, 40G, iphone

observed click

session 1

session 2

    • The new query should match more relevant documents and fewer non-relevant documents

8 of 61

Relevance feedback with the Rocchio’s model

8

rel docs

non-rel docs

beta >> gamma

9 of 61

Background: machine learning for diabetes prediction

9

Input features: X

Output label: y, 1 / -1

10 of 61

Background: linear regression

  • The relationship between input feature and output label is modeled using a linear predictor

  • Disadvantage: cannot model non-linear relationships

10

BMI

BMI

Blood pressure

Diabetes

Non diabetes

11 of 61

Background: gradient descent (optimization method)

  • We have a cost function we want to minimize
  • There are a list of optimization algorithm (for both convex and non-convex objectives), the most popular one is gradient descent (first-order optimization)
  • At each iteration, GD calculate gradient of , then take small step in direction of negative gradient

11

12 of 61

Using machine learning to help improve IR

  • Training features:
    • Retrieval model score: cosine similarity, IDF, BM25, proximity, pivoted document length normalization, PageRank, …
    • User demographic information: User age, gender
    • Location
    • User search history

  • Training labels
    • User clicks
    • Other user activities (e.g., purchase record, SAT clicks)

12

13 of 61

Machine learning for IR came in late

  • This “good idea” has been actively researched – and actively deployed by major web search engines in the past 10 years, e.g., XGBoost [2016]
    • Modern supervised learning has been around for 30 years, e.g., SVM [1995], gradient boosting [2003]

  • Why not earlier?

13

    • Limited training data (Google was popular after 2000)
    • Poor machine learning techniques
    • Not enough features for ML to show value
    • It was possible to tune parameters by hand (and people did)

14 of 61

Learning to rank benefits from web search

  • Modern (web) systems use a great number of features:
    • Log frequency of query word in anchor text?
    • Query word in color on page?
    • # of images on page?
    • # of (out) links on page?
    • PageRank of page?
    • URL length?
    • URL contains “~”?
    • Page edit recency?
    • Page loading speed
  • The New York Times in 2008-06-03 quoted Amit Singhal as saying Google was using over 200 such features (“signals”) – so it’s sure to be over 500 today. ☺

14

15 of 61

Learning to rank as binary classification

  • For each (q, d, rel) triples:
    • Output label y = rel (binary)
    • Input feature x = (α, ω):
      • α is cosine similarity of the vector space model score(q, d)
      • ω is minimum query window size in the document
    • Training: apply supervised classification on x and y for training data
    • Prediction: use the predicted label for retrieval (1=relevant, 0=nonrelevant)

15

Input features: X

Output label: y

16 of 61

Learning to rank as binary classification

16

0

2

3

4

5

0.05

0.025

cosine score α

Term proximity ω

R

R

R

R

R

R

R

R

R

R

R

N

N

N

N

N

N

N

N

N

N

Decision surface

17 of 61

Learning to rank as regression

  • For each (q, d, r) triples:
    • Output label y = Relevance judgment, multi-level
    • Input feature x = (α, ω):
      • α is the cosine similarity of the vector space model
      • ω is minimum query window size in the document
    • Training: apply linear regression on x and y
    • Ranking: use the predicted score for ranking (3=highly relevant, 2=relevant, 1=not relevant)

17

1

2

3

3

2

1

1

18 of 61

Learning to rank as regression

18

Term proximity ω

cosine score α

Relevance level

Linear model

19 of 61

Challenge in relevance judgment

  • Exact judgment for relevance is challenging
    • Exact judgment is incomparable across sessions
    • The criterion for relevance judgment may drift over time
  • Relative judgment is easier
    • i.e., for docID 2094 and 3191, only consider the “weak supervision” that 2094 is more relevant than 3191 (instead that 2094’s relevance is exactly 3)

19

1

2

3

3

2

1

1

20 of 61

Exact judgment vs Relative judgment

20

Exact judgment

Relative judgment

21 of 61

Cross Entropy Loss

  • Minimizing the cross entropy loss:
  • y: label, p: probability for y=1

21

22 of 61

Pairwise learning to rank on relative label

  • For each query, the probability of document i more relevant than j di > dj:

  • For each query, maximize or minimize the cross entropy loss?

22

Binary label

23 of 61

Yahoo! learning to rank challenge

Chapelle and Chang [2011]

  • Yahoo! Webscope dataset : 36,251 queries, 883k documents, 700 features, 5 ranking levels
    • Ratings: Perfect (navigational), Excellent, Good, Fair, Bad
    • Real web data
    • set-1: 473,134 feature vectors; 519 features; 19,944 queries
    • set-2: 34,815 feature vectors; 596 features; 1,266 queries
  • Winner (Burges et al. 2011) was linear combo of 12 models:
    • 8 Tree Ensembles (LambdaMART)
    • 2 LambdaRank Neural Nets
    • 2 Logistic regression models

23

24 of 61

Yahoo! learning to rank challenge

  • Goal was to validate learning to rank methods on a large, “real” web search problem
    • Previous work was mainly driven by LETOR datasets
      • Great as first public learning-to-rank data
      • Small: 10s of features, 100s of queries, 10k’s of docs
  • Only feature vectors released
    • Not URLs, queries, nor feature descriptions
      • Wanting to keep privacy and proprietary info safe
    • But included web graph features, click features, page freshness and page classification features as well as text match features

24

25 of 61

Regression tree

  • Regression tree: predicting a value rather than a class

25

decision tree

regression tree

26 of 61

Regression tree

  • Tree construction: recursively splitting nodes by minimizing MSE
    • Choose split value to minimize the variance (standard deviation) of the values in each child induced by split (normally just a binary split for easy search):

  • Termination: cutoff on SD or #examples or tree depth

26

27 of 61

Training a regression tree

What’s the algorithm of searching for the splitting criterion? Time complexity is quadratic

27

x^1 < v1

1.444

3.625

x^1 > v3

x^2 < v4

v3

v4

2

1.5

3

4

Step 1: similar as linear regression by searching different dimensions

28 of 61

Towards gradient boostings

  • Linear regression finds the parameter that minimizes the square loss:

  • Gradient boosting finds the function that minimizes the expected square loss:

28

29 of 61

Towards gradient boostings

  • Regression tree minimizes the square loss by splitting each node recursively

  • Gradient boosting minimizes the square loss by defining an additive function which iteratively reduces the residual (why?)

29

30 of 61

Gradient boosting: estimating the parameter

  • Function parameters are iteratively fit to the training data:
    • Set F0(x) = initial guess (or zero)
    • For each m = 1, 2, …, M

  • You successively estimate and add a new tree to the sum
  • You never go back to revisit past decisions

30

31 of 61

Gradient boosting: estimating the parameter

31

Least square (gradient = residual)

Each function is equal to a small regression tree

32 of 61

  • In the first iteration, predict a constant value that minimizes the loss: f0(x) = mean value

32

f0(x)

3.588

6

6

6

6

6

5

5

5

2

2

2

33 of 61

  • After the first iteration, update each value as the new residual (subtracting 3.588), then train a regression tree to minimize the loss

33

6

6

6

6

6

5

5

5

2

2

2

2.41

2.41

2.41

2.41

2.41

1.41

1.41

1.41

-1.58

-1.58

-2.58

-2.58

-1.58

-1.58

-1.58

-1.58

-1.58

f0(x)

3.588

f1(x)

x1 < 5.5

-1.81

2.037

34 of 61

  • After the second iteration, update each value as the new residual, then train a regression tree to minimize the loss

34

6

6

6

6

6

5

5

5

2

2

2

0.37

0.37

0.37

-0.62

-0.62

-0.62

0.22

0.22

-0.77

-0.77

0.22

0.22

0.22

0.22

0.22

0.37

0.37

f2(x)

x2 < 3.5

-0.427

0.299

35 of 61

Multiple additive regression tree [Friedman 1999]

35

f0(x)

3.588

f1(x)

x1 < 5.5

-1.81

2.037

f2(x)

x2 < 3.5

-0.427

0.299

+

+

Prediction =

+ …

36 of 61

Multiple additive regression tree [Friedman 1999]

36

f0(x)

3.588

f1(x)

x1 < 5.5

-1.81

2.037

f2(x)

x2 < 3.5

-0.427

0.299

+

+

+ …

yes

no

yes

no

37 of 61

RankNet [Burges et al. 2010]

  • Minimizes the cross entropy loss:

37

  • Plugging in the probability gives rise to:

38 of 61

RankNet [Burges et al. 2010]

  • Take the gradient of C w.r.t. w_k:

38

39 of 61

RankNet [Burges et al. 2010]

  • Pairwise loss -> Pointwise loss

  • Therefore, is sort of a “gradient” w.r.t. document di under query q (why under query q?)

39

A gradient w.r.t. document i

40 of 61

RankNet [Burges et al. 2010]

  • Interpreting : (a) is the perfect ranking, (b) is a ranking with 10 pairwise errors, (c) is a ranking with 8 pairwise errors. Each blue arrow represents the λi for each document vector xi

40

: sum over the negative examples above to make 1 go higher

41 of 61

RankNet [Burges et al. 2010]

  • RankNet minimizes the entropy loss, alternatively we can directly optimize the evaluation metrics, e.g., MAP [2007], NDCG [2010] and MRR
  • Modern web search tends to emphasize better precision at a higher position

41

Which metrics to use for emphasizing performance at a higher position?

a b

42 of 61

From RankNet to LambdaRank

  • RankNet has assigned the same weight for all errors
  • LambdaRank: scale (desired change of scores for the document pair di and d) based on the change in the NDCG score:

  • Burges et al. “prove” (partly theory, partly empirical) that this change is sufficient for model to optimize NDCG

42

43 of 61

From RankNet to LambdaRank

  • LambdaRank models gradients
  • MART can be trained with gradients (“gradient boosting”)
  • Combine both to get LambdaMART
    • MART with specified gradients and optimization step

43

44 of 61

LambdaMART [Burges et al. 2010]

  • Lambdas are kind of “gradients” in RankNet
  • In MART, with the specific lambda as gradients, we get:
    • LambdaMART = LambdaRank + MART (gradient boosting)

44

Gradient boosting

Use lambda as the residual for computing the gradient

45 of 61

Learning to rank categorization

  • Pointwise learning to rank: linear regression, SVM
  • Pairwise learning to rank: RankNet, lambdaMART
  • Listwise learning to rank (optimizing the metric score w.r.t. a ranked list): NDCGBoost

45

Algorithm

Category

Description

1992

Linear reg

Pointwise

Staged logistic regression

1999

MART

Pairwise

Multiple additive regression tree

2005

RankNet

Pairwise

2007

SVMmap

Listwise

Optimizing MAP using SVM

2010

NDCGBoost

Listwise

Optimizing NDCG using boosting

2016

XGBoost

Pairwise

Support various eval metrics

46 of 61

Listwise learning to rank for MAP [Yue et al. 2007]

  • Listwise learning to rank approach minimizes the loss with respect to the ranked list (e.g., MAP, NDCG)

  • SVM MAP [Yue et al. 2007]

46

minimize

+: hinge loss

Taking upper bound

47 of 61

Listwise learning to rank for NDCG [Valizadegan et al. 2009]

  • Optimizing the NDCG measure:

  • Directly optimizing the above function is infeasible, so they derived an approximation

  • Then compute the ranking function by ensemble following a similar approach as boosting

47

48 of 61

Listwise learning to rank for NDCG [Valizadegan et al. 2009]

48

49 of 61

Summary

  • The idea of machine learning for ranking has been around for a long time

  • In the last 2 decades, learning to rank has become a hot topic due to the development of machine learning, the availability of massive amount of user log and the growth in computational power

  • The mainstream learning to rank algorithms focus on tree ensemble method

  • Machine-learned ranking over many features now easily beats traditional hand-designed ranking functions by learning the evaluation metrics directly

49

50 of 61

Resources for learning to rank

  • Ranklib: https://sourceforge.net/p/lemur/wiki/RankLib/
  • XGBoost: https://github.com/dmlc/xgboost
  • Yahoo! Learning to rank challenge dataset
  • C. J. C. Burges. From RankNet to LambdaRank to LambdaMART: An Overview. Microsoft TR 2010.
  • O. Chapelle and Y. Chang. Yahoo! Learning to Rank Challenge Overview. JMLR Proceedings 2011

50

51 of 61

Deep learning based retrieval

  • DSSM: Deep Structured Semantic Model (2014)
  • MatchZoo: An open-source library for retrieval (2019)
  • Dense passage retrieval (2019)

51

52 of 61

Deep semantic matching [Pang et al. 2016]

52

q

v1

each cell:

distributed representation of words (word2vec)

w1

w2

….

wn

vm

v2

….

d

A Study of MatchPyramid Models on Ad-hoc Retrieval

53 of 61

aNMM: Ranking Short Answer Texts with Attention-Based Neural Matching Model

53

54 of 61

Dense passage retrieval

  • Step 1: map each query, document into a fixed vector embedding

54

“cs 589 stevens”

cs 589 Text mining and information retrieval

source: https://medium.com/@aikho/deep-learning-in-information-retrieval-part-ii-dense-retrieval-1f9fecb47de9

55 of 61

Contrastive learning

  • Step 2: Minimizes the distance with positive documents, maximizes the distance with negative documents

55

source: https://medium.com/@aikho/deep-learning-in-information-retrieval-part-ii-dense-retrieval-1f9fecb47de9

56 of 61

Contrastive learning

  • Selection of positive passages:
    • Highest ranked passage from BM25 that contains the answer

  • Adding hard negative of BM25 helps

56

57 of 61

Make Deep IR Model Explainable (ongoing work)

  • TF-IDF is explainable, however, BERT model is not

57

58 of 61

Make Deep IR Model Explainable (ongoing work)

58

59 of 61

  • LIME:

59

60 of 61

Lecture 5 Summary

  • Learning to rank: using machine learning for ranking
    • feature engineering
    • casting problem as classification regression

  • Gradient boosted regression tree
    • Regression tree vs gradient boosting

  • RankNet
    • How we cast pairwise to pointwise, know the derivation

  • Deep learning to rank
    • DRMM/aNMM, DPR, explainability

60

61 of 61

Improving sentence embedding quality

  • The key to retrieval is to improve sentence embedding quality
  • Massive text embedding benchmark (MTEB)

61