1 of 61

1

CS 589 Lecture 5

Monday 6:30-9:00

Kidde 228

All zoom links in Canvas

Slides adapted from Stanford CS276

Learning to rank

photo: https://www.scubedstudios.com/information-retrieval/

2 of 61

Lecture 4 Inverted Index Recap

Motivation: solving the challenge in scalability

Lots of 0, saving the storage cost

Query time improvement:

Linear time merge algorithm
Reducing time cost by skipping
Process query in order of increasing frequency

Index time improvement:

Compressing the posting pointer table, 28N -> 11N -> 9.75N
Compressing the posting list: storing gaps, prefix encoding
Speeding up dictionary search with prefix tree

2

3 of 61

Pop quiz (Inverted Index)

1. Which of the following are prefix encodings, i.e., there does NOT exists a sequence such that there are more than one ways to decode it?

A. {a: 1, b:11, c:0}

B. {a: 0, b: 10, b: 110, c:111}

C. {a: 0, b: 100, c: 101, d: 11}

D. {a: 0, b: 100, c: 10, d: 11}

3

4 of 61

Pop quiz (Inverted Index)

2. When we use the following two pointer algorithm to merge the following posting lists, select all documents that will be skipped. Can you implement the two pointer algorithm with Python?

4

2

14

18

27

30

40

47

51

11

12

13

48

56

59

60

70

80

52

53

100

List A:

List B:

5 of 61

A Brief History of IR

5

300 BC

Callimachus: the first library catalog

Punch cards, searching at 600 cards/min

1950s

1958

Cranfield evaluation methodology; word-based indexing

building IR systems on computers; relevance feedback

1960s

1970s

TREC; learning to rank; latent semantic indexing

1980s

TF-IDF; probability ranking principle

1990 - now

web search; supporting natural language queries;

6 of 61

Lecture 5 Overview

Learning to rank: using machine learning for ranking

Gradient boosted regression tree

RankNet

Deep learning to rank

6

7 of 61

Review of Lecture 1-3: Parameter estimation

7

query = “best phone”

$400, 20G, Nokia

$500, 30G, Nokia

$600, 40G, iphone

observed click

session 1

session 2

The new query should match more relevant documents and fewer non-relevant documents

8 of 61

Relevance feedback with the Rocchio’s model

8

rel docs

non-rel docs

beta >> gamma

9 of 61

Background: machine learning for diabetes prediction

9

Input features: X

Output label: y, 1 / -1

10 of 61

Background: linear regression

The relationship between input feature and output label is modeled using a linear predictor

Disadvantage: cannot model non-linear relationships

10

BMI

Blood pressure

Diabetes

Non diabetes

11 of 61

Background: gradient descent (optimization method)

We have a cost function we want to minimize
There are a list of optimization algorithm (for both convex and non-convex objectives), the most popular one is gradient descent (first-order optimization)
At each iteration, GD calculate gradient of , then take small step in direction of negative gradient

11

12 of 61

Using machine learning to help improve IR

Training features:

Retrieval model score: cosine similarity, IDF, BM25, proximity, pivoted document length normalization, PageRank, …
User demographic information: User age, gender
Location
User search history

Training labels

User clicks
Other user activities (e.g., purchase record, SAT clicks)

12

13 of 61

Machine learning for IR came in late

This “good idea” has been actively researched – and actively deployed by major web search engines in the past 10 years, e.g., XGBoost [2016]

Modern supervised learning has been around for 30 years, e.g., SVM [1995], gradient boosting [2003]

Why not earlier?

13

Limited training data (Google was popular after 2000)
Poor machine learning techniques
Not enough features for ML to show value
It was possible to tune parameters by hand (and people did)

14 of 61

Learning to rank benefits from web search

Modern (web) systems use a great number of features:

Log frequency of query word in anchor text?
Query word in color on page?
# of images on page?
# of (out) links on page?
PageRank of page?
URL length?
URL contains “~”?
Page edit recency?
Page loading speed

The New York Times in 2008-06-03 quoted Amit Singhal as saying Google was using over 200 such features (“signals”) – so it’s sure to be over 500 today. ☺

14

15 of 61

Learning to rank as binary classification

For each (q, d, rel) triples:

Output label y = rel (binary)
Input feature x = (α, ω):

α is cosine similarity of the vector space model score(q, d)
ω is minimum query window size in the document

Training: apply supervised classification on x and y for training data
Prediction: use the predicted label for retrieval (1=relevant, 0=nonrelevant)

15

Input features: X

Output label: y

16 of 61

Learning to rank as binary classification

16

0

2

3

4

5

0.05

0.025

cosine score α

Term proximity ω

R

N

Decision surface

17 of 61

Learning to rank as regression

For each (q, d, r) triples:

Output label y = Relevance judgment, multi-level
Input feature x = (α, ω):

α is the cosine similarity of the vector space model
ω is minimum query window size in the document

Training: apply linear regression on x and y
Ranking: use the predicted score for ranking (3=highly relevant, 2=relevant, 1=not relevant)

17

1

2

3

2

1

18 of 61

Learning to rank as regression

18

Term proximity ω

cosine score α

Relevance level

Linear model

19 of 61

Challenge in relevance judgment

Exact judgment for relevance is challenging

Exact judgment is incomparable across sessions
The criterion for relevance judgment may drift over time

Relative judgment is easier

i.e., for docID 2094 and 3191, only consider the “weak supervision” that 2094 is more relevant than 3191 (instead that 2094’s relevance is exactly 3)

19

1

2

3

2

1

20 of 61

Exact judgment vs Relative judgment

20

Exact judgment

Relative judgment

21 of 61

Cross Entropy Loss

Minimizing the cross entropy loss:
y: label, p: probability for y=1

21

22 of 61

Pairwise learning to rank on relative label

For each query, the probability of document i more relevant than j di > dj:

For each query, maximize or minimize the cross entropy loss?

22

Binary label

23 of 61

Yahoo! learning to rank challenge

Chapelle and Chang [2011]

Yahoo! Webscope dataset : 36,251 queries, 883k documents, 700 features, 5 ranking levels

Ratings: Perfect (navigational), Excellent, Good, Fair, Bad
Real web data
set-1: 473,134 feature vectors; 519 features; 19,944 queries
set-2: 34,815 feature vectors; 596 features; 1,266 queries

Winner (Burges et al. 2011) was linear combo of 12 models:

8 Tree Ensembles (LambdaMART)
2 LambdaRank Neural Nets
2 Logistic regression models

23

24 of 61

Yahoo! learning to rank challenge

Goal was to validate learning to rank methods on a large, “real” web search problem

Previous work was mainly driven by LETOR datasets

Great as first public learning-to-rank data
Small: 10s of features, 100s of queries, 10k’s of docs

Only feature vectors released

Not URLs, queries, nor feature descriptions

Wanting to keep privacy and proprietary info safe

But included web graph features, click features, page freshness and page classification features as well as text match features

24

25 of 61

Regression tree

Regression tree: predicting a value rather than a class

25

decision tree

regression tree

26 of 61

Regression tree

Tree construction: recursively splitting nodes by minimizing MSE

Choose split value to minimize the variance (standard deviation) of the values in each child induced by split (normally just a binary split for easy search):

Termination: cutoff on SD or #examples or tree depth

26

27 of 61

Training a regression tree

What’s the algorithm of searching for the splitting criterion? Time complexity is quadratic

27

x^1 < v1

1.444

3.625

x^1 > v3

x^2 < v4

v3

v4

2

1.5

3

4

Step 1: similar as linear regression by searching different dimensions

28 of 61

Towards gradient boostings

Linear regression finds the parameter that minimizes the square loss:

Gradient boosting finds the function that minimizes the expected square loss:

28

29 of 61

Towards gradient boostings

Regression tree minimizes the square loss by splitting each node recursively

Gradient boosting minimizes the square loss by defining an additive function which iteratively reduces the residual (why?)

29

30 of 61

Gradient boosting: estimating the parameter

Function parameters are iteratively fit to the training data:

Set F₀(x) = initial guess (or zero)
For each m = 1, 2, …, M

You successively estimate and add a new tree to the sum
You never go back to revisit past decisions

30

31 of 61

Gradient boosting: estimating the parameter

31

Least square (gradient = residual)

Each function is equal to a small regression tree

32 of 61

Gradient boosting regression tree example:

In the first iteration, predict a constant value that minimizes the loss: f0(x) = mean value

32

f0(x)

3.588

6

5

2

33 of 61

Gradient boosting regression tree example:

After the first iteration, update each value as the new residual (subtracting 3.588), then train a regression tree to minimize the loss

33

6

5

2

2.41

1.41

-1.58

-2.58

-1.58

f0(x)

3.588

f1(x)

x1 < 5.5

-1.81

2.037

34 of 61

Gradient boosting regression tree example:

After the second iteration, update each value as the new residual, then train a regression tree to minimize the loss

34

6

5

2

0.37

-0.62

0.22

-0.77

0.22

0.37

f2(x)

x2 < 3.5

-0.427

0.299

35 of 61

Multiple additive regression tree [Friedman 1999]

35

f0(x)

3.588

f1(x)

x1 < 5.5

-1.81

2.037

f2(x)

x2 < 3.5

-0.427

0.299

+

Prediction =

+ …

36 of 61

Multiple additive regression tree [Friedman 1999]

36

f0(x)

3.588

f1(x)

x1 < 5.5

-1.81

2.037

f2(x)

x2 < 3.5

-0.427

0.299

+

+ …

yes

no

yes

no

37 of 61

RankNet [Burges et al. 2010]

Minimizes the cross entropy loss:

37

Plugging in the probability gives rise to:

38 of 61

RankNet [Burges et al. 2010]

Take the gradient of C w.r.t. w_k:

38

39 of 61

RankNet [Burges et al. 2010]

Pairwise loss -> Pointwise loss

Therefore, is sort of a “gradient” w.r.t. document di under query q (why under query q?)

39

A gradient w.r.t. document i

40 of 61

RankNet [Burges et al. 2010]

Interpreting : (a) is the perfect ranking, (b) is a ranking with 10 pairwise errors, (c) is a ranking with 8 pairwise errors. Each blue arrow represents the λ_i for each document vector x_i

40

: sum over the negative examples above to make 1 go higher

41 of 61

RankNet [Burges et al. 2010]

RankNet minimizes the entropy loss, alternatively we can directly optimize the evaluation metrics, e.g., MAP [2007], NDCG [2010] and MRR
Modern web search tends to emphasize better precision at a higher position

41

Which metrics to use for emphasizing performance at a higher position?

a b

42 of 61

From RankNet to LambdaRank

RankNet has assigned the same weight for all errors
LambdaRank: scale (desired change of scores for the document pair di and d) based on the change in the NDCG score:

Burges et al. “prove” (partly theory, partly empirical) that this change is sufficient for model to optimize NDCG

42

43 of 61

From RankNet to LambdaRank

LambdaRank models gradients
MART can be trained with gradients (“gradient boosting”)
Combine both to get LambdaMART

MART with specified gradients and optimization step

43

44 of 61

LambdaMART [Burges et al. 2010]

Lambdas are kind of “gradients” in RankNet
In MART, with the specific lambda as gradients, we get:

LambdaMART = LambdaRank + MART (gradient boosting)

44

Gradient boosting

Use lambda as the residual for computing the gradient

45 of 61

Learning to rank categorization

Pointwise learning to rank: linear regression, SVM
Pairwise learning to rank: RankNet, lambdaMART
Listwise learning to rank (optimizing the metric score w.r.t. a ranked list): NDCGBoost

45

	Algorithm	Category	Description
1992	Linear reg	Pointwise	Staged logistic regression
1999	MART	Pairwise	Multiple additive regression tree
2005	RankNet	Pairwise
2007	SVM^map	Listwise	Optimizing MAP using SVM
2010	NDCGBoost	Listwise	Optimizing NDCG using boosting
2016	XGBoost	Pairwise	Support various eval metrics

46 of 61

Listwise learning to rank for MAP [Yue et al. 2007]

Listwise learning to rank approach minimizes the loss with respect to the ranked list (e.g., MAP, NDCG)

SVM MAP [Yue et al. 2007]

46

minimize

+: hinge loss

Taking upper bound

47 of 61

Listwise learning to rank for NDCG [Valizadegan et al. 2009]

Optimizing the NDCG measure:

Directly optimizing the above function is infeasible, so they derived an approximation

Then compute the ranking function by ensemble following a similar approach as boosting

47

48 of 61

Listwise learning to rank for NDCG [Valizadegan et al. 2009]

48

49 of 61

Summary

The idea of machine learning for ranking has been around for a long time

In the last 2 decades, learning to rank has become a hot topic due to the development of machine learning, the availability of massive amount of user log and the growth in computational power

The mainstream learning to rank algorithms focus on tree ensemble method

Machine-learned ranking over many features now easily beats traditional hand-designed ranking functions by learning the evaluation metrics directly

49

50 of 61

Resources for learning to rank

Ranklib: https://sourceforge.net/p/lemur/wiki/RankLib/
XGBoost: https://github.com/dmlc/xgboost
Yahoo! Learning to rank challenge dataset
C. J. C. Burges. From RankNet to LambdaRank to LambdaMART: An Overview. Microsoft TR 2010.
O. Chapelle and Y. Chang. Yahoo! Learning to Rank Challenge Overview. JMLR Proceedings 2011

50

51 of 61

Deep learning based retrieval

DSSM: Deep Structured Semantic Model (2014)
MatchZoo: An open-source library for retrieval (2019)
Dense passage retrieval (2019)

51

52 of 61

Deep semantic matching [Pang et al. 2016]

52

q

v1

each cell:

distributed representation of words (word2vec)

w1

w2

….

wn

vm

v2

….

d

A Study of MatchPyramid Models on Ad-hoc Retrieval

53 of 61

aNMM: Ranking Short Answer Texts with Attention-Based Neural Matching Model

53

54 of 61

Dense passage retrieval

Step 1: map each query, document into a fixed vector embedding

54

“cs 589 stevens”

cs 589 Text mining and information retrieval

source: https://medium.com/@aikho/deep-learning-in-information-retrieval-part-ii-dense-retrieval-1f9fecb47de9

55 of 61

Contrastive learning

Step 2: Minimizes the distance with positive documents, maximizes the distance with negative documents

55

source: https://medium.com/@aikho/deep-learning-in-information-retrieval-part-ii-dense-retrieval-1f9fecb47de9

56 of 61

Contrastive learning

Selection of positive passages:

Highest ranked passage from BM25 that contains the answer

Adding hard negative of BM25 helps

56

57 of 61

Make Deep IR Model Explainable (ongoing work)

TF-IDF is explainable, however, BERT model is not

57

58 of 61

Make Deep IR Model Explainable (ongoing work)

58

59 of 61

Make Deep IR Model Explainable (ongoing work)

LIME:

59

60 of 61

Lecture 5 Summary

Learning to rank: using machine learning for ranking

feature engineering
casting problem as classification regression

Gradient boosted regression tree

Regression tree vs gradient boosting

RankNet

How we cast pairwise to pointwise, know the derivation

Deep learning to rank

DRMM/aNMM, DPR, explainability

60

61 of 61

Improving sentence embedding quality

The key to retrieval is to improve sentence embedding quality
Massive text embedding benchmark (MTEB)

61