1 of 172

1

2 of 172

Pretrained Transformers for Text Ranking:�BERT and Beyond

Andrew Yates, Rodrigo Nogueira, and Jimmy Lin

2

@andrewyates @rodrigfnogueira @lintool

3 of 172

Pretrained Transformers for Text Ranking:�BERT and Beyond

by Jimmy Lin, Rodrigo Nogueira, and Andrew Yates https://arxiv.org/abs/2010.06467

3

Based on the survey:

Tutorial organization:

Recorded tutorial
Live sessions: hands-on component and Q&A

4 of 172

Outline

Part 1: Background�(text ranking, IR, ML)�
Part 2: Ranking with relevance classification�
Part 3: Ranking with dense representations�
Part 4: Conclusion & future directions

4

5 of 172

Text Ranking

Text ranking problems

Transformers

5

6 of 172

Definition

Given: a piece of text� (keywords, question, news article, …)

Rank: other pieces of text

(passages, documents, queries, …)

Ordered by: their similarity

6

e.g., Web search

7 of 172

Focus: Ad hoc Retrieval

collection of texts

Return: a ranked list of k texts d₁ … d_k

Maximizing: a metric of interest

Given: query q

7

query

black bear attacks

...

collection

metric: 0.66

1.

2.

3.

8 of 172

9 of 172

Other Problems: Community Question Answering

New question: What is the longest airline flight?

9

Source: Quora

10 of 172

11 of 172

Focus: Content-based Similarity

Agreement between query and a piece of text

11

12 of 172

Transformers

12

13 of 172

Pretrained Transformers

13

Initialized via pretraining

14 of 172

IR Background

Unsupervised ranking methods

Metrics

Test collections

14

15 of 172

Unsupervised Ranking Methods

15

term score

input

16 of 172

Unsupervised Ranking Methods

Inverse Document Frequency

(more discriminative)

Term Frequency

16

the

and

but

...

data

ranking

SIGIR

...

17 of 172

Unsupervised Ranking Methods

17

Sparse representations

18 of 172

Unsupervised Ranking Methods

18

IDF component

TF component

term saturation

length normalization

19 of 172

Vocabulary Mismatch

document retrieval

text ranking

document ranking

Web search

query, search,

ranking, retrieval, ...

19

Enrich query or document representations → move beyond exact matching

Unsupervised: pseudo-relevance feedback
Later: document expansion with a neural method

Expand

20 of 172

Benchmarking: Relevance Judgments

Is this document relevant to the query?

20

(Subjective)

21 of 172

Evaluation Metrics: Precision & Recall

21

# results returned

# relevant docs

returned

# relevant docs

returned

# relevant docs

in collection

22 of 172

Evaluation Metrics: Average Precision

22

# relevant docs

in collection

precision when relevant doc returned

23 of 172

Evaluation Metrics: Reciprocal Rank

23

rank of relevant doc

24 of 172

Graded Relevance Judgments

How relevant is this document to the query?

24

more

relevant

“related”

“highly relevant”

25 of 172

Evaluation Metrics: nDCG

25

ideal DCG�(sorted by relevance)

decreasing

gain

i=1

i=2

i=3

how relevant?

how early?

26 of 172

Test Collections

26

queries

...

documents

judgments

27 of 172

Test Collections

TREC Robust04

News
Keywords
Natural language

27

ID: 336

Title: black bear attacks

Description: A relevant document would discuss the frequency of vicious black bear attacks worldwide and the possible causes for this savage behavior.

Narrative: It has been reported that food or cosmetics sometimes attract hungry black bears, causing them to viciously attack humans. Relevant documents would include the aforementioned causes as well as speculation preferably from the scientific community as to other possible causes of vicious attacks by black bears. A relevant document would also detail steps taken or new methods devised by wildlife officials to control and/or modify the savageness of the black bear.

28 of 172

Test Collections

MS MARCO

Web
Questions
Passage collection
Document collection
Sparse judgments

28

TREC Deep Learning

Same passages / docs
New (dense) judgments

ID: 130510

Text: definition declaratory judgment

ID: 1131069

Text: how many sons robert kraft has

ID: 1131069

Text: when did rock n roll begin?

ID: 1103153

Text: who is thomas m cooley

29 of 172

Documents and Mean / Median Lengths

29

30 of 172

Queries, Query Lengths, and Judgments

30

31 of 172

A Simple Search Engine

31

e.g., BM25

32 of 172

Machine Learning Background

Learning to rank

Deep learning for ranking

BERT

32

33 of 172

Machine Learning Background

Learning to rank

Deep learning for ranking

BERT

33

34 of 172

A Simple Search Engine

34

This section

Key	Value
"chair"	[text #83, text #743, ...]
"store"	[text #1003, text #50, ...]
...	...

35 of 172

Learning to Rank (> 1990)

Supervised machine learning techniques
Typically based on hand-crafted features:

Content (e.g. term frequencies, document lengths)
Meta-data (e.g.: PageRank scores)

RankNet (Burges et al., 2005): a neural net

Different from DL models because they require hand-crafted features

35

Gained popularity with user click data (Burges., 2010)

36 of 172

Learning to Rank - Types of Losses

query q, texts d₁ ,d₂, d₃, a ranker f_𝜃, and a loss L:

36

Pointwise:

L ( f_𝜃, q, d₁ ,d₂, d₃) = L( f_𝜃, q, d₁) + L( f_𝜃, q, d₂) + L( f_𝜃, q, d₃)

Pairwise:

L ( f_𝜃, q, d₁ ,d₂, d₃) = L( f_𝜃, q, d₁, d₂) + L( f_𝜃, q, d₁, d₃) + L( f_𝜃, q, d₂, d₃)

Listwise:

L ( f_𝜃, q, d₁ ,d₂, d₃) = L( f_𝜃, q, d₁, d₂, d₃)

37 of 172

Learning to Rank - Types of Losses

An example:

Rel(d₁) > Rel(d₂) > Rel(d₃), f_𝜃 = a neural network that outputs a probability, L = cross-entropy

37

Pairwise

d₁

q

neural net

p_1,2

>

d₂

q

neural

net

p_2,1

d₁

Listwise

d₁

q

neural net

p_1,2,3

d₂

d₃

q

neural net

d₂

d₁

p_3,2,1

>

d₁

q

neural net

p₁

>

d₂

q

neural net

p₂

Pointwise

38 of 172

Machine Learning Background

Learning to rank

Deep learning for ranking

BERT

38

39 of 172

Neural Ranking Models (> 2016)

39

We will revisit these architectures in Dense Retrieval Section

Representation-based

Interaction-based

40 of 172

Popular Neural Ranking Models

(Diaz et al., 2016, Roy et al., 2016)

….

Check Mitra and Craswell, (2017) for an excellent survey of these methods

40

41 of 172

Machine Learning Background

Learning to rank

Deep learning for ranking

BERT

41

42 of 172

Progress in Information Retrieval - Robust04

42

Source: Yang et al., (2019)

Yilmaz et al., 2019; MacAvaney et al. 2019

Li et al., 2020;

Nogueira et al., 2020

Some of them are

zero-shot!

43 of 172

MS MARCO Passage Ranking Leaderboard on December 2018

43

44 of 172

MS MARCO Passage Ranking Leaderboard on January 2019

44

~8 points!

45 of 172

MS MARCO Passage Ranking Leaderboard on June/2021

45

46 of 172

Adoption by Commercial Search Engines

46

source

We’re making a significant improvement to how we understand queries, representing the biggest leap forward in the past five years, and one of the biggest leaps forward in the history of Search.

Starting from April of this year (2019), we used large transformer models to deliver the largest quality improvements to our Bing customers in the past year.

MS Bing

Google Search

source

47 of 172

What is BERT?

47

Self-supervised: ∞ training data

Supervised: (few) labeled examples

Devlin, Chang, Lee, Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL 2019.

48 of 172

BERT's Pretraining Ingredients

48

Transformer (encoder-only)

with lots of parameters

+

Lots of texts

+

Lots of Compute

49 of 172

49

…

E_[CLS]

T_[CLS]

E₁

T₁

E₂

T₂

E₃

T₃

E₄

T₄

E₅

T₅

E₆

T₆

E₇

T₇

E_[SEP]

T_[SEP]

[SEP]

[CLS]

A₁

A₂

A₃

[SEP]

B₁

B₂

[CLS]

t₁

t₂

t₃

t₄

t₅

t₆

Token

Embeddings

t₇

E_A

+

Segment

Embeddings

E_A

+

P₈

P₀

P₁

P₂

P₃

P₄

P₅

P₆

+

Position

Embeddings

P₇

+

BERT

String

The bank makes loans to clients.

The

bank

makes

loans

to

clients

.

Tokens

[CLS]

[SEP]

string → sequence of vectors

50 of 172

50

…

[SEP]

E_A

P₈

E_[CLS]

T_[CLS]

[CLS]

E₁

T₁

A₁

E₂

T₂

A₂

E₃

T₃

A₃

E₄

T₄

[SEP]

E₅

T₅

B₁

E₆

T₆

B₂

E₇

T₇

E_[SEP]

T_[SEP]

[CLS]

t_The

t_bank

t_makes

t_loans

t_[MASK]

t_clients

E_A

P₀

P₁

P₂

P₃

P₄

P₅

P₆

+

Token

Embeddings

Segment

Embeddings

Position

Embeddings

t_.

E_A

P₇

+

Pretraining - Masked Language Modeling

The

bank

makes

loans

[MASK]

clients

.

Loss = -log (P("to" | masked input))

Random masking

D ྾ V

softmax

51 of 172

Pretraining

51

Token Probability

Input: a document

The Mongol invasion of Europe in the 13th century was the conquest of much of Europe by the Mongol Empire.

The Mongol invasion of Europe in the 13th [MASK] was the conquest of much of [MASK] by the Mongol Empire.

BERT

vec i

Random Masking

vec j

...

softmax

year 0.02

century 0.94

car 0.00

…

land 0.01

Europe 0.97

is 0.00

…

linear

softmax

linear

52 of 172

52

Finetuning

53 of 172

BERT

for Relevance Classification

(aka monoBERT)

53

54 of 172

monoBERT:

BERT reranker

54

We want:

s_i = P(Relevant = 1|q, d_i)

query q

text d_i

D × 2

softmax

s_i= softmax(T_[CLS]W + b)₁

Non-relevant

Relevant

55 of 172

Training monoBERT

Loss:

55

BM25

Humans

56 of 172

Once monoBERT is trained...

56

d₁

d₂

d₃

d₄

Inverted Index

d₅

BM25

monoBERT

d₃

d₂

d₅

d_i

q

R₀

R₁

q

H₀

H₁

mono

BERT

s_i

d₁

d₄

57 of 172

TREC 2019 - Deep Learning Track - Passage

57

	nDCG@10	MAP	Recall@1k
BM25	0.506	0.377	0.739
monoBERT	0.738	0.506	0.739
BM25 + RM3	0.518	0.427	0.788
monoBERT	0.742	0.529	0.788

58 of 172

TREC 2019 - Deep Learning Track - Passage

58

5 points in Recall@1k →2 points in MAP

Why?

Hypothesis: Mismatch between training and inference lists

	nDCG@10	MAP	Recall@1k
BM25	0.506	0.377	0.739
monoBERT	0.738	0.506	0.739
BM25 + RM3	0.518	0.427	0.788
monoBERT	0.742	0.529	0.788

59 of 172

How useful is the BM25 signal?

59

60 of 172

Recap: Pre-BERT vs. monoBERT

60

61 of 172

Q&A

5 minutes

61

62 of 172

Break

Resume at 9:00 PDT

62

63 of 172

Part 2: Ranking with Relevance Classification

63

64 of 172

BERT’s Limitations

64

Cannot input entire documents

what do we input?
& how do we label it?

need separate embedding for every possible position

restricted to indices 0-511

…

[SEP]

E_A

P₈

E_[CLS]

T_[CLS]

[CLS]

E₁

T₁

A₁

E₂

T₂

A₂

E₃

T₃

A₃

E₄

T₄

[SEP]

E₅

T₅

B₁

E₆

T₆

B₂

E₇

T₇

E_[SEP]

T_[SEP]

[CLS]

t₁

t₂

t₃

t₄

t₅

t₆

E_A

P₀

P₁

P₂

P₃

P₄

P₅

P₆

+

Token

Embeddings

Segment

Embeddings

Position

Embeddings

t₇

E_A

P₇

+

65 of 172

BERT’s Limitations

65

computationally expensive layers

e.g., 110+ million learned weights��

…

[SEP]

E_A

P₈

E_[CLS]

T_[CLS]

[CLS]

E₁

T₁

A₁

E₂

T₂

A₂

E₃

T₃

A₃

E₄

T₄

[SEP]

E₅

T₅

B₁

E₆

T₆

B₂

E₇

T₇

E_[SEP]

T_[SEP]

[CLS]

t₁

t₂

t₃

t₄

t₅

t₆

E_A

P₀

P₁

P₂

P₃

P₄

P₅

P₆

+

Token

Embeddings

Segment

Embeddings

Position

Embeddings

t₇

E_A

P₇

+

(later: Beyond BERT & Dense Representations)

Multi-stage ranking pipeline

Identify candidate documents
Rerank

66 of 172

From Passages to Documents

66

67 of 172

Handling Length Limitation: Training

Chunk documents

Transfer labels�(approximation)

67

68 of 172

Handling Length Limitation: Inference

Aggregate Evidence

68

69 of 172

Approach #1: Score Aggregation

69

…

s₁

s₂

s₃

Document Score

Over passage scores. Dai, Callan. Deeper Text Understanding for IR with Contextual Neural Language Modeling. SIGIR 2019.
Over sentence scores. Yilmaz, Yang, Zhang, Lin. Cross-Domain Modeling of Sentence-Level Evidence for Document Retrieval. EMNLP '19.

70 of 172

Over Passage Scores: BERT-MaxP, FirstP, SumP

70

…

s₁

s₂

s₃

Document Score

Dai, Callan. Deeper Text Understanding for IR with Contextual Neural Language Modeling. SIGIR 2019.

Take max, first, or sum

71 of 172

Over Passage Scores: Results

71

Dai, Callan. Deeper Text Understanding for IR with Contextual Neural Language Modeling. SIGIR 2019.

72 of 172

Over Sentence Scores: Birch

First-stage�retrieval score

Sentence scores

72

Yilmaz, Yang, Zhang, Lin. Cross-Domain Modeling of Sentence-Level Evidence for Document Retrieval. EMNLP '19.

Trained on sentence-level judgments like tweets
Interpolation weights are tuned on target dataset

73 of 172

Over Sentence Scores: Results

73

Yilmaz, Yang, Zhang, Lin. Cross-Domain Modeling of Sentence-Level Evidence for Document Retrieval. EMNLP '19.

74 of 172

Approach #2: Representation Aggregation

74

Over term embeddings. MacAvaney, Yates, Cohan, Goharian. CEDR: Contextualized Embeddings for Document Ranking. SIGIR 2019.
Over passage representations. Li, Yates, MacAvaney, He, Sun. PARADE: Passage Representation Aggregation for Document Reranking. arXiv 2020.

passage

representations

75 of 172

Over Term Embeddings: CEDR

75

similarity matrix using

contextualized embeddings�(concatenate passages)�

interaction-based pre-BERT

model (PACRR, KNRM)

MacAvaney, Yates, Cohan, Goharian. CEDR: Contextualized Embeddings for Document Ranking. SIGIR 2019.

76 of 172

Over Term Embeddings: CEDR

76

Relevant Document

Nonrelevant

MacAvaney, Yates, Cohan, Goharian. CEDR: Contextualized Embeddings for Document Ranking. SIGIR 2019.

77 of 172

Over Term Embeddings: Results

77

MacAvaney, Yates, Cohan, Goharian. CEDR: Contextualized Embeddings for Document Ranking. SIGIR 2019.

78 of 172

Over Passage Representations: PARADE

78

Li, Yates, MacAvaney, He, Sun. PARADE: Passage Representation Aggregation for Document Reranking. arXiv 2020.

Aggregation of passages’ CLS embeddings

Aggregation approaches:�(increasing complexity)

Average feature value
Max feature value
Attn-weighted average
Two Transformer layers

79 of 172

Over Passage Representations: Results

79

Li, Yates, MacAvaney, He, Sun. PARADE: Passage Representation Aggregation for Document Reranking. arXiv 2020.

80 of 172

Enlarge Passage Representations: Longformer, QDS

80

Jiang, Xiong, Lee, Wang. Long Document Ranking with Query-Directed Sparse Transformer. Findings of EMNLP 2020.

Longformer: sparse attention

QDS-Transformer: specialize to IR

Beltagy, Peters, Cohan. Longformer: The Long-Document Transformer. arXiv 2020.

81 of 172

Multi-stage rerankers

why multi-stage?

duoBERT

Cascade Transformers

81

82 of 172

Multi-stage rerankers

why multi-stage?

duoBERT

Cascade Transformers

82

83 of 172

From Single to Multiple Rerankers

83

84 of 172

Why Multi-stage?

Trade-off between effectiveness (quality of the ranked lists) and efficiency (retrieval latency)

84

efficiency

effectiveness

Fewer Candidates

85 of 172

Multi-stage Rerankers

why multi-stage?

duoBERT

Cascade Transformers

85

86 of 172

Multi-stage with duoBERT

86

d₁

d₂

d₃

d₄

Inverted

Index

d₅

BM25

monoBERT

d₃

d₂

d₅

d_i

q

R₀

R₁

q

H₀

H₁

mono

BERT

duoBERT

d₂

d₃

d_i

q

d_j

R₂

d₅

H₂

d₂

d₃

p_i,j

duo

BERT

d₂

d₅

d₃

d₂

d₃

d₅

d₂

d₅

d₃

Doc Pairwise

s_i

Nogueira, Yang, Cho, Lin. Multi-stage document ranking with bert. 2019.

87 of 172

87

duoBERT

d₁^j

E_C

P_m+6

T_[CLS]

[CLS]

U₁

A₁

U₂

A₂

U₃

A₃

T_[SEP1]

E_[CLS]

[SEP]

V₁

E₁

B₁

V_m

E₂

B₂

T_[SEP2]

E₃

X₁

[CLS]

E_[SEP1]

q₁

q₂

F₁

q₃

[SEP]

F_m

d₁ⁱ

d_mⁱ

E_[SEP2]

E_A

G₁

E_A

E_B

P₀

P₁

P₂

P₃

P₄

P₅

P_m+4

+

Token

Embeddings

Segment

Embeddings

Position

Embeddings

[SEP]

E_B

P_m+5

+

...

d₁^j

E_C

P_m+n+5

X_n

G_n

+

T_[SEP3]

E_[SEP3]

[SEP]

E_C

P_m+n+6

+

p_i,j

D ྾ 2

query

text d_i

text d_j

duoBERT's Input Format

88 of 172

Training duoBERT

88

Query q

SEP

text d_i

CLS

duoBERT

Is doc d_i more relevant than doc d_j to the query q?

Loss:

SEP

text d_j

p_i,j= p(d_i>d_j| q)

89 of 172

Inference with duoBERT

89

p_1,2= p(d₁>d₂| q)

d₁

q

d₂

duo

BERT

d₃

d₂

d₁

R₁

monoBERT

d₁

d₂

R₂

d₃

Pairwise aggregation:

s₁ = p_1,2 + p_1,3

s₂ = p_2,1 + p_2,3

s₃ = p_3,1 + p_3,2

d₁

d₂

d₁

d₃

d₂

d₁

d₂

d₃

d₁

d₃

d₂

Text Pairs

90 of 172

monoBERT vs. duoBERT

90

91 of 172

Multi-stage Rerankers

why multi-stage?

duoBERT

Cascade Transformers

91

92 of 172

Cascade Transformers (Soldaini and Moschitti, 2020)

92

Key Observation: Subsets of the Transformer layers can be seen as a reranker

monoBERT layers 1-4

q

SEP

d2

CLS

SEP

q

SEP

d3

CLS

SEP

q

SEP

d1

CLS

SEP

|D| × 2

q

SEP

d4

CLS

SEP

s₁, s₂, s₃, s₄

monoBERT layers 5-8

q

SEP

d2

CLS

SEP

q

SEP

d1

CLS

SEP

|D| × 2

q

SEP

d4

CLS

SEP

s₁, s₂, s₄

monoBERT layers 9-12

q

SEP

d4

CLS

SEP

q

SEP

d1

CLS

SEP

|D| × 2

s₁, s₄

93 of 172

Cascade Transformers

93

d₃

d₂

d₅

R₁

d₁

d₂

d₃

d₄

d₅

monoBERT

layers 1-4

d_i

q

R₀

H₁

mono

BERT

1-4

s_i

d₄

monoBERT

layers 5-8

d₃

d₂

d_i

q

R₂

H₂

mono

BERT

5-8

s_i

d₄

monoBERT

layers 9-12

d₂

d_i

q

R₃

H₃

mono

BERT

9-12

s_i

d₄

Key Observation: Subsets of monoBERT layers can be used as rerankers

For efficiency, activations from the previous reranker are passed to the next one

Soldaini, Moschitti. The Cascade Transformer: an Application for Efficient Answer Sentence Selection. ACL 2020.

94 of 172

Cascade Transformers

94

For efficiency, all candidate texts are packed in the same GPU batch

95 of 172

Results

Works well for answer selection → short sequences
How to pack 100 documents into a GPU batch?

95

𝛼: proportion of candidates to be discarded at each stage

96 of 172

Takeaways of Multi-stage Rerankers

Advantage:

more tuning knobs → more flexibility in effectiveness/efficiency tradeoff space

Disadvantage:

more tuning knobs → more complexity

We are only starting exploring the design space for multi-stage reranking pipelines with Transformers

96

97 of 172

Document Preprocessing Techniques

Query vs document expansion

doc2query

DeepCT

DeepImpact

97

98 of 172

Document Preprocessing Techniques

Query vs document expansion

doc2query

DeepCT

DeepImpact

98

99 of 172

A Simple Search Engine

99

This section

100 of 172

Query Reformulation (aka query expansion)

query: "Where can I buy a small car?"

100

Query Reformulator

Search Engine

Better Retrieved Docs

reformulated query: "compact car sales store"

101 of 172

Query reformulation as a translation task

Hard: Input has little information

101

Query Reformulator

Query Language

Document Language

Easier: Input has a lot of information

Document Translator

Query Language

Document Language

102 of 172

Document Preprocessing Techniques

Query vs document expansion

doc2query

DeepCT

DeepImpact

102

103 of 172

doc2query

103

seq2seq Transformer

Document

Query

Supervised training:

pairs of <query, relevant document>

Source: Vaswani et al., 2017

Nogueira, Yang, Lin, Cho. Document expansion by query prediction. 2019.

104 of 172

doc2query

104

Researchers are finding that cinnamon reduces blood sugar levels naturally when taken daily...

does cinnamon lower blood sugar?

Researchers are finding that cinnamon reduces blood sugar levels naturally when taken daily...

Input: Document

Output: Predicted Query

Expanded Document:

Index

doc2query

Search Engine

+

foods and supplements to lower blood sugar

User's Query

Better Retrieved Docs

Concatenate

In practice: 5-40 queries are sampled

with top-k or nucleus sampling

105 of 172

Results

105

zero-shot: doc2query was trained only on MS MARCO

	MARCO Passage (MRR@10)	TREC-DL 19 (nDCG@10)	TREC-COVID (nDCG@20)	Robust04 (nDCG@20)
BM25	0.184	0.506	0.659	0.428
+ doc2query	0.277	0.642	0.6375	0.446
BM25 + RM3	0.156	0.518	-	0.450
+ doc2query	0.214	0.655	-	0.466
BM25 + monoBERT/T5*	0.365	0.738	0.7785*	-
+ doc2query	0.379	0.754	0.7939	-

106 of 172

What is more important? copied or new words?

106

Predicted queries are "better" than documents

107 of 172

Examples

107

69% copied

Excluding

stop-words:

31% new

108 of 172

Document Preprocessing Techniques

Query vs document expansion

doc2query

DeepCT

DeepImpact

108

109 of 172

DeepCT

109

DeepCT (BERT)

The Geocentric Theory was proposed by the greeks under the guidance...

Text d:

Relevant query q: "who proposed the geocentric theory"

0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

Target Scores

D × 1

...

0.2 0.5 0.2 0.1 0.4 0.1 0.0 0.6 0.2 0.4 0.3

Predicted Scores

Dai, Callan. Context-aware sentence/passage term importance estimation for first stage retrieval. 2019.

110 of 172

Once DeepCT is trained...

110

Researchers are finding that cinnamon reduces blood sugar levels naturally when taken daily...

DeepCT (BERT)

Text d:

10 0 20 10 90 80 90 100 40 20 0 10 40

Term Frequencies:

Index

New document: "Researchers Researchers … finding finding … that that … cinnamon cinnamon ... reduces ... "

D × 1

...

0.1 0.0 0.2 0.1 0.9 0.8 0.9 1.0 0.4 0.2 0.0 0.1 0.4

Predicted Scores

111 of 172

Results on MS MARCO Passage Dev Set

111

Model	MRR@10	R@1000	BERT Inferences per doc
BM25	0.184	0.853	-
+ doc2query	0.229	0.907	1
+ doc2query	0.277	0.944	40
DeepCT	0.243	0.913	1

112 of 172

Document Preprocessing Techniques

Query vs document expansion

doc2query

DeepCT

DeepImpact

112

113 of 172

DeepImpact: combining doc2query with DeepCT

113

Researchers are finding that cinnamon reduces blood sugar levels naturally when taken daily...

does cinnamon lower blood sugar?

Researchers are finding that cinnamon reduces blood sugar levels naturally when taken daily...

Input Document

Output: Predicted Query

Expanded Document:

Index

doc2query

Search Engine

+

foods and supplements to lower blood sugar

User's Query

Better Retrieved Docs

Concatenate

Term Scorer (~DeepCT)

{Researchers: 31, are: 0, finding: 4, that: 1, ...}

Mallia, Khattab, Tonellotto, and Suel. Learning Passage Impacts for Inverted Indexes. 2021

114 of 172

Results on MS MARCO Passage Dev Set

114

Model	MRR@10	R@1000	Latency (ms/query)
BM25	0.184	0.853	13
DeepCT	0.243	0.913	11
doc2query	0.278	0.947	12
DeepImpact	0.326	0.948	58
BM25 + monoBERT	0.355	0.853	(GPU) 10,700

115 of 172

How to Expand Long Documents?

115

doc2query

DeepCT

DeepImpact

doc2query

DeepCT

DeepImpact

doc2query

DeepCT

DeepImpact

doc2query

DeepCT

DeepImpact

Queries or

Terms

Queries or

Terms

Queries or

Terms

Queries or

Terms

+

windows of N sentences

116 of 172

Takeaways of Document Expansion

Advantages:

Documents have more context than queries →easy prediction task
Documents can be processed offline and in parallel
Run on CPU at query time

Disadvantages:

Have to iterate over the entire collection
Not as effective as rerankers (yet)

116

117 of 172

Beyond BERT

Improving effectiveness or efficiency

117

118 of 172

Improving Effectiveness: Model Variants

BERT Variants:

Improve training → replacement prediction rather than mask prediction (ELECTRA)
Modify training → tune hyper-parameters, remove NSP task (RoBERTa)
Modify architecture → share weights across Transformer layers (ALBERT)

118

119 of 172

Improving Effectiveness: Model Variants

119

…

E_[CLS]

T_[CLS]

E₁

T₁

E₂

T₂

E₃

T₃

E₄

T₄

E₅

T₅

E₆

T₆

E₇

T₇

E_[SEP]

T_[SEP]

The

bank

makes

loans

[MASK]

clients

.

Random masking

BERT

Original text

The

bank

makes

loans

to

clients

.

Term replacement

ELECTRA

The

store

makes

loans

and

clients

.

120 of 172

Improving Effectiveness: BERT Variants

120

Zhang, Yates, Lin. Comparing Score Aggregation Approaches for Pretrained Neural Language Models. ECIR 2021.

121 of 172

Improving Effectiveness: T5

Sequence2Sequence Model (T5): larger, improved pre-training

Input: query: [q] document: [d] relevant:

121

Training:

"true" or "false"

Inference:

score = P(token="true" | q, d)

w₁

w₂

w_N

...

Source

Raffel, Shazeer, Roberts, Lee, Narang, Matena, Zhou, Li, Liu. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. JMLR 2020.

122 of 172

Improving Effectiveness: T5

122

Nogueira, Jiang, Pradeep, Lin. Document Ranking with a Pretrained Sequence-to-Sequence Model. Findings of EMNLP 2020.

(desc.)

123 of 172

Improving Efficiency: Distillation & Architectures

Distillation:

Train a smaller model (“student”) to mimic a larger model (“teacher”)
Can be performed before fine-tuning, after, or both (best)

Cross-entropy loss

Teacher & student logits

123

124 of 172

Improving Efficiency: Distillation

124

Li, Yates, MacAvaney, He, Sun. PARADE: Passage Representation Aggregation for Document Reranking. arXiv 2020.

(titles)

125 of 172

Improving Efficiency: Distillation & Architectures

Distillation:

Train a smaller model (“student”) to mimic a larger model (“teacher”)
Can be performed before fine-tuning, after, or both (best)

Smaller architectures:

TK, TKL, CK → skip pretraining and reduce the number of Transformer layers
Take GloVe embeddings as input

125

126 of 172

Improving Efficiency: TK Architectures

126

Small Transformer stack

Similarity matrix & KNRM

Hofstatter, Zlabinger, Hanbury. Interpretable & Time-Budget-Constrained Contextualization for Re-Ranking. ECAI 2020.

127 of 172

Improving Efficiency: TK Architectures

127

Hofstatter, Zlabinger, Hanbury. Interpretable & Time-Budget-Constrained Contextualization for Re-Ranking. ECAI 2020.

128 of 172

Domain-specific Applications

128

129 of 172

TREC-COVID (April-September 2020)

Task: Find scientific articles relevant to questions such as:

Are patients taking Angiotensin-converting enzyme inhibitors (ACE) at increased risk for COVID-19?

129

MacAvaney, Cohan, Goharian. SLEDGE-Z: A Zero-Shot Baseline for COVID-19 Literature Search. EMNLP 2020.

130 of 172

TREC 2020 - Precision Medicine Track

Task: Find relevant scientific articles in PubMed for questions such as:

Is Dabrafenib effective for the melanoma disease in patients with gene BRAF (V600E) mutation?

BERT-based models work fine with small tweaks:

130

Model	nDCG@30
median	0.2857
BM25	0.3081
+ monoT5_rct	0.4193
damoespb1	0.4255

Roberts, Demner-Fushman, Voorhees, Bedrick, Hersh, Overview of the TREC 2020 Precision Medicine Track. TREC 2020.

Zero-shot:

Finetuned only on MS MARCO!

131 of 172

TREC 2020 - Health Misinformation

131

Better

BM25 + monoT5 + LabelT5

Task: Find relevant documents to questions such as:

Can ibuprofen worsen COVID-19?

Metric penalizes documents that contain incorrect information

BM25 + monoT5

132 of 172

Legal Domain

COLIEE 2020, Task 1:

Find relevant cases in a corpus that support the decision of a given case.

132

Team JNLP (Nguyen et al. 2020)

Method	F1
Pre-BERT	0.5148
BERT	0.6379
Pre-BERT + BERT	0.6682

Team TLIR (Shao et al. 2020)

Method	F1
BM25	0.5287
BM25 + BERT	0.6397

133 of 172

Takeaways

Zero-shot BERT-based models are starting show better effectiveness than classical retrieval
Simple domain adaptation techniques work well with transformer architectures (eg: SLEDGE-Z)

133

134 of 172

Q&A

10 minutes

134

135 of 172

Break

Resume at 10:40 PDT

135

136 of 172

Part 3: Ranking with Dense Representations

136

137 of 172

Sparse Representations

137

Advantages: 1) Fast to retrieve candidates from a inverted index because q is usually short. 2) Fast to compute because q ∩ d is usually small

Disadvantage: Terms need to match exactly

Task: Estimate the relevance of text d to a query q:

q = "fix my air conditioner"

d = "... AC repair ..."

138 of 172

Dense Representations

138

q = "fix my air conditioner"

d = "... AC repair ..."

0.8	-1.2	...	2.4	-0.3

Encoder

Continuous dense vectors ℝ^D

0.5	0.0	...	2.6	-0.7

Encoder

𝜙 is a similarity function (e.g., inner product or cosine similarity)

𝜙(𝜂(q), 𝜂(d)) → ideally measures how relevant q and d are to each other

𝜂(q)

𝜂(d)

139 of 172

Nearest Neighbor Search

139

140 of 172

Task: find the top k most relevant texts to a query

140

query

texts

...

𝜙

𝜙(𝜂(q), 𝜂(d₁))

𝜙(𝜂(q), 𝜂(d_|C|))

𝜙

Top k

Brute-force search:

We often need to search many (e.g.: billions) of texts

Brute-force won't scale

141 of 172

Approximate Nearest Neighbor Search

Exchange accuracy for speed
E.g.: k-means:

141

In practice, ANN implementations are more complicated
We assume a fast dense retrieval library is available (e.g.: Faiss, Annoy, ScaNN)

query 𝜂(q)

𝜙 (𝜂(q), 𝜂(c₁))

𝜙 (𝜂(q), 𝜂(c₂))

𝜙 (𝜂(q), 𝜂(c₃))

Centroids

𝜙 (𝜂(q), 𝜂(d₁))

…

𝜙 (𝜂(q), 𝜂(d_m))

142 of 172

Pre-BERT Dense Representations

142

143 of 172

Types of Encoders: Cross-encoder

143

Source

144 of 172

Types of Encoders: Bi-encoder

144

𝜂(q)

𝜂(d)

𝜙(𝜂(q), 𝜂(d))

s

𝜙(𝜂(q), 𝜂(d))

𝜂(q)

𝜂(d)

Source

145 of 172

Pre-BERT: Dual Embedding Space Model

Document

Centroid

Query Term Embeddings

Similarity

Function

145

Leverage term embeddings and → pretrained word2vec

Mitra, Nalisnick, Craswell, Caurana. A Dual Embedding Space Model for Document Ranking. arXiv & WWW 2016.

146 of 172

Similarity Functions: Distance vs. Comparison

Comparison

Distance

146

147 of 172

Pre-BERT: Deep Structured Semantic Model

Term Hashing

Feedforward

Network

Representation

Cosine similarity

147

Huang, He, Gao, Deng, Acero, Heck. Learning Deep Structured Semantic Models for Web Search using Clickthrough Data. CIKM 2013.

148 of 172

Distance-based Transformer Representations

148

149 of 172

Distance-based Representations

Key characteristic

Simple similarity function → inner (dot) product, cosine similarity, …

Compatible with ANN search

Johnson, Douze, Jégou. Billion-scale similarity search with GPUs. arXiv 2017.

149

150 of 172

Distance-based: SentenceBERT

150

Text representation:

CLS

Text representation:

CLS
Mean
Max

Reimers, Gurevych. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. EMNLP 2019.

151 of 172

Distance-based: SentenceBERT

151

Classification

Regression

Reimers, Gurevych. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. EMNLP 2019.

152 of 172

SentenceBERT Result Highlights

Zero-shot:

Mean: GloVe > BERT
BERT Mean > CLS

CLS is very poor

Fine-tuned:

Mean > CLS and Max
Close to cross-encoder

CLS is slightly worse

Classification:

is essential

152

Reimers, Gurevych. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. EMNLP 2019.

153 of 172

Dense Passage Retrieval (DPR)

Highlights:

QA reader & retriever
Selecting negative examples:�random, BM25, or in batch

153

	Similarity Function (𝜙)	Representation
SentenceBERT	Cosine similarity	CLS, Mean, or Max
DPR	Inner product	CLS

Query-1 Query-2

Batch

Karpukhin, Oğuz, Min, Lewis, Wu, Edunov, Chen, Yih. Dense Passage Retrieval for Open-Domain Question Answering. EMNLP 2020.

154 of 172

ANCE

Negative Contrastive Estimation:

Choose hard negatives (for model)

Checkpoint regularly

154

	Similarity Function (𝜙)	Representation
SentenceBERT	Cosine similarity	CLS, Mean, or Max
DPR	Inner product	CLS
ANCE	Inner product	CLS

L. Xiong, C. Xiong, Li, Tang, Liu, Bennett, Ahmed, Overwikj. Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval. ICLR 2021.

155 of 172

Selecting Negative Examples: Results

156 of 172

CLEAR

Highlights:

Prepend QRY or DOC
Exact matching component
Negatives: matching errors
Dynamic hinge loss margin

156

	Similarity Function (𝜙)	Representation
SentenceBERT	Cosine similarity	CLS, Mean, or Max
DPR	Inner product	CLS
ANCE	Inner product	CLS
CLEAR	Inner product + BM25	Mean

Gao, Dai, Chen, Fan, van Durme, Callan. Complementing Lexical Retrieval with Semantic Residual Embedding. arXiv 2020; ECIR 2021.

157 of 172

RepBERT

Exact match signal:

Reduces MRR (-2%)
Improves recall (+1%)

157

	Similarity Function (𝜙)	Representation
SentenceBERT	Cosine similarity	CLS, Mean, or Max
DPR	Inner product	CLS
ANCE	Inner product	CLS
CLEAR	Inner product + BM25	Mean
RepBERT	Inner product	Mean

Zhan, Mao, Liu, Zhang, Ma. RepBERT: Contextualized Text Embeddings for First-Stage Retrieval. arXiv 2020.

158 of 172

EPIC

Highlights:

Vocabulary vector
Term importance weights

158

	Similarity Function (𝜙)	Representation
SentenceBERT	Cosine similarity	CLS, Mean, or Max
DPR	Inner product	CLS
ANCE	Inner product	CLS
CLEAR	Inner product + BM25	Mean
RepBERT	Inner product	Mean
EPIC	Inner product	\|V\|-dimension vector

MacAvaney, Nardini, Perego, Tonellotto, Goharian, Frieder. Expansion via Prediction of Importance with Contextualization. SIGIR 2020.

159 of 172

Distance-based: Results

159

160 of 172

Comparison-based Transformer Representations

160

161 of 172

Comparison-based: Poly-encoders

Two

(Query, Document)

N

(|Query| + |Document|)

161

Number of interacting representations

Computational Cost

Humeau, Shuster, Lachaux, Weston. Poly-encoders: Transformer Architectures and Pre-training Strategies for Fast and Accurate Multi-sentence Scoring. ICLR 2020.

162 of 172

Comparison-based: Poly-encoders

M context codes

162

Compatible with ANN?

Unclear
Data-dependent

Humeau, Shuster, Lachaux, Weston. Poly-encoders: Transformer Architectures and Pre-training Strategies for Fast and Accurate Multi-sentence Scoring. ICLR 2020.

163 of 172

Comparison-based: ColBERT

MaxSim operator

163

Khattab, Zaharia. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. SIGIR 2020.

164 of 172

Comparison-based: ColBERT

MaxSim:

Sim-mat max pooling

(along query dimension)

164

Max Pool

Sum

Khattab, Zaharia. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. SIGIR 2020.

165 of 172

Comparison-based: ColBERT

165

Compatible with ANN?

Unclear
Data-dependent
70x faster than BERT-large

Khattab, Zaharia. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. SIGIR 2020.

166 of 172

Comparison-based: ColBERT

166

167 of 172

Q&A

5 minutes

167

168 of 172

Break

Resume at 11:35 PDT

168

169 of 172

Conclusions and Future Directions

169

170 of 172

Conclusions

Pretrained Transformers showed significant improvements in various IR benchmarks
Reproduced and adopted by many in academia and industry
No doubt we are in the age of BERT and Transformers

170

171 of 172

Open Questions

Transformers for ranking: apply (T5), adapt (Parade), or redesign (TK/CK)?

What is the future of:

zero-shot learning, transfer learning, domain adaptation, and task-specific fine-tuning?
dense vs. sparse retrieval?
multi-stage ranking architectures?

What can IR bring to transformers?

E.g.: REALM (Guu et al., 2020) -> Text retrieval into pretraining

171

172 of 172

Pretrained Transformers for Text Ranking:�BERT and Beyond

Thanks!

by Jimmy Lin, Rodrigo Nogueira, and Andrew Yates https://arxiv.org/abs/2010.06467

172

Learn more in survey (& upcoming book):