1 of 172

1

2 of 172

Pretrained Transformers for Text Ranking:οΏ½BERT and Beyond

Andrew Yates, Rodrigo Nogueira, and Jimmy Lin

2

3 of 172

Pretrained Transformers for Text Ranking:οΏ½BERT and Beyond

by Jimmy Lin, Rodrigo Nogueira, and Andrew Yates https://arxiv.org/abs/2010.06467

3

Based on the survey:

Tutorial organization:

  • Recorded tutorial
  • Live sessions: hands-on component and Q&A

4 of 172

Outline

  • Part 1: BackgroundοΏ½(text ranking, IR, ML)οΏ½
  • Part 2: Ranking with relevance classificationοΏ½
  • Part 3: Ranking with dense representationsοΏ½
  • Part 4: Conclusion & future directions

4

5 of 172

Text Ranking

Text ranking problems

Transformers

5

6 of 172

Definition

Given: a piece of textοΏ½ (keywords, question, news article, …)

Rank: other pieces of text

(passages, documents, queries, …)

Ordered by: their similarity

6

e.g., Web search

7 of 172

Focus: Ad hoc Retrieval

collection of texts

Return: a ranked list of k texts d1 … dk

Maximizing: a metric of interest

Given: query q

7

query

black bear attacks

...

collection

metric: 0.66

1.

2.

3.

8 of 172

Other Problems: Question Answering

Approach:

  • Rank passages
  • Rank answer spans

8

Source: SQuAD

9 of 172

Other Problems: Community Question Answering

New question: What is the longest airline flight?

9

Source: Quora

10 of 172

Other Problems: Text Recommendation

10

Source: Science News

11 of 172

Focus: Content-based Similarity

Agreement between query and a piece of text

11

12 of 172

Transformers

12

13 of 172

Pretrained Transformers

13

Initialized via pretraining

14 of 172

IR Background

Unsupervised ranking methods

Metrics

Test collections

14

15 of 172

Unsupervised Ranking Methods

15

term score

input

16 of 172

Unsupervised Ranking Methods

Inverse Document Frequency

(more discriminative)

Term Frequency

16

the

and

but

...

data

ranking

SIGIR

...

17 of 172

Unsupervised Ranking Methods

17

Sparse representations

18 of 172

Unsupervised Ranking Methods

18

IDF component

TF component

term saturation

length normalization

19 of 172

Vocabulary Mismatch

document retrieval

text ranking

document ranking

Web search

query, search,

ranking, retrieval, ...

19

Enrich query or document representations β†’ move beyond exact matching

  • Unsupervised: pseudo-relevance feedback
  • Later: document expansion with a neural method

Expand

20 of 172

Benchmarking: Relevance Judgments

Is this document relevant to the query?

20

(Subjective)

21 of 172

Evaluation Metrics: Precision & Recall

21

# results returned

# relevant docs

returned

# relevant docs

returned

# relevant docs

in collection

22 of 172

Evaluation Metrics: Average Precision

22

# relevant docs

in collection

precision when relevant doc returned

23 of 172

Evaluation Metrics: Reciprocal Rank

23

rank of relevant doc

24 of 172

Graded Relevance Judgments

How relevant is this document to the query?

24

more

relevant

β€œrelated”

β€œhighly relevant”

25 of 172

Evaluation Metrics: nDCG

25

ideal DCGοΏ½(sorted by relevance)

decreasing

gain

i=1

i=2

i=3

how relevant?

how early?

26 of 172

Test Collections

26

queries

...

documents

judgments

27 of 172

Test Collections

TREC Robust04

  • News
  • Keywords
  • Natural language

27

ID: 336

Title: black bear attacks

Description: A relevant document would discuss the frequency of vicious black bear attacks worldwide and the possible causes for this savage behavior.

Narrative: It has been reported that food or cosmetics sometimes attract hungry black bears, causing them to viciously attack humans. Relevant documents would include the aforementioned causes as well as speculation preferably from the scientific community as to other possible causes of vicious attacks by black bears. A relevant document would also detail steps taken or new methods devised by wildlife officials to control and/or modify the savageness of the black bear.

28 of 172

Test Collections

MS MARCO

  • Web
  • Questions
  • Passage collection
  • Document collection
  • Sparse judgments

28

TREC Deep Learning

  • Same passages / docs
  • New (dense) judgments

ID: 130510

Text: definition declaratory judgment

ID: 1131069

Text: how many sons robert kraft has

ID: 1131069

Text: when did rock n roll begin?

ID: 1103153

Text: who is thomas m cooley

29 of 172

Documents and Mean / Median Lengths

29

30 of 172

Queries, Query Lengths, and Judgments

30

31 of 172

A Simple Search Engine

31

e.g., BM25

32 of 172

Machine LearningΒ Background

Learning toΒ rank

Deep learning for ranking

BERT

32

33 of 172

Machine LearningΒ Background

Learning toΒ rank

Deep learning for ranking

BERT

33

34 of 172

A SimpleΒ Search Engine

34

This section

Key

Value

"chair"

[text #83, text #743, ...]

"store"

[text #1003, text #50, ...]

...

...

35 of 172

Learning to Rank (> 1990)

  • Supervised machine learning techniques
  • Typically based on hand-crafted features:
    • Content (e.g.Β term frequencies, document lengths)
    • Meta-data (e.g.: PageRank scores)
  • RankNet (Burges et al., 2005): a neural net
    • Different from DL models because they require hand-crafted features

35

  • Gained popularity with user click data (Burges., 2010)

36 of 172

Learning to Rank - Types of Losses

query q, texts d1 ,d2, d3, a ranker fπœƒ, and a loss L:

36

Pointwise:

L ( fπœƒ, q, d1 ,d2, d3) = L( fπœƒ, q, d1) + L( fπœƒ, q, d2) + L( fπœƒ, q, d3)

Pairwise:

L ( fπœƒ, q, d1 ,d2, d3) = L( fπœƒ, q, d1, d2) + L( fπœƒ, q, d1, d3) + L( fπœƒ, q, d2, d3)

Listwise:

L ( fπœƒ, q, d1 ,d2, d3) = L( fπœƒ, q, d1, d2, d3)

37 of 172

Learning to Rank - Types of Losses

An example:

Rel(d1) > Rel(d2) > Rel(d3), fπœƒ = a neural network that outputs a probability, L = cross-entropy

37

Pairwise

d1

q

neural net

p1,2

>

d2

d2

q

neural

net

p2,1

d1

Listwise

d1

q

neural net

p1,2,3

d2

d2

d3

q

neural net

d2

d1

p3,2,1

>

d1

q

neural net

p1

>

d2

q

neural net

p2

Pointwise

38 of 172

Machine LearningΒ Background

Learning toΒ rank

Deep learning for ranking

BERT

38

39 of 172

Neural Ranking Models (> 2016)

39

We will revisit these architectures in Dense Retrieval Section

Representation-based

Interaction-based

40 of 172

Popular Neural Ranking Models

40

41 of 172

Machine LearningΒ Background

Learning toΒ rank

Deep learning for ranking

BERT

41

42 of 172

Progress in Information Retrieval - Robust04

42

Yilmaz et al., 2019; MacAvaney et al. 2019

Li et al., 2020;

Nogueira et al., 2020

Some of them are

zero-shot!

43 of 172

MS MARCO Passage Ranking Leaderboard on December 2018

43

44 of 172

MS MARCO Passage Ranking Leaderboard on JanuaryΒ 2019

44

~8 points!

45 of 172

MS MARCO Passage Ranking Leaderboard on June/2021

45

46 of 172

Adoption by Commercial Search Engines

46

We’re making a significant improvement to how we understand queries, representing the biggest leap forward in the past five years, and one of the biggest leaps forward in the history of Search.

Starting from April of this year (2019), we used large transformer models to deliver the largest quality improvements to our Bing customers in the past year.

MS Bing

Google Search

47 of 172

What is BERT?

47

Self-supervised: ∞ training data

Supervised: (few) labeled examples

Devlin, Chang, Lee, Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL 2019.

48 of 172

BERT's Pretraining Ingredients

48

Transformer (encoder-only)

with lots of parameters

+

Lots of texts

+

Lots of Compute

49 of 172

49

…

…

…

…

…

…

…

…

…

E[CLS]

T[CLS]

E1

T1

E2

T2

E3

T3

E4

T4

E5

T5

E6

T6

E7

T7

E[SEP]

T[SEP]

[SEP]

[CLS]

A1

A2

A3

[SEP]

B1

B2

[CLS]

t1

t2

t3

t4

t5

t6

Token

Embeddings

t7

EA

EA

EA

EA

EA

EA

EA

EA

+

+

+

+

+

+

+

+

Segment

Embeddings

EA

+

P8

P0

P1

P2

P3

P4

P5

P6

+

+

+

+

+

+

+

+

Position

Embeddings

P7

+

BERT

String

The bank makes loans to clients.

The

bank

makes

loans

to

clients

.

Tokens

[CLS]

[SEP]

string β†’ sequence of vectors

50 of 172

50

…

…

…

…

…

…

…

…

…

[SEP]

EA

P8

E[CLS]

T[CLS]

[CLS]

E1

T1

A1

E2

T2

A2

E3

T3

A3

E4

T4

[SEP]

E5

T5

B1

E6

T6

B2

E7

T7

E[SEP]

T[SEP]

[CLS]

tThe

tbank

tmakes

tloans

t[MASK]

tclients

EA

EA

EA

EA

EA

EA

EA

P0

P1

P2

P3

P4

P5

P6

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

Token

Embeddings

Segment

Embeddings

Position

Embeddings

t.

EA

P7

+

+

Pretraining - Masked Language Modeling

The

bank

makes

loans

[MASK]

clients

.

Loss = -log (P("to" | masked input))

Random masking

D ΰΎΎ V

softmax

51 of 172

Pretraining

51

Token Probability

Input: a document

The Mongol invasion of Europe in the 13th century was the conquest of much of Europe by the Mongol Empire.

The Mongol invasion of Europe in the 13th [MASK] was the conquest of much of [MASK] by the Mongol Empire.

BERT

vec i

Random Masking

vec j

...

...

...

softmax

year 0.02

century 0.94

car 0.00

…

land 0.01

Europe 0.97

is 0.00

…

linear

softmax

linear

52 of 172

52

Finetuning

53 of 172

BERT

for Relevance Classification

(aka monoBERT)

53

54 of 172

monoBERT:

BERT reranker

54

We want:

si = P(Relevant = 1|q, di)

query q

text di

D Γ— 2

softmax

si = softmax(T[CLS]W + b)1

Non-relevant

Relevant

55 of 172

Training monoBERT

Loss:

55

BM25

Humans

56 of 172

Once monoBERT is trained...

56

d1

d2

d3

d4

Inverted Index

d5

BM25

monoBERT

d3

d2

d5

di

q

R0

R1

q

H0

H1

mono

BERT

si

d1

d4

57 of 172

TREC 2019 - Deep Learning Track - Passage

57

nDCG@10

MAP

Recall@1k

BM25

0.506

0.377

0.739

  • monoBERT

0.738

0.506

0.739

BM25 + RM3

0.518

0.427

0.788

  • monoBERT

0.742

0.529

0.788

58 of 172

TREC 2019 - Deep Learning Track - Passage

58

5 points in Recall@1k β†’2 points in MAP

Why?

Hypothesis: Mismatch between training and inference lists

nDCG@10

MAP

Recall@1k

BM25

0.506

0.377

0.739

  • monoBERT

0.738

0.506

0.739

BM25 + RM3

0.518

0.427

0.788

  • monoBERT

0.742

0.529

0.788

59 of 172

How useful is the BM25 signal?

59

60 of 172

Recap: Pre-BERT vs. monoBERT

60

61 of 172

Q&A

5 minutes

61

62 of 172

Break

Resume at 9:00 PDT

62

63 of 172

Part 2: Ranking with Relevance ClassificationΒ 

63

64 of 172

BERT’s Limitations

64

Cannot input entire documents

  • what do we input?
  • & how do we label it?

need separate embedding for every possible position

  • restricted to indices 0-511

…

…

…

…

…

…

…

…

…

[SEP]

EA

P8

E[CLS]

T[CLS]

[CLS]

E1

T1

A1

E2

T2

A2

E3

T3

A3

E4

T4

[SEP]

E5

T5

B1

E6

T6

B2

E7

T7

E[SEP]

T[SEP]

[CLS]

t1

t2

t3

t4

t5

t6

EA

EA

EA

EA

EA

EA

EA

P0

P1

P2

P3

P4

P5

P6

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

Token

Embeddings

Segment

Embeddings

Position

Embeddings

t7

EA

P7

+

+

65 of 172

BERT’s Limitations

65

computationally expensive layers

  • e.g., 110+ million learned weightsοΏ½οΏ½

…

…

…

…

…

…

…

…

…

[SEP]

EA

P8

E[CLS]

T[CLS]

[CLS]

E1

T1

A1

E2

T2

A2

E3

T3

A3

E4

T4

[SEP]

E5

T5

B1

E6

T6

B2

E7

T7

E[SEP]

T[SEP]

[CLS]

t1

t2

t3

t4

t5

t6

EA

EA

EA

EA

EA

EA

EA

P0

P1

P2

P3

P4

P5

P6

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

Token

Embeddings

Segment

Embeddings

Position

Embeddings

t7

EA

P7

+

+

(later: Beyond BERT & Dense Representations)

Multi-stage ranking pipeline

  • Identify candidate documents
  • Rerank

66 of 172

From Passages to Documents

66

67 of 172

Handling Length Limitation: Training

Chunk documents

Transfer labelsοΏ½(approximation)

67

68 of 172

Handling Length Limitation: Inference

Aggregate Evidence

68

69 of 172

Approach #1: Score Aggregation

69

…

…

…

…

…

…

s1

s2

s3

Document Score

  1. Over passage scores. Dai, Callan. Deeper Text Understanding for IR with Contextual Neural Language Modeling. SIGIR 2019.
  2. Over sentence scores. Yilmaz, Yang, Zhang, Lin. Cross-Domain Modeling of Sentence-Level Evidence for Document Retrieval. EMNLP '19.

70 of 172

Over Passage Scores: BERT-MaxP, FirstP, SumP

70

…

…

…

…

…

…

s1

s2

s3

Document Score

Dai, Callan. Deeper Text Understanding for IR with Contextual Neural Language Modeling. SIGIR 2019.

Take max, first, or sum

71 of 172

Over Passage Scores: Results

71

Dai, Callan. Deeper Text Understanding for IR with Contextual Neural Language Modeling. SIGIR 2019.

72 of 172

Over Sentence Scores: Birch

First-stageοΏ½retrieval score

Sentence scores

72

Yilmaz, Yang, Zhang, Lin. Cross-Domain Modeling of Sentence-Level Evidence for Document Retrieval. EMNLP '19.

  • Trained on sentence-level judgments like tweets
  • Interpolation weights are tuned on target dataset

73 of 172

Over Sentence Scores: Results

73

Yilmaz, Yang, Zhang, Lin. Cross-Domain Modeling of Sentence-Level Evidence for Document Retrieval. EMNLP '19.

74 of 172

Approach #2: Representation Aggregation

74

  • Over term embeddings. MacAvaney, Yates, Cohan, Goharian. CEDR: Contextualized Embeddings for Document Ranking. SIGIR 2019.
  • Over passage representations. Li, Yates, MacAvaney, He, Sun. PARADE: Passage Representation Aggregation for Document Reranking. arXiv 2020.

passage

representations

75 of 172

Over Term Embeddings: CEDR

75

similarity matrix using

contextualized embeddingsοΏ½(concatenate passages)οΏ½

interaction-based pre-BERT

model (PACRR, KNRM)

MacAvaney, Yates, Cohan, Goharian. CEDR: Contextualized Embeddings for Document Ranking. SIGIR 2019.

76 of 172

Over Term Embeddings: CEDR

76

Relevant Document

Nonrelevant

MacAvaney, Yates, Cohan, Goharian. CEDR: Contextualized Embeddings for Document Ranking. SIGIR 2019.

77 of 172

Over Term Embeddings: Results

77

MacAvaney, Yates, Cohan, Goharian. CEDR: Contextualized Embeddings for Document Ranking. SIGIR 2019.

78 of 172

Over Passage Representations: PARADE

78

Li, Yates, MacAvaney, He, Sun. PARADE: Passage Representation Aggregation for Document Reranking. arXiv 2020.

Aggregation of passages’ CLS embeddings

Aggregation approaches:οΏ½(increasing complexity)

  • Average feature value
  • Max feature value
  • Attn-weighted average
  • Two Transformer layers

79 of 172

Over Passage Representations: Results

79

Li, Yates, MacAvaney, He, Sun. PARADE: Passage Representation Aggregation for Document Reranking. arXiv 2020.

80 of 172

Enlarge Passage Representations: Longformer, QDS

80

Jiang, Xiong, Lee, Wang. Long Document Ranking with Query-Directed Sparse Transformer. Findings of EMNLP 2020.

Longformer: sparse attention

QDS-Transformer: specialize to IR

Beltagy, Peters, Cohan. Longformer: The Long-Document Transformer. arXiv 2020.

81 of 172

Multi-stageΒ rerankers

why multi-stage?

duoBERT

Cascade Transformers

81

82 of 172

Multi-stageΒ rerankers

why multi-stage?

duoBERT

Cascade Transformers

82

83 of 172

From Single to Multiple Rerankers

83

84 of 172

Why Multi-stage?

  • Trade-off between effectiveness (quality of the ranked lists) and efficiency (retrieval latency)

84

efficiency

effectiveness

Fewer Candidates

85 of 172

Multi-stageΒ Rerankers

why multi-stage?

duoBERT

Cascade Transformers

85

86 of 172

Multi-stage with duoBERT

86

d1

d2

d3

d4

Inverted

Index

d5

BM25

monoBERT

d3

d2

d5

di

q

R0

R1

q

H0

H1

mono

BERT

duoBERT

d2

d3

di

q

dj

R2

d5

H2

d2

d3

pi,j

duo

BERT

d2

d5

d3

d2

d3

d5

d5

d2

d5

d3

Doc Pairwise

si

Nogueira, Yang, Cho, Lin. Multi-stage document ranking with bert. 2019.

87 of 172

87

duoBERT

d1j

EC

Pm+6

T[CLS]

[CLS]

U1

A1

U2

A2

U3

A3

T[SEP1]

E[CLS]

[SEP]

V1

E1

B1

Vm

E2

B2

T[SEP2]

E3

X1

[CLS]

E[SEP1]

q1

q2

F1

q3

[SEP]

Fm

d1i

dmi

E[SEP2]

EA

G1

EA

EA

EA

EA

EB

EB

P0

P1

P2

P3

P4

P5

Pm+4

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

Token

Embeddings

Segment

Embeddings

Position

Embeddings

[SEP]

EB

Pm+5

+

+

...

...

...

...

...

...

...

...

...

...

d1j

EC

Pm+n+5

Xn

Gn

+

+

T[SEP3]

E[SEP3]

[SEP]

EC

Pm+n+6

+

+

pi,j

D ΰΎΎ 2

query

text di

text dj

duoBERT's Input Format

88 of 172

Training duoBERT

88

Query q

SEP

text di

CLS

duoBERT

Is doc di more relevant than doc dj to the query q?

Loss:

SEP

text dj

pi,j = p(di>dj | q)

89 of 172

Inference with duoBERT

89

p1,2 = p(d1>d2 | q)

d1

q

d2

duo

BERT

d3

d2

d1

R1

monoBERT

d1

d2

R2

d3

Pairwise aggregation:

s1 = p1,2 + p1,3

s2 = p2,1 + p2,3

s3 = p3,1 + p3,2

d1

d2

d1

d3

d2

d1

d2

d3

d3

d1

d3

d2

Text Pairs

90 of 172

monoBERT vs. duoBERT

90

91 of 172

Multi-stageΒ Rerankers

why multi-stage?

duoBERT

Cascade Transformers

91

92 of 172

Cascade Transformers (Soldaini and Moschitti, 2020)

92

Key Observation: Subsets of the Transformer layers can be seen as a reranker

monoBERT layers 1-4

q

SEP

d2

CLS

SEP

q

SEP

d3

CLS

SEP

q

SEP

d1

CLS

SEP

|D| Γ— 2

q

SEP

d4

CLS

SEP

s1, s2, s3, s4

monoBERT layers 5-8

q

SEP

d2

CLS

SEP

q

SEP

d1

CLS

SEP

|D| Γ— 2

q

SEP

d4

CLS

SEP

s1, s2, s4

monoBERT layers 9-12

q

SEP

d4

CLS

SEP

q

SEP

d1

CLS

SEP

|D| Γ— 2

s1, s4

93 of 172

Cascade Transformers

93

d3

d2

d5

R1

d1

d2

d3

d4

d5

monoBERT

layers 1-4

di

q

R0

H1

mono

BERT

1-4

si

d4

monoBERT

layers 5-8

d3

d2

di

q

R2

H2

mono

BERT

5-8

si

d4

monoBERT

layers 9-12

d2

di

q

R3

H3

mono

BERT

9-12

si

d4

Key Observation: Subsets of monoBERT layers can be used as rerankers

For efficiency, activations from the previous reranker are passed to the next one

Soldaini, Moschitti. The Cascade Transformer: an Application for Efficient Answer Sentence Selection. ACL 2020.

94 of 172

Cascade Transformers

94

For efficiency, all candidate texts are packed in the same GPU batch

95 of 172

Results

  • Works well for answer selection β†’ short sequences
  • How to pack 100 documents into a GPU batch?

95

𝛼: proportion of candidates to be discarded at each stage

96 of 172

Takeaways of Multi-stage Rerankers

Advantage:

    • more tuning knobs β†’ more flexibility in effectiveness/efficiency tradeoff space

Disadvantage:

  • more tuning knobs β†’ more complexity

We are only starting exploring the design space for multi-stage reranking pipelines with Transformers

96

97 of 172

Document Preprocessing Techniques

Query vs document expansion

doc2query

DeepCT

DeepImpact

97

98 of 172

Document Preprocessing Techniques

Query vs document expansion

doc2query

DeepCT

DeepImpact

98

99 of 172

A SimpleΒ Search Engine

99

This section

100 of 172

Query Reformulation (aka query expansion)

query: "Where can I buy a small car?"

100

Query Reformulator

Search Engine

Better Retrieved Docs

reformulated query: "compact car sales store"

101 of 172

Query reformulation as a translation task

Hard: Input has little information

101

Query Reformulator

Query Language

Document Language

Easier: Input has a lot of information

Document Translator

Query Language

Document Language

102 of 172

Document Preprocessing Techniques

Query vs document expansion

doc2query

DeepCT

DeepImpact

102

103 of 172

doc2query

103

seq2seq Transformer

Document

Query

Supervised training:

pairs of <query, relevant document>

Source: Vaswani et al., 2017

Nogueira, Yang, Lin, Cho. Document expansion by query prediction. 2019.

104 of 172

doc2query

104

Researchers are finding that cinnamon reduces blood sugar levels naturally when taken daily...

does cinnamon lower blood sugar?

does cinnamon lower blood sugar?

Researchers are finding that cinnamon reduces blood sugar levels naturally when taken daily...

Input: Document

Output: Predicted Query

Expanded Document:

Index

doc2query

Search Engine

+

foods and supplements to lower blood sugar

User's Query

Better Retrieved Docs

Concatenate

In practice: 5-40 queries are sampled

with top-k or nucleus sampling

105 of 172

Results

105

zero-shot: doc2query was trained only on MS MARCO

MARCO Passage

(MRR@10)

TREC-DL 19

(nDCG@10)

TREC-COVID

(nDCG@20)

Robust04

(nDCG@20)

BM25

0.184

0.506

0.659

0.428

+ doc2query

0.277

0.642

0.6375

0.446

BM25 + RM3

0.156

0.518

-

0.450

+ doc2query

0.214

0.655

-

0.466

BM25 + monoBERT/T5*

0.365

0.738

0.7785*

-

+ doc2query

0.379

0.754

0.7939

-

106 of 172

What is more important? copied or new words?

106

Predicted queries are "better" than documents

107 of 172

Examples

107

69% copied

Excluding

stop-words:

31% new

108 of 172

Document Preprocessing Techniques

Query vs document expansion

doc2query

DeepCT

DeepImpact

108

109 of 172

DeepCT

109

DeepCT (BERT)

The Geocentric Theory was proposed by the greeks under the guidance...

Text d:

Relevant query q: "who proposed the geocentric theory"

0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

Target Scores

D Γ— 1

D Γ— 1

D Γ— 1

...

0.2 0.5 0.2 0.1 0.4 0.1 0.0 0.6 0.2 0.4 0.3

Predicted Scores

Dai, Callan. Context-aware sentence/passage term importance estimation for first stage retrieval. 2019.

110 of 172

Once DeepCT is trained...

110

Researchers are finding that cinnamon reduces blood sugar levels naturally when taken daily...

DeepCT (BERT)

Text d:

10 0 20 10 90 80 90 100 40 20 0 10 40

Term Frequencies:

Index

New document: "Researchers Researchers … finding finding … that that … cinnamon cinnamon ... reduces ... "

D Γ— 1

D Γ— 1

D Γ— 1

...

0.1 0.0 0.2 0.1 0.9 0.8 0.9 1.0 0.4 0.2 0.0 0.1 0.4

Predicted Scores

111 of 172

Results on MS MARCO Passage Dev Set

111

Model

MRR@10

R@1000

BERT Inferences per doc

BM25

0.184

0.853

-

+ doc2query

0.229

0.907

1

+ doc2query

0.277

0.944

40

DeepCT

0.243

0.913

1

112 of 172

Document Preprocessing Techniques

Query vs document expansion

doc2query

DeepCT

DeepImpact

112

113 of 172

DeepImpact: combining doc2query with DeepCT

113

Researchers are finding that cinnamon reduces blood sugar levels naturally when taken daily...

does cinnamon lower blood sugar?

does cinnamon lower blood sugar?

Researchers are finding that cinnamon reduces blood sugar levels naturally when taken daily...

Input Document

Output: Predicted Query

Expanded Document:

Index

doc2query

Search Engine

+

foods and supplements to lower blood sugar

User's Query

Better Retrieved Docs

Concatenate

Term Scorer (~DeepCT)

{Researchers: 31, are: 0, finding: 4, that: 1, ...}

Mallia, Khattab, Tonellotto, and Suel. Learning Passage Impacts for Inverted Indexes. 2021

114 of 172

Results on MS MARCO Passage Dev Set

114

Model

MRR@10

R@1000

Latency (ms/query)

BM25

0.184

0.853

13

DeepCT

0.243

0.913

11

doc2query

0.278

0.947

12

DeepImpact

0.326

0.948

58

BM25 + monoBERT

0.355

0.853

(GPU) 10,700

115 of 172

How to Expand Long Documents?

115

doc2query

DeepCT

DeepImpact

doc2query

DeepCT

DeepImpact

doc2query

DeepCT

DeepImpact

doc2query

DeepCT

DeepImpact

Queries or

Terms

Queries or

Terms

Queries or

Terms

Queries or

Terms

+

windows of N sentences

116 of 172

Takeaways of Document Expansion

Advantages:

    • Documents have more context than queries β†’easy prediction task
    • Documents canΒ be processed offline and in parallel
    • Run on CPU at query time

Disadvantages:

    • HaveΒ to iterate over theΒ entire collection
    • Not as effective as rerankers (yet)

116

117 of 172

Beyond BERT

Improving effectiveness or efficiency

117

118 of 172

Improving Effectiveness: Model Variants

BERT Variants:

    • Improve training β†’ replacement prediction rather than mask prediction (ELECTRA)
    • Modify training β†’ tune hyper-parameters, remove NSP task (RoBERTa)
    • Modify architecture β†’ share weights across Transformer layers (ALBERT)

118

119 of 172

Improving Effectiveness: Model Variants

119

…

…

…

…

…

…

…

…

…

E[CLS]

T[CLS]

E1

T1

E2

T2

E3

T3

E4

T4

E5

T5

E6

T6

E7

T7

E[SEP]

T[SEP]

The

bank

makes

loans

[MASK]

clients

.

Random masking

BERT

Original text

The

bank

makes

loans

to

clients

.

Term replacement

ELECTRA

The

store

makes

loans

and

clients

.

120 of 172

Improving Effectiveness: BERT Variants

120

Zhang, Yates, Lin. Comparing Score Aggregation Approaches for Pretrained Neural Language Models. ECIR 2021.

121 of 172

Improving Effectiveness: T5

Sequence2Sequence Model (T5): larger, improved pre-training

  • Input: query: [q] document: [d] relevant:

121

Training:

"true" or "false"

Inference:

score = P(token="true" | q, d)

w1

w2

wN

...

Raffel, Shazeer, Roberts, Lee, Narang, Matena, Zhou, Li, Liu. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. JMLR 2020.

122 of 172

Improving Effectiveness: T5

122

Nogueira, Jiang, Pradeep, Lin. Document Ranking with a Pretrained Sequence-to-Sequence Model. Findings of EMNLP 2020.

(desc.)

123 of 172

Improving Efficiency: Distillation & Architectures

Distillation:

    • Train a smaller model (β€œstudent”) to mimic a larger model (β€œteacher”)
    • Can be performed before fine-tuning, after, or both (best)

Cross-entropy loss

Teacher & student logits

123

124 of 172

Improving Efficiency: Distillation

124

Li, Yates, MacAvaney, He, Sun. PARADE: Passage Representation Aggregation for Document Reranking. arXiv 2020.

(titles)

125 of 172

Improving Efficiency: Distillation & Architectures

Distillation:

    • Train a smaller model (β€œstudent”) to mimic a larger model (β€œteacher”)
    • Can be performed before fine-tuning, after, or both (best)

Smaller architectures:

    • TK, TKL, CK β†’ skip pretraining and reduce the number of Transformer layers
    • Take GloVe embeddings as input

125

126 of 172

Improving Efficiency: TK Architectures

126

Small Transformer stack

Similarity matrix & KNRM

Hofstatter, Zlabinger, Hanbury. Interpretable & Time-Budget-Constrained Contextualization for Re-Ranking. ECAI 2020.

127 of 172

Improving Efficiency: TK Architectures

127

Hofstatter, Zlabinger, Hanbury. Interpretable & Time-Budget-Constrained Contextualization for Re-Ranking. ECAI 2020.

128 of 172

Domain-specific Applications

128

129 of 172

TREC-COVID (April-September 2020)

Task: Find scientific articles relevant to questions such as:

    • Are patients taking Angiotensin-converting enzyme inhibitors (ACE) at increased risk for COVID-19?

129

MacAvaney, Cohan, Goharian. SLEDGE-Z: A Zero-Shot Baseline for COVID-19 Literature Search. EMNLP 2020.

130 of 172

TREC 2020 - Precision Medicine Track

Task: Find relevant scientific articles in PubMed for questions such as:

Is Dabrafenib effective for the melanoma disease in patients with gene BRAF (V600E) mutation?

BERT-based models work fine with small tweaks:

130

Model

nDCG@30

median

0.2857

BM25

0.3081

+ monoT5rct

0.4193

damoespb1

0.4255

Roberts, Demner-Fushman, Voorhees, Bedrick, Hersh, Overview of the TREC 2020 Precision Medicine Track. TREC 2020.

Zero-shot:

Finetuned only on MS MARCO!

131 of 172

TREC 2020 - Health Misinformation

131

Better

BM25 + monoT5 + LabelT5

Task: Find relevant documents to questions such as:

Can ibuprofen worsen COVID-19?

Metric penalizes documents that contain incorrect information

BM25 + monoT5

132 of 172

Legal Domain

COLIEE 2020, Task 1:

Find relevant cases in a corpus that support the decision of a given case.

132

Team JNLP (Nguyen et al. 2020)

Method

F1

Pre-BERT

0.5148

BERT

0.6379

Pre-BERT + BERT

0.6682

Team TLIR (Shao et al. 2020)

Method

F1

BM25

0.5287

BM25 + BERT

0.6397

133 of 172

Takeaways

  • Zero-shot BERT-based models are starting show better effectiveness than classical retrieval
  • Simple domain adaptation techniques work well with transformer architectures (eg: SLEDGE-Z)

133

134 of 172

Q&A

10 minutes

134

135 of 172

Break

Resume at 10:40 PDT

135

136 of 172

Part 3:Β Ranking with Dense Representations

136

137 of 172

Sparse Representations

137

Advantages: 1) Fast to retrieve candidates from a inverted index because q is usually short. 2) Fast to compute because q ∩ d is usually small

Disadvantage: Terms need to match exactly

Task: Estimate the relevance of text d to a query q:

q = "fix my air conditioner"

d = "... AC repair ..."

138 of 172

Dense Representations

138

q = "fix my air conditioner"

d = "... AC repair ..."

0.8

-1.2

...

2.4

-0.3

Encoder

Continuous dense vectors ℝD

0.5

0.0

...

2.6

-0.7

Encoder

πœ™ is a similarity function (e.g., inner product or cosine similarity)

πœ™(πœ‚(q), πœ‚(d)) β†’ ideally measures how relevant q and d are to each other

πœ‚(q)

πœ‚(d)

139 of 172

Nearest Neighbor Search

139

140 of 172

Task: find the top k most relevant texts to a query

140

query

texts

...

πœ™

πœ™(πœ‚(q), πœ‚(d1))

πœ™(πœ‚(q), πœ‚(d|C|))

πœ™

πœ™

πœ™

Top k

Brute-force search:

We often need to search many (e.g.: billions) of texts

    • Brute-force won't scale

141 of 172

Approximate Nearest Neighbor Search

  • Exchange accuracy for speed
  • E.g.: k-means:

141

  • In practice, ANN implementations are more complicated
  • We assume a fast dense retrieval library is available (e.g.: Faiss, Annoy, ScaNN)

query πœ‚(q)

πœ™ (πœ‚(q), πœ‚(c1))

πœ™ (πœ‚(q), πœ‚(c2))

πœ™ (πœ‚(q), πœ‚(c3))

Centroids

πœ™ (πœ‚(q), πœ‚(d1))

…

πœ™ (πœ‚(q), πœ‚(dm))

142 of 172

Pre-BERT Dense Representations

142

143 of 172

Types of Encoders: Cross-encoder

143

144 of 172

Types of Encoders: Bi-encoder

144

πœ‚(q)

πœ‚(d)

πœ™(πœ‚(q), πœ‚(d))

s

πœ™(πœ‚(q), πœ‚(d))

πœ‚(q)

πœ‚(d)

145 of 172

Pre-BERT: Dual Embedding Space Model

Document

Centroid

Query Term Embeddings

Similarity

Function

145

Leverage term embeddings and β†’ pretrained word2vec

Mitra, Nalisnick, Craswell, Caurana. A Dual Embedding Space Model for Document Ranking. arXiv & WWW 2016.

146 of 172

Similarity Functions: Distance vs. Comparison

Comparison

Distance

146

147 of 172

Pre-BERT: Deep Structured Semantic Model

Term Hashing

Feedforward

Network

Representation

Cosine similarity

147

Huang, He, Gao, Deng, Acero, Heck. Learning Deep Structured Semantic Models for Web Search using Clickthrough Data. CIKM 2013.

148 of 172

Distance-based Transformer Representations

148

149 of 172

Distance-based Representations

Key characteristic

Simple similarity function β†’ inner (dot) product, cosine similarity, …

Compatible with ANN search

Johnson, Douze, JΓ©gou. Billion-scale similarity search with GPUs. arXiv 2017.

149

150 of 172

Distance-based: SentenceBERT

150

Text representation:

  • CLS

Text representation:

  • CLS
  • Mean
  • Max

Reimers, Gurevych. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. EMNLP 2019.

151 of 172

Distance-based: SentenceBERT

151

Classification

Regression

Reimers, Gurevych. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. EMNLP 2019.

152 of 172

SentenceBERT Result Highlights

Zero-shot:

  • Mean: GloVe > BERT
  • BERT Mean > CLS

CLS is very poor

Fine-tuned:

  • Mean > CLS and Max
  • Close to cross-encoder

CLS is slightly worse

Classification:

is essential

152

Reimers, Gurevych. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. EMNLP 2019.

153 of 172

Dense Passage Retrieval (DPR)

Highlights:

  • QA reader & retriever
  • Selecting negative examples:οΏ½random, BM25, or in batch

153

Similarity Function (πœ™)

Representation

SentenceBERT

Cosine similarity

CLS, Mean, or Max

DPR

Inner product

CLS

Query-1 Query-2

Batch

Karpukhin, Oğuz, Min, Lewis, Wu, Edunov, Chen, Yih. Dense Passage Retrieval for Open-Domain Question Answering. EMNLP 2020.

154 of 172

ANCE

Negative Contrastive Estimation:

  • Choose hard negatives (for model)

  • Checkpoint regularly

154

Similarity Function (πœ™)

Representation

SentenceBERT

Cosine similarity

CLS, Mean, or Max

DPR

Inner product

CLS

ANCE

Inner product

CLS

L. Xiong, C. Xiong, Li, Tang, Liu, Bennett, Ahmed, Overwikj. Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval. ICLR 2021.

155 of 172

Selecting Negative Examples: Results

156 of 172

CLEAR

Highlights:

  • Prepend QRY or DOC
  • Exact matching component
  • Negatives: matching errors
  • Dynamic hinge loss margin

156

Similarity Function (πœ™)

Representation

SentenceBERT

Cosine similarity

CLS, Mean, or Max

DPR

Inner product

CLS

ANCE

Inner product

CLS

CLEAR

Inner product + BM25

Mean

Gao, Dai, Chen, Fan, van Durme, Callan. Complementing Lexical Retrieval with Semantic Residual Embedding. arXiv 2020; ECIR 2021.

157 of 172

RepBERT

Exact match signal:

  • Reduces MRR (-2%)
  • Improves recall (+1%)

157

Similarity Function (πœ™)

Representation

SentenceBERT

Cosine similarity

CLS, Mean, or Max

DPR

Inner product

CLS

ANCE

Inner product

CLS

CLEAR

Inner product + BM25

Mean

RepBERT

Inner product

Mean

Zhan, Mao, Liu, Zhang, Ma. RepBERT: Contextualized Text Embeddings for First-Stage Retrieval. arXiv 2020.

158 of 172

EPIC

Highlights:

  • Vocabulary vector
  • Term importance weights

158

Similarity Function (πœ™)

Representation

SentenceBERT

Cosine similarity

CLS, Mean, or Max

DPR

Inner product

CLS

ANCE

Inner product

CLS

CLEAR

Inner product + BM25

Mean

RepBERT

Inner product

Mean

EPIC

Inner product

|V|-dimension vector

MacAvaney, Nardini, Perego, Tonellotto, Goharian, Frieder. Expansion via Prediction of Importance with Contextualization. SIGIR 2020.

159 of 172

Distance-based: Results

159

160 of 172

Comparison-based Transformer Representations

160

161 of 172

Comparison-based: Poly-encoders

Two

(Query, Document)

N

(|Query| + |Document|)

161

Number of interacting representations

Computational Cost

Humeau, Shuster, Lachaux, Weston. Poly-encoders: Transformer Architectures and Pre-training Strategies for Fast and Accurate Multi-sentence Scoring. ICLR 2020.

162 of 172

Comparison-based: Poly-encoders

M context codes

162

Compatible with ANN?

  • Unclear
  • Data-dependent

Humeau, Shuster, Lachaux, Weston. Poly-encoders: Transformer Architectures and Pre-training Strategies for Fast and Accurate Multi-sentence Scoring. ICLR 2020.

163 of 172

Comparison-based: ColBERT

MaxSim operator

163

Khattab, Zaharia. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. SIGIR 2020.

164 of 172

Comparison-based: ColBERT

MaxSim:

Sim-mat max pooling

(along query dimension)

164

Max Pool

Sum

Khattab, Zaharia. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. SIGIR 2020.

165 of 172

Comparison-based: ColBERT

165

Compatible with ANN?

  • Unclear
  • Data-dependent
  • 70x faster than BERT-large

Khattab, Zaharia. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. SIGIR 2020.

166 of 172

Comparison-based: ColBERT

166

167 of 172

Q&A

5 minutes

167

168 of 172

Break

Resume at 11:35 PDT

168

169 of 172

Conclusions and Future Directions

169

170 of 172

Conclusions

  • Pretrained Transformers showed significant improvements in various IR benchmarks
  • Reproduced and adopted by many in academia and industry
  • No doubt we are in the age of BERT and Transformers

170

171 of 172

Open Questions

Transformers for ranking: apply (T5), adapt (Parade), or redesign (TK/CK)?

What is the future of:

  • zero-shot learning, transfer learning, domain adaptation, and task-specific fine-tuning?
  • dense vs. sparse retrieval?
  • multi-stage ranking architectures?

What can IR bring to transformers?

E.g.: REALM (Guu et al., 2020) -> Text retrieval into pretraining

171

172 of 172

Pretrained Transformers for Text Ranking:οΏ½BERT and Beyond

Thanks!

by Jimmy Lin, Rodrigo Nogueira, and Andrew Yates https://arxiv.org/abs/2010.06467

172

Learn more in survey (& upcoming book):