1
Pretrained Transformers for Text Ranking:οΏ½BERT and Beyond
Andrew Yates, Rodrigo Nogueira, and Jimmy Lin
2
Pretrained Transformers for Text Ranking:οΏ½BERT and Beyond
by Jimmy Lin, Rodrigo Nogueira, and Andrew Yates https://arxiv.org/abs/2010.06467
3
Based on the survey:
Tutorial organization:
Outline
4
Text Ranking
Text ranking problems
Transformers
5
Definition
Given: a piece of textοΏ½ (keywords, question, news article, β¦)
Rank: other pieces of text
(passages, documents, queries, β¦)
Ordered by: their similarity
6
e.g., Web search
Focus: Ad hoc Retrieval
collection of texts
Return: a ranked list of k texts d1 β¦ dk
Maximizing: a metric of interest
Given: query q
7
query
black bear attacks
...
collection
metric: 0.66
1.
2.
3.
Other Problems: Question Answering
Approach:
8
Source: SQuAD
Other Problems: Community Question Answering
New question: What is the longest airline flight?
9
Source: Quora
Other Problems: Text Recommendation
10
Source: Science News
Focus: Content-based Similarity
Agreement between query and a piece of text
11
Transformers
12
Pretrained Transformers
13
Initialized via pretraining
IR Background
Unsupervised ranking methods
Metrics
Test collections
14
Unsupervised Ranking Methods
15
term score
input
Unsupervised Ranking Methods
Inverse Document Frequency
(more discriminative)
Term Frequency
16
the
and
but
...
data
ranking
SIGIR
...
Unsupervised Ranking Methods
17
Sparse representations
Unsupervised Ranking Methods
18
IDF component
TF component
term saturation
length normalization
Vocabulary Mismatch
document retrieval
text ranking
document ranking
Web search
query, search,
ranking, retrieval, ...
19
Enrich query or document representations β move beyond exact matching
Expand
Benchmarking: Relevance Judgments
Is this document relevant to the query?
20
(Subjective)
Evaluation Metrics: Precision & Recall
21
# results returned
# relevant docs
returned
# relevant docs
returned
# relevant docs
in collection
Evaluation Metrics: Average Precision
22
# relevant docs
in collection
precision when relevant doc returned
Evaluation Metrics: Reciprocal Rank
23
rank of relevant doc
Graded Relevance Judgments
How relevant is this document to the query?
24
more
relevant
βrelatedβ
βhighly relevantβ
Evaluation Metrics: nDCG
25
ideal DCGοΏ½(sorted by relevance)
decreasing
gain
i=1
i=2
i=3
how relevant?
how early?
Test Collections
26
queries
...
documents
judgments
Test Collections
TREC Robust04
27
ID: 336
Title: black bear attacks
Description: A relevant document would discuss the frequency of vicious black bear attacks worldwide and the possible causes for this savage behavior.
Narrative: It has been reported that food or cosmetics sometimes attract hungry black bears, causing them to viciously attack humans. Relevant documents would include the aforementioned causes as well as speculation preferably from the scientific community as to other possible causes of vicious attacks by black bears. A relevant document would also detail steps taken or new methods devised by wildlife officials to control and/or modify the savageness of the black bear.
Test Collections
MS MARCO
28
TREC Deep Learning
ID: 130510
Text: definition declaratory judgment
ID: 1131069
Text: how many sons robert kraft has
ID: 1131069
Text: when did rock n roll begin?
ID: 1103153
Text: who is thomas m cooley
Documents and Mean / Median Lengths
29
Queries, Query Lengths, and Judgments
30
A Simple Search Engine
31
e.g., BM25
Machine LearningΒ Background
Learning toΒ rank
Deep learning for ranking
BERT
32
Machine LearningΒ Background
Learning toΒ rank
Deep learning for ranking
BERT
33
A SimpleΒ Search Engine
34
This section
Key | Value |
"chair" | [text #83, text #743, ...] |
"store" | [text #1003, text #50, ...] |
... | ... |
Learning to Rank (> 1990)
35
Learning to Rank - Types of Losses
query q, texts d1 ,d2, d3, a ranker fπ, and a loss L:
36
Pointwise:
L ( fπ, q, d1 ,d2, d3) = L( fπ, q, d1) + L( fπ, q, d2) + L( fπ, q, d3)
Pairwise:
L ( fπ, q, d1 ,d2, d3) = L( fπ, q, d1, d2) + L( fπ, q, d1, d3) + L( fπ, q, d2, d3)
Listwise:
L ( fπ, q, d1 ,d2, d3) = L( fπ, q, d1, d2, d3)
Learning to Rank - Types of Losses
An example:
Rel(d1) > Rel(d2) > Rel(d3), fπ = a neural network that outputs a probability, L = cross-entropy
37
Pairwise
d1
q
neural net
p1,2
>
d2
d2
q
neural
net
p2,1
d1
Listwise
d1
q
neural net
p1,2,3
d2
d2
d3
q
neural net
d2
d1
p3,2,1
>
d1
q
neural net
p1
>
d2
q
neural net
p2
Pointwise
Machine LearningΒ Background
Learning toΒ rank
Deep learning for ranking
BERT
38
Neural Ranking Models (> 2016)
39
We will revisit these architectures in Dense Retrieval Section
Representation-based
Interaction-based
Popular Neural Ranking Models
40
Machine LearningΒ Background
Learning toΒ rank
Deep learning for ranking
BERT
41
Progress in Information Retrieval - Robust04
42
Source: Yang et al., (2019)
Yilmaz et al., 2019; MacAvaney et al. 2019
Li et al., 2020;
Nogueira et al., 2020
Some of them are
zero-shot!
MS MARCO Passage Ranking Leaderboard on December 2018
43
MS MARCO Passage Ranking Leaderboard on JanuaryΒ 2019
44
~8 points!
MS MARCO Passage Ranking Leaderboard on June/2021
45
Adoption by Commercial Search Engines
46
Weβre making a significant improvement to how we understand queries, representing the biggest leap forward in the past five years, and one of the biggest leaps forward in the history of Search.
Starting from April of this year (2019), we used large transformer models to deliver the largest quality improvements to our Bing customers in the past year.
MS Bing
Google Search
What is BERT?
47
Self-supervised: β training data
Supervised: (few) labeled examples
Devlin, Chang, Lee, Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL 2019.
BERT's Pretraining Ingredients
48
Transformer (encoder-only)
with lots of parameters
+
Lots of texts
+
Lots of Compute
49
β¦
β¦
β¦
β¦
β¦
β¦
β¦
β¦
β¦
E[CLS]
T[CLS]
E1
T1
E2
T2
E3
T3
E4
T4
E5
T5
E6
T6
E7
T7
E[SEP]
T[SEP]
[SEP]
[CLS]
A1
A2
A3
[SEP]
B1
B2
[CLS]
t1
t2
t3
t4
t5
t6
Token
Embeddings
t7
EA
EA
EA
EA
EA
EA
EA
EA
+
+
+
+
+
+
+
+
Segment
Embeddings
EA
+
P8
P0
P1
P2
P3
P4
P5
P6
+
+
+
+
+
+
+
+
Position
Embeddings
P7
+
BERT
String
The bank makes loans to clients.
The
bank
makes
loans
to
clients
.
Tokens
[CLS]
[SEP]
string β sequence of vectors
50
β¦
β¦
β¦
β¦
β¦
β¦
β¦
β¦
β¦
[SEP]
EA
P8
E[CLS]
T[CLS]
[CLS]
E1
T1
A1
E2
T2
A2
E3
T3
A3
E4
T4
[SEP]
E5
T5
B1
E6
T6
B2
E7
T7
E[SEP]
T[SEP]
[CLS]
tThe
tbank
tmakes
tloans
t[MASK]
tclients
EA
EA
EA
EA
EA
EA
EA
P0
P1
P2
P3
P4
P5
P6
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Token
Embeddings
Segment
Embeddings
Position
Embeddings
t.
EA
P7
+
+
Pretraining - Masked Language Modeling
The
bank
makes
loans
[MASK]
clients
.
Loss = -log (P("to" | masked input))
Random masking
D ΰΎΎ V
softmax
Pretraining
51
Token Probability
Input: a document
The Mongol invasion of Europe in the 13th century was the conquest of much of Europe by the Mongol Empire.
The Mongol invasion of Europe in the 13th [MASK] was the conquest of much of [MASK] by the Mongol Empire.
BERT
vec i
Random Masking
vec j
...
...
...
softmax
year 0.02
century 0.94
car 0.00
β¦
land 0.01
Europe 0.97
is 0.00
β¦
linear
softmax
linear
52
Finetuning
BERT
for Relevance Classification
(aka monoBERT)
53
monoBERT:
BERT reranker
54
We want:
si = P(Relevant = 1|q, di)
query q
text di
D Γ 2
softmax
si = softmax(T[CLS]W + b)1
Non-relevant
Relevant
Training monoBERT
Loss:
55
BM25
Humans
Once monoBERT is trained...
56
d1
d2
d3
d4
Inverted Index
d5
BM25
monoBERT
d3
d2
d5
di
q
R0
R1
q
H0
H1
mono
BERT
si
d1
d4
TREC 2019 - Deep Learning Track - Passage
57
| nDCG@10 | MAP | Recall@1k |
BM25 | 0.506 | 0.377 | 0.739 |
| 0.738 | 0.506 | 0.739 |
BM25 + RM3 | 0.518 | 0.427 | 0.788 |
| 0.742 | 0.529 | 0.788 |
TREC 2019 - Deep Learning Track - Passage
58
5 points in Recall@1k β2 points in MAP
Why?
Hypothesis: Mismatch between training and inference lists
| nDCG@10 | MAP | Recall@1k |
BM25 | 0.506 | 0.377 | 0.739 |
| 0.738 | 0.506 | 0.739 |
BM25 + RM3 | 0.518 | 0.427 | 0.788 |
| 0.742 | 0.529 | 0.788 |
How useful is the BM25 signal?
59
Recap: Pre-BERT vs. monoBERT
60
Q&A
5 minutes
61
Break
Resume at 9:00 PDT
62
Part 2: Ranking with Relevance ClassificationΒ
63
BERTβs Limitations
64
Cannot input entire documents
need separate embedding for every possible position
β¦
β¦
β¦
β¦
β¦
β¦
β¦
β¦
β¦
[SEP]
EA
P8
E[CLS]
T[CLS]
[CLS]
E1
T1
A1
E2
T2
A2
E3
T3
A3
E4
T4
[SEP]
E5
T5
B1
E6
T6
B2
E7
T7
E[SEP]
T[SEP]
[CLS]
t1
t2
t3
t4
t5
t6
EA
EA
EA
EA
EA
EA
EA
P0
P1
P2
P3
P4
P5
P6
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Token
Embeddings
Segment
Embeddings
Position
Embeddings
t7
EA
P7
+
+
BERTβs Limitations
65
computationally expensive layers
β¦
β¦
β¦
β¦
β¦
β¦
β¦
β¦
β¦
[SEP]
EA
P8
E[CLS]
T[CLS]
[CLS]
E1
T1
A1
E2
T2
A2
E3
T3
A3
E4
T4
[SEP]
E5
T5
B1
E6
T6
B2
E7
T7
E[SEP]
T[SEP]
[CLS]
t1
t2
t3
t4
t5
t6
EA
EA
EA
EA
EA
EA
EA
P0
P1
P2
P3
P4
P5
P6
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Token
Embeddings
Segment
Embeddings
Position
Embeddings
t7
EA
P7
+
+
(later: Beyond BERT & Dense Representations)
Multi-stage ranking pipeline
From Passages to Documents
66
Handling Length Limitation: Training
Chunk documents
Transfer labelsοΏ½(approximation)
67
Handling Length Limitation: Inference
Aggregate Evidence
68
Approach #1: Score Aggregation
69
β¦
β¦
β¦
β¦
β¦
β¦
s1
s2
s3
Document Score
Over Passage Scores: BERT-MaxP, FirstP, SumP
70
β¦
β¦
β¦
β¦
β¦
β¦
s1
s2
s3
Document Score
Dai, Callan. Deeper Text Understanding for IR with Contextual Neural Language Modeling. SIGIR 2019.
Take max, first, or sum
Over Passage Scores: Results
71
Dai, Callan. Deeper Text Understanding for IR with Contextual Neural Language Modeling. SIGIR 2019.
Over Sentence Scores: Birch
First-stageοΏ½retrieval score
Sentence scores
72
Yilmaz, Yang, Zhang, Lin. Cross-Domain Modeling of Sentence-Level Evidence for Document Retrieval. EMNLP '19.
Over Sentence Scores: Results
73
Yilmaz, Yang, Zhang, Lin. Cross-Domain Modeling of Sentence-Level Evidence for Document Retrieval. EMNLP '19.
Approach #2: Representation Aggregation
74
passage
representations
Over Term Embeddings: CEDR
75
similarity matrix using
contextualized embeddingsοΏ½(concatenate passages)οΏ½
interaction-based pre-BERT
model (PACRR, KNRM)
MacAvaney, Yates, Cohan, Goharian. CEDR: Contextualized Embeddings for Document Ranking. SIGIR 2019.
Over Term Embeddings: CEDR
76
Relevant Document
Nonrelevant
MacAvaney, Yates, Cohan, Goharian. CEDR: Contextualized Embeddings for Document Ranking. SIGIR 2019.
Over Term Embeddings: Results
77
MacAvaney, Yates, Cohan, Goharian. CEDR: Contextualized Embeddings for Document Ranking. SIGIR 2019.
Over Passage Representations: PARADE
78
Li, Yates, MacAvaney, He, Sun. PARADE: Passage Representation Aggregation for Document Reranking. arXiv 2020.
Aggregation of passagesβ CLS embeddings
Aggregation approaches:οΏ½(increasing complexity)
Over Passage Representations: Results
79
Li, Yates, MacAvaney, He, Sun. PARADE: Passage Representation Aggregation for Document Reranking. arXiv 2020.
Enlarge Passage Representations: Longformer, QDS
80
Jiang, Xiong, Lee, Wang. Long Document Ranking with Query-Directed Sparse Transformer. Findings of EMNLP 2020.
Longformer: sparse attention
QDS-Transformer: specialize to IR
Beltagy, Peters, Cohan. Longformer: The Long-Document Transformer. arXiv 2020.
Multi-stageΒ rerankers
why multi-stage?
duoBERT
Cascade Transformers
81
Multi-stageΒ rerankers
why multi-stage?
duoBERT
Cascade Transformers
82
From Single to Multiple Rerankers
83
Why Multi-stage?
84
efficiency
effectiveness
Fewer Candidates
Multi-stageΒ Rerankers
why multi-stage?
duoBERT
Cascade Transformers
85
Multi-stage with duoBERT
86
d1
d2
d3
d4
Inverted
Index
d5
BM25
monoBERT
d3
d2
d5
di
q
R0
R1
q
H0
H1
mono
BERT
duoBERT
d2
d3
di
q
dj
R2
d5
H2
d2
d3
pi,j
duo
BERT
d2
d5
d3
d2
d3
d5
d5
d2
d5
d3
Doc Pairwise
si
Nogueira, Yang, Cho, Lin. Multi-stage document ranking with bert. 2019.
87
duoBERT
d1j
EC
Pm+6
T[CLS]
[CLS]
U1
A1
U2
A2
U3
A3
T[SEP1]
E[CLS]
[SEP]
V1
E1
B1
Vm
E2
B2
T[SEP2]
E3
X1
[CLS]
E[SEP1]
q1
q2
F1
q3
[SEP]
Fm
d1i
dmi
E[SEP2]
EA
G1
EA
EA
EA
EA
EB
EB
P0
P1
P2
P3
P4
P5
Pm+4
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Token
Embeddings
Segment
Embeddings
Position
Embeddings
[SEP]
EB
Pm+5
+
+
...
...
...
...
...
...
...
...
...
...
d1j
EC
Pm+n+5
Xn
Gn
+
+
T[SEP3]
E[SEP3]
[SEP]
EC
Pm+n+6
+
+
pi,j
D ΰΎΎ 2
query
text di
text dj
duoBERT's Input Format
Training duoBERT
88
Query q
SEP
text di
CLS
duoBERT
Is doc di more relevant than doc dj to the query q?
Loss:
SEP
text dj
pi,j = p(di>dj | q)
Inference with duoBERT
89
p1,2 = p(d1>d2 | q)
d1
q
d2
duo
BERT
d3
d2
d1
R1
monoBERT
d1
d2
R2
d3
Pairwise aggregation:
s1 = p1,2 + p1,3
s2 = p2,1 + p2,3
s3 = p3,1 + p3,2
d1
d2
d1
d3
d2
d1
d2
d3
d3
d1
d3
d2
Text Pairs
monoBERT vs. duoBERT
90
Multi-stageΒ Rerankers
why multi-stage?
duoBERT
Cascade Transformers
91
Cascade Transformers (Soldaini and Moschitti, 2020)
92
Key Observation: Subsets of the Transformer layers can be seen as a reranker
monoBERT layers 1-4
q
SEP
d2
CLS
SEP
q
SEP
d3
CLS
SEP
q
SEP
d1
CLS
SEP
|D| Γ 2
q
SEP
d4
CLS
SEP
s1, s2, s3, s4
monoBERT layers 5-8
q
SEP
d2
CLS
SEP
q
SEP
d1
CLS
SEP
|D| Γ 2
q
SEP
d4
CLS
SEP
s1, s2, s4
monoBERT layers 9-12
q
SEP
d4
CLS
SEP
q
SEP
d1
CLS
SEP
|D| Γ 2
s1, s4
Cascade Transformers
93
d3
d2
d5
R1
d1
d2
d3
d4
d5
monoBERT
layers 1-4
di
q
R0
H1
mono
BERT
1-4
si
d4
monoBERT
layers 5-8
d3
d2
di
q
R2
H2
mono
BERT
5-8
si
d4
monoBERT
layers 9-12
d2
di
q
R3
H3
mono
BERT
9-12
si
d4
Key Observation: Subsets of monoBERT layers can be used as rerankers
For efficiency, activations from the previous reranker are passed to the next one
Soldaini, Moschitti. The Cascade Transformer: an Application for Efficient Answer Sentence Selection. ACL 2020.
Cascade Transformers
94
For efficiency, all candidate texts are packed in the same GPU batch
Results
95
πΌ: proportion of candidates to be discarded at each stage
Takeaways of Multi-stage Rerankers
Advantage:
Disadvantage:
We are only starting exploring the design space for multi-stage reranking pipelines with Transformers
96
Document Preprocessing Techniques
Query vs document expansion
doc2query
DeepCT
DeepImpact
97
Document Preprocessing Techniques
Query vs document expansion
doc2query
DeepCT
DeepImpact
98
A SimpleΒ Search Engine
99
This section
Query Reformulation (aka query expansion)
query: "Where can I buy a small car?"
100
Query Reformulator
Search Engine
Better Retrieved Docs
reformulated query: "compact car sales store"
Query reformulation as a translation task
Hard: Input has little information
101
Query Reformulator
Query Language
Document Language
Easier: Input has a lot of information
Document Translator
Query Language
Document Language
Document Preprocessing Techniques
Query vs document expansion
doc2query
DeepCT
DeepImpact
102
doc2query
103
seq2seq Transformer
Document
Query
Supervised training:
pairs of <query, relevant document>
Source: Vaswani et al., 2017
Nogueira, Yang, Lin, Cho. Document expansion by query prediction. 2019.
doc2query
104
Researchers are finding that cinnamon reduces blood sugar levels naturally when taken daily...
does cinnamon lower blood sugar?
does cinnamon lower blood sugar?
Researchers are finding that cinnamon reduces blood sugar levels naturally when taken daily...
Input: Document
Output: Predicted Query
Expanded Document:
Index
doc2query
Search Engine
+
foods and supplements to lower blood sugar
User's Query
Better Retrieved Docs
Concatenate
In practice: 5-40 queries are sampled
with top-k or nucleus sampling
Results
105
zero-shot: doc2query was trained only on MS MARCO
| MARCO Passage (MRR@10) | TREC-DL 19 (nDCG@10) | TREC-COVID (nDCG@20) | Robust04 (nDCG@20) |
BM25 | 0.184 | 0.506 | 0.659 | 0.428 |
+ doc2query | 0.277 | 0.642 | 0.6375 | 0.446 |
BM25 + RM3 | 0.156 | 0.518 | - | 0.450 |
+ doc2query | 0.214 | 0.655 | - | 0.466 |
BM25 + monoBERT/T5* | 0.365 | 0.738 | 0.7785* | - |
+ doc2query | 0.379 | 0.754 | 0.7939 | - |
What is more important? copied or new words?
106
Predicted queries are "better" than documents
Examples
107
69% copied
Excluding
stop-words:
31% new
Document Preprocessing Techniques
Query vs document expansion
doc2query
DeepCT
DeepImpact
108
DeepCT
109
DeepCT (BERT)
The Geocentric Theory was proposed by the greeks under the guidance...
Text d:
Relevant query q: "who proposed the geocentric theory"
0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Target Scores
D Γ 1
D Γ 1
D Γ 1
...
0.2 0.5 0.2 0.1 0.4 0.1 0.0 0.6 0.2 0.4 0.3
Predicted Scores
Dai, Callan. Context-aware sentence/passage term importance estimation for first stage retrieval. 2019.
Once DeepCT is trained...
110
Researchers are finding that cinnamon reduces blood sugar levels naturally when taken daily...
DeepCT (BERT)
Text d:
10 0 20 10 90 80 90 100 40 20 0 10 40
Term Frequencies:
Index
New document: "Researchers Researchers β¦ finding finding β¦ that that β¦ cinnamon cinnamon ... reduces ... "
D Γ 1
D Γ 1
D Γ 1
...
0.1 0.0 0.2 0.1 0.9 0.8 0.9 1.0 0.4 0.2 0.0 0.1 0.4
Predicted Scores
Results on MS MARCO Passage Dev Set
111
Model | MRR@10 | R@1000 | BERT Inferences per doc |
BM25 | 0.184 | 0.853 | - |
+ doc2query | 0.229 | 0.907 | 1 |
+ doc2query | 0.277 | 0.944 | 40 |
DeepCT | 0.243 | 0.913 | 1 |
Document Preprocessing Techniques
Query vs document expansion
doc2query
DeepCT
DeepImpact
112
DeepImpact: combining doc2query with DeepCT
113
Researchers are finding that cinnamon reduces blood sugar levels naturally when taken daily...
does cinnamon lower blood sugar?
does cinnamon lower blood sugar?
Researchers are finding that cinnamon reduces blood sugar levels naturally when taken daily...
Input Document
Output: Predicted Query
Expanded Document:
Index
doc2query
Search Engine
+
foods and supplements to lower blood sugar
User's Query
Better Retrieved Docs
Concatenate
Term Scorer (~DeepCT)
{Researchers: 31, are: 0, finding: 4, that: 1, ...}
Mallia, Khattab, Tonellotto, and Suel. Learning Passage Impacts for Inverted Indexes. 2021
Results on MS MARCO Passage Dev Set
114
Model | MRR@10 | R@1000 | Latency (ms/query) |
BM25 | 0.184 | 0.853 | 13 |
DeepCT | 0.243 | 0.913 | 11 |
doc2query | 0.278 | 0.947 | 12 |
DeepImpact | 0.326 | 0.948 | 58 |
BM25 + monoBERT | 0.355 | 0.853 | (GPU) 10,700 |
How to Expand Long Documents?
115
doc2query
DeepCT
DeepImpact
doc2query
DeepCT
DeepImpact
doc2query
DeepCT
DeepImpact
doc2query
DeepCT
DeepImpact
Queries or
Terms
Queries or
Terms
Queries or
Terms
Queries or
Terms
+
windows of N sentences
Takeaways of Document Expansion
Advantages:
Disadvantages:
116
Beyond BERT
Improving effectiveness or efficiency
117
Improving Effectiveness: Model Variants
BERT Variants:
118
Improving Effectiveness: Model Variants
119
β¦
β¦
β¦
β¦
β¦
β¦
β¦
β¦
β¦
E[CLS]
T[CLS]
E1
T1
E2
T2
E3
T3
E4
T4
E5
T5
E6
T6
E7
T7
E[SEP]
T[SEP]
The
bank
makes
loans
[MASK]
clients
.
Random masking
BERT
Original text
The
bank
makes
loans
to
clients
.
Term replacement
ELECTRA
The
store
makes
loans
and
clients
.
Improving Effectiveness: BERT Variants
120
Zhang, Yates, Lin. Comparing Score Aggregation Approaches for Pretrained Neural Language Models. ECIR 2021.
Improving Effectiveness: T5
Sequence2Sequence Model (T5): larger, improved pre-training
121
Training:
"true" or "false"
Inference:
score = P(token="true" | q, d)
w1
w2
wN
...
Raffel, Shazeer, Roberts, Lee, Narang, Matena, Zhou, Li, Liu. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. JMLR 2020.
Improving Effectiveness: T5
122
Nogueira, Jiang, Pradeep, Lin. Document Ranking with a Pretrained Sequence-to-Sequence Model. Findings of EMNLP 2020.
(desc.)
Improving Efficiency: Distillation & Architectures
Distillation:
Cross-entropy loss
Teacher & student logits
123
Improving Efficiency: Distillation
124
Li, Yates, MacAvaney, He, Sun. PARADE: Passage Representation Aggregation for Document Reranking. arXiv 2020.
(titles)
Improving Efficiency: Distillation & Architectures
Distillation:
Smaller architectures:
125
Improving Efficiency: TK Architectures
126
Small Transformer stack
Similarity matrix & KNRM
Hofstatter, Zlabinger, Hanbury. Interpretable & Time-Budget-Constrained Contextualization for Re-Ranking. ECAI 2020.
Improving Efficiency: TK Architectures
127
Hofstatter, Zlabinger, Hanbury. Interpretable & Time-Budget-Constrained Contextualization for Re-Ranking. ECAI 2020.
Domain-specific Applications
128
TREC-COVID (April-September 2020)
Task: Find scientific articles relevant to questions such as:
129
MacAvaney, Cohan, Goharian. SLEDGE-Z: A Zero-Shot Baseline for COVID-19 Literature Search. EMNLP 2020.
TREC 2020 - Precision Medicine Track
Task: Find relevant scientific articles in PubMed for questions such as:
Is Dabrafenib effective for the melanoma disease in patients with gene BRAF (V600E) mutation?
BERT-based models work fine with small tweaks:
130
Model | nDCG@30 |
median | 0.2857 |
BM25 | 0.3081 |
+ monoT5rct | 0.4193 |
damoespb1 | 0.4255 |
Roberts, Demner-Fushman, Voorhees, Bedrick, Hersh, Overview of the TREC 2020 Precision Medicine Track. TREC 2020.
Zero-shot:
Finetuned only on MS MARCO!
TREC 2020 - Health Misinformation
131
Better
BM25 + monoT5 + LabelT5
Task: Find relevant documents to questions such as:
Can ibuprofen worsen COVID-19?
Metric penalizes documents that contain incorrect information
BM25 + monoT5
Legal Domain
COLIEE 2020, Task 1:
Find relevant cases in a corpus that support the decision of a given case.
132
Team JNLP (Nguyen et al. 2020)
Method | F1 |
Pre-BERT | 0.5148 |
BERT | 0.6379 |
Pre-BERT + BERT | 0.6682 |
Team TLIR (Shao et al. 2020)
Method | F1 |
BM25 | 0.5287 |
BM25 + BERT | 0.6397 |
Takeaways
133
Q&A
10 minutes
134
Break
Resume at 10:40 PDT
135
Part 3:Β Ranking with Dense Representations
136
Sparse Representations
137
Advantages: 1) Fast to retrieve candidates from a inverted index because q is usually short. 2) Fast to compute because q β© d is usually small
Disadvantage: Terms need to match exactly
Task: Estimate the relevance of text d to a query q:
q = "fix my air conditioner"
d = "... AC repair ..."
Dense Representations
138
q = "fix my air conditioner"
d = "... AC repair ..."
0.8 | -1.2 | ... | 2.4 | -0.3 |
Encoder
Continuous dense vectors βD
0.5 | 0.0 | ... | 2.6 | -0.7 |
Encoder
π is a similarity function (e.g., inner product or cosine similarity)
π(π(q), π(d)) β ideally measures how relevant q and d are to each other
π(q)
π(d)
Nearest Neighbor Search
139
Task: find the top k most relevant texts to a query
140
| | | | |
query
| | | | |
texts
| | | | |
| | | | |
| | | | |
...
π
π(π(q), π(d1))
π(π(q), π(d|C|))
π
π
π
Top k
Brute-force search:
We often need to search many (e.g.: billions) of texts
Approximate Nearest Neighbor Search
141
| |
query π(q)
π (π(q), π(c1))
π (π(q), π(c2))
π (π(q), π(c3))
Centroids
π (π(q), π(d1))
β¦
π (π(q), π(dm))
Pre-BERT Dense Representations
142
Types of Encoders: Cross-encoder
143
Types of Encoders: Bi-encoder
144
π(q)
π(d)
π(π(q), π(d))
s
π(π(q), π(d))
π(q)
π(d)
Pre-BERT: Dual Embedding Space Model
Document
Centroid
Query Term Embeddings
Similarity
Function
145
Leverage term embeddings and β pretrained word2vec
Mitra, Nalisnick, Craswell, Caurana. A Dual Embedding Space Model for Document Ranking. arXiv & WWW 2016.
Similarity Functions: Distance vs. Comparison
Comparison
Distance
146
Pre-BERT: Deep Structured Semantic Model
Term Hashing
Feedforward
Network
Representation
Cosine similarity
147
Huang, He, Gao, Deng, Acero, Heck. Learning Deep Structured Semantic Models for Web Search using Clickthrough Data. CIKM 2013.
Distance-based Transformer Representations
148
Distance-based Representations
Key characteristic
Simple similarity function β inner (dot) product, cosine similarity, β¦
Compatible with ANN search
Johnson, Douze, JΓ©gou. Billion-scale similarity search with GPUs. arXiv 2017.
149
Distance-based: SentenceBERT
150
Text representation:
Text representation:
Reimers, Gurevych. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. EMNLP 2019.
Distance-based: SentenceBERT
151
Classification
Regression
Reimers, Gurevych. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. EMNLP 2019.
SentenceBERT Result Highlights
Zero-shot:
CLS is very poor
Fine-tuned:
CLS is slightly worse
Classification:
is essential
152
Reimers, Gurevych. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. EMNLP 2019.
Dense Passage Retrieval (DPR)
Highlights:
153
| Similarity Function (π) | Representation |
SentenceBERT | Cosine similarity | CLS, Mean, or Max |
DPR | Inner product | CLS |
Query-1 Query-2
Batch
Karpukhin, OΔuz, Min, Lewis, Wu, Edunov, Chen, Yih. Dense Passage Retrieval for Open-Domain Question Answering. EMNLP 2020.
ANCE
Negative Contrastive Estimation:
154
| Similarity Function (π) | Representation |
SentenceBERT | Cosine similarity | CLS, Mean, or Max |
DPR | Inner product | CLS |
ANCE | Inner product | CLS |
L. Xiong, C. Xiong, Li, Tang, Liu, Bennett, Ahmed, Overwikj. Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval. ICLR 2021.
Selecting Negative Examples: Results
CLEAR
Highlights:
156
| Similarity Function (π) | Representation |
SentenceBERT | Cosine similarity | CLS, Mean, or Max |
DPR | Inner product | CLS |
ANCE | Inner product | CLS |
CLEAR | Inner product + BM25 | Mean |
Gao, Dai, Chen, Fan, van Durme, Callan. Complementing Lexical Retrieval with Semantic Residual Embedding. arXiv 2020; ECIR 2021.
RepBERT
Exact match signal:
157
| Similarity Function (π) | Representation |
SentenceBERT | Cosine similarity | CLS, Mean, or Max |
DPR | Inner product | CLS |
ANCE | Inner product | CLS |
CLEAR | Inner product + BM25 | Mean |
RepBERT | Inner product | Mean |
Zhan, Mao, Liu, Zhang, Ma. RepBERT: Contextualized Text Embeddings for First-Stage Retrieval. arXiv 2020.
EPIC
Highlights:
158
| Similarity Function (π) | Representation |
SentenceBERT | Cosine similarity | CLS, Mean, or Max |
DPR | Inner product | CLS |
ANCE | Inner product | CLS |
CLEAR | Inner product + BM25 | Mean |
RepBERT | Inner product | Mean |
EPIC | Inner product | |V|-dimension vector |
MacAvaney, Nardini, Perego, Tonellotto, Goharian, Frieder. Expansion via Prediction of Importance with Contextualization. SIGIR 2020.
Distance-based: Results
159
Comparison-based Transformer Representations
160
Comparison-based: Poly-encoders
Two
(Query, Document)
N
(|Query| + |Document|)
161
Number of interacting representations
Computational Cost
Humeau, Shuster, Lachaux, Weston. Poly-encoders: Transformer Architectures and Pre-training Strategies for Fast and Accurate Multi-sentence Scoring. ICLR 2020.
Comparison-based: Poly-encoders
M context codes
162
Compatible with ANN?
Humeau, Shuster, Lachaux, Weston. Poly-encoders: Transformer Architectures and Pre-training Strategies for Fast and Accurate Multi-sentence Scoring. ICLR 2020.
Comparison-based: ColBERT
MaxSim operator
163
Khattab, Zaharia. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. SIGIR 2020.
Comparison-based: ColBERT
MaxSim:
Sim-mat max pooling
(along query dimension)
164
Max Pool
Sum
Khattab, Zaharia. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. SIGIR 2020.
Comparison-based: ColBERT
165
Compatible with ANN?
Khattab, Zaharia. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. SIGIR 2020.
Comparison-based: ColBERT
166
Q&A
5 minutes
167
Break
Resume at 11:35 PDT
168
Conclusions and Future Directions
169
Conclusions
170
Open Questions
Transformers for ranking: apply (T5), adapt (Parade), or redesign (TK/CK)?
What is the future of:
What can IR bring to transformers?
E.g.: REALM (Guu et al., 2020) -> Text retrieval into pretraining
171
Pretrained Transformers for Text Ranking:οΏ½BERT and Beyond
Thanks!
by Jimmy Lin, Rodrigo Nogueira, and Andrew Yates https://arxiv.org/abs/2010.06467
172
Learn more in survey (& upcoming book):