1 of 34

Expand, Highlight, Generate:

RL-driven Document Generation for Passage Reranking

Arian Askari, Mohammad Aliannejadi, Chuan Meng, Evangelos Kanoulas, Suzan Verberne

2 of 34

Arian Askari

Bio:

Final year Ph.D. candidate at Leiden University.
My research centers on large language models, emphasizing their role in information retrieval.
Previously, my focus was on developing effective Transformer-Based retrievers for both domain-specific and web search.
I am working with Suzan Verberne, Mohammad Aliannejadi, and Evangelos Kanoulas
Currently, my passion lies in pushing the boundaries of information retrieval through the capabilities of LLMs

Arian Askari 2023

3 of 34

Generative LLMs are not immune to mistakes or hallucinations

Example of hallucination by ChatGPT
The above article does not exist in real-world and is hallucinated by ChatGPT

Arian Askari 2023

4 of 34

Generative LLMs are not immune to mistakes or hallucinations

There are existing articles that share word overlap with the title of the hallucinated article.

Arian Askari 2023

5 of 34

Search engines are important because:

They do not hallucinate!

Search engines find the most relevant documents
Main component of search engines: retrieval models (rankers)
But, search engines need to be optimized/trained on

relatively large amount of training data

Arian Askari 2023

6 of 34

Employing document generator

Our process for employing document generator in the retrieval task
We use BLOOM560M as the main LLM in our experiments

Arian Askari 2023

Filtering the pairs of query and synthetic document that are unlikely to be relevant

Generating a synthetic document per query

by document generator

A set of queries

Training a MonoT5 re-ranker�on the filtered data

7 of 34

Example of noisy text by LLaMA v1

Example 4:

Query: what goes on the bottom of shepherd's pie

Relevant document: \begin{itemize}

\item 50g butter

\begin {itemize}\item melted

\end{ulis}t

\n\item flour

\uline{15g} \

\linebreak

\lipsum[1]

\hfill

\vspace{2pt}

\textbf{

\textit{ }

\textbullet

\par

\noindent

\bigskip

\newline

\hline

*

**\end {enumerate}*\n *\end* \\\\\\ \\

&&&\n\\\\ \ \\ \ \ \\ \\ & \ & & \\ & &\n \\\\ &&&& \& \&\&\\& &\\\n \ &&&&\&\

&&& \\& &\\ \\&&&&

8 of 34

We propose two document generators: DocGen and DocGen-RL

DocGen is a few-shot based model which only uses three example for each step of the pipeline.

Query expansion. Our preliminary experiments reveal that query expansion and highlighting improve the quality of the generated documents

9 of 34

We propose two document generators: DocGen and DocGen-RL

[1]Orion Weller, Marc Marone, Nathaniel Weir, Dawn Lawrie, Daniel Khashabi, and Benjamin Van Durme.2023. " according to..." prompting language models improves quoting from pre-training data. arXiv preprint arXiv:2305.13252.

DocGen is a few-shot based model which only uses three example for each step of the pipeline.

Query expansion. Our preliminary experiments reveal that query expansion and highlighting improve the quality of the generated documents
Query highlighting. We highlight important words of the query using square brackets to manipulate the LLM to pay more attention to those words inspired by [1].

10 of 34

We propose two document generators: DocGen and DocGen-RL

[1]Orion Weller, Marc Marone, Nathaniel Weir, Dawn Lawrie, Daniel Khashabi, and Benjamin Van Durme.2023. " according to..." prompting language models improves quoting from pre-training data. arXiv preprint arXiv:2305.13252.

DocGen is a few-shot based model which only uses three example for each step of the pipeline.

Query expansion. Our preliminary experiments reveal that query expansion and highlighting improve the quality of the generated documents
Query highlighting. We highlight important words of the query using square brackets to manipulate the LLM to pay more attention to those words inspired by [1].
Document generation. We generate likely relevant documents given the expanded and highlighted query.

11 of 34

Query expansion prompt

Example 1:

Query: Is a little caffeine ok during pregnancy?

Query Expanded: What is the recommended amount of caffeine intake during pregnancy, and are there any potential risks associated with consuming small amounts of caffeine while pregnant?

Example 2:

Query: What fruit is native to Australia?

Query Expanded: Which fruit is exclusive to Australia and provide some additional details about it?

Example 3:

Query: How large is the canadian military?

Query Expanded: What is the size of the canadian military ahd what is the number of active personnel and reserve members?

Example 4:

Query: {query_text}

Query Expanded:

12 of 34

Query highlighting prompt

Example 1:

Query: What is the recommended amount of caffeine intake during pregnancy, and are there any potential risks associated with consuming small amounts of caffeine while pregnant?

Query Highlighted: What is the recommended amount of [caffeine] intake during [pregnancy], and are there any potential risks associated with consuming small amounts of [caffeine] while [pregnant]?

Example 2:

Query: Which fruit is exclusive to Australia and provide some additional details about it?

Query Highlighted: Which [fruit] is exclusive to [Australia] and provide some additional details about it?

Example 3:

Query: What is the size of the canadian military ahd what is the number of active personnel and reserve members?

Query Highlighted: What is the size of the [canadian military] ahd what is the number of active personnel and reserve members?

Example 4:

Query: {query_text}

Query Highlighted:

13 of 34

Document generation prompt

Example1:

Query: What is the recommended amount of [caffeine] intake during [pregnancy], and are there any potential risks associated with consuming small amounts of [caffeine] while [pregnant]?

Relevant Document: We don't know a lot about the effects of caffeine during pregnancy on you and your baby. So it's best to limit the amount you get each day. If you are pregnant, limit caffeine to 200 milligrams each day. This is about the amount in 1½ 8-ounce cups of coffee or one 12-ounce cup of coffee.

Example 2:

Query: Which [fruit] is exclusive to [Australia] and provide some additional details about it?

Relevant Document: Passiflora herbertiana. A rare passion fruit native to Australia. Fruits are green-skinned, white fleshed, with an unknown edible rating. Some sources list the fruit as edible, sweet and tasty, while others list the fruits as being bitter and inedible.assiflora herbertiana. A rare passion fruit native to Australia. Fruits are green-skinned, white fleshed, with an unknown edible rating. Some sources list the fruit as edible, sweet and tasty, while others list the fruits as being bitter and inedible.

Example 3:

Query: What is the size of the [canadian military] ahd what is the number of active personnel and reserve members?

Relevant Document: The Canadian Armed Forces. 1 The first large-scale Canadian peacekeeping mission started in Egypt on November 24, 1956. 2 There are approximately 65,000 Regular Force and 25,000 reservist members in the Canadian military. 3 In Canada, August 9 is designated as National Peacekeepers' Day.

Example 4:

Query: {query_text}

Relevant Document:

14 of 34

MonoT5

We use MonoT5 as it has been used widely for the existing baselines on data augmentation for IR

true

false

15 of 34

Training a consistency filtering model

Goal: keeping high quality data
We generate a dataset of synthetic documents given the queries using the previous pipeline.
We use this data to train an initial retriever which we call MonoT5-CF.
Given a query–document pair, we use the MonoT5-CF to predict the most relevant passages for q.
We keep the query–document pair in the final dataset, only when the the top-1 returned document by the retriever is the synthetically generated document given the query q.
We train DocGen on this filtered data

16 of 34

RL-training

Motivated by the challenges we encounter during query highlighting

and since the nature of the task is dependent on the document generation module
We propose a RL training that optimize highlighting

During RL training, we optimize the process of query highlighting with the goal of generating documents of higher quality that are more relevant to the highlighted query.
How?

we pass the highlighted query to the LLM with few-show examples, generate a document with few-show learning, and
We use the predicted relevance by DocGen as the reward for the LLM that highlights the query.

We use proximal policy optimization (PPO) as the policy gradient method

17 of 34

Results

18 of 34

Baselines: InPars, GenRead, Q2D

InPars generates query given passage
GenRead focuses on knowledge intensive task but can be considered as a baseline because it generates document given a query
Q2D focuses on query expansion but as it also do that by generating document we consider it as a baseline

Arian Askari 2023

Retriever	Data Augmentor	NQ nDCG	MS MARCO MRR	DL’20 MAP nDCG
First stage BM25	—	.329	.187	.286 .480
Rerankers MonoT5	InPars (Bonifacio et al., 2020)	.335	.259	.360 .576
	InPars (replicated)	.337	.223	.357 .569
	GenRead (replicated)	.368	.230	.354 .570
	Q2D (replicated)	.309	.158	.252 .437
Rerankers w/DocGen MonoT5	DocGen (Ours)	.467	.275	.398 .580
Rerankers w/DocGen MonoT5	DocGen-RL (Ours)	.517	.332	.421 .618
Human Annotation MonoT5	—	.567	.381	.491 .714

19 of 34

Results on NQ, MSMARCO, DL;20, HotpotQA, and Fever datasets in terms of official metrics

DocGen and DocGen-RL outperforms existing baselines by large margin. However, they are still far from using human-generated data.

Arian Askari 2023

Retriever	Data Augmentor	NQ nDCG	MS MARCO MRR	DL’20 MAP nDCG
First stage BM25	—	.329	.187	.286 .480
Rerankers MonoT5	InPars (Bonifacio et al., 2020)	.335	.259	.360 .576
	InPars (replicated)	.337	.223	.357 .569
	GenRead (replicated)	.368	.230	.354 .570
	Q2D (replicated)	.309	.158	.252 .437
Rerankers w/DocGen MonoT5	DocGen (Ours)	.467	.275	.398 .580
Rerankers w/DocGen MonoT5	DocGen-RL (Ours)	.517	.332	.421 .618
Human Annotation MonoT5	—	.567	.381	.491 .714

20 of 34

However, they are still far from using human-generated data.

Arian Askari 2023

Retriever	Data Augmentor	NQ nDCG	MS MARCO MRR	DL’20 MAP nDCG
First stage BM25	—	.329	.187	.286 .480
Rerankers MonoT5	InPars (Bonifacio et al., 2020)	.335	.259	.360 .576
	InPars (replicated)	.337	.223	.357 .569
	GenRead (replicated)	.368	.230	.354 .570
	Q2D (replicated)	.309	.158	.252 .437
Rerankers w/DocGen MonoT5	DocGen (Ours)	.467	.275	.398 .580
Rerankers w/DocGen MonoT5	DocGen-RL (Ours)	.517	.332	.421 .618
Human Annotation MonoT5	—	.567	.381	.491 .714

21 of 34

A summary on results

Arian Askari 2023

22 of 34

We perform two different analysis on DocGen and DocGen-RL:

Ablation study on DocGen
RL-training analysis on DocGen-RL

We use nDCG@10 for evaluation.

Arian Askari 2023

23 of 34

Ablation study on DocGen.

Highlighting is more important than expanding
while both steps do contribute to the effectiveness of the model.

Arian Askari 2023

Dataset	NQ-test
DocGen w/o expanding	.370
DocGen w/o highlighting	.363
DocGen w/o expanding and highlighting	.351
DocGen	.4670

24 of 34

RL-training analysis on DocGen-RL:

highlighting, can achieve the highest improvement for DocGen.

Arian Askari 2023

Dataset	NQ-test
DocGen + only RL on highlighting ( = DocGen-RL)	.517
DocGen + only RL on expanding	.473
DocGen + only RL on doc generation	.448
DocGen	.4670

25 of 34

RL-training analysis on DocGen-RL:

highlighting, can achieve the highest improvement for DocGen.
expanding queries, can slightly improve effectiveness

Arian Askari 2023

Dataset	NQ-test
DocGen + only RL on highlighting ( = DocGen-RL)	.517
DocGen + only RL on expanding	.473
DocGen + only RL on doc generation	.448
DocGen	.4670

26 of 34

RL-training analysis on DocGen-RL:

highlighting, can achieve the highest improvement for DocGen.
expanding queries, can slightly improve effectiveness
document generation, decreases effectiveness.

This could be because generating a document is a more challenging task, and training the LLM on this task using RL could be more challenging.

Arian Askari 2023

Dataset	NQ-test
DocGen + only RL on highlighting ( = DocGen-RL)	.517
DocGen + only RL on expanding	.473
DocGen + only RL on doc generation	.448
DocGen	.467

27 of 34

Analyzing gap between synthetic and realistic data

generated documents by DocGen and DocGen-RL, are closer to the human data compared to the other data augmentation methods.
For query generation with InPars, we observed that LLMs tend to select the important words from the document as the query words during query generation
which is dissimilar from human queries

28 of 34

Scaling analysis: Impact of scaling on DocGen. Eval on NQ-test in terms of nDCG@10

we achieve a significant improvement over the results by

increasing the scale of BLOOM
increasing the scale of MonoT5 parameters

Arian Askari 2023

Dataset	NQ-test
BLOOM 560M and T5-base (220M)	.467
BLOOM-3B	.482
T5-large (770M)	.495

29 of 34

Analysis on highlighting character

Arian Askari 2023

30 of 34

Limitations:

Other aspects of evaluation have not been investigated in this paper, specifically:

the effect of biased information in the generated documents on biases in the document ranking.

Another problem is that the factuality of the LLM output cannot be guaranteed.

Even though factually incorrect information in the generated data (as a result of LLM hallucination) is not likely to be harmful in the retrieval context, because only information that is truly contained in the document collection can be retrieved by a retrieval model in inference.

We do not systematically study and quantify the effect of hallucinated data on the performance of the ranker.

Arian Askari 2023

31 of 34

Takeaways

LLMs have high potential in generating training data for neural retrieval models.
Generating synthetic documents produces a dataset that is similar to human-generated data.
Our analysis indicates that scaling up enhances effectiveness within our setup.
There are also works that focus on generating queries and relevance assessments.
Our work can be looked as a complementary approach to the query generation method

Scan the QR Code to check out the dataset

Arian Askari 2023

Thank you!

32 of 34

Appendix

33 of 34

Further training the trained model on MS MARCO on our synthetic data.

Arian Askari 2023

34 of 34

Further training the trained model on MS MARCO on our synthetic data.

Arian Askari 2023