1 of 34

Expand, Highlight, Generate:

RL-driven Document Generation for Passage Reranking

Arian Askari, Mohammad Aliannejadi, Chuan Meng, Evangelos Kanoulas, Suzan Verberne

2 of 34

Arian Askari

  • Bio:
    • Final year Ph.D. candidate at Leiden University.
    • My research centers on large language models, emphasizing their role in information retrieval.
    • Previously, my focus was on developing effective Transformer-Based retrievers for both domain-specific and web search.
    • I am working with Suzan Verberne, Mohammad Aliannejadi, and Evangelos Kanoulas
    • Currently, my passion lies in pushing the boundaries of information retrieval through the capabilities of LLMs

Arian Askari 2023

3 of 34

Generative LLMs are not immune to mistakes or hallucinations

  • Example of hallucination by ChatGPT
  • The above article does not exist in real-world and is hallucinated by ChatGPT

Arian Askari 2023

4 of 34

Generative LLMs are not immune to mistakes or hallucinations

  • There are existing articles that share word overlap with the title of the hallucinated article.

Arian Askari 2023

5 of 34

Search engines are important because:

They do not hallucinate!

  • Search engines find the most relevant documents
  • Main component of search engines: retrieval models (rankers)
  • But, search engines need to be optimized/trained on

relatively large amount of training data

Arian Askari 2023

6 of 34

Employing document generator

  • Our process for employing document generator in the retrieval task
  • We use BLOOM560M as the main LLM in our experiments

Arian Askari 2023

Filtering the pairs of query and synthetic document that are unlikely to be relevant

Generating a synthetic document per query

by document generator

A set of queries

Training a MonoT5 re-ranker�on the filtered data

7 of 34

Example of noisy text by LLaMA v1

Example 4:

Query: what goes on the bottom of shepherd's pie

Relevant document: \begin{itemize}

\item 50g butter

\begin {itemize}\item melted

\end{ulis}t

\n\item flour

\uline{15g} \

\linebreak

\lipsum[1]

\hfill

\vspace{2pt}

\textbf{

\textit{ }

\textbullet

\par

\noindent

\bigskip

\newline

\hline

*

**\end {enumerate}*\n *\end* \\\\\\ \\

&&&\n\\\\ \ \\ \ \ \\ \\ & \ & & \\ & &\n \\\\ &&&& \& \&\&\\& &\\\n \ &&&&\&\

&&& \\& &\\ \\&&&&

8 of 34

We propose two document generators: DocGen and DocGen-RL

  • DocGen is a few-shot based model which only uses three example for each step of the pipeline.

  • Query expansion. Our preliminary experiments reveal that query expansion and highlighting improve the quality of the generated documents

9 of 34

We propose two document generators: DocGen and DocGen-RL

[1]Orion Weller, Marc Marone, Nathaniel Weir, Dawn Lawrie, Daniel Khashabi, and Benjamin Van Durme.2023. " according to..." prompting language models improves quoting from pre-training data. arXiv preprint arXiv:2305.13252.

  • DocGen is a few-shot based model which only uses three example for each step of the pipeline.

  • Query expansion. Our preliminary experiments reveal that query expansion and highlighting improve the quality of the generated documents
  • Query highlighting. We highlight important words of the query using square brackets to manipulate the LLM to pay more attention to those words inspired by [1].

10 of 34

We propose two document generators: DocGen and DocGen-RL

[1]Orion Weller, Marc Marone, Nathaniel Weir, Dawn Lawrie, Daniel Khashabi, and Benjamin Van Durme.2023. " according to..." prompting language models improves quoting from pre-training data. arXiv preprint arXiv:2305.13252.

  • DocGen is a few-shot based model which only uses three example for each step of the pipeline.

  • Query expansion. Our preliminary experiments reveal that query expansion and highlighting improve the quality of the generated documents
  • Query highlighting. We highlight important words of the query using square brackets to manipulate the LLM to pay more attention to those words inspired by [1].
  • Document generation. We generate likely relevant documents given the expanded and highlighted query.

11 of 34

Query expansion prompt

Example 1:

Query: Is a little caffeine ok during pregnancy?

Query Expanded: What is the recommended amount of caffeine intake during pregnancy, and are there any potential risks associated with consuming small amounts of caffeine while pregnant?

Example 2:

Query: What fruit is native to Australia?

Query Expanded: Which fruit is exclusive to Australia and provide some additional details about it?

Example 3:

Query: How large is the canadian military?

Query Expanded: What is the size of the canadian military ahd what is the number of active personnel and reserve members?

Example 4:

Query: {query_text}

Query Expanded:

12 of 34

Query highlighting prompt

Example 1:

Query: What is the recommended amount of caffeine intake during pregnancy, and are there any potential risks associated with consuming small amounts of caffeine while pregnant?

Query Highlighted: What is the recommended amount of [caffeine] intake during [pregnancy], and are there any potential risks associated with consuming small amounts of [caffeine] while [pregnant]?

Example 2:

Query: Which fruit is exclusive to Australia and provide some additional details about it?

Query Highlighted: Which [fruit] is exclusive to [Australia] and provide some additional details about it?

Example 3:

Query: What is the size of the canadian military ahd what is the number of active personnel and reserve members?

Query Highlighted: What is the size of the [canadian military] ahd what is the number of active personnel and reserve members?

Example 4:

Query: {query_text}

Query Highlighted:

13 of 34

Document generation prompt

Example1:

Query: What is the recommended amount of [caffeine] intake during [pregnancy], and are there any potential risks associated with consuming small amounts of [caffeine] while [pregnant]?

Relevant Document: We don't know a lot about the effects of caffeine during pregnancy on you and your baby. So it's best to limit the amount you get each day. If you are pregnant, limit caffeine to 200 milligrams each day. This is about the amount in 1½ 8-ounce cups of coffee or one 12-ounce cup of coffee.

Example 2:

Query: Which [fruit] is exclusive to [Australia] and provide some additional details about it?

Relevant Document: Passiflora herbertiana. A rare passion fruit native to Australia. Fruits are green-skinned, white fleshed, with an unknown edible rating. Some sources list the fruit as edible, sweet and tasty, while others list the fruits as being bitter and inedible.assiflora herbertiana. A rare passion fruit native to Australia. Fruits are green-skinned, white fleshed, with an unknown edible rating. Some sources list the fruit as edible, sweet and tasty, while others list the fruits as being bitter and inedible.

Example 3:

Query: What is the size of the [canadian military] ahd what is the number of active personnel and reserve members?

Relevant Document: The Canadian Armed Forces. 1 The first large-scale Canadian peacekeeping mission started in Egypt on November 24, 1956. 2 There are approximately 65,000 Regular Force and 25,000 reservist members in the Canadian military. 3 In Canada, August 9 is designated as National Peacekeepers' Day.

Example 4:

Query: {query_text}

Relevant Document:

14 of 34

MonoT5

  • We use MonoT5 as it has been used widely for the existing baselines on data augmentation for IR

true

false

15 of 34

Training a consistency filtering model

  • Goal: keeping high quality data
  • We generate a dataset of synthetic documents given the queries using the previous pipeline.
  • We use this data to train an initial retriever which we call MonoT5-CF.
  • Given a query–document pair, we use the MonoT5-CF to predict the most relevant passages for q.
  • We keep the query–document pair in the final dataset, only when the the top-1 returned document by the retriever is the synthetically generated document given the query q.
  • We train DocGen on this filtered data

16 of 34

RL-training

  • Motivated by the challenges we encounter during query highlighting
    • and since the nature of the task is dependent on the document generation module
    • We propose a RL training that optimize highlighting
  • During RL training, we optimize the process of query highlighting with the goal of generating documents of higher quality that are more relevant to the highlighted query.
  • How?
    • we pass the highlighted query to the LLM with few-show examples, generate a document with few-show learning, and
    • We use the predicted relevance by DocGen as the reward for the LLM that highlights the query.
      • We use proximal policy optimization (PPO) as the policy gradient method

17 of 34

Results

18 of 34

Baselines: InPars, GenRead, Q2D

  • InPars generates query given passage
  • GenRead focuses on knowledge intensive task but can be considered as a baseline because it generates document given a query
  • Q2D focuses on query expansion but as it also do that by generating document we consider it as a baseline

Arian Askari 2023

Retriever

Data Augmentor

NQ

nDCG

MS MARCO

MRR

DL’20

MAP nDCG

First stage

BM25

.329

.187

.286 .480

Rerankers

MonoT5

InPars (Bonifacio et al., 2020)

.335

.259

.360 .576

InPars (replicated)

.337

.223

.357 .569

GenRead (replicated)

.368

.230

.354 .570

Q2D (replicated)

.309

.158

.252 .437

Rerankers w/DocGen

MonoT5

DocGen (Ours)

.467

.275

.398 .580

DocGen-RL (Ours)

.517

.332

.421 .618

Human Annotation

MonoT5

.567

.381

.491 .714

19 of 34

Results on NQ, MSMARCO, DL;20, HotpotQA, and Fever datasets in terms of official metrics

  • DocGen and DocGen-RL outperforms existing baselines by large margin. However, they are still far from using human-generated data.

Arian Askari 2023

Retriever

Data Augmentor

NQ

nDCG

MS MARCO

MRR

DL’20

MAP nDCG

First stage

BM25

.329

.187

.286 .480

Rerankers

MonoT5

InPars (Bonifacio et al., 2020)

.335

.259

.360 .576

InPars (replicated)

.337

.223

.357 .569

GenRead (replicated)

.368

.230

.354 .570

Q2D (replicated)

.309

.158

.252 .437

Rerankers w/DocGen

MonoT5

DocGen (Ours)

.467

.275

.398 .580

DocGen-RL (Ours)

.517

.332

.421 .618

Human Annotation

MonoT5

.567

.381

.491 .714

20 of 34

  • However, they are still far from using human-generated data.

Arian Askari 2023

Retriever

Data Augmentor

NQ

nDCG

MS MARCO

MRR

DL’20

MAP nDCG

First stage

BM25

.329

.187

.286 .480

Rerankers

MonoT5

InPars (Bonifacio et al., 2020)

.335

.259

.360 .576

InPars (replicated)

.337

.223

.357 .569

GenRead (replicated)

.368

.230

.354 .570

Q2D (replicated)

.309

.158

.252 .437

Rerankers w/DocGen

MonoT5

DocGen (Ours)

.467

.275

.398 .580

DocGen-RL (Ours)

.517

.332

.421 .618

Human Annotation

MonoT5

.567

.381

.491 .714

21 of 34

  • A summary on results

Arian Askari 2023

22 of 34

We perform two different analysis on DocGen and DocGen-RL:

  • Ablation study on DocGen
  • RL-training analysis on DocGen-RL

We use nDCG@10 for evaluation.

Arian Askari 2023

23 of 34

Ablation study on DocGen.

  • Highlighting is more important than expanding
  • while both steps do contribute to the effectiveness of the model.

Arian Askari 2023

Dataset

NQ-test

DocGen w/o expanding

.370

DocGen w/o highlighting

.363

DocGen w/o expanding and highlighting

.351

DocGen

.4670

24 of 34

RL-training analysis on DocGen-RL:

  • highlighting, can achieve the highest improvement for DocGen.

Arian Askari 2023

Dataset

NQ-test

DocGen + only RL on highlighting

( = DocGen-RL)

.517

DocGen + only RL on expanding

.473

DocGen + only RL on doc generation

.448

DocGen

.4670

25 of 34

RL-training analysis on DocGen-RL:

  • highlighting, can achieve the highest improvement for DocGen.
  • expanding queries, can slightly improve effectiveness

Arian Askari 2023

Dataset

NQ-test

DocGen + only RL on highlighting

( = DocGen-RL)

.517

DocGen + only RL on expanding

.473

DocGen + only RL on doc generation

.448

DocGen

.4670

26 of 34

RL-training analysis on DocGen-RL:

  • highlighting, can achieve the highest improvement for DocGen.
  • expanding queries, can slightly improve effectiveness
  • document generation, decreases effectiveness.
    • This could be because generating a document is a more challenging task, and training the LLM on this task using RL could be more challenging.

Arian Askari 2023

Dataset

NQ-test

DocGen + only RL on highlighting

( = DocGen-RL)

.517

DocGen + only RL on expanding

.473

DocGen + only RL on doc generation

.448

DocGen

.467

27 of 34

Analyzing gap between synthetic and realistic data

  • generated documents by DocGen and DocGen-RL, are closer to the human data compared to the other data augmentation methods.
  • For query generation with InPars, we observed that LLMs tend to select the important words from the document as the query words during query generation
  • which is dissimilar from human queries

28 of 34

Scaling analysis: Impact of scaling on DocGen. Eval on NQ-test in terms of nDCG@10

  • we achieve a significant improvement over the results by
    • increasing the scale of BLOOM
    • increasing the scale of MonoT5 parameters

Arian Askari 2023

Dataset

NQ-test

BLOOM 560M and T5-base (220M)

.467

BLOOM-3B

.482

T5-large (770M)

.495

29 of 34

Analysis on highlighting character

Arian Askari 2023

30 of 34

Limitations:

  1. Other aspects of evaluation have not been investigated in this paper, specifically:
    • the effect of biased information in the generated documents on biases in the document ranking.
  2. Another problem is that the factuality of the LLM output cannot be guaranteed.
    • Even though factually incorrect information in the generated data (as a result of LLM hallucination) is not likely to be harmful in the retrieval context, because only information that is truly contained in the document collection can be retrieved by a retrieval model in inference.
  3. We do not systematically study and quantify the effect of hallucinated data on the performance of the ranker.

Arian Askari 2023

31 of 34

Takeaways

  • LLMs have high potential in generating training data for neural retrieval models.
  • Generating synthetic documents produces a dataset that is similar to human-generated data.
  • Our analysis indicates that scaling up enhances effectiveness within our setup.
  • There are also works that focus on generating queries and relevance assessments.
  • Our work can be looked as a complementary approach to the query generation method

Scan the QR Code to check out the dataset

Arian Askari 2023

Thank you!

32 of 34

Appendix

33 of 34

Further training the trained model on MS MARCO on our synthetic data.

Arian Askari 2023

34 of 34

Further training the trained model on MS MARCO on our synthetic data.

Arian Askari 2023