1 of 29

CS458 Natural language Processing

Lecture 16

RAG

Krishnendu Ghosh

Department of Computer Science & Engineering

Indian Institute of Information Technology Dharwad

2 of 29

Information Retrieval with Dense Vectors

3 of 29

Information Retrieval with Dense Vectors

BERT stands for Bidirectional Encoder Representations from Transformers. It's a machine learning model developed by Google for natural language processing (NLP).

4 of 29

BERT

Present both the query and the document to a single encoder, allowing the transformer self-attention to see all the tokens of both the query and the document, and thus building a representation that is sensitive to the meanings of both query and document. Then a linear layer can be put on top of the [CLS] token to predict a similarity score for the query/document tuple:

z = BERT(q;[SEP]; d)[CLS]

score(q,d) = softmax(U(z))

5 of 29

BERT

Use a single encoder to jointly encode query and document and finetune to produce a relevance score with a linear layer over the CLS token. This is too compute-expensive to use except in rescoring

6 of 29

BERT

Usually the retrieval step is not done on an entire document. Instead documents are broken up into smaller passages, such as non-overlapping fixed-length chunks of say 100 tokens, and the retriever encodes and retrieves these passages rather than entire documents. The query and document have to be made to fit in the BERT 512-token window, for example by truncating the query to 64 tokens and truncating the document if necessary so that it, the query, [CLS], and [SEP] fit in 512 tokens.

7 of 29

BERT

Use separate encoders for query and document, and use the dot product between CLS token outputs for the query and document as the score. This is less compute-expensive, but not as accurate.

8 of 29

BERT

At the other end of the computational spectrum is a much more efficient architecture, the bi-encoder. In this architecture we can encode the documents in the collection only one time by using two separate encoder models, one to encode the query and one to encode the document.

z_q = BERT_Q(q)[CLS]

z_d = BERT_D(d)[CLS]

score(q,d) = z_q · z_d

9 of 29

ColBERT

"ColBERT" stands for "Contextualized Late Interaction over BERT", referring to a machine learning model that utilizes a unique method of interacting with the BERT model to achieve efficient and contextually deep information retrieval by representing text at the token level with individual vector embeddings.

10 of 29

ColBERT

This method separately encodes the query and document, but rather than encoding the entire query or document into one vector, it separately encodes each of them into contextual representations for each token. These BERT representations of each document word can be pre-stored for efficiency. The relevance score between a query q and a document d is a sum of maximum similarity (MaxSim) operators between tokens in q and tokens in d. Essentially, for each token in q, ColBERT finds the most contextually similar token in d, and then sums up these similarities. A relevant document will have tokens that are contextually very similar to the query.

11 of 29

ColBERT

12 of 29

ColBERT

13 of 29

RAG

14 of 29

RAG: Retrieval Augmented Generation

Retrieval-augmented generation is a technique used in natural language processing that combines the power of both retrieval-based models and generative models to enhance the quality and relevance of generated text.

To understand retrieval-augmented generation, let’s break it down into its two main components:

retrieval models
generative models

15 of 29

RAG: Retrieval Augmented Generation

In the first stage of the 2-stage retrieve and read model we retrieve relevant passages from a text collection, for example using the dense retrievers of the previous section. In the second reader stage, we generate the answer via retrieval-augmented generation.

16 of 29

RAG

Retrieval models

These models are designed to retrieve relevant information from a given set of documents or a knowledge base. They typically use techniques like information retrieval or semantic search techniques to identify the most relevant pieces of information based on a given query. Retrieval-based models excel at finding accurate and specific information but lack the ability to generate creative or novel content.

17 of 29

RAG

Generative models

Generative models, on the other hand, are designed to generate new content based on a given prompt or context. These LLMs use a large amount of training data to learn the patterns and structures of natural language. Generative models can generate creative and coherent text, but they may struggle with factual accuracy or relevance to a specific context.

18 of 29

QA using RAG

19 of 29

QA using RAG

Recall that in simple conditional generation, we can cast the task of question answering as word prediction by giving a language model a question and a token like A: suggesting that an answer should come next:

Q: Who wrote the book ‘‘The Origin of Species"? A:

Then we generate autoregressively conditioned on this text.

20 of 29

QA using RAG

More formally, recall that simple autoregressive language modeling computes the probability of a string from the previous tokens:

And simple conditional generation for question answering adds a prompt like Q: , followed by a query q , and A:, all concatenated:

21 of 29

QA using RAG

22 of 29

QA Datasets

A similar natural question set is the MS MARCO (Microsoft Machine Reading Comprehension) collection of datasets, including 1 million real anonymized English questions from Microsoft Bing query logs together with a human generated answer and 9 million passages (Bajaj et al., 2016), that can be used both to test retrieval ranking and question answering.

23 of 29

QA Datasets

The DuReader dataset is a Chinese QA resource based on search engine queries and community QA (He et al., 2018).

TyDi QA dataset contains 204K question-answer pairs from 11 typologically diverse languages, including Arabic, Bengali, Kiswahili, Russian, and Thai (Clark et al., 2020). In the TYDI QA task, a system is given a question and the passages from a Wikipedia article and must (a) select the passage containing the answer (or NULL if no passage contains the answer), and (b) mark the minimal answer span (or NULL).

24 of 29

QA Datasets

On the probing side are datasets like MMLU (Massive Multitask Language Understanding), a commonly-used dataset of 15908 knowledge and reasoning questions in 57 areas including medicine, mathematics, computer science, law, and others. MMLU questions are sourced from various exams for humans, such as the US Graduate Record Exam, Medical Licensing Examination, and Advanced Placement exams.

25 of 29

Evaluating Question Answering

26 of 29

Evaluating Question Answering

Three techniques are commonly employed to evaluate question-answering systems, with the choice depending on the type of question and QA situation. For multiple choice questions like in MMLU, we report exact match:

Exact match: The % of predicted answers that match the gold answer exactly.

27 of 29

Evaluating Question Answering

For questions with free text answers, like Natural Questions, we commonly evaluated with token F1 score to roughly measure the partial string overlap between the answer and the reference answer:

F1 score: The average token overlap between predicted and gold answers. Treat the prediction and gold as a bag of tokens, and compute F1 for each question, then return the average F1 over all questions.

28 of 29

Evaluating Question Answering

Finally, in some situations QA systems give multiple ranked answers. In such cases we evaluated using mean reciprocal rank, or MRR (Voorhees, 1999). MRR is designed for systems that return a short ranked list of answers or passages for each test set question, which we can compare against the (human-labeled) correct answer.

29 of 29

Thank You