CS458 Natural language Processing
Lecture 16
RAG
Krishnendu Ghosh
Department of Computer Science & Engineering
Indian Institute of Information Technology Dharwad
Information Retrieval with Dense Vectors
Information Retrieval with Dense Vectors
BERT stands for Bidirectional Encoder Representations from Transformers. It's a machine learning model developed by Google for natural language processing (NLP).
BERT
Present both the query and the document to a single encoder, allowing the transformer self-attention to see all the tokens of both the query and the document, and thus building a representation that is sensitive to the meanings of both query and document. Then a linear layer can be put on top of the [CLS] token to predict a similarity score for the query/document tuple:
z = BERT(q;[SEP]; d)[CLS]
score(q,d) = softmax(U(z))
BERT
Use a single encoder to jointly encode query and document and finetune to produce a relevance score with a linear layer over the CLS token. This is too compute-expensive to use except in rescoring
BERT
Usually the retrieval step is not done on an entire document. Instead documents are broken up into smaller passages, such as non-overlapping fixed-length chunks of say 100 tokens, and the retriever encodes and retrieves these passages rather than entire documents. The query and document have to be made to fit in the BERT 512-token window, for example by truncating the query to 64 tokens and truncating the document if necessary so that it, the query, [CLS], and [SEP] fit in 512 tokens.
BERT
Use separate encoders for query and document, and use the dot product between CLS token outputs for the query and document as the score. This is less compute-expensive, but not as accurate.
BERT
At the other end of the computational spectrum is a much more efficient architecture, the bi-encoder. In this architecture we can encode the documents in the collection only one time by using two separate encoder models, one to encode the query and one to encode the document.
zq = BERTQ(q)[CLS]
zd = BERTD(d)[CLS]
score(q,d) = zq · zd
ColBERT
"ColBERT" stands for "Contextualized Late Interaction over BERT", referring to a machine learning model that utilizes a unique method of interacting with the BERT model to achieve efficient and contextually deep information retrieval by representing text at the token level with individual vector embeddings.
ColBERT
This method separately encodes the query and document, but rather than encoding the entire query or document into one vector, it separately encodes each of them into contextual representations for each token. These BERT representations of each document word can be pre-stored for efficiency. The relevance score between a query q and a document d is a sum of maximum similarity (MaxSim) operators between tokens in q and tokens in d. Essentially, for each token in q, ColBERT finds the most contextually similar token in d, and then sums up these similarities. A relevant document will have tokens that are contextually very similar to the query.
ColBERT
ColBERT
RAG
RAG: Retrieval Augmented Generation
Retrieval-augmented generation is a technique used in natural language processing that combines the power of both retrieval-based models and generative models to enhance the quality and relevance of generated text.
To understand retrieval-augmented generation, let’s break it down into its two main components:
RAG: Retrieval Augmented Generation
In the first stage of the 2-stage retrieve and read model we retrieve relevant passages from a text collection, for example using the dense retrievers of the previous section. In the second reader stage, we generate the answer via retrieval-augmented generation.
RAG
Retrieval models
These models are designed to retrieve relevant information from a given set of documents or a knowledge base. They typically use techniques like information retrieval or semantic search techniques to identify the most relevant pieces of information based on a given query. Retrieval-based models excel at finding accurate and specific information but lack the ability to generate creative or novel content.
RAG
Generative models
Generative models, on the other hand, are designed to generate new content based on a given prompt or context. These LLMs use a large amount of training data to learn the patterns and structures of natural language. Generative models can generate creative and coherent text, but they may struggle with factual accuracy or relevance to a specific context.
QA using RAG
QA using RAG
Recall that in simple conditional generation, we can cast the task of question answering as word prediction by giving a language model a question and a token like A: suggesting that an answer should come next:
Q: Who wrote the book ‘‘The Origin of Species"? A:
Then we generate autoregressively conditioned on this text.
QA using RAG
More formally, recall that simple autoregressive language modeling computes the probability of a string from the previous tokens:
And simple conditional generation for question answering adds a prompt like Q: , followed by a query q , and A:, all concatenated:
QA using RAG
QA Datasets
A similar natural question set is the MS MARCO (Microsoft Machine Reading Comprehension) collection of datasets, including 1 million real anonymized English questions from Microsoft Bing query logs together with a human generated answer and 9 million passages (Bajaj et al., 2016), that can be used both to test retrieval ranking and question answering.
QA Datasets
The DuReader dataset is a Chinese QA resource based on search engine queries and community QA (He et al., 2018).
TyDi QA dataset contains 204K question-answer pairs from 11 typologically diverse languages, including Arabic, Bengali, Kiswahili, Russian, and Thai (Clark et al., 2020). In the TYDI QA task, a system is given a question and the passages from a Wikipedia article and must (a) select the passage containing the answer (or NULL if no passage contains the answer), and (b) mark the minimal answer span (or NULL).
QA Datasets
On the probing side are datasets like MMLU (Massive Multitask Language Understanding), a commonly-used dataset of 15908 knowledge and reasoning questions in 57 areas including medicine, mathematics, computer science, law, and others. MMLU questions are sourced from various exams for humans, such as the US Graduate Record Exam, Medical Licensing Examination, and Advanced Placement exams.
Evaluating Question Answering
Evaluating Question Answering
Three techniques are commonly employed to evaluate question-answering systems, with the choice depending on the type of question and QA situation. For multiple choice questions like in MMLU, we report exact match:
Exact match: The % of predicted answers that match the gold answer exactly.
Evaluating Question Answering
For questions with free text answers, like Natural Questions, we commonly evaluated with token F1 score to roughly measure the partial string overlap between the answer and the reference answer:
F1 score: The average token overlap between predicted and gold answers. Treat the prediction and gold as a bag of tokens, and compute F1 for each question, then return the average F1 over all questions.
Evaluating Question Answering
Finally, in some situations QA systems give multiple ranked answers. In such cases we evaluated using mean reciprocal rank, or MRR (Voorhees, 1999). MRR is designed for systems that return a short ranked list of answers or passages for each test set question, which we can compare against the (human-labeled) correct answer.
Thank You