ONCE UPON A TIME In The...
ONCE UPON A TIME In The�Cozy Afternoon at Masaryk University
ONCE UPON A TIME In The�Cozy Afternoon at Masaryk University
Question Answering and Beyond
Prologue:
Prologue:
An Unknown Visitor
Whoami
Web, bio, more info:�https://mfajcik.github.io/
Chapter 1:
Introduction
Information Need
8
?
Information Need
9
? Information Need
Information Need
10
? Information Need
When has X his birthday?
I suffer from Y every winter. How to prevent it?
Where to buy skiing equipment?
Information Need
11
? Information Need
When has X his birthday?
I suffer from Y every winter. How to prevent it?
Where to buy skiing equipment?
What is the information need?
Interact with world�To Know
Information Need
12
? Information Need
When has X his birthday?
I suffer from Y every winter. How to prevent it?
Where to buy skiing equipment?
Interact To Know
Retrieve/Record Knowledge
socializing
Information Need
13
? Information Need
When has X his birthday?
I suffer from Y every winter. How to prevent it?
Where to buy skiing equipment?
Interact To Know
Retrieve/Record Knowledge
socializing
language
Information Need
14
Traditional Information Retrieval Today
Research Desiderata
Chapter 2:
Information Retrieval
Information Retrieval (IR)
16
Nguyen, Tri, et al. "MS MARCO: A human generated machine reading comprehension dataset." CoCo@ NIPS. 2016.
Query (often a list of keywords)
Task: distinguish between �relevant/irrelevant documents
Term “Information Retrieval” in literature. Example from MSMarco (Nguyen et al. 2016)
Information Retrieval (IR)
Term “Information Retrieval” in literature. Example from MSMarco (Nguyen et al. 2016)
17
Nguyen, Tri, et al. "MS MARCO: A human generated machine reading comprehension dataset." CoCo@ NIPS. 2016.
Query (often a list of keywords)
Task: distinguish between �relevant/irrelevant documents
The labels can be non-binary �(relevance scores)
Is Information Retrieval Document Retrieval?
18
Question Answering (QA)
19
Question (in natural language)
Task: provide Answer
Provided document(s)
Question Answering (QA)
20
Who is Jožko Mrkvička?
A fictional character in colloquial Slovak, whose name is used to denote an ordinary average citizen
What answer is expected?
What is question asking about?
Extractive QA
21
Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang, P. (2016, January). SQuAD: 100, 000+ Questions for Machine Comprehension of Text. In EMNLP.
Exact Match measures the percentage of predictions that match at least one of the ground truth answers exactly
Extractive QA
22
Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang, P. (2016, January). SQuAD: 100, 000+ Questions for Machine Comprehension of Text. In EMNLP.
(macro)F1 measures the average overlap between the prediction and ground truth answer.
Prediction and ground truth are treated as bags of tokens and their F1 is computed.
Usually a maximum F1 over all of the ground truth answers for a given question is taken, and the result is an average over all of the questions.
Extractive QA
23
Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang, P. (2016, January). SQuAD: 100, 000+ Questions for Machine Comprehension of Text. In EMNLP.
(macro)F1 measures the average overlap between the prediction and ground truth answer.
Prediction and ground truth are treated as bags of tokens and their F1 is computed.
Usually a maximum F1 over all of the ground truth answers for a given question is taken, and the result is an average over all of the questions.
When Document Retrieval meets QA
24
Open-domain QA�Brief Business Motivation
25
25
QA vs Fact-Checking
Figure inspired by Elior Sulem, Jamaal Hay, and Dan Roth. 2022. Yes, no or IDK: The challenge of unanswerable yes/no questions. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1075–1085, Seattle, United States. Association for Computational Linguistics.
Yes/No Question (Y/N), Closed-domain Extractive QA (CD), A fact to be verified (FACT)
Chapter 3:
Introduction into BM25
Retrieval
Corpus
Very large �# millions/billions of documents
Retrieval
Ranking
Retrieval via TF-IDF
Schütze, Hinrich, Christopher D. Manning, and Prabhakar Raghavan. Introduction to information retrieval. Vol. 39. Cambridge: Cambridge University Press, 2008.
Standard TF-IDF works reasonably well for retrieval!
Retrieval via TF-IDF
For query Q:=q1q2...qn and document D:=w1w2...wn we compute the score from overlapping terms as follows:
same quantity
Retrieval via TF-IDF
How to implement?
Bonus: Check out tf-idf implementation in DrQA
For query Q:=q1q2...qn and document D:=w1w2...wn we compute the score from overlapping terms as follows:
Building BM25 Retrieval
this is term frequency in document D
saturation parameter
Building BM25 Retrieval
Building BM25 Retrieval
Some authors have more to say: they may write a
single document containing or covering more ground.
An extreme version would have the author writing
two or more documents and concatenating them.
Hypothesis A
Hypothesis B
My beagle dog is a great beagle. Beagle is great.
~I own a beagle.
Building BM25 Retrieval
current document’s length
average document length in corpus
soft constraint to cover both hypotheses
Building BM25 Retrieval
BM-25 Formula
Building BM25 Retrieval
(k+1)
Robertson, Stephen, and Hugo Zaragoza. "The Probabilistic Relevance Framework: BM25 and Beyond." Information Retrieval 3.4 (2009): 333-389.
Chapter 4:
Question Answering
Selective QA
Evaluation: Standard multiclass classification metrics (Accuracy, F1, MCC)
Extractive QA
Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang, P. (2016, January). SQuAD: 100, 000+ Questions for Machine Comprehension of Text. In EMNLP.
(macro)F1 measures the average overlap between the prediction and ground truth answer.
Prediction and ground truth are treated as bags of tokens and their F1 is computed.
Usually a maximum F1 over all of the ground truth answers for a given question is taken, and the result is an average over all of the questions.
Abstractive QA
Task: Answer question from the story
Evaluation via Traditional NLG metrics
BLEU-4, ROUGE-L, Meteor
A simple extractive QA system A
Given question Q and document D
find answer span <astart,aend>
Estimate parameters via �maximum likelihood estimation
A simple extractive QA system: Decoding
Given question Q and document D
find answer span <astart,aend>
Estimate parameters via �maximum likelihood estimation
A simple extractive QA system B
Given question Q and document D
find answer span <astart,aend>
Estimate parameters via �maximum likelihood estimation
img source: Devlin, Jacob, et al. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." NAACL-HLT. 2019.
The Objective
45
Do we need to assume the independence?
Assumption on Independence (Xiong et al., 2017; Seo et al.,2017; Chen et al., 2017; Yu et al., 2018; Devlin et al., 2019; Cheng et al., 2020; inter alia)
No, we can compute joint objective with similar complexity directly, and�it “works better” �(Fajcik et al., 2021)
Rethinking the Objectives of Extractive Question Answering Fajcik, Martin, Josef, Jon, and Pavel, Smrz In Proceedings of the 3rd Workshop on Machine Reading for Question Answering 2021
Open-domain QA�MOTIVATION #1: Research-wise
46
Almost any NLP task can be framed as question answering!
Open-domain QA�MOTIVATION #2: Information retrieval in everyday life
47
“Academics and industry researchers need to achieve �the intellectual ‘escape velocity’ necessary to �revolutionize search. They must invest much more �in bold strategies that can achieve natural-language
searching and answering, rather than providing the electronic equivalent.”
“Moving up the information food chain requires a search engine that can interpret a user's question, extract facts from all the information on the web, and select �an appropriate answer.”
Keyword searching
Etzioni, Oren. "Search needs a shake-up." Nature 476.7358 (2011): 25-26.
Example of traditional approach
Retriever
Extractive reader
Example of traditional approach: Reader
BM25 Negative Document P1-
Document P1+
BM25 Negative Document P2-
Document Pn
Jožko Mrkvička is a fictional character in colloquial Slovak (but also journalistic style), whose name is used to denote the average citizen, or as an implicit name in examples of the textbook type. It is not associated with any negative or positive qualities (such as the English John Bull), nor is it derived from any truly existing character, nor is it the object of fabulations to give it a semblance of historical authenticity (such as the Czech Jára Cimrman) … writer Mária Ďuríčková (1919) for the main character in her book Jožko Mrkvička Spáč (1972). [1]
From retriever
Extractive reader
Document Pn
Jožko Mrkvička is a fictional character in colloquial Slovak (but also journalistic style), whose name is used to denote the average citizen, or as an implicit name in examples of the textbook type. It is not associated with any negative or positive qualities (such as the English John Bull), nor is it derived from any truly existing character, nor is it the object of fabulations to give it a semblance of historical authenticity (such as the Czech Jára Cimrman) … writer Mária Ďuríčková (1919) for the main character in her book Jožko Mrkvička Spáč (1972). [1]
Maximum Marginal Likelihood
Clark, Christopher, and Matt Gardner. "Simple and Effective Multi-Paragraph Reading Comprehension." Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018.
Maximum Marginal Likelihood
Maximum Marginal Likelihood
Maximum Marginal Likelihood
0 if zi is not from Z, 1 otherwise
Maximum Marginal Likelihood
0 if zi is not from Z, 1 otherwise
Maximum Marginal Likelihood
This is a so called „latent variable model“ with latent variable vi. Remember GMM!
0 if zi is not from Z, 1 otherwise
MML in Open-domain QA
Model
Question
1-st passage
Question
2-nd passage
Question
n-th passage
. . .
Model
Model
Passage representations
Passage representations
Passage representations
d
seq_len
Softmax
Linear
Linear
Linear
Loss for 1 sample
Clark, Christopher, and Matt Gardner. "Simple and Effective Multi-Paragraph Reading Comprehension." Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018.
MML in Open-domain QA
Model
Question
1-st passage
Question
2-nd passage
Question
n-th passage
. . .
Model
Model
Passage representations
Passage representations
Passage representations
d
seq_len
Softmax
Linear
Linear
Linear
Loss for 1 sample
Clark, Christopher, and Matt Gardner. "Simple and Effective Multi-Paragraph Reading Comprehension." Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018.
Important for cross-passage answer score calibration!
Do we need to use extractive models?
T5
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W. and Liu, P.J., 2019. Exploring the limits of transfer learning with a unified text-to-text transformer.
Idea #1: „Concatenate, pass, profit“
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W. and Liu, P.J., 2019. Exploring the limits of transfer learning with a unified text-to-text transformer.
Concatenate!
Question + Passage 1 + Passage 2+ Passage 3…
[Answer]
Drawbacks?
Idea #1: „Concatenate, pass, profit“
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W. and Liu, P.J., 2019. Exploring the limits of transfer learning with a unified text-to-text transformer.
Concatenate!
Question + Passage 1 + Passage 2+ Passage 3…
[Answer]
1. Memory complexity
2. Decoding: If we do decoding without restrictions,
the model might generate something not present in the text
Idea #2: Processing passages jointly: Fusion-in-Decoder
Izacard, Gautier, and Édouard Grave. "Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering." Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. 2021.
Fusing the extractive and generative approaches
Fusing the extractive and generative approaches
Fajcik, M., Docekal, M., Ondrej, K. and Smrz, P., 2021. Pruning the index contents for memory efficient open-domain QA. arXiv preprint arXiv:2102.10697.
Chapter 5:
‘23 Trends
Is QA “solved” by LLM such as ChatGPT/GPT4?
There is no definite answer, but we can do what every good scientist should.�Hypothesize…
Warning:
Is QA “solved” by LLM such as ChatGPT/GPT4?
There is no definite answer, but we can do what every good scientist should. Hypothesize…
Is QA “solved” by LLM such as ChatGPT/GPT4?
There is no definite answer, but we can do what every good scientist should. Hypothesize…
Is QA “solved” by LLM such as ChatGPT/GPT4?
There is no definite answer, but we can do what every good scientist should. Hypothesize…
Is QA “solved” by LLM such as ChatGPT/GPT4?
There is no definite answer, but we can do what every good scientist should. Hypothesize…
Bang, Y., Cahyawijaya, S., Lee, N., Dai, W., Su, D., Wilie, B., Lovenia, H., Ji, Z., Yu, T., Chung, W. and Do, Q.V., 2023. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023.
Is QA “solved” by Retrieval-Augmented LLMs?
There is no definite answer, but we can do what every good scientist should. Hypothesize…
Is QA “solved” by Evidence-grounded LLMs?
There is no definite answer, but we can do what every good scientist should. Hypothesize…
Query: What are the pros and cons of the top 3 selling pet vacuums?
Dmitri Brereton, “Bing AI Can’t Be Trusted”, https://dkb.blog/p/bing-ai-cant-be-trusted
Is QA “solved” by Evidence-grounded LLMs?
Query: What are the pros and cons of the top 3 selling pet vacuums?
”This is all completely made up information.
Bing AI was kind enough to give us its sources, so we can go to the hgtv article and check for ourselves.
�The cited article says nothing about limited suction power or noise. In fact, the top amazon review for this product talks about how quiet it is.
�The article also says nothing about the “short cord length of 16 feet” because it doesn’t have a cord. It’s a portable handheld vacuum.”
Dmitri Brereton, “Bing AI Can’t Be Trusted”, https://dkb.blog/p/bing-ai-cant-be-trusted
Is QA “solved” by Evidence-grounded LLMs?
Liu, Nelson F., Tianyi Zhang, and Percy Liang. "Evaluating Verifiability in Generative Search Engines." arXiv preprint arXiv:2304.09848 (2023).
Evidence-grounded models still suffer from hallucination and bad evidence grounding.
Data from very recent preprint (19 Apr)
Citation recall is the proportion of verification worthy statements that are fully supported by their associated citations.�
Citation precision is the proportion of generated citations that support their associated statements.
“We find that responses from existing generative search engines are fluent and appear informative, but frequently contain unsupported statements and inaccurate citations: on average, a mere 51.5% of generated sentences are fully supported by citations (citation recall) and only 74.5% of citations support their associated sentence (citation precision). We believe that these results are concerningly low for systems that may serve as a primary tool for information seeking users, especially given their facade of trustworthiness.”
Is QA “solved” by Evidence-grounded LLMs?
There is no definite answer, but we can do what every good scientist should. Hypothesize…
Figure source: Shakarian, Paulo, et al. "An Independent Evaluation of ChatGPT on Mathematical Word Problems (MWP)." arXiv preprint arXiv:2302.13814 (2023).
Is QA “solved” by Evidence-grounded LLMs?
There is no definite answer, but we can do what every good scientist should. Hypothesize…
Figure source: Shakarian, Paulo, et al. "An Independent Evaluation of ChatGPT on Mathematical Word Problems (MWP)." arXiv preprint arXiv:2302.13814 (2023).
Epilogue:�Takeaways
Takeaways: QA
Takeaways: QA
Medved, Marek, and Ales Horák. "SQAD: Simple Question Answering Database." RASLAN. 2014.
Takeaways: QA
Medved, Marek, and Ales Horák. "SQAD: Simple Question Answering Database." RASLAN. 2014.
Takeaways: QA
Medved, Marek, and Ales Horák. "SQAD: Simple Question Answering Database." RASLAN. 2014.
Takeaways: QA
Medved, Marek, and Ales Horák. "SQAD: Simple Question Answering Database." RASLAN. 2014.
Takeaways: Document Retrieval
More Recent Directions, Literature, etc.
Neural Document Retrieval
Contriever/mContriever — Unsupervisedly pretrained dense retrieval (also multilingual, but no Czech), sometimes matching closely supervised approaches, well generalizing.
LaBSE — Symmetric embeddings for textual similarity (!not query-document) over 109 languages, trained in a supervised way (parallel sentences) and unsupervised way.
ColBERTv2 — SOTA multi-vector learned dense retrieval model, with interesting quantization of residual vectors.
SPLADEv2 — SOTA learned sparse retrieval model.
JPR — Diverse retrieval for multi-answer questions.
Baleen — Multi-hop retrieval for multihop questions.
Open-Domain Question Answering
ATLAS — Unsupervisedly pre-trained evidence-grounded LLM (11B).
REATT — A joint retrieval-reader model for both, retrieval and LM.
DENSEPHRASES — All potential short answers on Wikipedia are encoded into gigantic index, answer is retrieved directly (no reader part!).
Open-Domain Fact-Checking
Claim-Dissector — Our new work on interpretable evidence-grounded fact-checking.
General Model Pretraining
MetaICL — A model pre-trained for learning to learn from context (so-called in-context learning).
LLAMA — Recently released large language model thay beats GPT-3/MegaTron despite being order of magnitude smaller.
No links included, IR it out! ☺