1 of 84

ONCE UPON A TIME In The...

2 of 84

ONCE UPON A TIME In The�Cozy Afternoon at Masaryk University

3 of 84

ONCE UPON A TIME In The�Cozy Afternoon at Masaryk University

Question Answering and Beyond

4 of 84

Prologue:

5 of 84

Prologue:

An Unknown Visitor

6 of 84

Whoami

  1. Senior PhD student from BUT-FIT supervised by�prof. Smrž.��
  2. A person fond of question answering, fact checking and basically any open-domain retrieval problem :-).

Web, bio, more info:�https://mfajcik.github.io/

7 of 84

Chapter 1:

Introduction

8 of 84

Information Need

8

?

9 of 84

Information Need

9

? Information Need

10 of 84

Information Need

10

? Information Need

When has X his birthday?

I suffer from Y every winter. How to prevent it?

Where to buy skiing equipment?

11 of 84

Information Need

11

? Information Need

When has X his birthday?

I suffer from Y every winter. How to prevent it?

Where to buy skiing equipment?

What is the information need?

Interact with world�To Know

12 of 84

Information Need

12

? Information Need

When has X his birthday?

I suffer from Y every winter. How to prevent it?

Where to buy skiing equipment?

Interact To Know

Retrieve/Record Knowledge

socializing

13 of 84

Information Need

13

? Information Need

When has X his birthday?

I suffer from Y every winter. How to prevent it?

Where to buy skiing equipment?

Interact To Know

Retrieve/Record Knowledge

socializing

language

14 of 84

Information Need

14

Traditional Information Retrieval Today

Research Desiderata

  • Provide answer, if question requires factoid answer
  • Provide summary, if question requires summary
  • Provide search result if question requires listing
  • Solve logic, if question requires problem solving
  • Questions are often ambiguous, disambiguate via interaction
  • Make models understand natural language, not humans learn model language

15 of 84

Chapter 2:

Information Retrieval

16 of 84

Information Retrieval (IR)

16

Nguyen, Tri, et al. "MS MARCO: A human generated machine reading comprehension dataset." CoCo@ NIPS. 2016.

Query (often a list of keywords)

Task: distinguish between �relevant/irrelevant documents

Term “Information Retrieval” in literature. Example from MSMarco (Nguyen et al. 2016)

17 of 84

Information Retrieval (IR)

Term “Information Retrieval” in literature. Example from MSMarco (Nguyen et al. 2016)

17

Nguyen, Tri, et al. "MS MARCO: A human generated machine reading comprehension dataset." CoCo@ NIPS. 2016.

Query (often a list of keywords)

Task: distinguish between �relevant/irrelevant documents

The labels can be non-binary �(relevance scores)

18 of 84

Is Information Retrieval Document Retrieval?

  • Lets brainstorm, how else can we retrieve information?

18

19 of 84

Question Answering (QA)

  • A set of problems related to drawing conclusions from data (example from MSMarco)

19

Question (in natural language)

Task: provide Answer

Provided document(s)

20 of 84

Question Answering (QA)

20

Who is Jožko Mrkvička?

A fictional character in colloquial Slovak, whose name is used to denote an ordinary average citizen

What answer is expected?

  • factoid?
  • open-ended?
  • chit-chat?
  • CODE?

  • no-answer?
  • respond with clarifying question to ambiguous question?

What is question asking about?

  • facts?
  • open-ended?
  • chit-chat?
  • math?
  • multi-answer/multihop question?

21 of 84

Extractive QA

21

Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang, P. (2016, January). SQuAD: 100, 000+ Questions for Machine Comprehension of Text. In EMNLP.

Exact Match measures the percentage of predictions that match at least one of the ground truth answers exactly

22 of 84

Extractive QA

22

Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang, P. (2016, January). SQuAD: 100, 000+ Questions for Machine Comprehension of Text. In EMNLP.

(macro)F1 measures the average overlap between the prediction and ground truth answer.

Prediction and ground truth are treated as bags of tokens and their F1 is computed.

Usually a maximum F1 over all of the ground truth answers for a given question is taken, and the result is an average over all of the questions.

23 of 84

Extractive QA

23

Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang, P. (2016, January). SQuAD: 100, 000+ Questions for Machine Comprehension of Text. In EMNLP.

(macro)F1 measures the average overlap between the prediction and ground truth answer.

Prediction and ground truth are treated as bags of tokens and their F1 is computed.

Usually a maximum F1 over all of the ground truth answers for a given question is taken, and the result is an average over all of the questions.

24 of 84

When Document Retrieval meets QA

24

25 of 84

Open-domain QA�Brief Business Motivation

25

25

 

26 of 84

QA vs Fact-Checking

Figure inspired by Elior Sulem, Jamaal Hay, and Dan Roth. 2022. Yes, no or IDK: The challenge of unanswerable yes/no questions. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1075–1085, Seattle, United States. Association for Computational Linguistics.

Yes/No Question (Y/N), Closed-domain Extractive QA (CD), A fact to be verified (FACT)

27 of 84

Chapter 3:

Introduction into BM25

28 of 84

Retrieval

Corpus

Very large �# millions/billions of documents

Retrieval

Ranking

29 of 84

Retrieval via TF-IDF

Schütze, Hinrich, Christopher D. Manning, and Prabhakar Raghavan. Introduction to information retrieval. Vol. 39. Cambridge: Cambridge University Press, 2008.

Standard TF-IDF works reasonably well for retrieval!

30 of 84

Retrieval via TF-IDF

For query Q:=q1q2...qn and document D:=w1w2...wn we compute the score from overlapping terms as follows:

same quantity

31 of 84

Retrieval via TF-IDF

How to implement?

Bonus: Check out tf-idf implementation in DrQA

For query Q:=q1q2...qn and document D:=w1w2...wn we compute the score from overlapping terms as follows:

32 of 84

Building BM25 Retrieval

  1. [Query term importance in the document] Pick a function, which increases monotonically with tf, is rising slowly, but this time is asymptotically approaching (saturates at) some value.

this is term frequency in document D

saturation parameter

33 of 84

Building BM25 Retrieval

  1. [Query term importance in the document] Pick a function, which increases monotonically with tf, is rising slowly, but this time is asymptotically approaching (saturates at) some value.
  2. [Overall Term importance] For every term pick a weight Ww expressing overall term’s w importance (e.g. it can be old school Ww =IDFw)

34 of 84

Building BM25 Retrieval

  1. [Query term importance in the document] Pick a function, which increases monotonically with tf, is rising slowly, but this time is asymptotically approaching (saturates at) some value.
  2. [Overall Term importance] For every term pick a weight Ww expressing overall term’s w importance (e.g. it can be old school Ww =IDFw)
  3. [Fix Long Document Bias] Alleviate long document bias problem present in certain collections by penalizing too long documents.

  • Some authors are simply more verbose than others, using more words to say the same thing.
  • These create bias in our model; long documents which say the same thing are preferred before short documents, as they achieve more tfs on average.
  • An obvious solution to this is to divide tfs by the document length.

Some authors have more to say: they may write a

single document containing or covering more ground.

An extreme version would have the author writing

two or more documents and concatenating them.

Hypothesis A

Hypothesis B

My beagle dog is a great beagle. Beagle is great.

~I own a beagle.

35 of 84

Building BM25 Retrieval

  1. [Query term importance in the document] Pick a function, which increases monotonically with tf, is rising slowly, but this time is asymptotically approaching (saturates at) some value.
  2. [Overall Term importance] For every term pick a weight Ww expressing overall term’s w importance (e.g. it can be old school Ww =IDFw)
  3. [Fix Long Document Bias] Alleviate long document bias problem present in certain collections by penalizing too long documents.

current document’s length

average document length in corpus

soft constraint to cover both hypotheses

36 of 84

Building BM25 Retrieval

  1. [Query term importance in the document] Pick a function, which increases monotonically with tf, is rising slowly, but this time is asymptotically approaching (saturates at) some value.
  2. [Overall Term importance] For every term pick a weight Ww expressing overall term’s w importance (e.g. it can be old school Ww =IDFw)
  3. [Fix Long Document Bias] Alleviate long document bias problem present in certain collections by penalizing too long documents.

BM-25 Formula

37 of 84

Building BM25 Retrieval

  1. [Query term importance in the document] Pick a function, which increases monotonically with tf, is rising slowly, but this time is asymptotically approaching (saturates at) some value.
  2. [Overall Term importance] For every term pick a weight Ww expressing overall term’s w importance (e.g. it can be old school Ww =IDFw)
  3. [Fix Long Document Bias] Alleviate long document bias problem present in certain collections by penalizing too long documents.
  4. Robertson & Zaragosa, 2009 recommends hyperparam settings 0.5<b<0.8; 1.2<k<2

(k+1)

Robertson, Stephen, and Hugo Zaragoza. "The Probabilistic Relevance Framework: BM25 and Beyond." Information Retrieval 3.4 (2009): 333-389.

38 of 84

Chapter 4:

Question Answering

39 of 84

Selective QA

Evaluation: Standard multiclass classification metrics (Accuracy, F1, MCC)

40 of 84

Extractive QA

Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang, P. (2016, January). SQuAD: 100, 000+ Questions for Machine Comprehension of Text. In EMNLP.

(macro)F1 measures the average overlap between the prediction and ground truth answer.

Prediction and ground truth are treated as bags of tokens and their F1 is computed.

Usually a maximum F1 over all of the ground truth answers for a given question is taken, and the result is an average over all of the questions.

41 of 84

Abstractive QA

Task: Answer question from the story

Evaluation via Traditional NLG metrics

BLEU-4, ROUGE-L, Meteor

42 of 84

A simple extractive QA system A

Given question Q and document D

find answer span <astart,aend>

Estimate parameters via �maximum likelihood estimation

43 of 84

A simple extractive QA system: Decoding

Given question Q and document D

find answer span <astart,aend>

Estimate parameters via �maximum likelihood estimation

44 of 84

A simple extractive QA system B

Given question Q and document D

find answer span <astart,aend>

Estimate parameters via �maximum likelihood estimation

img source: Devlin, Jacob, et al. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." NAACL-HLT. 2019.

45 of 84

The Objective

  • Cross-entropy objective for extractive question answering
    • given question q
    • passage (or a set of passages) D
    • answer represented by start/end positions as/ae

45

Do we need to assume the independence?

Assumption on Independence (Xiong et al., 2017; Seo et al.,2017; Chen et al., 2017; Yu et al., 2018; Devlin et al., 2019; Cheng et al., 2020; inter alia)

No, we can compute joint objective with similar complexity directly, and�it “works better” �(Fajcik et al., 2021)

Rethinking the Objectives of Extractive Question Answering Fajcik, Martin, Josef, Jon, and Pavel, Smrz In Proceedings of the 3rd Workshop on Machine Reading for Question Answering 2021

46 of 84

Open-domain QA�MOTIVATION #1: Research-wise

46

  1. Dense Neural Passage retrieval “just” started to work (Lee et al., 2019; Guu et al., 2020; Karupkhin et al., 2020; Khattab et al. 2020; Izacard et al., 2020)
  2. Open-domain QA is easy to annotate, all you need is questions and answers.
  3. Closed-domain QA in some cases already works “very well”. Human Performance surpassed - SQuADv1.1, SQuADv2.0 (Rajpurkar et al. 2016,2018), CoQA (Reddy et al., 2018)

Almost any NLP task can be framed as question answering!

47 of 84

Open-domain QA�MOTIVATION #2: Information retrieval in everyday life

47

  • Search needs a shake-up (Etzioni, 2011)

“Academics and industry researchers need to achieve �the intellectual ‘escape velocity’ necessary to �revolutionize search. They must invest much more �in bold strategies that can achieve natural-language

searching and answering, rather than providing the electronic equivalent.”

“Moving up the information food chain requires a search engine that can interpret a user's question, extract facts from all the information on the web, and select �an appropriate answer.”

Keyword searching

Etzioni, Oren. "Search needs a shake-up." Nature 476.7358 (2011): 25-26.

48 of 84

Example of traditional approach

Retriever

Extractive reader

49 of 84

Example of traditional approach: Reader

BM25 Negative Document P1-

Document P1+

BM25 Negative Document P2-

Document Pn

Jožko Mrkvička is a fictional character in colloquial Slovak (but also journalistic style), whose name is used to denote the average citizen, or as an implicit name in examples of the textbook type. It is not associated with any negative or positive qualities (such as the English John Bull), nor is it derived from any truly existing character, nor is it the object of fabulations to give it a semblance of historical authenticity (such as the Czech Jára Cimrman) … writer Mária Ďuríčková (1919) for the main character in her book Jožko Mrkvička Spáč (1972). [1]

From retriever

Extractive reader

Document Pn

Jožko Mrkvička is a fictional character in colloquial Slovak (but also journalistic style), whose name is used to denote the average citizen, or as an implicit name in examples of the textbook type. It is not associated with any negative or positive qualities (such as the English John Bull), nor is it derived from any truly existing character, nor is it the object of fabulations to give it a semblance of historical authenticity (such as the Czech Jára Cimrman) … writer Mária Ďuríčková (1919) for the main character in her book Jožko Mrkvička Spáč (1972). [1]

  • In current literature, each document is usually processed via language representation model (e.g. BERT) separately.

50 of 84

Maximum Marginal Likelihood

  • In Open-QA, we often do not know, which answer span is correct and which is not

Clark, Christopher, and Matt Gardner. "Simple and Effective Multi-Paragraph Reading Comprehension." Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018.

51 of 84

Maximum Marginal Likelihood

  • In Open-QA, we often do not know, which answer span is correct and which is not
  • Solution? Marginalize over all spans with correct surface form, let the model decide
  • Formally:
  • in fully supervised setting, we are given input x, and answer span , our NLL objective for 1 sample is ������

  • img source: Sewon Min, Danqi Chen, Hannaneh Hajishirzi, and Luke Zettlemoyer. 2019. A Discrete Hard EM Approach for Weakly Supervised Question Answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2851–2864, Hong Kong, China. Association for Computational Linguistics.

52 of 84

Maximum Marginal Likelihood

  • In Open-QA, we often do not know, which answer span is correct and which is not
  • Solution? Marginalize over all spans with correct surface form, let the model decide
  • Formally:
  • in fully supervised setting, we are given input x, and answer span , our NLL objective for 1 sample is ���in weakly supervised setting, we are given input x, and many answer spans for each string match Z={z1,z2,...,zn}, some of which are correct, some of which are not.
  • Note that Z is subset of Ztot, the set of all spans in the document(s), y is answer string match���

  • img source: Sewon Min, Danqi Chen, Hannaneh Hajishirzi, and Luke Zettlemoyer. 2019. A Discrete Hard EM Approach for Weakly Supervised Question Answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2851–2864, Hong Kong, China. Association for Computational Linguistics.

53 of 84

Maximum Marginal Likelihood

  • In Open-QA, we often do not know, which answer span is correct and which is not
  • Solution? Marginalize over all spans with correct surface form, let the model decide
  • Formally:
  • in fully supervised setting, we are given input x, and answer span , our NLL objective for 1 sample is ���in weakly supervised setting, we are given input x, and many answer spans for each string match Z={z1,z2,...,zn}, some of which are correct, some of which are not.
  • Note that Z is subset of Ztot, the set of all spans in the document(s), y is answer string match����

0 if zi is not from Z, 1 otherwise

  • img source: Sewon Min, Danqi Chen, Hannaneh Hajishirzi, and Luke Zettlemoyer. 2019. A Discrete Hard EM Approach for Weakly Supervised Question Answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2851–2864, Hong Kong, China. Association for Computational Linguistics.

54 of 84

Maximum Marginal Likelihood

  • In Open-QA, we often do not know, which answer span is correct and which is not
  • Solution? Marginalize over all spans with correct surface form, let the model decide
  • Formally:
  • in fully supervised setting, we are given input x, and answer span , our NLL objective for 1 sample is ���in weakly supervised setting, we are given input x, and many answer spans for each string match Z={z1,z2,...,zn}, some of which are correct, some of which are not.
  • Note that Z is subset of Ztot, the set of all spans in the document(s), y is answer string match����

0 if zi is not from Z, 1 otherwise

  • img source: Sewon Min, Danqi Chen, Hannaneh Hajishirzi, and Luke Zettlemoyer. 2019. A Discrete Hard EM Approach for Weakly Supervised Question Answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2851–2864, Hong Kong, China. Association for Computational Linguistics.

55 of 84

Maximum Marginal Likelihood

  • In Open-QA, we often do not know, which answer span is correct and which is not
  • Solution? Marginalize over all spans with correct surface form, let the model decide
  • Formally:
  • in fully supervised setting, we are given input x, and answer span , our NLL objective for 1 sample is ���in weakly supervised setting, we are given input x, and many answer spans for each string match Z={z1,z2,...,zn}, some of which are correct, some of which are not.
  • Note that Z is subset of Ztot, the set of all spans in the document(s), y is answer string match����

This is a so called „latent variable model“ with latent variable vi. Remember GMM!

0 if zi is not from Z, 1 otherwise

  • img source: Sewon Min, Danqi Chen, Hannaneh Hajishirzi, and Luke Zettlemoyer. 2019. A Discrete Hard EM Approach for Weakly Supervised Question Answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2851–2864, Hong Kong, China. Association for Computational Linguistics.

56 of 84

MML in Open-domain QA

Model

Question

1-st passage

Question

2-nd passage

Question

n-th passage

. . .

Model

Model

Passage representations

Passage representations

Passage representations

d

seq_len

Softmax

Linear

Linear

Linear

Loss for 1 sample

Clark, Christopher, and Matt Gardner. "Simple and Effective Multi-Paragraph Reading Comprehension." Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018.

57 of 84

MML in Open-domain QA

Model

Question

1-st passage

Question

2-nd passage

Question

n-th passage

. . .

Model

Model

Passage representations

Passage representations

Passage representations

d

seq_len

Softmax

Linear

Linear

Linear

Loss for 1 sample

Clark, Christopher, and Matt Gardner. "Simple and Effective Multi-Paragraph Reading Comprehension." Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018.

Important for cross-passage answer score calibration!

58 of 84

Do we need to use extractive models?

  • IDEA: generate answer through the language model

59 of 84

T5

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W. and Liu, P.J., 2019. Exploring the limits of transfer learning with a unified text-to-text transformer.

  • Seq-2-seq, Enc-Decoder�unlike BERT
  • subword language units
  • trained on denoising objective�and ~25 supervised tasks
  • 750GB CommonCrawl data

60 of 84

Idea #1: „Concatenate, pass, profit“

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W. and Liu, P.J., 2019. Exploring the limits of transfer learning with a unified text-to-text transformer.

Concatenate!

Question + Passage 1 + Passage 2+ Passage 3…

[Answer]

Drawbacks?

61 of 84

Idea #1: „Concatenate, pass, profit“

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W. and Liu, P.J., 2019. Exploring the limits of transfer learning with a unified text-to-text transformer.

Concatenate!

Question + Passage 1 + Passage 2+ Passage 3…

[Answer]

1. Memory complexity

2. Decoding: If we do decoding without restrictions,

the model might generate something not present in the text

62 of 84

Idea #2: Processing passages jointly: Fusion-in-Decoder

  • Do we need to read every passage independently?
  • No, we can actually allow inter-passage interaction learning!
  • Example: Fusion-in-Decoder (FiD), encode every passage separately, decode jointly
  • Trick works well with pre-trained models (T5)!
  • Can process very long inputs (sequences of 200(passage length)*100(context size) tokens long)
  • Optimize target answer via standard language modeling loss (Cross-Entropy)

Izacard, Gautier, and Édouard Grave. "Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering." Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. 2021.

63 of 84

Fusing the extractive and generative approaches

  • Our past work: �Rank twice, reaD twice R2-D2
  • https://r2d2.fit.vutbr.cz/
  • Some demo details:
    • The search is done in “popular” 8% of Wikipedia
    • Only factoid answers, up to 6 words
    • Wikipedia from dec 2018 is used
  • Martin Fajcik, Martin Docekal, Karel Ondrej, and Pavel Smrz. 2021. R2-D2: A Modular Baseline for Open-Domain Question Answering. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 854–870, Punta Cana, Dominican Republic. Association for Computational Linguistics.

64 of 84

Fusing the extractive and generative approaches

  • Why is the search done in “popular” 8% of Wikipedia?
    • We’ve shown we can remove 92% of index from two most popular datasets for open-domain QA, NaturalQuestions and TriviaQA, while losing only up to 3% absolute performance on test set.
    • How? We trained a classifier which given a passage, tries to say apriori (without seeing any question), whether the passage is relevant or not.
    • Could same “pruning” mechanism be implicitly present in modern supervised neural retrieval approaches?
      • Wait for the release of my PhD thesis ☺

Fajcik, M., Docekal, M., Ondrej, K. and Smrz, P., 2021. Pruning the index contents for memory efficient open-domain QA. arXiv preprint arXiv:2102.10697.

65 of 84

Chapter 5:

‘23 Trends

66 of 84

Is QA “solved” by LLM such as ChatGPT/GPT4?

There is no definite answer, but we can do what every good scientist should.�Hypothesize…

Warning:

  • The subsequent slides are subjective, and draw takeaways from simple case-study observations.
  • Observations made are not (yet) fully quantified with the scientific evidence.

67 of 84

Is QA “solved” by LLM such as ChatGPT/GPT4?

There is no definite answer, but we can do what every good scientist should. Hypothesize…

  1. Yes because…
    1. Large LLM have extensive factual knowledge.
    2. LLMs can present answers excellently!

68 of 84

Is QA “solved” by LLM such as ChatGPT/GPT4?

There is no definite answer, but we can do what every good scientist should. Hypothesize…

  1. Maybe because…
    1. Large LLMs can lie excellently.�This kind of problem is called ”Hallucination”.

69 of 84

Is QA “solved” by LLM such as ChatGPT/GPT4?

There is no definite answer, but we can do what every good scientist should. Hypothesize…

  1. No because
    1. Large LLMs cannot explain themselves.

70 of 84

Is QA “solved” by LLM such as ChatGPT/GPT4?

There is no definite answer, but we can do what every good scientist should. Hypothesize…

  1. No because
    1. LLMs are competetive, but not outperforming the task specific models.

Bang, Y., Cahyawijaya, S., Lee, N., Dai, W., Su, D., Wilie, B., Lovenia, H., Ji, Z., Yu, T., Chung, W. and Do, Q.V., 2023. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023.

71 of 84

Is QA “solved” by Retrieval-Augmented LLMs?

There is no definite answer, but we can do what every good scientist should. Hypothesize…

  1. Yes because
    1. all responses with factoid answers are grounded.

72 of 84

Is QA “solved” by Evidence-grounded LLMs?

There is no definite answer, but we can do what every good scientist should. Hypothesize…

  1. Maybe because
    1. Evidence-grounded models still suffer from hallucination.

Query: What are the pros and cons of the top 3 selling pet vacuums?

Dmitri Brereton, “Bing AI Can’t Be Trusted”, https://dkb.blog/p/bing-ai-cant-be-trusted

73 of 84

Is QA “solved” by Evidence-grounded LLMs?

Query: What are the pros and cons of the top 3 selling pet vacuums?

”This is all completely made up information.

Bing AI was kind enough to give us its sources, so we can go to the hgtv article and check for ourselves.

�The cited article says nothing about limited suction power or noise. In fact, the top amazon review for this product talks about how quiet it is.

�The article also says nothing about the “short cord length of 16 feet” because it doesn’t have a cord. It’s a portable handheld vacuum.”

Dmitri Brereton, “Bing AI Can’t Be Trusted”, https://dkb.blog/p/bing-ai-cant-be-trusted

74 of 84

Is QA “solved” by Evidence-grounded LLMs?

Liu, Nelson F., Tianyi Zhang, and Percy Liang. "Evaluating Verifiability in Generative Search Engines." arXiv preprint arXiv:2304.09848 (2023).

Evidence-grounded models still suffer from hallucination and bad evidence grounding.

Data from very recent preprint (19 Apr)

Citation recall is the proportion of verification worthy statements that are fully supported by their associated citations.�

Citation precision is the proportion of generated citations that support their associated statements.

“We find that responses from existing generative search engines are fluent and appear informative, but frequently contain unsupported statements and inaccurate citations: on average, a mere 51.5% of generated sentences are fully supported by citations (citation recall) and only 74.5% of citations support their associated sentence (citation precision). We believe that these results are concerningly low for systems that may serve as a primary tool for information seeking users, especially given their facade of trustworthiness.”

75 of 84

Is QA “solved” by Evidence-grounded LLMs?

There is no definite answer, but we can do what every good scientist should. Hypothesize…

  1. Maybe because
    1. Evidence-grounded models still suffer from hallucination.
    2. LLMs still cannot solve logic well.

Figure source: Shakarian, Paulo, et al. "An Independent Evaluation of ChatGPT on Mathematical Word Problems (MWP)." arXiv preprint arXiv:2302.13814 (2023).

76 of 84

Is QA “solved” by Evidence-grounded LLMs?

There is no definite answer, but we can do what every good scientist should. Hypothesize…

  1. Maybe because
    1. Evidence-grounded models still suffer from hallucination.
    2. LLMs still cannot solve logic well.

Figure source: Shakarian, Paulo, et al. "An Independent Evaluation of ChatGPT on Mathematical Word Problems (MWP)." arXiv preprint arXiv:2302.13814 (2023).

77 of 84

Epilogue:�Takeaways

78 of 84

Takeaways: QA

  • Question Answering, Document Retrieval, Fact-Checking, Entity Disambiguation, Multimodal Retrieval, all of this is information retrieval.
  • Closed-domain QA works well, especially on popular topics (sport,history, tv shows). Bio/scientific domain, math, or technical jargon are still left unattained.
  • Extractive QA can be tackled with answer start/end probability estimation
  • Open-domain QA needs to deail with multi-passage processing, with methods such as MML and cross-passage normalization.
  • Be sure to check out Czech QA dataset from MU! SQAD (Medveď and Horák, 2014).

79 of 84

Takeaways: QA

  • Question Answering, Document Retrieval, Fact-Checking, Entity Disambiguation, Multimodal Retrieval, all of this is information retrieval.
  • Closed-domain QA works well, especially on popular topics (sport,history, tv shows). Bio/scientific domain, math, or technical jargon are still left unattained.
  • Extractive QA can be tackled with answer start/end probability estimation
  • Open-domain QA needs to deail with multi-passage processing, with methods such as MML and cross-passage normalization.
  • Be sure to check out Czech QA dataset from MU! SQAD (Medveď and Horák, 2014).

Medved, Marek, and Ales Horák. "SQAD: Simple Question Answering Database." RASLAN. 2014.

80 of 84

Takeaways: QA

  • Question Answering, Document Retrieval, Fact-Checking, Entity Disambiguation, Multimodal Retrieval, all of this is information retrieval.
  • Closed-domain QA works well, especially on popular topics (sport,history, tv shows). Bio/scientific domain, math, or technical jargon are still left unattained.
  • Extractive QA can be tackled with answer start/end probability estimation
  • Open-domain QA needs to deail with multi-passage processing, with methods such as MML and cross-passage normalization.
  • Be sure to check out Czech QA dataset from MU! SQAD (Medveď and Horák, 2014).

Medved, Marek, and Ales Horák. "SQAD: Simple Question Answering Database." RASLAN. 2014.

81 of 84

Takeaways: QA

  • Question Answering, Document Retrieval, Fact-Checking, Entity Disambiguation, Multimodal Retrieval, all of this is information retrieval.
  • Closed-domain QA works well, especially on popular topics (sport,history, tv shows). Bio/scientific domain, math, or technical jargon are still left unattained.
  • Extractive QA can be tackled with answer start/end probability estimation
  • Open-domain QA needs to deail with multi-passage processing, with methods such as MML and cross-passage normalization.
  • Be sure to check out Czech QA dataset from MU! SQAD (Medveď and Horák, 2014).

Medved, Marek, and Ales Horák. "SQAD: Simple Question Answering Database." RASLAN. 2014.

82 of 84

Takeaways: QA

  • Question Answering, Document Retrieval, Fact-Checking, Entity Disambiguation, Multimodal Retrieval, all of this is information retrieval.
  • Closed-domain QA works well, especially on popular topics (sport,history, tv shows). Bio/scientific domain, math, or technical jargon are still left unattained.
  • Extractive QA can be tackled with answer start/end probability estimation
  • Open-domain QA needs to deail with multi-passage processing, with methods such as MML and cross-passage normalization.
  • Be sure to check out Czech QA dataset from MU! SQAD (Medveď and Horák, 2014).

Medved, Marek, and Ales Horák. "SQAD: Simple Question Answering Database." RASLAN. 2014.

83 of 84

Takeaways: Document Retrieval

  • BM25 is a “fairly popular” baseline from “classic IR” in production-ready systems today. With standard BM25, one have two hyperparameters to control:
    • (a) term saturation
    • (b) long-document bias

84 of 84

More Recent Directions, Literature, etc.

Neural Document Retrieval

Contriever/mContriever — Unsupervisedly pretrained dense retrieval (also multilingual, but no Czech), sometimes matching closely supervised approaches, well generalizing.

LaBSE — Symmetric embeddings for textual similarity (!not query-document) over 109 languages, trained in a supervised way (parallel sentences) and unsupervised way.

ColBERTv2 — SOTA multi-vector learned dense retrieval model, with interesting quantization of residual vectors.

SPLADEv2 — SOTA learned sparse retrieval model.

JPR — Diverse retrieval for multi-answer questions.

Baleen — Multi-hop retrieval for multihop questions.

Open-Domain Question Answering

ATLAS — Unsupervisedly pre-trained evidence-grounded LLM (11B).

REATT — A joint retrieval-reader model for both, retrieval and LM.

DENSEPHRASES — All potential short answers on Wikipedia are encoded into gigantic index, answer is retrieved directly (no reader part!).

Open-Domain Fact-Checking

Claim-Dissector — Our new work on interpretable evidence-grounded fact-checking.

General Model Pretraining

MetaICL — A model pre-trained for learning to learn from context (so-called in-context learning).

LLAMA — Recently released large language model thay beats GPT-3/MegaTron despite being order of magnitude smaller.

No links included, IR it out! ☺