2 of 66

Question Answering: a Taxonomy

What information source does a system build on?

A text passage, all Web documents, knowledge bases, tables, images..

Question type

Factoid vs non-factoid, open-domain vs closed-domain, simple vs compositional, ..

Answer type

A short segment of text, a paragraph, a list, yes/no, ...

3 of 66

Practical applications

Slide by Danqi Chen

4 of 66

Practical applications

Slide by Danqi Chen

5 of 66

IBM Watson beated Jeopardy! champions

Slide by Danqi Chen

6 of 66

IBM Watson beated Jeopardy!ß champions

Image credit: J & M, edition 3

(1) Question processing, (2) Candidate answer generation, (3) Candidate answer scoring, and

(4) Confidence merging and ranking.

7 of 66

Question answering in deep learning era

Image credit: (Lee et al., 2019)

Most state-of-the-art question answering systems are built on top of end-to-end training and pre-trained language models (e.g., BERT)!

Slide by Danqi Chen

8 of 66

Differences between QA tasks

Open Domain Question Answering

Relies on external memory consisting of large coprora of documents, that may be preproceesed to build IR systems (inverted index, ranking)

Answer Selection

Concentrates on choosing on the fly most likely passage containing the answer from a small given list

Reading Comprehension

Find the answer within a given paragraph (no preprocessing, no memory)

Dialog Systems

Must provide answers to questions and also keep memory of dialog context, remembering previous statements.
May rely on OpenDomain QA for answering general questions.

Inferential QA

Provide an answer that requires inference from knowledge sources

9 of 66

Breief History of Reading Comprehension

Early NLP work attempted reading comprehension

Schank, Abelson, Lehnert et al. c. 1977 – “Yale A.I. Project”

Revived by Lynette Hirschman in 1999:

Could NLP systems answer human reading comprehension questions for 3rd to 6th graders? Simple methods attempted.

Revived again by Chris Burges in 2013 with MCTest

Again answering questions over simple story texts

Floodgates opened in 2015/16 with the production of large datasets which allow training supervised neural systems

Hermann et al. (NIPS 2015) DeepMind CNN/DM dataset
Rajpurkar et al. (EMNLP 2016) SQuAD
MS MARCO, TriviaQA, RACE, NewsQA, NarrativeQA, …

Slide by C. Manning

10 of 66

Machine Comprehension (Burges 2013)

“A machine comprehends a passage of text if, for any question regarding that text that can be answered correctly by a majority of native speakers, that machine can provide a string which those speakers would agree both answers that question, and

does not contain information irrelevant to that question.”

11 of 66

MCTest Reading Comprehension

Alyssa got to the beach after a long trip. She's from Charlotte.

She travelled from Atlanta. She's now in Miami.

She went to Miami to visit some friends.

But she wanted some time to herself at the beach, so she went there first.

After going swimming and laying out, she went to her friend Ellen's house.

Ellen greeted Alyssa and they both had some lemonade to drink. Alyssa called her friends Kristin and Rachel to meet at Ellen's house…….

Why did Alyssa go to Miami? To visit some friends

Passage (P)

Question (Q)

Answer (A)

Slide by C. Manning

→

(P)

(Q)

(A)

12 of 66

Why do we care about this problem?

Useful for many practical applications
Reading comprehension is an important testbed for evaluating how well computer systems understand human language

Wendy Lehnert 1977: “Since questions can be devised to query any aspect of text comprehension, the ability to answer questions is the strongest possible demonstration of understanding.”

Many other NLP tasks can be reduced to a reading comprehension problem:

Information extraction

(Barack Obama, educated_at, ?)

Question: Where did Barack Obama graduate from?

Passage: Obama was born in Honolulu, Hawaii. After graduating from Columbia University in 1983, he worked as a community organizer in Chicago.

(Levy et al., 2017)

Semantic role labeling

(He et al., 2015)

Slide by Danqi Chen

13 of 66

Stanford Question Answering Dataset (SQuAD)

(Rajpurkar et al., 2016): SQuAD: 100,000+ Questions for Machine Comprehension

Slide by C. Manning

100k annotated (passage, question, answer) triples

Large-scale supervised datasets are a key ingredient for training effective neural models for reading comprehension!

Passages are selected from English Wikipedia, usually 100~150 words. Questions are crowd-sourced.
Each answer is a short span of text in the passage.

This is a limitation— not all the questions can be answered in this way!

SQuAD is a popular reading comprehension dataset; it is “almost solved” today and the state-of-the-art exceeds the estimated human performance.

14 of 66

SqUAD 1.1 Evalutation

Slide by C. Manning

15 of 66

SQuAD Leaderboard (2019/02)

Slide by C. Manning

16 of 66

Other QA Datasets

TriviaQA: Questions and answers by trivia enthusiasts. Independently collected web paragraphs that contain the answer and seem to discuss question, but no human verification that paragraph supports answer to question
Natural Questions: Question drawn from frequently asked Google search questions. Answers from Wikipedia paragraphs. Answer can be substring, yes, no, or NOT_PRESENT. Verified by human annotation.
HotpotQA. Constructed questions to be answered from the whole of Wikipedia which involve getting information from two pages to answer a multistep query:�Q: Which novel by the author of “Armada” will be adapted as a feature film by Steven Spielberg? A: Ready Player One

17 of 66

Neural models for reading comprehension

A family of LSTM-based models with attention (2016-2018)

Attentive Reader (Hermann et al., 2015), Stanford Attentive Reader (Chen et al., 2016), Match-LSTM (Wang et al., 2017), BiDFA (Seo et al., 2017), Dynamic coattention network (Xiong et al., 2017), DrQA (Chen et al., 2017), R-Net (Wang et al., 2017), ReasoNet (Shen et al., 2017)..

Fine-tuning BERT-like models for reading comprehension (2019+)

N~100, M ~15

answer is a span in the passage

How can we build a model to solve SQuAD?

Slide by Danqi Chen

18 of 66

Stanford Attentive Reader

[Chen, Bolton, & Manning 2016]�[Chen, Fisch, Weston & Bordes 2017] https://arxiv.org/pdf/1704.00051.pdf

DrQA [Chen 2018]

Demonstrated a minimal, highly successful architecture for reading comprehension and question answering

Slide by C. Manning

19 of 66

The Stanford Attentive Reader

Which. team won Super … 50 ?

Input

Output

Passage (P)

Question (Q)

Answer (A)

Which team Q won Super Bowl 50?

…

Slide by C. Manning

20 of 66

The Stanford Attentive Reader

Bidirectional LSTMs

Who did Genghis Khan unite before he

began conquering the rest of Eurasia?

…

Slide by C. Manning

GlLoVE Embeddings

Exact match

Aligned embedding

21 of 66

The Stanford Attentive Reader

Passage

Attention

Who did Genghis Khan unite before he

began conquering the rest of Eurasia?

predict start token at position i

predict end token at position i

…

Slide by C. Manning

22 of 66

SQuAD 1.1 Results (single model, c. Feb 2017)

	F1
Logistic regression	51.0
Fine-Grained Gating (Carnegie Mellon U)	73.3
Match-LSTM (Singapore Management U)	73.7
DCN (Salesforce)	75.9
BiDAF (UW & Allen Institute)	77.3
Multi-Perspective Matching (IBM)	78.7
ReasoNet (MSR Redmond)	79.4
DrQA (Chen et al. 2017)	79.4
r-net (MSR Asia) [Wang et al., ACL 2017]	79.7
Google Brain / CMU (Feb 2018)	88.0
Human performance	91.2

Slide by C. Manning

23 of 66

Slide by C. Manning

24 of 66

Recap: seq2seq model with attention

Instead of source and target sentences, we have two sequences: passage and question (lengths are different)

We need to model which words in the passage are most relevant to which question words

Attention is the key ingredient here, similar to attention across sentences in machine translation

We don’t need an autoregressive decoder to generate the target sentence word-by-word. Instead, we need two classifiers to predict the start and end positions of the answer!

Slide by Danqi Chen

25 of 66

BiDAF: the Bidirectional Attention Flow model

(Seo et al., 2017): Bidirectional Attention Flow for Machine Comprehension

26 of 66

BiDAF: Encoding

Use a concatenation of word embedding (GloVe) and character embedding (CNNs over character embeddings) for each word in context and query.

e(c_i) = f([GloVe(c_i); charEmb(c_i)]) e(q_i) = f([GloVe(q_i); charEmb(q_i)])

f: high-way networks omitted here

Then, use two bidirectional LSTMs to produce contextual embeddings for context and query.

Slide by Danqi Chen

27 of 66

BiDAF: Attention

Context-to-query attention: For each context word, choose the most relevant words from the query words.

(Slides adapted from Minjoon Seo)

28 of 66

BiDAF: Attention

Query-to-context attention: choose the context words that are most relevant to one of query words.

(Slides adapted from Minjoon Seo)

29 of 66

BiDAF: Attention

Slide by Danqi Chen

30 of 66

BiDAF: Modeling and output layers

Slide by Danqi Chen

31 of 66

BiDAF: Performance on SQuAD

This model achieved 77.3 F1 on SQuAD v1.1.

Without context-to-query attention: 67.7 F1
Without query-to-context attention: 73.7 F1
Without character embeddings: 75.4 F1

(Seo et al., 2017): Bidirectional Attention Flow for Machine Comprehension

32 of 66

Attention visualization

Slide by Danqi Chen

33 of 66

LSTM-based vs BERT models

Image credit: (Seo et al, 2017)

Image credit: J & M, edition 3

Slide by Danqi Chen

34 of 66

BERT for reading comprehension

BERT is a deep bidirectional Transformer encoder pre-trained on large amounts of text (Wikipedia + BooksCorpus)
BERT is pre-trained on two training objectives:

Masked language model (MLM)
Next sentence prediction (NSP)

BERTbase has 12 layers and 110M parameters, BERTlarge has 24 layers and 330M parameters

35 of 66

BERT for reading comprehension

Question = Segment A

Passage = Segment B

Answer = predicting two endpoints in segment B

Image credit: https://mccormickml.com/

^whereH = [h₁, h₂,. .., h_N] are the hidden vectors of the paragraph, returned by BERT

Slide by Danqi Chen

36 of 66

BERT for reading comprehension

	F1	EM
Human performance	91.2*	82.3*
BiDAF	77.3	67.7
BERT-base	88.5	80.8
BERT-large	90.9	84.1
XLNet	94.5	89.0
RoBERTa	94.6	88.9
ALBERT	94.8	89.3

(dev set, except for human performance)

Slide by Danqi Chen

37 of 66

Comparisons between BiDAF and BERT models

BERT model has many more parameters (110M or 330M) than BiDAF (~2.5M).
BiDAF is built on top of several bidirectional LSTMs while BERT is built on top of Transformers.
BERT is pre-trained while BiDAF is only built on top of GloVe (and all the remaining parameters need to be learned from the supervision datasets).

Pre-training is clearly a game changer but it is expensive..

Slide by Danqi Chen

38 of 66

Comparisons between BiDAF and BERT models

Are they really fundamentally different? Probably not.

BiDAF and other models aim to model the interactions between question and passage.
BERT uses self-attention between the concatenation of question and passage = attention(P, P) + attention(P, Q) + attention(Q, P) + attention(Q, Q)
(Clark and Gardner, 2018) shows that adding a self-attention layer for the passage attention(P, P) to BiDAF also improves performance.

question

passage

Slide by Danqi Chen

39 of 66

Recent, more advanced architectures

Question answering work in 2016–2018 employed progressively more complex architectures with a multitude of variants of attention – often yielding good task gains

Slide by C. Manning

40 of 66

SpanBERT: Better pre-training

(Joshi & Chen et al., 2020): SpanBERT: Improving Pre-training by Representing and Predicting Spans

41 of 66

SpanBERT Performance

42 of 66

SQuAD leaderboard (2019/02)

Slide by C. Manning

43 of 66

SQuAD 2.0

A defect of SQuAD 1.0 is that all questions have an answer in the paragraph
Systems (implicitly) rank candidates and choose the best one
You don’t have to judge whether a span answers the question
In SQuAD 2.0, 1/3 of the training questions have no answer, and about 1/2 of the dev/test questions have no answer

For NoAnswer examples, NoAnswer receives a score of 1, and any other response gets 0, for both exact match and F1

Simplest system approach to SQuAD 2.0:

Have a threshold score for whether a span answers a question

Or you could have a second component that confirms answering

Like Natural Language Inference (NLI) or “Answer validation”

Slide by C. Manning

44 of 66

SQuAD 2.0 No Answer Example

When did Genghis Khan kill Great Khan?

Gold Answers: <No Answer>�Prediction: 1234 [from Microsoft nlnet]

Slide by C. Manning

45 of 66

Still basic NLU errors

What dynasty came before the Yuan?

Gold Answers: (1) Song dynasty (2) Mongol Empire

(3) the Song dynasty

Prediction: Ming dynasty [BERT (single model) (Google AI)]

Slide by C. Manning

46 of 66

SQuAD Limitations

SQuAD has a number of other key limitations too:

Only span-based answers (no yes/no, counting, implicit why)
Questions were constructed looking at the passages

Not genuine information needs
Generally greater lexical and syntactic matching between questions and answer span than you get IRL

Barely any multi-fact/sentence inference beyond coreference

Nevertheless, it is a well-targeted, well-structured, clean dataset

It has been the most used and competed on QA dataset
It has also been a useful starting point for building systems in industry (though in-domain data always really helps!)

Slide by C. Manning

47 of 66

Leaderboard 2021/5

Single

Ensemble

48 of 66

TensorFlow 2.0 Question Answering

49 of 66

TensorFlow 2.0 Question Answering

Task: select span from a given Wikipedia article answering a question
Two types of questions:

Long answer
Short answer

Visualize examples
Data

Each training contains a Wikipedia article, a related question, candidate long form answers, the correct long and short form answer or answers for the sample, if any exist.

Evaluation

micro F1
Predicted long and short answers must match exactly the token indices of one of the ground truth labels

50 of 66

Leaderboard

https://www.kaggle.com/c/tensorflow2-question-answering/leaderboard

51 of 66

A BERT Baseline for the Natural Questions

https://arxiv.org/abs/1901.08634

Input representation:

split documents into sequences < 512

[CLS] tokenized question [SEP] tokens from the document [SEP]

target answer type

short, yes, no, long, no-answer

Downsampling

remove sequences without answer to balance dataset

Model

(c,s,e,t)

c is a context of 512 wordpiece ids
s, e ∈ {0, 1, . . . , 511} are inclusive indices pointing to the start and end of the target answer span
t ∈ {0, 1, 2, 3, 4} is the annotated answer type, corresponding to the labels “short”, “long”, “yes”, “no”, and “no-answer”.

53 of 66

Winning Submission

Guanshuo Xu: https://www.kaggle.com/c/tensorflow2-question-answering/discussion/127551
Architecture same as baseline

hard negative sampling to increase the difficulty of the candidate-level training
added html tags for unseen tokens
The loss update of the span prediction branch was simply ignored if no short answer span exist during training

54 of 66

How did it win?

My best ensemble only achieved 0.66 public LB (0.69 private) performance using the optimized thresholds. At that time I had already lost most of my hope to win. In my last 2-3 submission, I arbitrarily played with the thresholds. One of the submissions scored 0.71 (both public and private LB), and I chose it and won the competition. Unbelievable.

55 of 66

Code

Notebook of 2^nd place:

https://www.kaggle.com/seesee/submit-full

56 of 66

Fujitsu AI NLP Challenge 2018

Winner of $20,000 prize:

A. Gravina, F. Rossetto, S. Severini and G. Attardi. 2018. Cross-Attentive Convolutional Neural Networks. Workshop NLP4AI.

57 of 66

Question Answer Selection: SelQA dataset

Question Answer selection on the SelQA dataset
Special dataset of questions requiring more than simple IR

How much cholesterol is there in an ounce of bacon?
One rasher of cooked streaky bacon contains 5.4g of fat, and 4.4g of protein.	0
Four pieces of bacon can also contain up to 800mg of sodium, which is roughly equivalent to 1.92g of salt.	0
The fat and protein content varies depending on the cut and cooking method. 68% of bacon’s calories come from fat, almost half of which are saturated.	0
Each ounce of bacon contains 30mg of cholesterol.	1

58 of 66

Model – General Architecture

Stacked Cross Attentive Convolutional Layers
Combinate the Layer-wise representations
Add Word matching Features
Logistic Regression at the end

The Convolutional Model and the Logistic Regression have been trained separately

59 of 66

Cross Attentive Convolution

Light Attentive Convolution in both directions
Both Question and Answer have awareness of each other
From each layer we extract a representation

60 of 66

Experiments - Datasets

WikiQA: fewer examples and designed for different tasks (e.g. Answer Triggering)
SelQA: New version with more example, created through crowd-sourcing, designed for the Selection Question Answering task

61 of 66

Experiments - Results

For the WikiQA experiments we applied a stronger Regularization, since the dataset was smaller (8k vs 66k pairs question – answer)

For the SelQA experiments we were able to achieve much more consistent results

WikiQA

SelQA

62 of 66

Experiments – Error Analysis

For the SelQA dataset, analysis of performance according to topic of question

Best scoring subjects: History, Science and Country
Worst scoring subjects: Art, Tv and Music

63 of 66

Is Reading Comprehension Solved?

We have already surpassed human performance on SQuAD. Does it mean that reading comprehension is already solved? Of course not!
The current systems still perform poorly on adversarial examples or examples from out-of-domain distributions

64 of 66

Is Reading Comprehension Solved?

Systems trained on one dataset can’t generalize to other datasets:

65 of 66

Is Reading Comprehension Solved?

BERT-large model trained on SQuAD

(Ribeiro et al., 2020): Beyond Accuracy: Behavioral Testing of NLP Models with CheckList

66 of 66

Is Reading Comprehension Solved?

BERT-large model trained on SQuAD

(Ribeiro et al., 2020): Beyond Accuracy: Behavioral Testing of NLP Models with CheckList

1 of 66

2 of 66

3 of 66

4 of 66

5 of 66

6 of 66

7 of 66

8 of 66

9 of 66

10 of 66

11 of 66

12 of 66

13 of 66

14 of 66

15 of 66

16 of 66

17 of 66

18 of 66

19 of 66

20 of 66

21 of 66

22 of 66

23 of 66

24 of 66

25 of 66

26 of 66

27 of 66

28 of 66

29 of 66

30 of 66

31 of 66

32 of 66

33 of 66

34 of 66

35 of 66

36 of 66

37 of 66

38 of 66

39 of 66

40 of 66

41 of 66

42 of 66

43 of 66

44 of 66

45 of 66

46 of 66

47 of 66

48 of 66

49 of 66

50 of 66

51 of 66

52 of 66

53 of 66

54 of 66

55 of 66

56 of 66

57 of 66

58 of 66

59 of 66

60 of 66

61 of 66

62 of 66

63 of 66

64 of 66

65 of 66

66 of 66