Reading Comprehension
Human Language Technologies
Giuseppe Attardi
Università di Pisa
Question Answering: a Taxonomy
Practical applications
3
Slide by Danqi Chen
Practical applications
4
Slide by Danqi Chen
IBM Watson beated Jeopardy! champions
5
Slide by Danqi Chen
IBM Watson beated Jeopardy!ß champions
6
Image credit: J & M, edition 3
(1) Question processing, (2) Candidate answer generation, (3) Candidate answer scoring, and
(4) Confidence merging and ranking.
Question answering in deep learning era
7
Image credit: (Lee et al., 2019)
Most state-of-the-art question answering systems are built on top of end-to-end training and pre-trained language models (e.g., BERT)!
Slide by Danqi Chen
Differences between QA tasks
Breief History of Reading Comprehension
Slide by C. Manning
Machine Comprehension (Burges 2013)
“A machine comprehends a passage of text if, for any question regarding that text that can be answered correctly by a majority of native speakers, that machine can provide a string which those speakers would agree both answers that question, and
does not contain information irrelevant to that question.”
MCTest Reading Comprehension
Alyssa got to the beach after a long trip. She's from Charlotte.
She travelled from Atlanta. She's now in Miami.
She went to Miami to visit some friends.
But she wanted some time to herself at the beach, so she went there first.
After going swimming and laying out, she went to her friend Ellen's house.
Ellen greeted Alyssa and they both had some lemonade to drink. Alyssa called her friends Kristin and Rachel to meet at Ellen's house…….
Why did Alyssa go to Miami? To visit some friends
Passage (P)
Question (Q)
Answer (A)
Slide by C. Manning
+
→
(P)
(Q)
(A)
Why do we care about this problem?
12
Information extraction
(Barack Obama, educated_at, ?)
Question: Where did Barack Obama graduate from?
Passage: Obama was born in Honolulu, Hawaii. After graduating from Columbia University in 1983, he worked as a community organizer in Chicago.
(Levy et al., 2017)
Semantic role labeling
(He et al., 2015)
Slide by Danqi Chen
Stanford Question Answering Dataset (SQuAD)
(Rajpurkar et al., 2016): SQuAD: 100,000+ Questions for Machine Comprehension
Slide by C. Manning
Large-scale supervised datasets are a key ingredient for training effective neural models for reading comprehension!
This is a limitation— not all the questions can be answered in this way!
SqUAD 1.1 Evalutation
Slide by C. Manning
SQuAD Leaderboard (2019/02)
Slide by C. Manning
Other QA Datasets
Neural models for reading comprehension
17
Attentive Reader (Hermann et al., 2015), Stanford Attentive Reader (Chen et al., 2016), Match-LSTM (Wang et al., 2017), BiDFA (Seo et al., 2017), Dynamic coattention network (Xiong et al., 2017), DrQA (Chen et al., 2017), R-Net (Wang et al., 2017), ReasoNet (Shen et al., 2017)..
N~100, M ~15
answer is a span in the passage
How can we build a model to solve SQuAD?
Slide by Danqi Chen
Stanford Attentive Reader
[Chen, Bolton, & Manning 2016]�[Chen, Fisch, Weston & Bordes 2017] https://arxiv.org/pdf/1704.00051.pdf
DrQA [Chen 2018]
Slide by C. Manning
The Stanford Attentive Reader
Which. team won Super … 50 ?
Input
Output
Passage (P)
Question (Q)
Answer (A)
Which team Q won Super Bowl 50?
…
…
Slide by C. Manning
The Stanford Attentive Reader
Bidirectional LSTMs
Who did Genghis Khan unite before he
began conquering the rest of Eurasia?
…
…
…
…
P
Q
Slide by C. Manning
GlLoVE Embeddings
Exact match
Aligned embedding
The Stanford Attentive Reader
Passage
Attention
Who did Genghis Khan unite before he
began conquering the rest of Eurasia?
predict start token at position i
predict end token at position i
…
…
Q
q
Slide by C. Manning
SQuAD 1.1 Results (single model, c. Feb 2017)
| F1 |
Logistic regression | 51.0 |
Fine-Grained Gating (Carnegie Mellon U) | 73.3 |
Match-LSTM (Singapore Management U) | 73.7 |
DCN (Salesforce) | 75.9 |
BiDAF (UW & Allen Institute) | 77.3 |
Multi-Perspective Matching (IBM) | 78.7 |
ReasoNet (MSR Redmond) | 79.4 |
DrQA (Chen et al. 2017) | 79.4 |
r-net (MSR Asia) [Wang et al., ACL 2017] | 79.7 |
Google Brain / CMU (Feb 2018) | 88.0 |
Human performance | 91.2 |
Slide by C. Manning
Slide by C. Manning
Recap: seq2seq model with attention
24
Attention is the key ingredient here, similar to attention across sentences in machine translation
Slide by Danqi Chen
BiDAF: the Bidirectional Attention Flow model
(Seo et al., 2017): Bidirectional Attention Flow for Machine Comprehension
BiDAF: Encoding
e(ci) = f([GloVe(ci); charEmb(ci)]) e(qi) = f([GloVe(qi); charEmb(qi)])
f: high-way networks omitted here
Slide by Danqi Chen
BiDAF: Attention
27
Context-to-query attention: For each context word, choose the most relevant words from the query words.
(Slides adapted from Minjoon Seo)
BiDAF: Attention
28
Query-to-context attention: choose the context words that are most relevant to one of query words.
(Slides adapted from Minjoon Seo)
BiDAF: Attention
29
Slide by Danqi Chen
BiDAF: Modeling and output layers
30
Slide by Danqi Chen
BiDAF: Performance on SQuAD
This model achieved 77.3 F1 on SQuAD v1.1.
(Seo et al., 2017): Bidirectional Attention Flow for Machine Comprehension
Attention visualization
28
Slide by Danqi Chen
LSTM-based vs BERT models
33
Image credit: (Seo et al, 2017)
Image credit: J & M, edition 3
Slide by Danqi Chen
BERT for reading comprehension
29
BERT for reading comprehension
30
Question = Segment A
Passage = Segment B
Answer = predicting two endpoints in segment B
Image credit: https://mccormickml.com/
where H = [h1, h2,. .., hN] are the hidden vectors of the paragraph, returned by BERT
Slide by Danqi Chen
BERT for reading comprehension
31
| F1 | EM |
Human performance | 91.2* | 82.3* |
BiDAF | 77.3 | 67.7 |
BERT-base | 88.5 | 80.8 |
BERT-large | 90.9 | 84.1 |
XLNet | 94.5 | 89.0 |
RoBERTa | 94.6 | 88.9 |
ALBERT | 94.8 | 89.3 |
(dev set, except for human performance)
Slide by Danqi Chen
Comparisons between BiDAF and BERT models
32
Pre-training is clearly a game changer but it is expensive..
Slide by Danqi Chen
Comparisons between BiDAF and BERT models
Are they really fundamentally different? Probably not.
question
passage
Slide by Danqi Chen
Recent, more advanced architectures
Question answering work in 2016–2018 employed progressively more complex architectures with a multitude of variants of attention – often yielding good task gains
Slide by C. Manning
SpanBERT: Better pre-training
(Joshi & Chen et al., 2020): SpanBERT: Improving Pre-training by Representing and Predicting Spans
SpanBERT Performance
SQuAD leaderboard (2019/02)
Slide by C. Manning
SQuAD 2.0
Slide by C. Manning
SQuAD 2.0 No Answer Example
When did Genghis Khan kill Great Khan?
Gold Answers: <No Answer>�Prediction: 1234 [from Microsoft nlnet]
Slide by C. Manning
Still basic NLU errors
What dynasty came before the Yuan?
Gold Answers: (1) Song dynasty (2) Mongol Empire
(3) the Song dynasty
Prediction: Ming dynasty [BERT (single model) (Google AI)]
Slide by C. Manning
SQuAD Limitations
Slide by C. Manning
Leaderboard 2021/5
Single
Ensemble
TensorFlow 2.0 Question Answering
TensorFlow 2.0 Question Answering
Leaderboard
https://www.kaggle.com/c/tensorflow2-question-answering/leaderboard
A BERT Baseline for the Natural Questions
https://arxiv.org/abs/1901.08634
Input representation:
split documents into sequences < 512
[CLS] tokenized question [SEP] tokens from the document [SEP]
target answer type
short, yes, no, long, no-answer
Downsampling
remove sequences without answer to balance dataset
Model
(c,s,e,t)
No answer
Winning Submission
How did it win?
My best ensemble only achieved 0.66 public LB (0.69 private) performance using the optimized thresholds. At that time I had already lost most of my hope to win. In my last 2-3 submission, I arbitrarily played with the thresholds. One of the submissions scored 0.71 (both public and private LB), and I chose it and won the competition. Unbelievable.
Code
https://www.kaggle.com/seesee/submit-full
Fujitsu AI NLP Challenge 2018
Winner of $20,000 prize:
A. Gravina, F. Rossetto, S. Severini and G. Attardi. 2018. Cross-Attentive Convolutional Neural Networks. Workshop NLP4AI.
Question Answer Selection: SelQA dataset
How much cholesterol is there in an ounce of bacon? | |
One rasher of cooked streaky bacon contains 5.4g of fat, and 4.4g of protein. | 0 |
Four pieces of bacon can also contain up to 800mg of sodium, which is roughly equivalent to 1.92g of salt. | 0 |
The fat and protein content varies depending on the cut and cooking method. 68% of bacon’s calories come from fat, almost half of which are saturated. | 0 |
Each ounce of bacon contains 30mg of cholesterol. | 1 |
Model – General Architecture
The Convolutional Model and the Logistic Regression have been trained separately
Cross Attentive Convolution
Experiments - Datasets
Experiments - Results
WikiQA
SelQA
Experiments – Error Analysis
Is Reading Comprehension Solved?
Is Reading Comprehension Solved?
Systems trained on one dataset can’t generalize to other datasets:
Is Reading Comprehension Solved?
BERT-large model trained on SQuAD
(Ribeiro et al., 2020): Beyond Accuracy: Behavioral Testing of NLP Models with CheckList
Is Reading Comprehension Solved?
BERT-large model trained on SQuAD
(Ribeiro et al., 2020): Beyond Accuracy: Behavioral Testing of NLP Models with CheckList