ABCDEFGHI
1
DatasetDescriptionUmbrella Term(s)Data exampleHomepage / paperNotesPro/ConMetricSize train / dev/ test (examples)
2
CoLAThe Corpus of Linguistic Acceptability:
Binary classification: single sentences that are either grammatical or ungrammatical.
AcceptabilityThey made him angry. 1 = acceptable
This building is than that one. 0 = unacceptable
https://nyu-mll.github.io/CoLA/
Matthews10k
1k
1.1k
3
SST-2Stanford Sentiment Treebank:
Binary classification (?): phrases culled from movie reviews scored on their positivity/negativity. Phrases can be positive ,negative, or completely neutral English phrases, see example
Sentimentthe movie as a whole is cheap junk and an insult to their death-defying efforts = .11111
the movie is funny , smart , visually inventive , and most of all , alive . = .93056
has both = .5
https://nlp.stanford.edu/sentiment/index.html
Accuracy67k
872
1.8k
4
MRPCThe Microsoft Research Paraphrase Corpus: A pair of sentences, classify them as paraphrases or not paraphrases

From GLUE paper: , with human annotations of whether the sentences in the pair are semantically equivalent.
Paraphrase"A New Castle County woman has become the first Delaware patient to contract the West Nile virus this year , the state 's Department of Health reported ."
"A 62-year-old West Babylon man has contracted the West Nile virus , the first human case in Suffolk County this year , according to the county health department .'"
label = 0

"Yesterday , Taiwan reported 35 new infections , bringing the total number of cases to 418 ."
"The island reported another 35 probable cases yesterday , taking its total to 418 ."
label = 1
https://www.microsoft.com/en-us/download/details.aspx?id=52398
The negative examples seek to foil pure statistical approaches. Sometimes it's unclear to even a human that the negative examples should be negative:

"History will remember the University of Washington 's Dr. Belding Scribner as the man who has saved more than a million lives by making long-term kidney dialysis possible ." b'Dr. Belding Scribner , inventor of a device that made long-term kidney dialysis possible and has saved more than a million lives , has died in Seattle at age 82 .'

'Gillespie sent a letter to CBS President Leslie Moonves asking for a historical review or a disclaimer .' b'Republican National Committee Chairman Ed Gillespie issued a letter Friday to CBS Television President Leslie Moonves .'
Accuracy / F14k
N/A
1.7k
5
STS-BThe Semantic Textual Similarity Benchmark (Cer et al., 2017) is based on the datasets
for a series of annual challenges for the task of
determining the similarity on a continuous scale
from 1 to 5 of a pair of sentences drawn from
various sources. We use the STS-Benchmark release, which draws from news headlines, video
and image captions, and natural language inference data, scored by human annotators.
Sentence SimilarityElephants are walking down a trail.' b'A herd of elephants are walking along a trail.' = 4.6 = very similar

'Thai police use tear gas against protesters' b'Brazil leader breaks silence about protests' = 1.0 = not similar

'A man is making a bed.' b'A woman is playing a guitar.' = 0.0 = not similar

'A man is playing a guitar and singing on a stage.' b'A man is playing the electric guitar.' = 3.8 = somewhat similar

'The woman is holding the hands of the man.' b'The woman is checking the eyes of the man.' = 2.4 = somewhat similar
https://www.aclweb.org/anthology/S17-2001

Pearson / Spearman7k
1.5k
1.4k
6
QQPThe Quora Question Pairs3 dataset is a
collection of question pairs from the community
question-answering website Quora. Given two
questions, the task is to determine whether they are
semantically equivalent.
ParaphraseHow can I be a good geologist?
What should I do to be a great geologist?
1 = similar

How can I increase the speed of my internet connection while using a VPN?
How can Internet speed be increased by hacking through DNS?
0 = not similar
https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-PairsAccuracy / F1400k
N/A
391k
7
MNLI-mNatural Language InferenceAccuracy393k
20k
20k
8
MNLI-mmThe Multi-Genre Natural Language Inference Corpus (Williams et al., 2018) is a crowdsourced collection of sentence pairs with textual
entailment annotations. Given a premise sentence
and a hypothesis sentence, the task is to predict
whether the premise entails the hypothesis, contradicts the hypothesis, or neither (neutral). The
premise sentences are gathered from a diverse set
of sources, including transcribed speech, popular
fiction, and government reports. The test set is
broken into two sections: matched, which is drawn
from the same sources as the training set, and mismatched, which uses different sources and thus requires domain transfer. We use the standard test
set, for which we obtained labels privately from
the authors, and evaluate on both sections.
NLIJon walked back to the town to the smithy.' b'Jon traveled back to his hometown.' = 1 = neutral

'Tourist Information offices can be very helpful.' b'Tourist Information offices are never of any help.' = 2 = contradiction

"I'm confused." b'Not all of it is very clear to me.' = 0 = entailed

https://www.nyu.edu/projects/bowman/multinli/
Accuracy
9
QNLIThe Stanford Question Answering
Dataset (Rajpurkar et al. 2016; SQuAD) is
a question-answering dataset consisting of
question-paragraph pairs, where the one of
the sentences in the paragraph (drawn from
Wikipedia) contains the answer to the corresponding question (written by an annotator). We
automatically convert the original SQuAD dataset
into a sentence pair classification task by forming
a pair between a question and each sentence in the
corresponding context. The task is to determine
whether the context sentence contains the answer
to the question. We filter out pairs where there
is low lexical overlap4 between the question and
the context sentence. Specifically, we select all
pairs in which the most similar sentence to the
question was not the answer sentence, as well as
an equal amount of cases in which the correct
sentence was the most similar to the question, but
another distracting sentence was a close second.
Question Answering / Natural Language Inference"How was the Everton FC's crest redesign received by fans?" b'The redesign was poorly received by supporters, with a poll on an Everton fan site registering a 91% negative response to the crest.' = 0 = answerable

'In what year did Robert Louis Stevenson die?' b"Mission work in Samoa had begun in late 1830 by John Williams, of the London Missionary Society arriving in Sapapali'i from The Cook Islands and Tahiti." = 1 = not answerable

'What is essential for the mating of the elements that create radio waves?' b'Antennas are required by any radio receiver or transmitter to couple its electrical connection to the electromagnetic field.' = 0 = answerable
QNLI created in GLUE

SQUAD: https://rajpurkar.github.io/SQuAD-explorer/
Accuracy108k
11k
11k
10
RTEThe Recognizing Textual Entailment
(RTE) datasets come from a series of annual
challenges for the task of textual entailment, also
known as NLI. We combine the data from RTE1
(Dagan et al., 2006), RTE2 (Bar Haim et al.,
2006), RTE3 (Giampiccolo et al., 2007), and
RTE5 (Bentivogli et al., 2009)
5
. Each example in
these datasets consists of a premise sentence and a
hypothesis sentence, gathered from various online
news sources. The task is to predict if the premise
entails the hypothesis. We convert all the data to
a two-class split (entailment or not entailment,
where we collapse neutral and contradiction into
not entailment for challenges with three classes)
for consistency.
Natural Language InferenceIn 2003, Yunus brought the microcredit revolution to the streets of Bangladesh to support more than 50,000 beggars, whom the Grameen Bank respectfully calls Struggling Members.' b'Yunus supported more than 50,000 Struggling Members.' = 0 = entailed

"In 1997, Tyson bit off part of one of Evander Holyfield's ears in their rematch that led to Tyson's disqualification and suspension. In 2002, at a melee at a news conference in New York, Tyson bit champion Lennox Lewis' leg." b"Tyson bit off part of one of Evander Holyfield's ears in 2002." = 1 = not entailed

"The marriage is planned to take place in the Kiev Monastery of the Caves, whose father superior, Bishop Pavel, is Yulia Timoshenko's spiritual father." b'Yulia Timoshenko is the daughter of Bishop Pavel.' = 1 = not entailed

'With its headquarters in Madrid, Spain, WTO is an inter-governmental body entrusted by the United Nations to promote and develop tourism.' b'The WTO headquarters is located in Madrid, Spain.' = 0 = entailed
https://aclweb.org/aclwiki/Textual_Entailment_Resource_PoolAccuracy2.7k
N/A
3k
11
WNLIThe Winograd Schema Challenge: the system must read a sentence with a pronoun and decide the referent from a list of choices. The examples are constructed to foil simple statistical methods

GLUE paper: The task (a slight relaxation of the original
Winograd Schema Challenge) is to predict if the
sentence with the pronoun substituted is entailed
by the original sentence
Coreference / Natural Language InferenceThe trophy didn’t fit into the suitcase because it
was too [large/small].
Question: What was too [large/small]?
Answer: the trophy / the suitcase

'Lily spoke to Donna, breaking her concentration.' b"Lily spoke to Donna, breaking Lily's concentration." = 0

'I put the cake away in the refrigerator. It has a lot of leftovers in it.' b'The refrigerator has a lot of leftovers in it.' = 1

"Susan knows all about Ann's personal problems because she is nosy." b'Ann is nosy.' = 0
WNLI introduced in GLUE paper

Winograd schema challenge here: https://cs.nyu.edu/faculty/davise/papers/WinogradSchemas/WS.html
Difficult as it requires more common sense understanding that may not be necessarily derivable from the text aloneAccuracy706
N/A
146
12
Download Script:
https://raw.githubusercontent.com/nyu-mll/jiant/master/scripts/download_glue_data.py
(Also available via tensorflow datasets library (tfds))
13
Download Links:
"CoLA":'https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data%2FCoLA.zip?alt=media&token=46d5e637-3411-4188-bc44-5809b5bfb5f4', "SST":'https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data%2FSST-2.zip?alt=media&token=aabc5f6b-e466-44a2-b9b4-cf6337f84ac8', "MRPC":'https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data%2Fmrpc_dev_ids.tsv?alt=media&token=ec5c0836-31d5-48f4-b431-7480817f1adc', "QQP":'https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data%2FQQP.zip?alt=media&token=700c6acf-160d-4d89-81d1-de4191d02cb5', "STS":'https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data%2FSTS-B.zip?alt=media&token=bddb94a7-8706-4e0d-a694-1109e12273b5', "MNLI":'https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data%2FMNLI.zip?alt=media&token=50329ea1-e339-40e2-809c-10c40afff3ce', "SNLI":'https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data%2FSNLI.zip?alt=media&token=4afcfbb2-ff0c-4b2d-a09a-dbf07926f4df', "QNLI":'https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data%2FQNLI.zip?alt=media&token=c24cad61-f2df-4f04-9ab6-aa576fa829d0', "RTE":'https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data%2FRTE.zip?alt=media&token=5efa7e85-a0bb-4f19-8ea2-9e1840f077fb', "WNLI":'https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data%2FWNLI.zip?alt=media&token=068ad0a0-ded7-4bd7-99a5-5e00222e0faf',
14
15