GLUe - Google Sheets

	A	B	C	D	E	F	G	H	I
1	Dataset	Description	Umbrella Term(s)	Data example	Homepage / paper	Notes	Pro/Con	Metric	Size train / dev/ test (examples)

2	CoLA	The Corpus of Linguistic Acceptability: Binary classification: single sentences that are either grammatical or ungrammatical.	Acceptability	They made him angry. 1 = acceptable This building is than that one. 0 = unacceptable	https://nyu-mll.github.io/CoLA/			Matthews	10k 1k 1.1k
3	SST-2	Stanford Sentiment Treebank: Binary classification (?): phrases culled from movie reviews scored on their positivity/negativity. Phrases can be positive ,negative, or completely neutral English phrases, see example	Sentiment	the movie as a whole is cheap junk and an insult to their death-defying efforts = .11111 the movie is funny , smart , visually inventive , and most of all , alive . = .93056 has both = .5	https://nlp.stanford.edu/sentiment/index.html			Accuracy	67k 872 1.8k
4	MRPC	The Microsoft Research Paraphrase Corpus: A pair of sentences, classify them as paraphrases or not paraphrases From GLUE paper: , with human annotations of whether the sentences in the pair are semantically equivalent.	Paraphrase	"A New Castle County woman has become the first Delaware patient to contract the West Nile virus this year , the state 's Department of Health reported ." "A 62-year-old West Babylon man has contracted the West Nile virus , the first human case in Suffolk County this year , according to the county health department .'" label = 0 "Yesterday , Taiwan reported 35 new infections , bringing the total number of cases to 418 ." "The island reported another 35 probable cases yesterday , taking its total to 418 ." label = 1	https://www.microsoft.com/en-us/download/details.aspx?id=52398	The negative examples seek to foil pure statistical approaches. Sometimes it's unclear to even a human that the negative examples should be negative: "History will remember the University of Washington 's Dr. Belding Scribner as the man who has saved more than a million lives by making long-term kidney dialysis possible ." b'Dr. Belding Scribner , inventor of a device that made long-term kidney dialysis possible and has saved more than a million lives , has died in Seattle at age 82 .' 'Gillespie sent a letter to CBS President Leslie Moonves asking for a historical review or a disclaimer .' b'Republican National Committee Chairman Ed Gillespie issued a letter Friday to CBS Television President Leslie Moonves .'		Accuracy / F1	4k N/A 1.7k
5	STS-B	The Semantic Textual Similarity Benchmark (Cer et al., 2017) is based on the datasets for a series of annual challenges for the task of determining the similarity on a continuous scale from 1 to 5 of a pair of sentences drawn from various sources. We use the STS-Benchmark release, which draws from news headlines, video and image captions, and natural language inference data, scored by human annotators.	Sentence Similarity	Elephants are walking down a trail.' b'A herd of elephants are walking along a trail.' = 4.6 = very similar 'Thai police use tear gas against protesters' b'Brazil leader breaks silence about protests' = 1.0 = not similar 'A man is making a bed.' b'A woman is playing a guitar.' = 0.0 = not similar 'A man is playing a guitar and singing on a stage.' b'A man is playing the electric guitar.' = 3.8 = somewhat similar 'The woman is holding the hands of the man.' b'The woman is checking the eyes of the man.' = 2.4 = somewhat similar	https://www.aclweb.org/anthology/S17-2001			Pearson / Spearman	7k 1.5k 1.4k
6	QQP	The Quora Question Pairs3 dataset is a collection of question pairs from the community question-answering website Quora. Given two questions, the task is to determine whether they are semantically equivalent.	Paraphrase	How can I be a good geologist? What should I do to be a great geologist? 1 = similar How can I increase the speed of my internet connection while using a VPN? How can Internet speed be increased by hacking through DNS? 0 = not similar	https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs			Accuracy / F1	400k N/A 391k
7	MNLI-m		Natural Language Inference					Accuracy	393k 20k 20k
8	MNLI-mm	The Multi-Genre Natural Language Inference Corpus (Williams et al., 2018) is a crowdsourced collection of sentence pairs with textual entailment annotations. Given a premise sentence and a hypothesis sentence, the task is to predict whether the premise entails the hypothesis, contradicts the hypothesis, or neither (neutral). The premise sentences are gathered from a diverse set of sources, including transcribed speech, popular fiction, and government reports. The test set is broken into two sections: matched, which is drawn from the same sources as the training set, and mismatched, which uses different sources and thus requires domain transfer. We use the standard test set, for which we obtained labels privately from the authors, and evaluate on both sections.	NLI	Jon walked back to the town to the smithy.' b'Jon traveled back to his hometown.' = 1 = neutral 'Tourist Information offices can be very helpful.' b'Tourist Information offices are never of any help.' = 2 = contradiction "I'm confused." b'Not all of it is very clear to me.' = 0 = entailed	https://www.nyu.edu/projects/bowman/multinli/			Accuracy
9	QNLI	The Stanford Question Answering Dataset (Rajpurkar et al. 2016; SQuAD) is a question-answering dataset consisting of question-paragraph pairs, where the one of the sentences in the paragraph (drawn from Wikipedia) contains the answer to the corresponding question (written by an annotator). We automatically convert the original SQuAD dataset into a sentence pair classification task by forming a pair between a question and each sentence in the corresponding context. The task is to determine whether the context sentence contains the answer to the question. We filter out pairs where there is low lexical overlap4 between the question and the context sentence. Specifically, we select all pairs in which the most similar sentence to the question was not the answer sentence, as well as an equal amount of cases in which the correct sentence was the most similar to the question, but another distracting sentence was a close second.	Question Answering / Natural Language Inference	"How was the Everton FC's crest redesign received by fans?" b'The redesign was poorly received by supporters, with a poll on an Everton fan site registering a 91% negative response to the crest.' = 0 = answerable 'In what year did Robert Louis Stevenson die?' b"Mission work in Samoa had begun in late 1830 by John Williams, of the London Missionary Society arriving in Sapapali'i from The Cook Islands and Tahiti." = 1 = not answerable 'What is essential for the mating of the elements that create radio waves?' b'Antennas are required by any radio receiver or transmitter to couple its electrical connection to the electromagnetic field.' = 0 = answerable	QNLI created in GLUE SQUAD: https://rajpurkar.github.io/SQuAD-explorer/			Accuracy	108k 11k 11k
10	RTE	The Recognizing Textual Entailment (RTE) datasets come from a series of annual challenges for the task of textual entailment, also known as NLI. We combine the data from RTE1 (Dagan et al., 2006), RTE2 (Bar Haim et al., 2006), RTE3 (Giampiccolo et al., 2007), and RTE5 (Bentivogli et al., 2009) 5 . Each example in these datasets consists of a premise sentence and a hypothesis sentence, gathered from various online news sources. The task is to predict if the premise entails the hypothesis. We convert all the data to a two-class split (entailment or not entailment, where we collapse neutral and contradiction into not entailment for challenges with three classes) for consistency.	Natural Language Inference	In 2003, Yunus brought the microcredit revolution to the streets of Bangladesh to support more than 50,000 beggars, whom the Grameen Bank respectfully calls Struggling Members.' b'Yunus supported more than 50,000 Struggling Members.' = 0 = entailed "In 1997, Tyson bit off part of one of Evander Holyfield's ears in their rematch that led to Tyson's disqualification and suspension. In 2002, at a melee at a news conference in New York, Tyson bit champion Lennox Lewis' leg." b"Tyson bit off part of one of Evander Holyfield's ears in 2002." = 1 = not entailed "The marriage is planned to take place in the Kiev Monastery of the Caves, whose father superior, Bishop Pavel, is Yulia Timoshenko's spiritual father." b'Yulia Timoshenko is the daughter of Bishop Pavel.' = 1 = not entailed 'With its headquarters in Madrid, Spain, WTO is an inter-governmental body entrusted by the United Nations to promote and develop tourism.' b'The WTO headquarters is located in Madrid, Spain.' = 0 = entailed	https://aclweb.org/aclwiki/Textual_Entailment_Resource_Pool			Accuracy	2.7k N/A 3k
11	WNLI	The Winograd Schema Challenge: the system must read a sentence with a pronoun and decide the referent from a list of choices. The examples are constructed to foil simple statistical methods GLUE paper: The task (a slight relaxation of the original Winograd Schema Challenge) is to predict if the sentence with the pronoun substituted is entailed by the original sentence	Coreference / Natural Language Inference	The trophy didn’t fit into the suitcase because it was too [large/small]. Question: What was too [large/small]? Answer: the trophy / the suitcase 'Lily spoke to Donna, breaking her concentration.' b"Lily spoke to Donna, breaking Lily's concentration." = 0 'I put the cake away in the refrigerator. It has a lot of leftovers in it.' b'The refrigerator has a lot of leftovers in it.' = 1 "Susan knows all about Ann's personal problems because she is nosy." b'Ann is nosy.' = 0	WNLI introduced in GLUE paper Winograd schema challenge here: https://cs.nyu.edu/faculty/davise/papers/WinogradSchemas/WS.html	Difficult as it requires more common sense understanding that may not be necessarily derivable from the text alone		Accuracy	706 N/A 146
12	Download Script:	https://raw.githubusercontent.com/nyu-mll/jiant/master/scripts/download_glue_data.py (Also available via tensorflow datasets library (tfds))
13	Download Links:	"CoLA":'https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data%2FCoLA.zip?alt=media&token=46d5e637-3411-4188-bc44-5809b5bfb5f4', "SST":'https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data%2FSST-2.zip?alt=media&token=aabc5f6b-e466-44a2-b9b4-cf6337f84ac8', "MRPC":'https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data%2Fmrpc_dev_ids.tsv?alt=media&token=ec5c0836-31d5-48f4-b431-7480817f1adc', "QQP":'https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data%2FQQP.zip?alt=media&token=700c6acf-160d-4d89-81d1-de4191d02cb5', "STS":'https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data%2FSTS-B.zip?alt=media&token=bddb94a7-8706-4e0d-a694-1109e12273b5', "MNLI":'https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data%2FMNLI.zip?alt=media&token=50329ea1-e339-40e2-809c-10c40afff3ce', "SNLI":'https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data%2FSNLI.zip?alt=media&token=4afcfbb2-ff0c-4b2d-a09a-dbf07926f4df', "QNLI":'https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data%2FQNLI.zip?alt=media&token=c24cad61-f2df-4f04-9ab6-aa576fa829d0', "RTE":'https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data%2FRTE.zip?alt=media&token=5efa7e85-a0bb-4f19-8ea2-9e1840f077fb', "WNLI":'https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data%2FWNLI.zip?alt=media&token=068ad0a0-ded7-4bd7-99a5-5e00222e0faf',
14
15