ABCDEFGHIJKLMNOPQRSTU
1
MethodCategoryTopic areaSummary
2
PerplexityPrecision and recall General"How much the model is surprised by seeing new data. The lower the perplexity, the better the training is."
3
BLEUPrecision and recall General"BLEU score measures the quality of predicted text, referred to as the candidate, compared to a set of references."
4
MMLUPrecision and recall General"A multitask test-set consisting of multiple-choice questions from various branches of knowledge"
5
ROUGEReading comprehensionGeneral"Recall-Oriented Understudy for Gisting Evaluation: a set of metrics and a software package used for evaluating automatic summarization and machine translation software in natural language processing. The metrics compare an automatically produced summary or translation against a reference or a set of references (human-produced) summary or translation."
6
METEORReasoning capabilitiesGeneral"A metric that measures the quality of generated text based on the alignment between the generated text and the reference text. The metric is based on the harmonic mean of unigram precision and recall, with recall weighted higher than precision."
7
BERTScoreSummarizationGeneral"An automatic evaluation metric for text generation. Analogously to common metrics, BERTSCORE computes a similarity score for each token in the candidate sentence with each token in the reference sentence. However, instead of exact matches, we compute token similarity using contextual embeddings. We evaluate using the outputs of 363 machine translation and image captioning systems"
8
QuALITYReading comprehensionGeneral"A multiple-choice QA dataset that uses long articles of 2k–8k tokens. Dataset is crowdsourced and examples have unambiguous answers but are still challenging. Questions require consolidating information from multiple parts of the text, to prevent skimming."
9
MT-benchConversationGeneral"A benchmark tailored for evaluating the proficiency of chat assistants in multi-turn conversations using GPT-4 as the judge."
10
QMSumSummarizationMeetings"A benchmark for query-based multi-domain meeting summarization, where models have to select and summarize relevant spans of meetings in response to a query."
11
TruthfulQASafety and truthfulnessGeneral"A benchmark for evaluating the truthfulness of LLMs in generating answers to questions constructed in a way that humans tend to answer the curated questions falsely due to false believes, biases and misconceptions."
12
Bias
Benchmark for QA (BBQ)
Safety and truthfulnessGeneral"A dataset of question sets constructed by the authors that highlight attested social biases against people belonging to protected classes along nine social dimensions relevant for U.S. English-speaking contexts."
13
Helpful, Honest, and Harmless (HHH)Safety and truthfulnessGeneral"A dataset used for evaluating language models. It is pragmatically broken down into the categories of helpfulness, honesty/accuracy, and harmlessness. The dataset is formatted in terms of binary comparisons, often broken down from a ranked ordering of three or four possible responses to a given query or context. The goal of these evaluations is that on careful reflection, the vast majority of people would agree that the chosen response is better (more helpful, honest, and harmless) than the alternative offered for comparison."
14
HellaSwagCommonsense reasoningGeneral"A challenging benchmark for AI models that tests their ability to predict the ending of an incomplete narrative."
15
AI2 Reasoning
Challenge (ARC)
Commonsense reasoningScience"The ARC question set is partitioned into a Challenge Set and an Easy Set, where the Challenge Set contains only questions answered incorrectly by both a retrieval-based algorithm and a word co-occurrence algorithm. The dataset contains only natural, grade-school science questions (authored for human tests), and is the largest public-domain set of this kind (7,787 questions)."
16
WinoGrandeCommonsense reasoningGeneral"A Winograd schema is a pair of sentences that differ only in one or two words and that contain a referential ambiguity that is resolved in opposite directions in the two sentences. We have compiled a collection of Winograd schemas, designed so that the correct answer is obvious to the human reader, but cannot easily be found using selectional restrictions or statistical techniques over text corpora."
17
QASPERReading comprehensionNLP academic papers"An information-seeking question answering (QA) dataset over academic research papers. Each question is written as a followup to the title and abstract of a particular paper, and the answer, if present, is identified in the rest of the paper, along with evidence required to arrive at it."
18
Narrative QAReading comprehensionGeneral"NarrativeQA Manual is an English-language dataset of stories and corresponding questions designed to test reading comprehension, especially on long documents."
19
Physical Interaction Question Answering (PIQA)Commonsense reasoningPhysical commonsenseA dataset designed "to evaluate language representations on their knowledge of physical commonsense. We focus on everyday situations with a preference for atypical solutions."
20
SIQACommonsense reasoningSocial commonsense"The first largescale benchmark for commonsense reasoning about social situations. SOCIAL IQA contains 38,000 multiple choice questions for probing emotional and social intelligence in a variety of everyday situations"
21
OpenBookQACommonsense reasoningScience"The open book that comes with our questions is a set of 1326 elementary level science facts. Roughly 6000 questions probe an understanding of these facts and their application to novel situations. This requires combining an open book fact (e.g., metals conduct electricity) with broad common knowledge (e.g., a suit of armor is made of metal) obtained from other sources. While existing QA datasets over documents or knowledge bases, being generally self-contained, focus on linguistic understanding, OpenBookQA probes a deeper understanding of both the topic—in the context of common knowledge—and the language it is expressed in."
22
CommonsenseQACommonsense reasoningGeneral"Dataset focusing on commonsense question answering, based on knowledge encoded in CONCEPTNET (Speer et al., 2017). We propose a method for generating commonsense questions at scale by asking crowd workers to author questions that describe the relation between concepts"
23
NaturalQuestionsKnowledgeGeneral"Questions consist of real anonymized, aggregated queries issued to the Google search engine. An annotator is presented with a question along with a Wikipedia page from the top 5 search results, and annotates a long answer (typically a paragraph) and a short answer (one or more entities) if present on the page, or marks null if no long/short answer is present."
24
TriviaQAKnowledgeGeneral"TriviaQA is a reading comprehension dataset containing over 650K question-answer-evidence triples that:
(1) has relatively complex, compositional questions,
(2) has considerable syntactic and lexical variability between questions and corresponding answer-evidence sentences, and
(3) requires more cross sentence reasoning to find answers"
25
SQuADReading comprehensionGeneral"Unanswerable questions written adversarially by crowdworkers to look similar to answerable ones. Must correctly answer questions and identify when there is no answer."
26
QuACReading comprehensionGeneral"14K information-seeking QA dialogs (100K questions in total). The dialogs involve two crowd workers: (1) a student who poses a sequence of freeform questions to learn as much as possible about a hidden Wikipedia text, and (2) a teacher who answers the questions by providing short excerpts from the text. QuAC introduces challenges not found in existing machine comprehension datasets: its questions are often more open-ended, unanswerable, or only meaningful within the dialog context."
27
AGI EvalReasoning capabilitiesGeneral"Benchmark specifically designed to assess foundation model in the context of human-centric standardized exams, such as college entrance exams, law school admission tests, math competitions, and lawyer qualification tests."
28
GSM8KReasoning capabilitiesMath"A dataset of 8.5K high quality linguistically diverse grade school math word problems."
29
MATHReasoning capabilitiesMath"To measure the problem-solving ability of machine learning models, we introduce the MATH dataset, which consists of 12, 500 problems from high school math competitions."
30
BIG-Bench Hard (BBH)Reasoning capabilitiesGeneral"BIG-Bench is a collaborative benchmark that aims to quantitatively measure the capabilities and limitations of language models (Srivastava et al., 2022, Beyond the Imitation Game Benchmark). The benchmark has over 200 diverse text-based tasks in task categories including traditional NLP, mathematics, commonsense reasoning, and question-answering. In this paper, we curate BIG-Bench Hard (BBH), a subset of 23 particularly challenging BIG-Bench tasks (27 subtasks) for which no prior result from Srivastava et al. (2022) has outperformed the average human-rater score."
31
BoolQReading comprehensionGeneral"16,000 naturally occurring yes/no questions in a dataset we call BoolQ (for Boolean Questions). Each question is paired with a paragraph from Wikipedia that an independent annotator has marked as containing the answer. The task is then to take a question and passage as input, and to return “yes” or “no” as output."
32
DROPReasoning capabilitiesGeneral"Adversarialy-created reading comprehension benchmark, which requires models to navigate through references and execute discrete operations like addition or sorting."
33
CRASSReasoning capabilitiesGeneral"We introduce the CRASS (counterfactual reasoning assessment) data set and benchmark utilizing questionized counterfactual conditionals as a novel and powerful tool to evaluate large language models. "
34
RACEReasoning capabilitiesEnglish"A new dataset for benchmark evaluation of methods in the reading comprehension task. Collected from the English exams for middle and high school Chinese students in the age range between 12 to 18, RACE consists of near 28,000 passages and near 100,000 questions generated by human experts (English instructors), and covers a variety of topics which are carefully designed for evaluating the students’ ability in understanding and reasoning."
35
LAMBADAReading comprehensionGeneral"A dataset to evaluate the capabilities of computational models for text understanding by means of a word prediction task. LAMBADA is a collection of narrative passages sharing the characteristic that human subjects are able to guess their last word if they are exposed to the whole passage, but not if they only see the last sentence preceding the target word. To succeed on LAMBADA, computational models cannot simply rely on local context, but must be able to keep track of information in the broader discourse."
36
ACI BenchConversationMedical"The largest dataset to date tackling the problem of AI-assisted note generation from visit dialogue. "
37
MS-MACROReasoning capabilitiesGeneral"The dataset comprises of 1,010,916 anonymized questions---sampled from Bing's search query logs---each with a human generated answer and 182,669 completely human rewritten generated answers. In addition, the dataset contains 8,841,823 passages---extracted from 3,563,535 web documents retrieved by Bing---that provide the information necessary for curating the natural language answers. A question in the MS MARCO dataset may have multiple answers or no answers at all. Using this dataset, we propose three different tasks with varying levels of difficulty: (i) predict if a question is answerable given a set of context passages, and extract and synthesize the answer as a human would (ii) generate a well-formed answer (if possible) based on the context passages that can be understood with the question and passage context, and finally (iii) rank a set of retrieved passages given a question."
38
ToxiGenSafety and truthfulnessGeneral"A large-scale, machine-generated dataset [16] of 274,186 toxic and benign statements about 13 minority groups with a focus on implicit hate speech that does not contain slurs or profanity. We use the dataset to test a model’s ability to both identify and generate toxic content."
39
Automated RAI Measurement FrameworkSafety and truthfulnessGeneral"We present a framework for the automated measurement of responsible AI (RAI) metrics for large language models (LLMs) and associated products and services. Our framework for automatically measuring harms from LLMs builds on existing technical and sociotechnical expertise and leverages the capabilities of state-ofthe-art LLMs, such as GPT-4. We use this framework to run through several case studies investigating how different LLMs may violate a range of RAI-related principles."
40
Codex HumanEvalReasoning capabilitiesCodeA dataset to "evaluate functional correctness on a set of 164 handwritten programming problems, which we call the HumanEval dataset. Each problem includes a function signature, docstring, body, and several unit tests, with an average of 7.7 tests per problem."
41
GREKnowledgeGeneralThe GRE
42
MBEKnowledgeLawMultistate bar exam
43
USMLEKnowledgeMedicalUnited States Medical Licensing Examination
44
NCLEXKnowledgeMedicalNursing licensing exam
45
NAPLEXKnowledgeMedicalPharmacist licensing exam
46
PubMedQAReading comprehensionMedical"A novel biomedical question answering (QA) dataset collected from PubMed abstracts. The task of PubMedQA is to answer research questions with yes/no/maybe (e.g.: Do preoperative statins reduce atrial fibrillation after coronary artery bypass grafting?) using the corresponding abstracts."
47
MultiMedQAKnowledgeMedical"A benchmark combining six existing open question answering datasets spanning professional medical exams, research, and consumer queries; and HealthSearchQA, a new free-response dataset of medical questions searched online."
48
Quantitative Reasoning with Data (QRData)Reasoning capabilitiesStatisticsBenchmark "aiming to evaluate Large Language Models' capability in statistical and causal reasoning with real-world data."
49
BIBenchReasoning capabilitiesStatistics"A comprehensive benchmark designed to evaluate the data analysis capabilities of LLMs within the context of Business Intelligence (BI)."
50
LegalBenchReasoning capabilitiesLegal"A collaboratively constructed legal reasoning benchmark consisting of 162 tasks covering six different types of legal reasoning."
51
SeeGULLSafety and truthfulnessGeneral"A broad-coverage stereotype dataset, built by utilizing generative capabilities of large language models such as PaLM, and GPT-3, and leveraging a globally diverse rater pool to validate the prevalence of those stereotypes in society. SeeGULL is in English, and contains stereotypes about identity groups spanning 178 countries across 8 different geo-political regions across 6 continents, as well as state-level identities within the US and India. We also include fine-grained offensiveness scores for different stereotypes and demonstrate their global disparities."
52
Biomedical Language Understanding Evaluation (BLUE)Reading comprehensionMedical"The benchmark consists of five tasks with ten datasets that cover both biomedical and clinical texts with different dataset sizes and difficulties."
53
RAG TriadRAG evaluationGeneral"An evaluation framework to assess Language Model Model (LLM) responses for reliability and contextual integrity. It combines three evaluations: Context Relevance, Groundedness, and Answer Relevance. These evaluations help detect hallucinations in LLM responses by ensuring that context is relevant, responses are grounded, and answers align with user queries."
54
RAG instruct benchmark testerRAG evaluationFinance and Legal"This is an updated benchmarking test dataset for "retrieval augmented generation" (RAG) use cases in the enterprise, especially for financial services, and legal. This test dataset includes 200 questions with context passages pulled from common 'retrieval scenarios', e.g., financial news, earnings releases, contracts, invoices, technical articles, general news and short texts."
55
Needle in a haystackRAG evaluationGeneral"A simple 'needle in a haystack' analysis to test in-context retrieval ability of long context LLMs."
56
Ragas (RAG assessment)RAG evaluationGeneral"A framework for reference-free evaluation of Retrieval Augmented Generation (RAG) pipelines."
57
MIRAGERAG evaluationMedical"Mirage is our proposed benchmark for Medical Information Retrieval-Augmented Generation Evaluation, which includes 7,663 questions from five commonly used QA datasets in biomedicine."
58
FoFoReasoning capabilitiesFormatting"Evaluating large language models’ (LLMs) ability to follow complex, domain-specific formats."
59
PLUEReading comprehensionPolicy"A multi-task benchmark for evaluating the privacy policy language understanding across various tasks."
60
PlanBenchReasoning capabilitiesGeneral"An extensible benchmark suite based on the kinds of domains used in the automated planning community, especially in the International Planning Competition, to test the capabilities of LLMs in planning or reasoning about actions and change"
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100