Cooper/Smith - AI & Global Health Benchmark List

	A	B	C	D
1	Method	Category	Topic area	Summary

2	Perplexity	Precision and recall	General	"How much the model is surprised by seeing new data. The lower the perplexity, the better the training is."
3	BLEU	Precision and recall	General	"BLEU score measures the quality of predicted text, referred to as the candidate, compared to a set of references."
4	MMLU	Precision and recall	General	"A multitask test-set consisting of multiple-choice questions from various branches of knowledge"
5	ROUGE	Reading comprehension	General	"Recall-Oriented Understudy for Gisting Evaluation: a set of metrics and a software package used for evaluating automatic summarization and machine translation software in natural language processing. The metrics compare an automatically produced summary or translation against a reference or a set of references (human-produced) summary or translation."
6	METEOR	Reasoning capabilities	General	"A metric that measures the quality of generated text based on the alignment between the generated text and the reference text. The metric is based on the harmonic mean of unigram precision and recall, with recall weighted higher than precision."
7	BERTScore	Summarization	General	"An automatic evaluation metric for text generation. Analogously to common metrics, BERTSCORE computes a similarity score for each token in the candidate sentence with each token in the reference sentence. However, instead of exact matches, we compute token similarity using contextual embeddings. We evaluate using the outputs of 363 machine translation and image captioning systems"
8	QuALITY	Reading comprehension	General	"A multiple-choice QA dataset that uses long articles of 2k–8k tokens. Dataset is crowdsourced and examples have unambiguous answers but are still challenging. Questions require consolidating information from multiple parts of the text, to prevent skimming."
9	MT-bench	Conversation	General	"A benchmark tailored for evaluating the proficiency of chat assistants in multi-turn conversations using GPT-4 as the judge."
10	QMSum	Summarization	Meetings	"A benchmark for query-based multi-domain meeting summarization, where models have to select and summarize relevant spans of meetings in response to a query."
11	TruthfulQA	Safety and truthfulness	General	"A benchmark for evaluating the truthfulness of LLMs in generating answers to questions constructed in a way that humans tend to answer the curated questions falsely due to false believes, biases and misconceptions."
12	Bias Benchmark for QA (BBQ)	Safety and truthfulness	General	"A dataset of question sets constructed by the authors that highlight attested social biases against people belonging to protected classes along nine social dimensions relevant for U.S. English-speaking contexts."
13	Helpful, Honest, and Harmless (HHH)	Safety and truthfulness	General	"A dataset used for evaluating language models. It is pragmatically broken down into the categories of helpfulness, honesty/accuracy, and harmlessness. The dataset is formatted in terms of binary comparisons, often broken down from a ranked ordering of three or four possible responses to a given query or context. The goal of these evaluations is that on careful reflection, the vast majority of people would agree that the chosen response is better (more helpful, honest, and harmless) than the alternative offered for comparison."
14	HellaSwag	Commonsense reasoning	General	"A challenging benchmark for AI models that tests their ability to predict the ending of an incomplete narrative."
15	AI2 Reasoning Challenge (ARC)	Commonsense reasoning	Science	"The ARC question set is partitioned into a Challenge Set and an Easy Set, where the Challenge Set contains only questions answered incorrectly by both a retrieval-based algorithm and a word co-occurrence algorithm. The dataset contains only natural, grade-school science questions (authored for human tests), and is the largest public-domain set of this kind (7,787 questions)."
16	WinoGrande	Commonsense reasoning	General	"A Winograd schema is a pair of sentences that differ only in one or two words and that contain a referential ambiguity that is resolved in opposite directions in the two sentences. We have compiled a collection of Winograd schemas, designed so that the correct answer is obvious to the human reader, but cannot easily be found using selectional restrictions or statistical techniques over text corpora."
17	QASPER	Reading comprehension	NLP academic papers	"An information-seeking question answering (QA) dataset over academic research papers. Each question is written as a followup to the title and abstract of a particular paper, and the answer, if present, is identified in the rest of the paper, along with evidence required to arrive at it."
18	Narrative QA	Reading comprehension	General	"NarrativeQA Manual is an English-language dataset of stories and corresponding questions designed to test reading comprehension, especially on long documents."
19	Physical Interaction Question Answering (PIQA)	Commonsense reasoning	Physical commonsense	A dataset designed "to evaluate language representations on their knowledge of physical commonsense. We focus on everyday situations with a preference for atypical solutions."
20	SIQA	Commonsense reasoning	Social commonsense	"The first largescale benchmark for commonsense reasoning about social situations. SOCIAL IQA contains 38,000 multiple choice questions for probing emotional and social intelligence in a variety of everyday situations"
21	OpenBookQA	Commonsense reasoning	Science	"The open book that comes with our questions is a set of 1326 elementary level science facts. Roughly 6000 questions probe an understanding of these facts and their application to novel situations. This requires combining an open book fact (e.g., metals conduct electricity) with broad common knowledge (e.g., a suit of armor is made of metal) obtained from other sources. While existing QA datasets over documents or knowledge bases, being generally self-contained, focus on linguistic understanding, OpenBookQA probes a deeper understanding of both the topic—in the context of common knowledge—and the language it is expressed in."
22	CommonsenseQA	Commonsense reasoning	General	"Dataset focusing on commonsense question answering, based on knowledge encoded in CONCEPTNET (Speer et al., 2017). We propose a method for generating commonsense questions at scale by asking crowd workers to author questions that describe the relation between concepts"
23	NaturalQuestions	Knowledge	General	"Questions consist of real anonymized, aggregated queries issued to the Google search engine. An annotator is presented with a question along with a Wikipedia page from the top 5 search results, and annotates a long answer (typically a paragraph) and a short answer (one or more entities) if present on the page, or marks null if no long/short answer is present."
24	TriviaQA	Knowledge	General	"TriviaQA is a reading comprehension dataset containing over 650K question-answer-evidence triples that: (1) has relatively complex, compositional questions, (2) has considerable syntactic and lexical variability between questions and corresponding answer-evidence sentences, and (3) requires more cross sentence reasoning to find answers"
25	SQuAD	Reading comprehension	General	"Unanswerable questions written adversarially by crowdworkers to look similar to answerable ones. Must correctly answer questions and identify when there is no answer."
26	QuAC	Reading comprehension	General	"14K information-seeking QA dialogs (100K questions in total). The dialogs involve two crowd workers: (1) a student who poses a sequence of freeform questions to learn as much as possible about a hidden Wikipedia text, and (2) a teacher who answers the questions by providing short excerpts from the text. QuAC introduces challenges not found in existing machine comprehension datasets: its questions are often more open-ended, unanswerable, or only meaningful within the dialog context."
27	AGI Eval	Reasoning capabilities	General	"Benchmark specifically designed to assess foundation model in the context of human-centric standardized exams, such as college entrance exams, law school admission tests, math competitions, and lawyer qualification tests."
28	GSM8K	Reasoning capabilities	Math	"A dataset of 8.5K high quality linguistically diverse grade school math word problems."
29	MATH	Reasoning capabilities	Math	"To measure the problem-solving ability of machine learning models, we introduce the MATH dataset, which consists of 12, 500 problems from high school math competitions."
30	BIG-Bench Hard (BBH)	Reasoning capabilities	General	"BIG-Bench is a collaborative benchmark that aims to quantitatively measure the capabilities and limitations of language models (Srivastava et al., 2022, Beyond the Imitation Game Benchmark). The benchmark has over 200 diverse text-based tasks in task categories including traditional NLP, mathematics, commonsense reasoning, and question-answering. In this paper, we curate BIG-Bench Hard (BBH), a subset of 23 particularly challenging BIG-Bench tasks (27 subtasks) for which no prior result from Srivastava et al. (2022) has outperformed the average human-rater score."
31	BoolQ	Reading comprehension	General	"16,000 naturally occurring yes/no questions in a dataset we call BoolQ (for Boolean Questions). Each question is paired with a paragraph from Wikipedia that an independent annotator has marked as containing the answer. The task is then to take a question and passage as input, and to return “yes” or “no” as output."
32	DROP	Reasoning capabilities	General	"Adversarialy-created reading comprehension benchmark, which requires models to navigate through references and execute discrete operations like addition or sorting."
33	CRASS	Reasoning capabilities	General	"We introduce the CRASS (counterfactual reasoning assessment) data set and benchmark utilizing questionized counterfactual conditionals as a novel and powerful tool to evaluate large language models. "
34	RACE	Reasoning capabilities	English	"A new dataset for benchmark evaluation of methods in the reading comprehension task. Collected from the English exams for middle and high school Chinese students in the age range between 12 to 18, RACE consists of near 28,000 passages and near 100,000 questions generated by human experts (English instructors), and covers a variety of topics which are carefully designed for evaluating the students’ ability in understanding and reasoning."
35	LAMBADA	Reading comprehension	General	"A dataset to evaluate the capabilities of computational models for text understanding by means of a word prediction task. LAMBADA is a collection of narrative passages sharing the characteristic that human subjects are able to guess their last word if they are exposed to the whole passage, but not if they only see the last sentence preceding the target word. To succeed on LAMBADA, computational models cannot simply rely on local context, but must be able to keep track of information in the broader discourse."
36	ACI Bench	Conversation	Medical	"The largest dataset to date tackling the problem of AI-assisted note generation from visit dialogue. "
37	MS-MACRO	Reasoning capabilities	General	"The dataset comprises of 1,010,916 anonymized questions---sampled from Bing's search query logs---each with a human generated answer and 182,669 completely human rewritten generated answers. In addition, the dataset contains 8,841,823 passages---extracted from 3,563,535 web documents retrieved by Bing---that provide the information necessary for curating the natural language answers. A question in the MS MARCO dataset may have multiple answers or no answers at all. Using this dataset, we propose three different tasks with varying levels of difficulty: (i) predict if a question is answerable given a set of context passages, and extract and synthesize the answer as a human would (ii) generate a well-formed answer (if possible) based on the context passages that can be understood with the question and passage context, and finally (iii) rank a set of retrieved passages given a question."
38	ToxiGen	Safety and truthfulness	General	"A large-scale, machine-generated dataset [16] of 274,186 toxic and benign statements about 13 minority groups with a focus on implicit hate speech that does not contain slurs or profanity. We use the dataset to test a model’s ability to both identify and generate toxic content."
39	Automated RAI Measurement Framework	Safety and truthfulness	General	"We present a framework for the automated measurement of responsible AI (RAI) metrics for large language models (LLMs) and associated products and services. Our framework for automatically measuring harms from LLMs builds on existing technical and sociotechnical expertise and leverages the capabilities of state-ofthe-art LLMs, such as GPT-4. We use this framework to run through several case studies investigating how different LLMs may violate a range of RAI-related principles."
40	Codex HumanEval	Reasoning capabilities	Code	A dataset to "evaluate functional correctness on a set of 164 handwritten programming problems, which we call the HumanEval dataset. Each problem includes a function signature, docstring, body, and several unit tests, with an average of 7.7 tests per problem."
41	GRE	Knowledge	General	The GRE
42	MBE	Knowledge	Law	Multistate bar exam
43	USMLE	Knowledge	Medical	United States Medical Licensing Examination
44	NCLEX	Knowledge	Medical	Nursing licensing exam
45	NAPLEX	Knowledge	Medical	Pharmacist licensing exam
46	PubMedQA	Reading comprehension	Medical	"A novel biomedical question answering (QA) dataset collected from PubMed abstracts. The task of PubMedQA is to answer research questions with yes/no/maybe (e.g.: Do preoperative statins reduce atrial fibrillation after coronary artery bypass grafting?) using the corresponding abstracts."
47	MultiMedQA	Knowledge	Medical	"A benchmark combining six existing open question answering datasets spanning professional medical exams, research, and consumer queries; and HealthSearchQA, a new free-response dataset of medical questions searched online."
48	Quantitative Reasoning with Data (QRData)	Reasoning capabilities	Statistics	Benchmark "aiming to evaluate Large Language Models' capability in statistical and causal reasoning with real-world data."
49	BIBench	Reasoning capabilities	Statistics	"A comprehensive benchmark designed to evaluate the data analysis capabilities of LLMs within the context of Business Intelligence (BI)."
50	LegalBench	Reasoning capabilities	Legal	"A collaboratively constructed legal reasoning benchmark consisting of 162 tasks covering six different types of legal reasoning."
51	SeeGULL	Safety and truthfulness	General	"A broad-coverage stereotype dataset, built by utilizing generative capabilities of large language models such as PaLM, and GPT-3, and leveraging a globally diverse rater pool to validate the prevalence of those stereotypes in society. SeeGULL is in English, and contains stereotypes about identity groups spanning 178 countries across 8 different geo-political regions across 6 continents, as well as state-level identities within the US and India. We also include fine-grained offensiveness scores for different stereotypes and demonstrate their global disparities."
52	Biomedical Language Understanding Evaluation (BLUE)	Reading comprehension	Medical	"The benchmark consists of five tasks with ten datasets that cover both biomedical and clinical texts with different dataset sizes and difficulties."
53	RAG Triad	RAG evaluation	General	"An evaluation framework to assess Language Model Model (LLM) responses for reliability and contextual integrity. It combines three evaluations: Context Relevance, Groundedness, and Answer Relevance. These evaluations help detect hallucinations in LLM responses by ensuring that context is relevant, responses are grounded, and answers align with user queries."
54	RAG instruct benchmark tester	RAG evaluation	Finance and Legal	"This is an updated benchmarking test dataset for "retrieval augmented generation" (RAG) use cases in the enterprise, especially for financial services, and legal. This test dataset includes 200 questions with context passages pulled from common 'retrieval scenarios', e.g., financial news, earnings releases, contracts, invoices, technical articles, general news and short texts."
55	Needle in a haystack	RAG evaluation	General	"A simple 'needle in a haystack' analysis to test in-context retrieval ability of long context LLMs."
56	Ragas (RAG assessment)	RAG evaluation	General	"A framework for reference-free evaluation of Retrieval Augmented Generation (RAG) pipelines."
57	MIRAGE	RAG evaluation	Medical	"Mirage is our proposed benchmark for Medical Information Retrieval-Augmented Generation Evaluation, which includes 7,663 questions from five commonly used QA datasets in biomedicine."
58	FoFo	Reasoning capabilities	Formatting	"Evaluating large language models’ (LLMs) ability to follow complex, domain-specific formats."
59	PLUE	Reading comprehension	Policy	"A multi-task benchmark for evaluating the privacy policy language understanding across various tasks."
60	PlanBench	Reasoning capabilities	General	"An extensible benchmark suite based on the kinds of domains used in the automated planning community, especially in the International Planning Competition, to test the capabilities of LLMs in planning or reasoning about actions and change"
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100