A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | data_name | purpose_type | purpose_tags | purpose_stated | purpose_llmdev | entries_type | entries_languages | entries_n | entries_unit | entries_detail | creation_creator_type | creation_source_type | creation_detail | access_git_url | access_hf_url | access_license | publication_date | publication_affils | publication_sector | publication_name | publication_venue | publication_url | other_notes | other_date_added | |
2 | HarmEval | broad safety | evaluate LLM misuse scenarios | eval only | chat | English | 550 | prompts | Each prompt is an unsafe question or instruction. | machine | | based on OpenAI and Meta usage policies | https://github.com/NeuralSentinel/SafeInfer | not available | unspecified | 25-Feb-2025 | IIT Kharagpur, Microsoft | mixed | Banerjee et al.: "SafeInfer: Context Adaptive Decoding Time Safety Alignment for Large Language Models" | AAAI 2025 | https://ojs.aaai.org/index.php/AAAI/article/view/34927 | - covers 11 categories: Illegal Activity, Child Abuse Content, Physical Harm, Economic Harm, Political Campaigning, Privacy Violation Activity, Tailored Financial Advice, Fraud/Deception, Hate/Harass/Violence, Adult Content, Malware | 4.4.2025 | ||
3 | ImplicitBias | bias | sociodemographics | evaluate implicit bias in LLMs | eval only | other | English | 33,600 | prompts | Each prompt is either an Implicit Associatian Test word association task or decision scenario. | hybrid | mixed | partly based on IAT templates, with custom sociodemogrpahic attributes, and LLM-generated decision scenarios | https://github.com/baixuechunzi/llm-implicit-bias | not available | MIT | 20-Feb-2025 | University of Chicago, Stanford University, New York University, Princeton University | academia / npo | Bai et al.: "Explicitly unbiased large language models still form biased associations" | PNAS | https://www.pnas.org/doi/10.1073/pnas.2416228122 | - separate datasets for measuring implicit bias and decision bias | 21.3.2025 | |
4 | DifferenceAwareness | bias | sociodemographics | evaluate difference awareness in LLMs | eval only | chat | English | 16,000 | prompts | Each prompt is a question. | human | mixed | some benchmarks sampled from others, some created by hand | https://github.com/Angelina-Wang/difference_awareness | not available | CC0 1.0 Universal | 4-Feb-2025 | Stanford University | academia / npo | Wang et al.: "Fairness through Difference Awareness: Measuring Desired Group Discrimination in LLMs" | arxiv | https://arxiv.org/abs/2502.01926 | - consists of 8 benchmarks with 2k questions each - 4 benchmarks are descriptive, 4 normative | 21.3.2025 | |
5 | JBBBehaviours | broad safety | used to test adversarial method | evaluate effectiveness of different jailbreaking methods | eval only | chat | English | 100 | prompts | Each prompt is an unsafe question or instruction. | hybrid | mixed | sampled from AdvBench and Harmbench, plus hand-written examples | https://github.com/JailbreakBench/jailbreakbench | https://huggingface.co/datasets/JailbreakBench/JBB-Behaviors | MIT | 13-Dec-2024 | University of Pennsylvania, ETH Zurich, EPFL, Sony AI | mixed | Chao et al.: "JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models" | NeurIPS 2024 (D&B) | https://arxiv.org/abs/2404.01318 | - covers 10 safety categories - comes with 100 benign prompts from XSTest | 01.08.2024 | |
6 | MedSafetyBench | narrow safety | medical safety | measure the medical safety of LLMs | train and eval | chat | English | 1,800 | conversations | Each conversation is an unsafe medical request with an associated safe response. | machine | original | generated by GPT-4 and llama-2-7b | https://github.com/AI4LIFE-GROUP/med-safety-bench | not available | MIT | 13-Dec-2024 | Harvard University | academia / npo | Han et al.: "MedSafetyBench: Evaluating and Improving the Medical Safety of Large Language Models" | NeurIPS 2024 (D&B) | https://arxiv.org/abs/2403.03744 | - constructed around the AMA’s Principles of Medical Ethics | 01.08.2024 | |
7 | PRISM | value alignment | cultural / social | capture diversity of human preferences over LLM behaviours | train and eval | chat | English | 8,011 | conversations | Conversations can be multi-turn, with user input and responses from one or multiple LLMs. | human | original | written by crowdworkers | https://github.com/HannahKirk/prism-alignment | https://huggingface.co/datasets/HannahRoseKirk/prism-alignment | CC BY-NC 4.0 | 13-Dec-2024 | University of Oxford, University of Pennsylvania, Bocconi University, AWS AI Labs, ML Commons, UCL, Cohere, Meta AI, NYU, Contextual AI, Meedan | mixed | Kirk et al.: "The PRISM Alignment Dataset: What Participatory, Representative and Individualised Human Feedback Reveals About the Subjective and Multicultural Alignment of Large Language Models" | NeurIPS 2024 (D&B) | https://arxiv.org/abs/2404.16019 | - collected from 1,500 participants residing in 38 different countries - also comes with survey on each participant | 01.08.2024 | |
8 | WildGuardMix | other | content moderation | train and evaluate content moderation guardrails | other | chat | English | 86,759 | examples | Each example is either a prompt or a prompt + response annotated for safety. | hybrid | mixed | synthetic data (87%), in-the-wild user-LLLM interactions (11%), and existing annotator-written data (2%) | not available | https://huggingface.co/datasets/allenai/wildguardmix | ODC BY | 13-Dec-2024 | Allen AI, University of Washington, Seoul National University | mixed | Han et al.: "WildGuard: Open One-stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs" | NeurIPS 2024 (D&B) | https://arxiv.org/abs/2406.18495 | - consists of two splits, WildGuardTrain and WildGuardTest | 17.12.2024 | |
9 | SGBench | broad safety | benchmark | evaluate the generalisation of LLM safety across various tasks and prompt types | eval only | chat | English | 1,442 | prompts | Each prompt is a malicious instruction or question, which comes in multiple formats (direct query, jailbreak, multiple-choice or safety classification) | human | sampled | sampled from AdvBench, HarmfulQA, BeaverTails and SaladBench, then expanded using templates and LLMs | https://github.com/MurrayTom/SG-Bench | not available | GPL 3.0 | 13-Dec-2024 | Peking University | academia / npo | Mou et al.: "SG-Bench: Evaluating LLM Safety Generalization Across Diverse Tasks and Prompt Types" | NeurIPS 2024 (D&B) | https://arxiv.org/abs/2410.21965 | - comes in multiple formats (chat, multiple choice, classification) | 27.02.2024 | |
10 | StrongREJECT | broad safety | used to test adversarial method | better investigate the effectiveness of different jailbreaking techniques | eval only | chat | English | 313 | prompts | Each prompt is a 'forbidden question' in one of six categories. | hybrid | mixed | human-written and curated questions from existing datasets, plus LLM-generated prompts verified for quality and relevance | https://github.com/alexandrasouly/strongreject/tree/main | not available | MIT | 13-Dec-2024 | UC Berkeley | academia / npo | Souly et al.: "A StrongREJECT for Empty Jailbreaks" | NeurIPS 2024 (D&B) | https://arxiv.org/abs/2402.10260 | - focus of the work is adversarial / to jailbreak LLMs - covers 6 question categories: disinformation/deception, hate/harassment/discrimination, illegal goods/services, non-violent crimes, sexual content, violence | 27.02.2024 | |
11 | CoCoNot | narrow safety | non-compliance | evaluate and improve contextual non-compliance in LLMs | train and eval | chat | English | 12,478 | prompts | Each prompt is a question or instruction that models should not comply with. | hybrid | original | human-written seed prompts expanded using LLMs | https://github.com/allenai/noncompliance | https://huggingface.co/datasets/allenai/coconot | MIT | 13-Dec-2024 | Allen AI, University of Washington, Ohio State University, Microsoft Research, Samaya AI, Nvidia | mixed | Brahman et al.: "The Art of Saying No: Contextual Noncompliance in Language Models" | NeurIPS 2024 (D&B) | https://arxiv.org/abs/2407.12043 | - covers 5 categories of non-compliance - comes with a contrast set of requests that should be complied with | 17.12.2024 | |
12 | WildJailbreak | broad safety | train LLMs to be safe | train only | chat | English | 261,534 | conversations | Each conversation is single-turn, containing a potentially unsafe prompt and a safe model response. | machine | original | generated by different LLMs prompted with example prompts and jailbreak techniques | not available | https://huggingface.co/datasets/allenai/wildjailbreak | ODC BY | 13-Dec-2024 | University of Washington, Allen AI, Seoul National University, Carnegie Mellon University | mixed | Liang et al.: "WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models" | NeurIPS 2024 | https://openreview.net/forum?id=n5R6TvBVcX¬eId=f3XHMKAJeK | - contains Vanilla and Adversarial portions - comes with a smaller testset of only prompts | 17.12.2024 | ||
13 | MultiTP | value alignment | moral | evaluate LLM moral decision-making across many languages | eval only | multiple choice | Afrikaans, Amharic, Arabic, Azerbaijani, Belarusian, Bulgarian, Bengali, Bosnian, Catalan, Cebuano, Corsican, Czech, Welsh, Danish, German, Modern Greek, English, Esperanto, Spanish, Estonian, Basque, Persian, Finnish, French, Western Frisian, Irish, Scottish Gaelic, Galician, Gujarati, Hausa, Hawaiian, Hebrew, Hindi, Hmong, Croatian, Haitian, Hungarian, Armenian, Indonesian, Igbo, Icelandic, Italian, Modern Hebrew, Japanese, Javanese, Georgian, Kazakh, Central Khmer, Kannada, Korean, Kurdish, Kirghiz, Latin, Luxembourgish, Lao, Lithuanian, Latvian, Malagasy, Maori, Macedonian, Malayalam, Mongolian, Marathi, Malay, Maltese, Burmese, Nepali, Dutch, Norwegian, Nyanja, Oriya, Panjabi, Polish, Pushto, Portuguese, Romanian, Russian, Sindhi, Sinhala, Slovak, Slovenian, Samoan, Shona, Somali, Albanian, Serbian, Southern Sotho, Sundanese, Swedish, Swahili, Tamil, Telugu, Tajik, Thai, Tagalog, Turkish, Uighur, Ukrainian, Urdu, Uzbek, Vietnamese, Xhosa, Yiddish, Yoruba, Chinese, Chinese (Traditional), Zulu | 107,000 | binary-choice questions | Each question is a trolley-style moral choice scenario. | hybrid | mixed | sampled from Moral Machines, then expanded with templated augmentations, then auto-translated into 106 languages | https://github.com/causalNLP/moralmachine | not available | MIT | 13-Dec-2024 | MPI, ETH Zurich, Allen AI, University of Washington, University of Michigan, University of Trieste | academia / npo | Jin et al.: "Multilingual Trolley Problems for Language Models" | Pluralistic Alignment Workshop at NeurIPS 2024 | https://arxiv.org/abs/2407.02273 | - covers 106 languages + English - non-English languages are auto-translated | 01.08.2024 | |
14 | DoAnythingNow | narrow safety | jailbreak, adversarial goal in data creation | characterise and evaluate in-the-wild LLM jailbreak prompts | eval only | chat | English | 15,140 | prompts | Each prompt is an instruction or question, sometimes with a jailbreak. | human | sampled | written by users on Reddit, Discord, websites and in other datasets | https://github.com/verazuo/jailbreak_llms | not available | MIT | 9-Dec-2024 | CISPA Helmholtz Center for Information Security, NetApp | mixed | Shen et al.: "'Do Anything Now': Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models" | CCS 2024 | https://dl.acm.org/doi/10.1145/3658644.3670388 | - there are 1,405 jailbreak prompts among the full prompt set | 11.01.2024 | |
15 | ForbiddenQuestions | broad safety | used to test adversarial method | evaluate whether LLMs answer questions that violate OpenAI's usage policy | eval only | chat | English | 390 | prompts | Each prompt is a question targetting behaviour disallowed by OpenAI. | machine | original | GPT-written templates expanded by combination | https://github.com/verazuo/jailbreak_llms | not available | MIT | 9-Dec-2024 | CISPA Helmholtz Center for Information Security, NetApp | mixed | Shen et al.: "'Do Anything Now': Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models" | CCS 2024 | https://dl.acm.org/doi/10.1145/3658644.3670388 | - covers 13 "forbidden" scenarios taken from the OpenAI usage policy | 11.01.2024 | |
16 | CoSafe | broad safety | evaluating LLM safety in dialogue coreference | eval only | chat | English | 1,400 | conversations | Each conversation is multi-turn with the final question being unsafe. | hybrid | mixed | sampled from BeaverTails, then expanded with GPT-4 into multi-turn conversations | https://github.com/ErxinYu/CoSafe-Dataset | not available | unspecified | 12-Nov-2024 | Hong Kong Polytechnic University, Huawei | mixed | Yu et al.: "CoSafe: Evaluating Large Language Model Safety in Multi-Turn Dialogue Coreference" | EMNLP 2024 (Main) | https://aclanthology.org/2024.emnlp-main.968/ | - focuses on multi-turn conversations - covers 14 categories of harm from BeaverTails | 01.08.2024 | ||
17 | SoFa | bias | sociodemographics | evaluate social biases of LLMs | eval only | other | English | 1,490,120 | sentences | Each sentence mentions one sociodemographic group. | hybrid | mixed | templates sampled from SBIC, then combined with sociodemographic groups | not available | https://huggingface.co/datasets/copenlu/sofa | MIT | 12-Nov-2024 | University of Pisa, University of Copenhagen | academia / npo | Manerba et al.: "Social Bias Probing: Fairness Benchmarking for Language Models" | EMNLP 2024 (Main) | https://aclanthology.org/2024.emnlp-main.812/ | - evaluation is based on sentence perplexity - similar to StereoSet and CrowSPairs | 17.12.2024 | |
18 | ArabicAdvBench | broad safety | evaluate safety risks of LLMs in (different forms of) Arabic | eval only | chat | Arabic | 520 | prompts | Each prompt is a harmful instruction with an associated jailbreak prompt. | machine | mixed | translated from AdvBench (which is machine-generated) | https://github.com/SecureDL/arabic_jailbreak/tree/main/data | not available | MIT | 12-Nov-2024 | University of Central Florida | academia / npo | Al Ghanim et al.: "Jailbreaking LLMs with Arabic Transliteration and Arabizi" | EMNLP 2024 (Main) | https://aclanthology.org/2024.emnlp-main.1034 | - translated from AdvBench prompts - comes in different versions of Arabic | 17.12.2024 | ||
19 | AyaRedTeaming | broad safety | adversarial goal in data creation | provide a testbed for exploring alignment across global and local preferences | train and eval | chat | Arabic, English, Filipino, French, Hindi, Russian, Serbian, Spanish | 7,419 | prompts | Each prompt is a harmful question or instruction. | human | original | written by paid native-speaking annotators | not available | https://huggingface.co/datasets/CohereForAI/aya_redteaming | Apache 2.0 | 12-Nov-2024 | Cohere for AI, Cohere | mixed | Aakanshka et al.: "The Multilingual Alignment Prism: Aligning Global and Local Preferences to Reduce Harm" | EMNLP 2024 (Main) | https://aclanthology.org/2024.emnlp-main.671/ | - distinguishes between global and local harm - covers 9 harm categories | 17.12.2024 | |
20 | UltraSafety | broad safety | create data for safety fine-tuning of LLMs | train only | chat | English | 3,000 | prompts | Each prompt is a harmful instruction with an associated jailbreak prompt. | machine | mixed | sampled from AdvBench and MaliciousInstruct, then expanded with SelfInstruct | not available | https://huggingface.co/datasets/openbmb/UltraSafety | MIT | 12-Nov-2024 | Renmin University of China, Tsinghua University, WeChat AI, Tencent | mixed | Guo et al.: "Controllable Preference Optimization: Toward Controllable Multi-Objective Alignment" | EMNLP 2024 (Main) | https://aclanthology.org/2024.emnlp-main.85/ | - comes with responses from different models, assessed for how safe they are by a GPT classifier | 01.08.2024 | ||
21 | DeMET | bias | gender | evaluate gender bias in LLM decision-making | eval only | multiple choice | English | 29 | scenarios | Each scenario describes a married couple (with two name placeholders) facing a decision. | human | original | written by the authors inspired by WHO questionnaire | https://github.com/sharonlevy/GenderBiasScenarios | not available | MIT | 12-Nov-2024 | Johns Hopkins University, Northeastern Illinois University | academia / npo | Levy et al.: "Gender Bias in Decision-Making with Large Language Models: A Study of Relationship Conflicts" | EMNLP 2024 (Findings) | https://aclanthology.org/2024.findings-emnlp.331 | - comes with list of gender-specific / neutral names to insert into scenarion templates | 17.12.2024 | |
22 | GEST | bias | gender | measure gender-stereotypical reasoning in language models and machine translation systems. | eval only | other | Belarussian, Russian, Ukrainian, Croatian, Serbian, Slovene, Czech, Polish, Slovak, English | 3,565 | sentences | Each sentence corresponds to a specific gender stereotype. | human | original | written by professional translators, then validated by the authors | https://github.com/kinit-sk/gest | https://huggingface.co/datasets/kinit/gest | Apache 2.0 | 12-Nov-2024 | Kempelen Institute of Intelligent Technologies | academia / npo | Pikuliak et al.: "Women Are Beautiful, Men Are Leaders: Gender Stereotypes in Machine Translation and Language Modeling" | EMNLP 2024 (Findings) | https://aclanthology.org/2024.findings-emnlp.173/ | - data can be used to evaluate MLM or MT models - covers 16 specific gender stereotype (e.g. 'women are beautiful') - covers 1 category of bias: gender | 05.02.2024 | |
23 | GenMO | bias | gender | evaluate gender bias in LLM moral decision-making | eval only | multiple choice | English | 908 | scenarios | Each scenario describes a person in an everyday situation where their actions can be judged to be moral or not. | human | sampled | sampled from MoralStories, ETHICS, and SocialChemistry | https://github.com/divij30bajaj/GenMO | not available | unspecified | 12-Nov-2024 | Texas A&M University | academia / npo | Bajaj et al.: "Evaluating Gender Bias of LLMs in Making Morality Judgements" | EMNLP 2024 (Findings) | https://aclanthology.org/2024.findings-emnlp.928 | - models are asked whether the action described in the prompt is moral or not | 17.12.2024 | |
24 | SGXSTest | narrow safety | false refusal | evaluate exaggerated safety / false refusal in LLMs for the Singaporean context | eval only | chat | English | 200 | prompts | Each prompt is a simple question. | human | original | created based on XSTest | https://github.com/walledai/walledeval | https://huggingface.co/datasets/walledai/SGXSTest | Apache 2.0 | 12-Nov-2024 | Walled AI | industry | Gupta et al.: "WalledEval: A Comprehensive Safety Evaluation Toolkit for Large Language Models" | EMNLP 2024 (Demo) | https://aclanthology.org/2024.emnlp-demo.42/ | - half safe and half unsafe prompts, contrasting each other | 17.12.2024 | |
25 | HiXSTest | narrow safety | false refusal | evaluate exaggerated safety / false refusal behaviour in Hindi LLMs | eval only | chat | Hindi | 50 | prompts | Each prompt is a simple question. | human | original | created based on XSTest | https://github.com/walledai/walledeval | https://huggingface.co/datasets/walledai/HiXSTest | Apache 2.0 | 12-Nov-2024 | Walled AI | industry | Gupta et al.: "WalledEval: A Comprehensive Safety Evaluation Toolkit for Large Language Models" | EMNLP 2024 (Demo) | https://aclanthology.org/2024.emnlp-demo.42/ | - half safe and half unsafe prompts, contrasting each other | 17.12.2024 | |
26 | CIVICS | value alignment | cultural / social | evaluate the social and cultural variation of LLMs across multiple languages and value-sensitive topics | eval only | chat | German, Italian, French, English | 699 | excerpts | Each excerpt is a short value-laden text passage. | human | sampled | sampled from government and media sites | not available | https://huggingface.co/datasets/CIVICS-dataset/CIVICS | CC BY 4.0 | 16-Oct-2024 | Hugging Face, University of Amsterdam, Carnegie Mellon University | mixed | Pistilli et al.: "CIVICS: Building a Dataset for Examining Culturally-Informed Values in Large Language Models" | AIES 2024 | https://ojs.aaai.org/index.php/AIES/article/view/31710 | - covers socially sensitive topics, including LGBTQI rights, social welfare, immigration, disability rights, and surrogacy | 17.12.2024 | |
27 | GlobalOpinionQA | value alignment | cultural / social | evaluate whose opinions LLM responses are most similar to | eval only | multiple choice | English | 2,556 | multiple-choice questions | - | human | sampled | adapted from Pew’s Global Attitudes surveys and the World Values Survey (written by survey designers) | not available | https://huggingface.co/datasets/Anthropic/llm_global_opinions | CC BY-NC-SA 4.0 | 7-Oct-2024 | Anthropic | industry | Durmus et al.: "Towards Measuring the Representation of Subjective Global Opinions in Language Models" | COLM 2024 | https://arxiv.org/abs/2306.16388 | - comes with responses from people across the globe - goal is to capture more diversity than OpinionQA | 04.01.2024 | |
28 | PolygloToxicityPrompts | broad safety | toxicity | evaluate propensity of LLMs to generate toxic content across 17 languages | eval only | autocomplete | Arabic, Chinese, Czech, Dutch, English, French, German, Hindi, Indonesian, Italian, Japanese, Korean, Polish, Portuguese, Russian, Spanish, Swedish | 425,000 | prompts | Each prompt is an unfinished document. | human | sampled | sampled from mC4 and The Pile | https://github.com/kpriyanshu256/polyglo-toxicity-prompts | https://huggingface.co/datasets/ToxicityPrompts/PolygloToxicityPrompts | AI2 ImpACT - LR | 7-Oct-2024 | Carnegie Mellon University, University of Virginia, Allen Institute for AI | academia / npo | Jain et al.: "PolygloToxicityPrompts: Multilingual Evaluation of Neural Toxic Degeneration in Large Language Models" | COLM 2024 | https://arxiv.org/abs/2405.09373 | - fragments are documents rather than sentences, as for RealToxicityPrompts | 21.3.2025 | |
29 | CALM | bias | gender and race | evaluate LLM gender and racial bias across different tasks and domains | eval only | other | English | 78,400 | examples | Each example is a QA, sentiment classification or NLI example. | hybrid | mixed | templates created by humans based on other datasets, then expanded by combination | https://github.com/vipulgupta1011/CALM | https://huggingface.co/datasets/vipulgupta/CALM | MIT | 7-Oct-2024 | Penn State | academia / npo | Gupta et al.: "CALM: A Multi-task Benchmark for Comprehensive Assessment of Language Model Bias" | COLM 2024 | https://arxiv.org/abs/2308.12539 | - covers three tasks: QA, sentiment classification, NLI - covers 2 categories of bias: gender and race - 78,400 examples are generated from 224 templates - Gender and race are instantiated using names | 11.01.2024 | |
30 | CHiSafetyBench | broad safety | evaluate LLM capabilities to identify risky content and refuse to answer risky questions in Chinese | eval only | multiple choice | Chinese | 1,861 | multiple-choice questions | - | hybrid | mixed | sampled from SafetyBench and SafetyPrompts, plus prompts generated by GPT-3.5 | https://github.com/UnicomAI/UnicomBenchmark/tree/main/CHiSafetyBench | not available | unspecified | 2-Sep-2024 | China Unicom | industry | Zhang et al.: "CHiSafetyBench: A Chinese Hierarchical Safety Benchmark for Large Language Models" | arxiv | https://arxiv.org/abs/2406.10311 | - covers 5 risk areas: discrimination, violation of values, commercial violations, infringement of rights, security requirements for specific services - comes with smaller set of unsafe open-ended questions | 01.08.2024 | ||
31 | PHTest | narrow safety | false refusal | evaluate false refusal in LLMs | eval only | chat | English | 3,260 | prompts | Each prompt is a "pseudo-harmful" question or instruction. | machine | original | generated by Llama2-7B-Chat, Llama3-8B-Instruct, and Mistral-7B-Instruct-V0.2. | https://github.com/umd-huang-lab/FalseRefusal | https://huggingface.co/datasets/furonghuang-lab/PHTest | MIT | 1-Sep-2024 | University of Maryland, Adobe Research | mixed | An et al.: "Automatic Pseudo-Harmful Prompt Generation for Evaluating False Refusals in Large Language Models" | arxiv | https://arxiv.org/abs/2409.00598 | - main contribution of paper is method for generating similar datasets | 21.3.2025 | |
32 | SafetyBench | broad safety | benchmark | evaluate LLM safety with multiple choice questions | eval only | multiple choice | English, Chinese | 11,435 | multiple-choice questions | - | hybrid | mixed | sampled from existing datasets + exams (zh), then LLM augmentation (zh) | https://github.com/thu-coai/SafetyBench | https://huggingface.co/datasets/thu-coai/SafetyBench | MIT | 11-Aug-2024 | Tsinghua University, Northwest Minzu University, Peking University, China Mobile Research Institute | mixed | Zhang et al.: "SafetyBench: Evaluating the Safety of Large Language Models with Multiple Choice Questions" | ACL 2024 (Main) | https://aclanthology.org/2024.acl-long.830/ | - split into 7 categories - language distribution imbalanced across categories - tests knowledge about safety rather than safety | 04.01.2024 | |
33 | CatQA | broad safety | evaluate effectiveness of safety training method | eval only | chat | English, Chinese, Vietnamese | 550 | prompts | Each prompt is a question. | machine | original | generated by unnamed LLM that is not safety-tuned | https://github.com/declare-lab/resta | https://huggingface.co/datasets/declare-lab/CategoricalHarmfulQA | Apache 2.0 | 11-Aug-2024 | Singapore University of Technology and Design, Nanyang Technological University | academia / npo | Bhardwaj et al.: "Language Models are Homer Simpson! Safety Re-Alignment of Fine-tuned Language Models through Task Arithmetic" | ACL 2024 (Main) | https://aclanthology.org/2024.acl-long.762/ | - covers 11 harm categories, each divided into 5 subcategories - for each subcategory there are 10 prompts | 17.12.2024 | ||
34 | KorNAT | value alignment | cultural / social | evaluate LLM alignment with South Korean social values | eval only | multiple choice | English, Korean | 10,000 | multiple-choice questions | Each question is about a specific social scenario. | hybrid | mixed | generated by GPT-3.5 based on keywords sampled from media and surveys | not available | https://huggingface.co/datasets/datumo/KorNAT | CC BY-NC 4.0 | 11-Aug-2024 | KAIST, Datumo Inc, Seoul National University Bundang Hospital, Naver AI Lab | mixed | Lee et al.: "KorNAT: LLM Alignment Benchmark for Korean Social Values and Common Knowledge" | ACL 2024 (Findings) | https://aclanthology.org/2024.findings-acl.666 | - each prompt comes in English and Korean - also contains a dataset portion testing for common knowledge | 17.12.2024 | |
35 | CMoralEval | value alignment | moral | evaluate the morality of Chinese LLMs | eval only | multiple choice | Chinese | 30,388 | multiple-choice questions | Each question is about a specific moral scenario. | hybrid | mixed | sampled from Chinese TV programs and other media, then turned into templates using GPT 3.5 | https://github.com/tjunlp-lab/CMoralEval | not available | MIT | 11-Aug-2024 | Tianjin University, Kuming University, Zhengzhou University | academia / npo | Yu et al.: "CMoralEval: A Moral Evaluation Benchmark for Chinese Large Language Models" | ACL 2024 (Findings) | https://aclanthology.org/2024.findings-acl.703 | - covers 2 types of moral scenarios - GitHub is currently empty | 17.12.2024 | |
36 | XSafety | broad safety | evaluate multilingual LLM safety | eval only | chat | English, Chinese, Hindi, Spanish, French, Arabic, Bengali, Russian, Japanese, German | 28,000 | prompts | Each prompt is a question or instruction. | hybrid | mixed | sampled from SafetyPrompts, then auto-translated and validated | https://github.com/Jarviswang94/Multilingual_safety_benchmark | not available | Apache 2.0 | 11-Aug-2024 | Chinese University of Hong Kong, Tencent | mixed | Wang et al.: "All Languages Matter: On the Multilingual Safety of Large Language Models" | ACL 2024 (Findings) | https://aclanthology.org/2024.findings-acl.349/ | - covers 14 safety scenarios, some "typical", some "instruction" | 01.08.2024 | ||
37 | SaladBench | broad safety | evaluate LLM safety, plus attack and defense methods | eval only | chat | English | 21,000 | prompts | Each prompt is a question or instruction. | hybrid | mixed | mostly sampled from existing datasets, then augmented using GPT-4 | https://github.com/OpenSafetyLab/SALAD-BENCH | https://huggingface.co/datasets/OpenSafetyLab/Salad-Data | Apache 2.0 | 11-Aug-2024 | Shanghai Artificial Intelligence Laboratory, Harbin Institute of Technology, Beijing Institute of Technology, Chinese University of Hong Kong, Hong Kong Polytechnic University | academia / npo | Li et al.: "SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models" | ACL 2024 (Findings) | https://arxiv.org/abs/2402.05044 | - structured with three-level taxonomy including 66 categories - comes with multiple-choice question set | 01.08.2024 | ||
38 | MMHB | bias | sociodemographics | evaluate sociodemographic biases in LLMs across multiple languages | eval only | chat | English, French, Hindi, Indonesian, Italian, Portuguese, Spanish, Vietnamese | 5,754,444 | prompts | Each prompt is a sentence starting a two-person conversation. | hybrid | mixed | templates sampled from HolisticBias, then translated and expanded | https://github.com/facebookresearch/ResponsibleNLP/tree/main/mmhb | not available | unspecified | 29-Jun-2024 | Meta | industry | Tan et al.: "Towards Massive Multilingual Holistic Bias" | arxiv | https://arxiv.org/abs/2407.00486 | - covers 13 demographic axes - based on HolisticBias | 17.12.2024 | |
39 | GPTFuzzer | broad safety | used to test adversarial method | evaluate effectiveness of automated red-teaming method | eval only | chat | English | 100 | prompts | Each prompt is a question or instruction. | hybrid | sampled | sampled from AnthropicHarmlessBase and unpublished GPT-written dataset | https://github.com/sherdencooper/GPTFuzz | not available | MIT | 27-Jun-2024 | Northwestern University, Ant Group | mixed | Yu et al.: "GPTFuzzer: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts" | arxiv | https://arxiv.org/abs/2309.10253 | - small dataset for testing automated jailbreaks | 01.08.2024 | |
40 | WMDP | narrow safety | hazardous knowledge | measure hazardous knowledge in biosecurity, cybersecurity, and chemical security | eval only | chat | English | 3,688 | multiple-choice questions | - | human | original | written by experts (academics and technical consultants) | https://github.com/centerforaisafety/wmdp | https://huggingface.co/datasets/cais/wmdp | MIT | 25-Jun-2024 | Center for AI Safety, UC Berkeley, MIT, SecureBio, Scale AI, NYU, IIIT Hyderabad, Stanford University, Harvard University, USC, UIUC, UCLA, Lapis Labs, California Institute of Technology, Sybil, Northeastern University, Carnegie Mellon University, RTX BBN Technologies, University of Newcastle, Keio University, Ariel University, Arizona State University, xAI | mixed | Nathaniel et al.: "The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning" | ICML 2024 (Poster) | https://proceedings.mlr.press/v235/li24bc.html | - covers 3 hazard categories: biosecurity, cybersecurity, and chemical security | 01.08.2024 | |
41 | ALERT | broad safety | evaluate the safety of LLMs through red teaming methodologies | eval only | chat | English | 44,800 | prompts | Each prompt is a question or instruction. | hybrid | mixed | sampled from AnthropicRedTeam, then augmented with templates | https://github.com/Babelscape/ALERT | https://huggingface.co/datasets/Babelscape/ALERT | CC BY-NC-SA 4.0 | 24-Jun-2024 | Sapienza University of Rome, Babelscape, TU Darmstadt, Hessian.AI, DFKI, Ontocord.AI, UChicago, UIUC | mixed | Tedeschi et al.: "ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red Teaming" | arxiv | https://arxiv.org/abs/2404.08676 | - covers 6 categories and 32 sub-categories informed by AI regulation and prior work | 01.08.2024 | ||
42 | ORBench | narrow safety | false refusal | evaluate false refusal in LLMs at scale | eval only | chat | English | 80,000 | prompts | Each prompt is a question or instruction. | machine | original | generated by Mixtral as toxic prompts, then rewritten by Mixtral into safe prompts, then filtered | https://github.com/justincui03/or-bench | https://huggingface.co/datasets/bench-llm/or-bench | CC BY 4.0 | 20-Jun-2024 | UCLA, UC Berkeley | academia / npo | Cui et al.: "OR-Bench: An Over-Refusal Benchmark for Large Language Models" | arxiv | https://arxiv.org/abs/2405.20947 | - comes with a subset of 1k hard prompts and 600 toxic prompts - covers 10 categories of harmful behaviour | 17.12.2024 | |
43 | SorryBench | broad safety | evaluate fine-grained LLM safety across varying linguistic characteristics | eval only | chat | English, French, Chinese, Marathi, Tamil, Malayalam | 9,450 | prompts | Each prompt is an unsafe question or instruction. | hybrid | mixed | sampled from 10 other datasets, then augmented with templates | https://github.com/sorry-bench/sorry-bench | https://huggingface.co/datasets/sorry-bench/sorry-bench-202406 | MIT | 20-Jun-2024 | Princeton University, Virginia Tech, Stanford University, UC Berkeley, UIUC, UChicago | academia / npo | Xie et al.: "SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors" | arxiv | https://arxiv.org/abs/2406.14598 | - covers 45 potentially unsafe topics with 10 base prompts each - includes 20 linguistic augmentations, which results in 21*450 prompts | 01.08.2024 | ||
44 | XSTest | narrow safety | false refusal | evaluate exaggerated safety / false refusal behaviour in LLMs | eval only | chat | English | 450 | prompts | Each prompt is a simple question. | human | original | written by the authors | https://github.com/paul-rottger/exaggerated-safety | https://huggingface.co/datasets/natolambert/xstest-v2-copy | CC BY 4.0 | 16-Jun-2024 | Bocconi University, University of Oxford, Stanford University | academia / npo | Röttger et al.: "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models" | NAACL 2024 (Main) | https://aclanthology.org/2024.naacl-long.301/ | - split into ten types of prompts - 250 safe prompts and 200 unsafe prompts | 23.12.2023 | |
45 | Flames | broad safety | adversarial goal in data creation | evaluate value alignment of Chinese language LLMs | eval only | chat | Chinese | 2,251 | prompts | Each prompt is a question or instruction. | human | original | written by crowdworkers | https://github.com/AIFlames/Flames | not available | Apache 2.0 | 16-Jun-2024 | Shanghai Artificial Intelligence Laboratory, Fudan University | academia / npo | Huang et al.: "FLAMES: Benchmarking Value Alignment of LLMs in Chinese" | NAACL 2024 (Main) | https://aclanthology.org/2024.naacl-long.256/ | - covers 5 dimensions: fairness, safety, morality, legality, data protection | 01.08.2024 | |
46 | SEval | broad safety | evaluate LLM safety | eval only | chat | English, Chinese | 20,000 | prompts | Each prompt is an unsafe question or instruction. | machine | original | generated by a fine-tuned Qwen-14b | https://github.com/IS2Lab/S-Eval | https://huggingface.co/datasets/IS2Lab/S-Eval | CC BY-NC-SA 4.0 | 28-May-2024 | Zhejian University, Alibaba | mixed | Yuan et al.: "S-Eval: Automatic and Adaptive Test Generation for Benchmarking Safety Evaluation of Large Language Models" | arxiv | https://arxiv.org/abs/2405.14191 | - covers 8 risk categories - comes with 20 adversarial augmentations for each base prompt | 01.08.2024 | ||
47 | WorldValuesBench | value alignment | cultural / social | evaluate LLM awareness of multicultural human values | train and eval | multiple choice | English | 21,492,393 | multiple-choice questions | Each question comes with sociodemographic attributes of the respondent. | human | sampled | adapted from the World Values Survey (written by survey designers) using templates | https://github.com/Demon702/WorldValuesBench | not available | unspecified | 20-May-2024 | University of Massachusetts Amherst, Allen AI, University of North Carolina at Chapel Hill | academia / npo | Zhao et al.: "WorldValuesBench: A Large-Scale Benchmark Dataset for Multi-Cultural Value Awareness of Language Models" | LREC COLING 2024 (Main) | https://aclanthology.org/2024.lrec-main.1539/ | - covers 239 different questions | 01.08.2024 | |
48 | CBBQ | bias | sociodemographics | evaluate sociodemographic bias of LLMs in Chinese | eval only | multiple choice | Chinese | 106,588 | examples | Each example is a context plus a question with three answer choices. | hybrid | original | partly human-written, partly generated by GPT-4 | https://github.com/YFHuangxxxx/CBBQ | https://huggingface.co/datasets/walledai/CBBQ | CC BY-SA 4.0 | 20-May-2024 | Tianjin University | academia / npo | Huang and Xiong: "CBBQ: A Chinese Bias Benchmark Dataset Curated with Human-AI Collaboration for Large Language Models" | LREC COLING 2024 (Main) | https://aclanthology.org/2024.lrec-main.260 | - similar to BBQ but not parallel - comprises 3,039 templates and 106,588 samples | 17.12.2024 | |
49 | KoBBQ | bias | sociodemographics | evaluate sociodemographic bias of LLMs in Korean | eval only | multiple choice | Korean | 76,048 | examples | Each example is a context plus a question with three answer choices. | human | mixed | partly sampled and translated from BBQ, partly newly written | https://github.com/naver-ai/KoBBQ/ | https://huggingface.co/datasets/naver-ai/kobbq | MIT | 3-May-2024 | KAIST, Naver AI | mixed | Jin et al.: "KoBBQ: Korean Bias Benchmark for Question Answering" | TACL | https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00661/120915/KoBBQ-Korean-Bias-Benchmark-for-Question-Answering | - similar to BBQ but not parallel - comprises 268 templates and 76,048 samples across 12 categories of social bias | 17.12.2024 | |
50 | SAFE | broad safety | evaluate LLM safety beyond binary distincton of safe and unsafe | eval only | chat | English | 52,430 | conversations | Each conversation is single-turn, containing a prompt and a potentially harmful model response. | hybrid | sampled | sampled seed prompts from Friday website, then generated more prompts with GPT-4 | https://github.com/xiaoqiao/EvalSafetyLLM | not available | unspecified | 24-Apr-2024 | Zhejiang University, Westlake University | academia / npo | Yu et al.: "Beyond Binary Classification: A Fine-Grained Safety Dataset for Large Language Models" | IEEE Access | https://ieeexplore.ieee.org/abstract/document/10507773 | - covers 7 classes: safe, sensitivity, harmfulness, falsehood, information corruption, unnaturalness, deviation from instructions | 01.08.2024 | ||
51 | AegisAIContentSafety | other | content moderation | evaluate content moderation guardrails | other | chat | English | 11,997 | conversational turns | Each conversational turn is a user prompt or a model response annotated for safety | hybrid | mixed | prompts sampled from AnthropicHarmlessBase, responses generated by Mistral-7B-v0.1-Instruct | not available | https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-1.0 | CC BY 4.0 | 11-Sep-2024 | NVIDIA | industry | Ghosh et al.: "AEGIS: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts" | arxiv | https://arxiv.org/abs/2404.05993 | - covers 13 risk categories | 17.12.2024 | |
52 | DoNotAnswer | broad safety | evaluate 'dangerous capabilities' of LLMs | eval only | chat | English | 939 | prompts | Each prompt is a question. | machine | original | generated by GPT-4 | https://github.com/Libr-AI/do-not-answer | https://huggingface.co/datasets/LibrAI/do-not-answer | Apache 2.0 | 17-Mar-2024 | LibrAI, MBZUAI, University of Melbourne | mixed | Wang et al.: "Do-Not-Answer: Evaluating Safeguards in LLMs" | EACL 2024 (Findings) | https://aclanthology.org/2024.findings-eacl.61/ | - split across 5 risk areas and 12 harm types - authors prompted GPT-4 to generate questions | 23.12.2023 | ||
53 | RuLES | narrow safety | rule-following | evaluate the ability of LLMs to follow simple rules | eval only | chat | English | 862 | prompts | Each prompt is a test case combining rules and instructions. | human | original | written by the authors | https://github.com/normster/llm_rules | https://huggingface.co/datasets/normster/RuLES | Apache 2.0 | 8-Mar-2024 | UC Berkeley, Center for AI Safety, Stanford University, King Abdulaziz City for Science and Technology | academia / npo | Mu et al.: "Can LLMs Follow Simple Rules?" | arxiv | https://arxiv.org/abs/2311.04235 | - covers 19 rules across 14 scenarios | 11.01.2024 | |
54 | HarmBench | broad safety | used to test adversarial method | evaluate effectiveness of automated red-teaming methods | eval only | chat | English | 400 | prompts | Each prompt is an instruction. | human | original | written by the authors | https://github.com/centerforaisafety/HarmBench/tree/main/data/behavior_datasets | not available | MIT | 27-Feb-2024 | University of Illinois, Center for AI Safety, Carnegie Mellon University, UC Berkeley, Microsoft | mixed | Mazeika et al.: "HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal" | ICML 2024 (Poster) | https://arxiv.org/abs/2402.04249 | - covers 7 semantic categories of behaviour: Cybercrime & Unauthorized Intrusion, Chemical & Biological Weapons/Drugs, Copyright Violations, Misinformation & Disinformation, Harassment & Bullying, Illegal Activities, and General Harm - The dataset also includes 110 multimodal prompts | 07.04.2024 | |
55 | DecodingTrust | broad safety | benchmark | evaluate trustworthiness of LLMs | eval only | multiple choice | English | 243,877 | prompts | Each prompt is an instruction. | hybrid | original | human-written templates and examples plus extensive augmentation from GPTs | https://github.com/AI-secure/DecodingTrust | https://huggingface.co/datasets/AI-Secure/DecodingTrust | CC BY-SA 4.0 | 26-Feb-2024 | University of Illinois, Stanford University, UC Berkeley, Center for AI Safety, Microsoft, Chinese University of Hong Kong | mixed | Wang et al.: "DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models" | NeurIPS 2023 (Outstanding Paper) | https://arxiv.org/abs/2306.11698 | - split across 8 'trustworthiness perspectives': toxicity, stereotypes, adversarial and robustness, privacy, ethics and fairness | 11.01.2024 | |
56 | SimpleSafetyTests | broad safety | evaluate critical safety risks in LLMs | eval only | chat | English | 100 | prompts | Each prompt is a simple question or instruction. | human | original | written by the authors | https://github.com/bertiev/SimpleSafetyTests | https://huggingface.co/datasets/Bertievidgen/SimpleSafetyTests | CC BY-NC 4.0 | 16-Feb-2024 | Patronus AI, University of Oxford, Bocconi University | mixed | Vidgen et al.: "SimpleSafetyTests: a Test Suite for Identifying Critical Safety Risks in Large Language Models" | arxiv | https://arxiv.org/abs/2311.08370 | - The dataset is split into ten types of prompts | 23.12.2023 | ||
57 | ConfAIde | narrow safety | privacy | evaluate the privacy-reasoning capabilities of instruction-tuned LLMs | eval only | other | English | 1,326 | prompts | Each prompt is a reasoning task, with complexity increasing across 4 tiers. Answer formats vary. | hybrid | original | human-written templates expanded by combination and with LLMs | https://github.com/skywalker023/confAIde/tree/main/benchmark | not available | MIT | 11-Feb-2024 | University of Washington, Allen AI, Carnegie Mellon University, National University of Singapore | academia / npo | Mireshghallah et al.: "Can LLMs Keep a Secret? Testing Privacy Implications of Language Models via Contextual Integrity Theory" | ICLR 2024 (Spotlight) | https://openreview.net/forum?id=gmg7t8b4s0 | - split into 4 tiers with different prompt formats - tier 1 contains 10 prompts, tier 2 2*98, tier 3 4*270, tier 4 50 | 11.01.2024 | |
58 | MaliciousInstruct | broad safety | used to test adversarial method | evaluate success of generation exploit jailbreak | eval only | chat | English | 100 | prompts | Each prompt is an unsafe question. | machine | original | written by ChatGPT, then filtered by authors | https://github.com/Princeton-SysML/Jailbreak_LLM/tree/main/data | not available | unspecified | 11-Feb-2024 | Princeton University | academia / npo | Huang et al.: "Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation" | ICLR 2024 (Spotlight) | https://openreview.net/forum?id=r42tSSCHPh | - covers ten 'malicious intentions': psychological manipulation, sabotage, theft, defamation, cyberbullying, false accusation, tax fraud, hacking, fraud, and illegal drug use. | 07.04.2024 | |
59 | HExPHI | broad safety | evaluate LLM safety | eval only | chat | English | 330 | prompts | Each prompt is a harmful instruction. | hybrid | mixed | sampled from AdvBench and AnthropicRedTeam, then refined manually and with LLMs. | not available | https://huggingface.co/datasets/LLM-Tuning-Safety/HEx-PHI | custom (HEx-PHI) | 11-Feb-2024 | Princeton University, Virginia Tech, IBM Research, Stanford University | mixed | Qi et al.: "Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!" | ICLR 2024 (Oral) | https://openreview.net/forum?id=hTEGyKf0dZ | - covers 11 harm areas - focus of the article is on finetuning models | 07.04.2024 | ||
60 | CoNA | narrow safety | hate speech | evaluate compliance of LLMs with harmful instructions | eval only | chat | English | 178 | prompts | Each prompt is an instruction. | human | sampled | sampled from MT-CONAN, then rephrased | https://github.com/vinid/instruction-llms-safety-eval | not available | CC BY-NC 4.0 | 11-Feb-2024 | Stanford University, Bocconi University | academia / npo | Bianchi et al.: "Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions" | ICLR 2024 (Poster) | https://openreview.net/forum?id=gT5hALch9z | - focus: harmful instructions (e.g. hate speech) - all prompts are (meant to be) unsafe | 23.12.2023 | |
61 | ControversialInstructions | narrow safety | hate speech | evaluate LLM behaviour on controversial topics | eval only | chat | English | 40 | prompts | Each prompt is an instruction. | human | original | written by the authors | https://github.com/vinid/instruction-llms-safety-eval | not available | CC BY-NC 4.0 | 11-Feb-2024 | Stanford University, Bocconi University | academia / npo | Bianchi et al.: "Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions" | ICLR 2024 (Poster) | https://openreview.net/forum?id=gT5hALch9z | - Focus: controversial topics (e.g. immigration) - All prompts are (meant to be) unsafe | 23.12.2023 | |
62 | QHarm | broad safety | evaluate LLM safety | eval only | chat | English | 100 | prompts | Each prompt is a question. | human | sampled | sampled randomly from AnthropicHarmlessBase (written by crowdworkers) | https://github.com/vinid/instruction-llms-safety-eval | not available | CC BY-NC 4.0 | 11-Feb-2024 | Stanford University, Bocconi University | academia / npo | Bianchi et al.: "Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions" | ICLR 2024 (Poster) | https://openreview.net/forum?id=gT5hALch9z | - Wider topic coverage due to source dataset - Prompts are mostly unsafe | 23.12.2023 | ||
63 | MaliciousInstructions | broad safety | evaluate compliance of LLMs with malicious instructions | eval only | chat | English | 100 | prompts | Each prompt is an instruction. | machine | original | generated by GPT-3 (text-davinci-003) | https://github.com/vinid/instruction-llms-safety-eval | not available | CC BY-NC 4.0 | 11-Feb-2024 | Stanford University, Bocconi University | academia / npo | Bianchi et al.: "Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions" | ICLR 2024 (Poster) | https://openreview.net/forum?id=gT5hALch9z | - Focus: malicious instructions (e.g. bombmaking) - All prompts are (meant to be) unsafe | 23.12.2023 | ||
64 | PhysicalSafetyInstructions | narrow safety | physical safety | evaluate LLM commonsense physical safety | eval only | chat | English | 1,000 | prompts | Each prompt is an instruction. | human | mixed | sampled from SafeText, then rephrased | https://github.com/vinid/instruction-llms-safety-eval | not available | CC BY-NC 4.0 | 11-Feb-2024 | Stanford University, Bocconi University | academia / npo | Bianchi et al.: "Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions" | ICLR 2024 (Poster) | https://openreview.net/forum?id=gT5hALch9z | - Focus: commonsense physical safety - 50 safe and 50 unsafe prompts | 23.12.2023 | |
65 | SafetyInstructions | broad safety | positive interaction | fine-tune LLMs to be safer | train only | chat | English | 2,000 | conversations | Each conversation is a user prompt with a safe model response. | hybrid | mixed | prompts sampled form AnthropicRedTeam, responses generated by gpt-3.5-turbo | https://github.com/vinid/safety-tuned-llamas/tree/main/data/training | not available | CC BY-NC 4.0 | 11-Feb-2024 | Stanford University, Bocconi University | academia / npo | Bianchi et al.: "Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions" | ICLR 2024 (Poster) | https://openreview.net/forum?id=gT5hALch9z | 07.04.2024 | ||
66 | SycophancyEval | narrow safety | sycophancy | evaluate sycophancy in LLMs | eval only | chat | English | 20,956 | prompts | Each prompt is an open-ended question or instruction. | hybrid | mixed | prompts are written by humans and models | https://github.com/meg-tong/sycophancy-eval/tree/main | https://huggingface.co/datasets/meg-tong/sycophancy-eval | MIT | 11-Feb-2024 | Anthropic | industry | Sharma et al.: "Towards Understanding Sycophancy in Language Models" | ICLR 2024 (Poster) | https://openreview.net/forum?id=tvhaxkMKAn | - uses four different task setups to evaluate sycophancy: answer (7268 prompts), are_you_sure (4888 prompts), feedback (8500 prompts), mimicry (300 prompts) | 07.04.2024 | |
67 | OKTest | narrow safety | false refusal | evaluate false refusal in LLMs | eval only | chat | English | 350 | prompts | Each prompt is a question or instruction that models should not refuse. | machine | original | generated by GPT-4 based on keywords | https://github.com/InvokerStark/OverKill | not available | unspecified | 31-Jan-2024 | Fudan University, UC Santa Barbara, Shanghai AI Laboratory | academia / npo | Shi et al.: "Navigating the OverKill in Large Language Models" | arxiv | https://arxiv.org/abs/2401.17633 | - contains only safe prompts | 17.12.2024 | |
68 | AdvBench | broad safety | used to test adversarial method | elicit generation of harmful or objectionable content from LLMs | eval only | chat | English | 1,000 | prompts | 500 are harmful strings that the model should not reproduce, 500 are harmful instructions. | machine | original | generated by Wizard-Vicuna-30B-Uncensored | https://github.com/llm-attacks/llm-attacks/tree/main/data/advbench | not available | MIT | 20-Dec-2023 | Carnegie Mellon University, Center for AI Safety, Google DeepMind, Bosch Center for AI | mixed | Zou et al.: "Universal and Transferable Adversarial Attacks on Aligned Language Models" | arxiv | https://arxiv.org/abs/2307.15043 | - focus of the work is adversarial / to jailbreak LLMs - AdvBench tests whether jailbreaks succeeded | 11.01.2024 | |
69 | Mosscap | other | adversarial goal in data creation, crowdsourced | analyse prompt hacking / extraction attacks | other | chat | English | 278,945 | prompts | Most prompts are prompt extraction attacks. | human | original | written by players of the Mosscap game | not available | https://huggingface.co/datasets/Lakera/mosscap_prompt_injection | MIT | 20-Dec-2023 | Lakera AI | industry | Lakera AI: "Mosscap Prompt Injection" | blog | https://grt.lakera.ai/mosscap | - focus of the work is adversarial / prompt extraction - not all prompts are attacks - prompts correspond to 8 difficulty levels of the game | 11.01.2024 | |
70 | MoralChoice | value alignment | moral | evaluate the moral beliefs in LLMs | eval only | multiple choice | English | 1,767 | binary-choice question | Each prompt is a hypothetical moral scenario with two potential actions. | hybrid | original | mostly generated by GPTs, plus some human-written scenarios | https://github.com/ninodimontalcino/moralchoice | https://huggingface.co/datasets/ninoscherrer/moralchoice | CC BY 4.0 | 10-Dec-2023 | FAR AI, Columbia University | academia / npo | Scherrer et al.: "Evaluating the Moral Beliefs Encoded in LLMs" | NeurIPS 2023 (Spotlight) | https://openreview.net/forum?id=O06z2G18me | - 687 scenarios are low-ambiguity, 680 are high-ambiguity - three Surge annotators choose the favourable action for each scenario | 11.01.2024 | |
71 | DICES350 | value alignment | safety | collect diverse perspectives on conversational AI safety | eval only | chat | English | 350 | conversations | Conversations can be multi-turn, with user input and LLM output. | human | original | written by adversarial Lamda users | https://github.com/google-research-datasets/dices-dataset/ | not available | CC BY 4.0 | 10-Dec-2023 | Google Research, UCL, University of Cambridge | mixed | Aroyo et al.: "DICES Dataset: Diversity in Conversational AI Evaluation for Safety" | NeurIPS 2023 (Poster) | https://proceedings.neurips.cc/paper_files/paper/2023/hash/a74b697bce4cac6c91896372abaa8863-Abstract-Datasets_and_Benchmarks.html | - 104 ratings per item - annotators from US - annotation across 24 safety criteria | 23.12.2023 | |
72 | DICES990 | value alignment | safety | collect diverse perspectives on conversational AI safety | eval only | chat | English | 990 | conversations | Conversations can be multi-turn, with user input and LLM output. | human | original | written by adversarial Lamda users | https://github.com/google-research-datasets/dices-dataset/ | not available | CC BY 4.0 | 10-Dec-2023 | Google Research, UCL, University of Cambridge | mixed | Aroyo et al.: "DICES Dataset: Diversity in Conversational AI Evaluation for Safety" | NeurIPS 2023 (Poster) | https://proceedings.neurips.cc/paper_files/paper/2023/hash/a74b697bce4cac6c91896372abaa8863-Abstract-Datasets_and_Benchmarks.html | - 60-70 ratings per item - annotators from US and India - annotation across 16 safety criteria | 23.12.2023 | |
73 | BeaverTails | broad safety | evaluate and improve LLM safety on QA | train and eval | chat | English | 333,963 | conversations | Each conversation contains a human prompt and LLM response. | hybrid | mixed | prompts sampled from human AnthropicRedTeam data, plus model-generated responses | https://github.com/PKU-Alignment/beavertails | https://huggingface.co/datasets/PKU-Alignment/BeaverTails | CC BY-NC 4.0 | 10-Dec-2023 | Peking University | academia / npo | Ji et al.: "BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset" | NeurIPS 2023 (Poster) | https://arxiv.org/abs/2307.04657 | - 16,851 unique prompts sampled from AnthropicRedTeam - covers 14 harm categories (e.g. animal abuse) - annotated for safety by 3.34 crowdworkers on average | 11.01.2024 | ||
74 | TDCRedTeaming | broad safety | used to test adversarial method | evaluate success of automated red-teaming approaches | eval only | chat | English | 100 | prompts | Each prompt is an instruction. | human | original | written by the authors | https://github.com/centerforaisafety/tdc2023-starter-kit/tree/main/red_teaming/data | not available | MIT | 10-Dec-2023 | UIUC, Carnegie Mellon University, UC Berkeley, Center for AI Safety, Microsoft, University of Washington | mixed | Mazeika et al.: "TDC 2023 (LLM Edition): The Trojan Detection Challenge" | NeurIPS 2023 (Competition) | https://trojandetection.ai/ | - covers 7 categories: Bigotry and Abusive Language, Violent Content and Conduct, Illegal Activities, Malware and Exploits, Scams, Misinformation and Disinformation, Other Undesirable Content | 07.04.2024 | |
75 | PromptExtractionRobustness | narrow safety | prompt extraction, adversarial goal in data creation, crowdsourced | evaluate LLM vulnerability to prompt extraction | eval only | chat | English | 569 | samples | Each sample combines defense and attacker input. | human | original | written by Tensor Trust players | https://github.com/HumanCompatibleAI/tensor-trust-data | https://huggingface.co/datasets/qxcv/tensor-trust?row=0 | unspecified | 10-Dec-2023 | UC Berkeley, Georgia Tech, Harvard University | academia / npo | Toyer et al.: "Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game" | Instruction Workshop at NeurIPS 2023 | https://openreview.net/forum?id=UWymGURI75 | - filtered from larger raw prompt extraction dataset - collected using the open Tensor Trust online game | 11.01.2024 | |
76 | PromptHijackingRobustness | narrow safety | prompt extraction, adversarial goal in data creation, crowdsourced | evaluate LLM vulnerability to prompt hijacking | eval only | chat | English | 775 | samples | Each sample combines defense and attacker input. | human | original | written by Tensor Trust players | https://github.com/HumanCompatibleAI/tensor-trust-data | https://huggingface.co/datasets/qxcv/tensor-trust?row=0 | unspecified | 10-Dec-2023 | UC Berkeley, Georgia Tech, Harvard University | academia / npo | Toyer et al.: "Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game" | Instruction Workshop at NeurIPS 2023 | https://openreview.net/forum?id=UWymGURI75 | - filtered from larger raw prompt extraction dataset - collected using the open Tensor Trust online game | 11.01.2024 | |
77 | JADE | broad safety | use linguistic fuzzing to generate challenging prompts for evaluating LLM safety | eval only | chat | Chinese, English | 2,130 | prompts | Each prompt is a question targeting a specific harm category. | machine | original | generated by LLMs based on linguistic rules | https://github.com/whitzard-ai/jade-db | not available | MIT | 10-Dec-2023 | Whitzard AI, Fudan University | mixed | Zhang et al.: "JADE: A Linguistics-based Safety Evaluation Platform for Large Language Models" | arxiv | https://arxiv.org/abs/2311.00286 | - JADE is a platform for safety data generation and evaluation - Prompt generations are based on linguistic rules created by authors - the paper comes with 4 example datasets | 05.02.2024 | ||
78 | CPAD | broad safety | adversarial goal in data creation | elicit generation of harmful or objectionable content from LLMs | eval only | chat | Chinese | 10,050 | prompts | Each prompt is a longer-form scenario / instruction aimed at eliciting a harmful response. | hybrid | original | mostly generated by GPTs based on some human seed prompts | https://github.com/liuchengyuan123/CPAD | not available | CC BY-SA 4.0 | 8-Dec-2023 | Zhejiang University, Alibaba | mixed | Liu et al.: "Goal-Oriented Prompt Attack and Safety Evaluation for LLMs" | arxiv | https://arxiv.org/abs/2309.11830 | - focus of the work is adversarial / to jailbreak LLMs - CPAD stands for Chinese Prompt Attack Dataset | 11.01.2024 | |
79 | CyberattackAssistance | narrow safety | cyberattacks | evaluate LLM compliance in assisting in cyberattacks | eval only | chat | English | 1,000 | prompts | Each prompt is an instruction to assist in a cyberattack. | hybrid | original | written by experts, augmented with LLMs | https://github.com/facebookresearch/PurpleLlama/tree/main/CybersecurityBenchmarks/datasets/mitre | not available | custom (Llama2 Community License) | 7-Dec-2023 | Meta | industry | Bhatt et al.: "Purple Llama CyberSecEval: A Secure Coding Benchmark for Language Models" | arxiv | https://arxiv.org/abs/2312.04724 | - instructions are split into 10 MITRE categories - The dataset comes with additional LLM-rephrased instructions | 23.12.2023 | |
80 | AdvPromptSet | broad safety | adversarial goal in data creation | evaluate LLM responses to adversarial toxicity text prompts | eval only | other | English | 197,628 | sentences | Each sentence is taken from a social media dataset. | human | sampled | sampled from two Jigsaw social media datasets (written by social media users) | https://github.com/facebookresearch/ResponsibleNLP/tree/main/AdvPromptSet | not available | MIT | 6-Dec-2023 | Meta | industry | Esiobu et al.: "ROBBIE: Robust Bias Evaluation of Large Generative Language Models" | EMNLP 2023 (Main) | https://aclanthology.org/2023.emnlp-main.230/ | - created as part of the ROBBIE bias benchmark - originally labelled for toxicity by Jigsaw | 11.01.2024 | |
81 | HolisticBiasR | bias | sociodemographics | evaluate LLM completions for sentences related to individual sociodemographics | eval only | autocomplete | English | 214,460 | prompts | Each prompt is the beginning of a sentence related to a person's sociodemographics. | hybrid | original | human-written templates expanded by combination | https://github.com/facebookresearch/ResponsibleNLP/tree/main/robbie | not available | MIT | 6-Dec-2023 | Meta | industry | Esiobu et al.: "ROBBIE: Robust Bias Evaluation of Large Generative Language Models" | EMNLP 2023 (Main) | https://aclanthology.org/2023.emnlp-main.230/ | - created as part of the ROBBIE bias benchmark - constructed from 60 Regard templates - uses noun phrases from Holistic Bias - covers 11 categories of bias: age, body type, class, culture, disability, gender, nationality, political ideology, race/ethnicity, religion, sexual orientation | 11.01.2024 | |
82 | HackAPrompt | other | adversarial goal in data creation, crowdsourced | analyse prompt hacking / extraction attacks | other | chat | mostly English | 601,757 | prompts | Most prompts are prompt extraction attacks. | human | original | written by participants of the HackAPrompt competition | not available | https://huggingface.co/datasets/hackaprompt/hackaprompt-dataset | MIT | 6-Dec-2023 | University of Maryland, Mila, Towards AI, Stanford University, Technical University of Sofia, University of Milan, NYU, University of Arizona | mixed | Schulhoff et al.: "Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs Through a Global Prompt Hacking Competition" | EMNLP 2023 (Main) | https://aclanthology.org/2023.emnlp-main.302/ | - focus of the work is adversarial / prompt hacking - prompts were written by ca. 2.8k people from 50+ countries | 11.01.2024 | |
83 | ToxicChat | other | content moderation | evaluate dialogue content moderation systems | other | chat | mostly English | 10,166 | conversations | Each conversation is a single-turn with user input and LLM output. | human | original | written by LMSys users | not available | https://huggingface.co/datasets/lmsys/toxic-chat | CC BY-NC 4.0 | 6-Dec-2023 | UC San Diego | academia / npo | Lin et al.: "ToxicChat: A Benchmark for Content Moderation in Real-world User-AI Interactions" | EMNLP 2023 (Findings) | https://aclanthology.org/2023.findings-emnlp.311/ | - subset of LMSYSChat1M - annotated for toxicity by 4 authors - ca. 7% toxic | 23.12.2023 | |
84 | AART | broad safety | used to test adversarial method | illustrate the AART automated red-teaming method | eval only | chat | English | 3,269 | prompts | Each prompt is an instruction. | machine | original | generated by PALM | https://github.com/google-research-datasets/aart-ai-safety-dataset | not available | CC BY 4.0 | 6-Dec-2023 | Google Research | industry | Radharapu et al.: "AART: AI-Assisted Red-Teaming with Diverse Data Generation for New LLM-powered Applications" | EMNLP 2023 (Industry) | https://aclanthology.org/2023.emnlp-industry.37/ | - contains examples for specific geographic regions - prompts also change up use cases and concepts | 23.12.2023 | |
85 | DELPHI | broad safety | evaluate LLM performance in handling controversial issues | eval only | chat | English | 29,201 | prompts | Each prompt is a question about a more or less controversial issue. | human | sampled | sampled from the Quora Question Pair Dataset (written by Quora users) | https://github.com/apple/ml-delphi/tree/main/dataset | not available | CC BY 4.0 | 6-Dec-2023 | Apple | industry | Sun et al.: "DELPHI: Data for Evaluating LLMs’ Performance in Handling Controversial Issues" | EMNLP 2023 (Industry) | https://aclanthology.org/2023.emnlp-industry.76 | - annotated for 5 levels of controversy - annotators are native English speakers who have spent significant time in Western Europe | 07.04.2024 | ||
86 | AttaQ | broad safety | adversarial goal in data creation | evaluate tendency of LLMs to generate harmful or undesirable responses | eval only | chat | English | 1,402 | prompts | Each prompt is a question. | hybrid | mixed | sampled from AnthropicRedTeam, and LLM-generated, and crawled from Wikipedia | not available | https://huggingface.co/datasets/ibm/AttaQ | MIT | 6-Dec-2023 | IBM Research AI | industry | Kour et al.: "Unveiling Safety Vulnerabilities of Large Language Models" | GEM Workshop at EMNLP 2023 | https://aclanthology.org/2023.gem-1.10/ | - consists of a mix of sources | 01.08.2024 | |
87 | DiscrimEval | bias | sociodemographics | evaluate the potential discriminatory impact of LMs across use cases | eval only | multiple choice | English | 9,450 | binary-choice questions | Each question comes with scenario context. | machine | original | topics, templates and questions generated by Claude | not available | https://huggingface.co/datasets/Anthropic/discrim-eval | CC BY 4.0 | 6-Dec-2023 | Anthropic | industry | Tamkin et al.: "Evaluating and Mitigating Discrimination in Language Model Decisions" | arxiv | https://arxiv.org/abs/2312.03689 | - covers 70 different decision scenarios - each question comes with an 'implicit' version where race and gender are conveyed through associated names - covers 3 categories of bias: race, gender, age | 11.01.2024 | |
88 | SPMisconceptions | narrow safety | privacy | measure the ability of LLMs to refute misconceptions | eval only | chat | English | 122 | prompts | Each prompt is a single-sentence misconception. | human | original | written by the authors | https://github.com/purseclab/LLM_Security_Privacy_Advice | not available | MIT | 4-Dec-2023 | Purdue University | academia / npo | Chen et al.: "Can Large Language Models Provide Security & Privacy Advice? Measuring the Ability of LLMs to Refute Misconceptions" | ACSAC 2023 | https://dl.acm.org/doi/10.1145/3627106.3627196 | - misconceptions all relate to security and privacy - uses templates to turn misconceptions into prompts - covers six categories (e.g. crypto and blockchain, law and regulation) | 11.01.2024 | |
89 | FFT | broad safety | evaluation factuality, fairness, and toxicity of LLMs | eval only | chat | English | 2,116 | prompts | Each prompt is a question or instruction, sometimes within a jailbreak template. | hybrid | mixed | sampled from AnthropicRedTeam, and LLM-generated, and crawled from Wikipedia, Reddit and elsewhere | https://github.com/cuishiyao96/FFT | not available | unspecified | 30-Nov-2023 | Chinese Academy of Sciences, University of Chinese Academy of Sciences, Baidu | mixed | Cui et al.: "FFT: Towards Harmlessness Evaluation and Analysis for LLMs with Factuality, Fairness, Toxicity" | arxiv | https://arxiv.org/abs/2311.18580 | - tests factuality, fairness and toxicity | 01.08.2024 | ||
90 | GandalfIgnoreInstructions | other | adversarial goal in data creation, crowdsourced | analyse prompt hacking / extraction attacks | other | chat | English | 1,000 | prompts | Most prompts are prompt extraction attacks. | human | original | written by players of the Gandalf game | not available | https://huggingface.co/datasets/Lakera/gandalf_ignore_instructions | MIT | 2-Oct-2023 | Lakera AI | industry | Lakera AI: "Gandalf Prompt Injection: Ignore Instruction Prompts" | blog | https://www.lakera.ai/blog/who-is-gandalf | - focus of the work is adversarial / prompt extraction - not all prompts are attacks | 05.02.2024 | |
91 | GandalfSummarization | other | adversarial goal in data creation, crowdsourced | analyse prompt hacking / extraction attacks | other | chat | English | 140 | prompts | Most prompts are prompt extraction attacks. | human | original | written by players of the Gandalf game | not available | https://huggingface.co/datasets/Lakera/gandalf_summarization | MIT | 2-Oct-2023 | Lakera AI | industry | Lakera AI: "Gandalf Prompt Injection: Summarization Prompts" | blog | https://www.lakera.ai/blog/who-is-gandalf | - focus of the work is adversarial / prompt extraction - not all prompts are attacks | 05.02.2024 | |
92 | HarmfulQA | broad safety | evaluate and improve LLM safety | eval only | chat | English | 1,960 | prompts | Each prompt is a question. | machine | original | generated by ChatGPT | https://github.com/declare-lab/red-instruct/tree/main/harmful_questions | https://huggingface.co/datasets/declare-lab/HarmfulQA | Apache 2.0 | 30-Aug-2023 | Singapore University of Technology and Design | academia / npo | Bhardwaj and Poria: "Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment" | arxiv | https://arxiv.org/abs/2308.09662 | - split into 10 topics (e.g. "Mathematics and Logic") - similarity across prompts is quite high - not all prompts are unsafe / safety-related | 23.12.2023 | ||
93 | LatentJailbreak | narrow safety | jailbreak, adversarial goal in data creation | evaluate safety and robustness of LLMs in response to adversarial prompts | eval only | chat | English | 416 | prompts | Each prompt combines a meta-instruction (e.g. translation) with the same toxic instruction template. | hybrid | original | human-written templates expanded by combination | https://github.com/qiuhuachuan/latent-jailbreak/tree/main | not available | MIT | 28-Aug-2023 | Zhejiang University, Westlake University | academia / npo | Qiu et al.: "Latent Jailbreak: A Benchmark for Evaluating Text Safety and Output Robustness of Large Language Models" | arxiv | https://arxiv.org/abs/2307.08487 | - focus of the work is adversarial / to jailbreak LLMs - 13 prompt templates instantiated with 16 protected group terms and 2 posititional types - main exploit focuses on translation | 11.01.2024 | |
94 | OpinionQA | value alignment | cultural / social | evaluate the alignment of LLM opinions with US demographic groups | eval only | multiple choice | English | 1,498 | multiple-choice questions | - | human | sampled | adapted from the Pew American Trends Panel surveys | https://github.com/tatsu-lab/opinions_qa | not available | unspecified | 23-Jul-2023 | Stanford University | academia / npo | Santurkar et al.: "Whose Opinions Do Language Models Reflect?" | ICML 2023 (Oral) | https://dl.acm.org/doi/10.5555/3618408.3619652 | - questions taken from 15 ATP surveys - covers 60 demographic groups | 04.01.2024 | |
95 | CValuesResponsibilityMC | value alignment | safety | evaluate human value alignment in Chinese LLMs | eval only | multiple choice | Chinese | 1,712 | multiple-choice questions | Each question targets responsible behaviours. | machine | original | automatically created from human-written prompts | https://github.com/X-PLUG/CValues | https://huggingface.co/datasets/Skepsun/cvalues_rlhf | Apache 2.0 | 19-Jul-2023 | Alibaba, Beijing Jiaotong University | mixed | Xu et al.: "CVALUES: Measuring the Values of Chinese Large Language Models from Safety to Responsibility" | arxiv | https://arxiv.org/abs/2307.09705 | - distinguishes between unsafe and irresponsible responses - covers 8 domains (e.g. psychology, law, data science) | 04.01.2024 | |
96 | CValuesResponsibilityPrompts | value alignment | safety | evaluate human value alignment in Chinese LLMs | eval only | chat | Chinese | 800 | prompts | Each prompt is an open question targeting responsible behaviours. | human | original | written by the authors | https://github.com/X-PLUG/CValues | https://huggingface.co/datasets/Skepsun/cvalues_rlhf | Apache 2.0 | 19-Jul-2023 | Alibaba, Beijing Jiaotong University | mixed | Xu et al.: "CVALUES: Measuring the Values of Chinese Large Language Models from Safety to Responsibility" | arxiv | https://arxiv.org/abs/2307.09705 | - distinguishes between unsafe and irresponsible responses - covers 8 domains (e.g. psychology, law, data science) | 04.01.2024 | |
97 | CHBias | bias | sociodemographics | evaluate LLM bias related to sociodemographics | eval only | other | Chinese | 4,800 | Weibo posts | Each post references a target group. | human | sampled | posts sampled from Weibo (written by Weibo users) | https://github.com/hyintell/CHBias | not available | MIT | 9-Jul-2023 | Eindhoven University, University of Liverpool, University of Technology Sydney | academia / npo | Zhao et al.: "CHBias: Bias Evaluation and Mitigation of Chinese Conversational Language Models" | ACL 2023 (Main) | https://aclanthology.org/2023.acl-long.757/ | - 4 bias categories: gender, sexual orientation, age, appearance - annotated by Chinese NLP grad students - similar evaluation setup to CrowS pairs | 04.01.2024 | |
98 | HarmfulQ | broad safety | evaluate LLM safety | eval only | chat | English | 200 | prompts | Each prompt is a question. | machine | original | generated by GPT-3 (text-davinci-002) | https://github.com/SALT-NLP/chain-of-thought-bias | not available | MIT | 9-Jul-2023 | Stanford University, Shanghai Jiao Tong University, Georgia Tech | academia / npo | Shaikh et al.: "On Second Thought, Let’s Not Think Step by Step! Bias and Toxicity in Zero-Shot Reasoning" | ACL 2023 (Main) | https://aclanthology.org/2023.acl-long.244/ | - focus on 6 attributes: "racist, stereotypical, sexist, illegal, toxic, harmful" - authors do manual filtering for overly similar questions | 23.12.2023 | ||
99 | FairPrism | other | content moderation | analyse harms in conversations with LLMs | other | chat | English | 5,000 | conversations | Each conversation is single-turn, containing a prompt and a potentially harmful model response. | hybrid | sampled | prompts sampled from ToxiGen and Social Bias Frames, responses from GPTs and XLNet | https://github.com/microsoft/fairprism | not available | MIT | 9-Jul-2023 | UC Berkeley, Microsoft Research | mixed | Fleisig et al.: "FairPrism: Evaluating Fairness-Related Harms in Text Generation" | ACL 2023 (Main) | https://aclanthology.org/2023.acl-long.343 | - does not introduce new prompts - focus is on analysing model responses | 07.04.2024 | |
100 | SeeGULL | bias | sociodemographics | expand cultural and geographic coverage of stereotype benchmarks | eval only | other | English | 7,750 | tuples | Each tuple is an identity group plus a stereotype attribute. | hybrid | original | generated by LLMs, partly validated by human annotation | https://github.com/google-research-datasets/seegull | not available | CC BY-SA 4.0 | 9-Jul-2023 | Virginia Tech, Google Research | mixed | Jha et al.: "SeeGULL: A Stereotype Benchmark with Broad Geo-Cultural Coverage Leveraging Generative Models" | ACL 2023 (Main) | aclanthology.org/2023.acl-long.548 | - stereotypes about identity groups spanning 178 countries across 8 different geo-political regions across 6 continents - examples accompanied by fine-grained offensiveness scores | 05.02.2024 |