ABCDEFGHIJKLMNOPQRSTUVWX
1
data_namepurpose_typepurpose_tagspurpose_statedpurpose_llmdeventries_typeentries_languagesentries_nentries_unitentries_detailcreation_creator_typecreation_source_typecreation_detailaccess_git_urlaccess_hf_urlaccess_licensepublication_datepublication_affilspublication_sectorpublication_namepublication_venuepublication_urlother_notesother_date_added
2
HarmEvalbroad safetyevaluate LLM misuse scenarioseval onlychatEnglish550promptsEach prompt is an unsafe question or instruction.machinebased on OpenAI and Meta usage policieshttps://github.com/NeuralSentinel/SafeInfernot availableunspecified25-Feb-2025IIT Kharagpur, MicrosoftmixedBanerjee et al.: "SafeInfer: Context Adaptive Decoding Time Safety Alignment for Large Language Models"AAAI 2025https://ojs.aaai.org/index.php/AAAI/article/view/34927- covers 11 categories: Illegal Activity, Child Abuse Content, Physical Harm, Economic Harm, Political Campaigning, Privacy Violation Activity, Tailored Financial Advice, Fraud/Deception, Hate/Harass/Violence, Adult Content, Malware4.4.2025
3
ImplicitBiasbiassociodemographicsevaluate implicit bias in LLMseval onlyotherEnglish33,600promptsEach prompt is either an Implicit Associatian Test word association task or decision scenario.hybridmixedpartly based on IAT templates, with custom sociodemogrpahic attributes, and LLM-generated decision scenarioshttps://github.com/baixuechunzi/llm-implicit-biasnot availableMIT20-Feb-2025University of Chicago, Stanford University, New York University, Princeton Universityacademia / npoBai et al.: "Explicitly unbiased large language models still form biased associations"PNAShttps://www.pnas.org/doi/10.1073/pnas.2416228122- separate datasets for measuring implicit bias and decision bias21.3.2025
4
DifferenceAwarenessbiassociodemographicsevaluate difference awareness in LLMseval onlychatEnglish16,000promptsEach prompt is a question.humanmixedsome benchmarks sampled from others, some created by handhttps://github.com/Angelina-Wang/difference_awarenessnot availableCC0 1.0 Universal4-Feb-2025Stanford Universityacademia / npoWang et al.: "Fairness through Difference Awareness: Measuring Desired Group Discrimination in LLMs"arxivhttps://arxiv.org/abs/2502.01926- consists of 8 benchmarks with 2k questions each
- 4 benchmarks are descriptive, 4 normative
21.3.2025
5
JBBBehavioursbroad safetyused to test adversarial methodevaluate effectiveness of different jailbreaking methodseval onlychatEnglish100promptsEach prompt is an unsafe question or instruction.hybridmixedsampled from AdvBench and Harmbench, plus hand-written exampleshttps://github.com/JailbreakBench/jailbreakbenchhttps://huggingface.co/datasets/JailbreakBench/JBB-BehaviorsMIT13-Dec-2024University of Pennsylvania, ETH Zurich, EPFL, Sony AImixedChao et al.: "JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models"NeurIPS 2024 (D&B)https://arxiv.org/abs/2404.01318- covers 10 safety categories
- comes with 100 benign prompts from XSTest
01.08.2024
6
MedSafetyBenchnarrow safetymedical safetymeasure the medical safety of LLMstrain and evalchatEnglish1,800conversationsEach conversation is an unsafe medical request with an associated safe response.machineoriginalgenerated by GPT-4 and llama-2-7bhttps://github.com/AI4LIFE-GROUP/med-safety-benchnot availableMIT13-Dec-2024Harvard Universityacademia / npoHan et al.: "MedSafetyBench: Evaluating and Improving the Medical Safety of Large Language Models"NeurIPS 2024 (D&B)https://arxiv.org/abs/2403.03744- constructed around the AMA’s Principles of Medical Ethics01.08.2024
7
PRISMvalue alignmentcultural / socialcapture diversity of human preferences over LLM behaviourstrain and evalchatEnglish8,011conversationsConversations can be multi-turn, with user input and responses from one or multiple LLMs.humanoriginalwritten by crowdworkershttps://github.com/HannahKirk/prism-alignmenthttps://huggingface.co/datasets/HannahRoseKirk/prism-alignmentCC BY-NC 4.013-Dec-2024University of Oxford, University of Pennsylvania, Bocconi University, AWS AI Labs, ML Commons, UCL, Cohere, Meta AI, NYU, Contextual AI, MeedanmixedKirk et al.: "The PRISM Alignment Dataset: What Participatory, Representative and Individualised Human Feedback Reveals About the Subjective and Multicultural Alignment of Large Language Models"NeurIPS 2024 (D&B)https://arxiv.org/abs/2404.16019- collected from 1,500 participants residing in 38 different countries
- also comes with survey on each participant
01.08.2024
8
WildGuardMixothercontent moderationtrain and evaluate content moderation guardrailsotherchatEnglish86,759examplesEach example is either a prompt or a prompt + response annotated for safety.hybridmixedsynthetic data (87%), in-the-wild user-LLLM interactions (11%), and existing annotator-written data (2%)not availablehttps://huggingface.co/datasets/allenai/wildguardmixODC BY13-Dec-2024Allen AI, University of Washington, Seoul National UniversitymixedHan et al.: "WildGuard: Open One-stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs"NeurIPS 2024 (D&B)https://arxiv.org/abs/2406.18495- consists of two splits, WildGuardTrain and WildGuardTest17.12.2024
9
SGBenchbroad safetybenchmarkevaluate the generalisation of LLM safety across various tasks and prompt typeseval onlychatEnglish1,442promptsEach prompt is a malicious instruction or question, which comes in multiple formats (direct query, jailbreak, multiple-choice or safety classification)humansampledsampled from AdvBench, HarmfulQA, BeaverTails and SaladBench, then expanded using templates and LLMshttps://github.com/MurrayTom/SG-Benchnot availableGPL 3.013-Dec-2024Peking Universityacademia / npoMou et al.: "SG-Bench: Evaluating LLM Safety Generalization Across Diverse Tasks and Prompt Types"NeurIPS 2024 (D&B)https://arxiv.org/abs/2410.21965- comes in multiple formats (chat, multiple choice, classification)27.02.2024
10
StrongREJECTbroad safetyused to test adversarial methodbetter investigate the effectiveness of different jailbreaking techniqueseval onlychatEnglish313promptsEach prompt is a 'forbidden question' in one of six categories.hybridmixedhuman-written and curated questions from existing datasets, plus LLM-generated prompts verified for quality and relevancehttps://github.com/alexandrasouly/strongreject/tree/mainnot availableMIT13-Dec-2024UC Berkeleyacademia / npoSouly et al.: "A StrongREJECT for Empty Jailbreaks"NeurIPS 2024 (D&B)https://arxiv.org/abs/2402.10260- focus of the work is adversarial / to jailbreak LLMs
- covers 6 question categories: disinformation/deception, hate/harassment/discrimination, illegal goods/services, non-violent crimes, sexual content, violence
27.02.2024
11
CoCoNotnarrow safetynon-complianceevaluate and improve contextual non-compliance in LLMstrain and evalchatEnglish12,478promptsEach prompt is a question or instruction that models should not comply with.hybridoriginalhuman-written seed prompts expanded using LLMshttps://github.com/allenai/noncompliancehttps://huggingface.co/datasets/allenai/coconotMIT13-Dec-2024Allen AI, University of Washington, Ohio State University, Microsoft Research, Samaya AI, NvidiamixedBrahman et al.: "The Art of Saying No: Contextual Noncompliance in Language Models"NeurIPS 2024 (D&B)https://arxiv.org/abs/2407.12043- covers 5 categories of non-compliance
- comes with a contrast set of requests that should be complied with
17.12.2024
12
WildJailbreakbroad safetytrain LLMs to be safetrain onlychatEnglish261,534conversationsEach conversation is single-turn, containing a potentially unsafe prompt and a safe model response.machineoriginalgenerated by different LLMs prompted with example prompts and jailbreak techniquesnot availablehttps://huggingface.co/datasets/allenai/wildjailbreakODC BY13-Dec-2024University of Washington, Allen AI, Seoul National University, Carnegie Mellon UniversitymixedLiang et al.: "WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models"NeurIPS 2024https://openreview.net/forum?id=n5R6TvBVcX&noteId=f3XHMKAJeK- contains Vanilla and Adversarial portions
- comes with a smaller testset of only prompts
17.12.2024
13
MultiTPvalue alignmentmoralevaluate LLM moral decision-making across many languageseval onlymultiple choiceAfrikaans, Amharic, Arabic, Azerbaijani, Belarusian, Bulgarian, Bengali, Bosnian, Catalan, Cebuano, Corsican, Czech, Welsh, Danish, German, Modern Greek, English, Esperanto, Spanish, Estonian, Basque, Persian, Finnish, French, Western Frisian, Irish, Scottish Gaelic, Galician, Gujarati, Hausa, Hawaiian, Hebrew, Hindi, Hmong, Croatian, Haitian, Hungarian, Armenian, Indonesian, Igbo, Icelandic, Italian, Modern Hebrew, Japanese, Javanese, Georgian, Kazakh, Central Khmer, Kannada, Korean, Kurdish, Kirghiz, Latin, Luxembourgish, Lao, Lithuanian, Latvian, Malagasy, Maori, Macedonian, Malayalam, Mongolian, Marathi, Malay, Maltese, Burmese, Nepali, Dutch, Norwegian, Nyanja, Oriya, Panjabi, Polish, Pushto, Portuguese, Romanian, Russian, Sindhi, Sinhala, Slovak, Slovenian, Samoan, Shona, Somali, Albanian, Serbian, Southern Sotho, Sundanese, Swedish, Swahili, Tamil, Telugu, Tajik, Thai, Tagalog, Turkish, Uighur, Ukrainian, Urdu, Uzbek, Vietnamese, Xhosa, Yiddish, Yoruba, Chinese, Chinese (Traditional), Zulu107,000binary-choice questionsEach question is a trolley-style moral choice scenario.hybridmixedsampled from Moral Machines, then expanded with templated augmentations, then auto-translated into 106 languageshttps://github.com/causalNLP/moralmachinenot availableMIT13-Dec-2024MPI, ETH Zurich, Allen AI, University of Washington, University of Michigan, University of Triesteacademia / npoJin et al.: "Multilingual Trolley Problems for Language Models"Pluralistic Alignment Workshop at NeurIPS 2024https://arxiv.org/abs/2407.02273- covers 106 languages + English
- non-English languages are auto-translated
01.08.2024
14
DoAnythingNownarrow safetyjailbreak, adversarial goal in data creationcharacterise and evaluate in-the-wild LLM jailbreak promptseval onlychatEnglish15,140promptsEach prompt is an instruction or question, sometimes with a jailbreak.humansampledwritten by users on Reddit, Discord, websites and in other datasetshttps://github.com/verazuo/jailbreak_llmsnot availableMIT9-Dec-2024CISPA Helmholtz Center for Information Security, NetAppmixedShen et al.: "'Do Anything Now': Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models"CCS 2024https://dl.acm.org/doi/10.1145/3658644.3670388- there are 1,405 jailbreak prompts among the full prompt set11.01.2024
15
ForbiddenQuestionsbroad safetyused to test adversarial methodevaluate whether LLMs answer questions that violate OpenAI's usage policyeval onlychatEnglish390promptsEach prompt is a question targetting behaviour disallowed by OpenAI.machineoriginalGPT-written templates expanded by combinationhttps://github.com/verazuo/jailbreak_llmsnot availableMIT9-Dec-2024CISPA Helmholtz Center for Information Security, NetAppmixedShen et al.: "'Do Anything Now': Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models"CCS 2024https://dl.acm.org/doi/10.1145/3658644.3670388- covers 13 "forbidden" scenarios taken from the OpenAI usage policy11.01.2024
16
CoSafebroad safetyevaluating LLM safety in dialogue coreferenceeval onlychatEnglish1,400conversationsEach conversation is multi-turn with the final question being unsafe.hybridmixedsampled from BeaverTails, then expanded with GPT-4 into multi-turn conversationshttps://github.com/ErxinYu/CoSafe-Datasetnot availableunspecified12-Nov-2024Hong Kong Polytechnic University, HuaweimixedYu et al.: "CoSafe: Evaluating Large Language Model Safety in Multi-Turn Dialogue Coreference"
EMNLP 2024 (Main)https://aclanthology.org/2024.emnlp-main.968/- focuses on multi-turn conversations
- covers 14 categories of harm from BeaverTails
01.08.2024
17
SoFabiassociodemographicsevaluate social biases of LLMseval onlyotherEnglish1,490,120sentencesEach sentence mentions one sociodemographic group.hybridmixedtemplates sampled from SBIC, then combined with sociodemographic groupsnot availablehttps://huggingface.co/datasets/copenlu/sofaMIT12-Nov-2024University of Pisa, University of Copenhagenacademia / npoManerba et al.: "Social Bias Probing: Fairness Benchmarking for Language Models"EMNLP 2024 (Main)https://aclanthology.org/2024.emnlp-main.812/- evaluation is based on sentence perplexity
- similar to StereoSet and CrowSPairs
17.12.2024
18
ArabicAdvBenchbroad safetyevaluate safety risks of LLMs in (different forms of) Arabiceval onlychatArabic520promptsEach prompt is a harmful instruction with an associated jailbreak prompt.machinemixedtranslated from AdvBench (which is machine-generated)https://github.com/SecureDL/arabic_jailbreak/tree/main/datanot availableMIT12-Nov-2024University of Central Floridaacademia / npoAl Ghanim et al.: "Jailbreaking LLMs with Arabic Transliteration and Arabizi"EMNLP 2024 (Main)https://aclanthology.org/2024.emnlp-main.1034- translated from AdvBench prompts
- comes in different versions of Arabic
17.12.2024
19
AyaRedTeamingbroad safetyadversarial goal in data creationprovide a testbed for exploring alignment across global and local preferencestrain and evalchatArabic, English, Filipino, French, Hindi, Russian, Serbian, Spanish7,419promptsEach prompt is a harmful question or instruction.humanoriginalwritten by paid native-speaking annotatorsnot availablehttps://huggingface.co/datasets/CohereForAI/aya_redteamingApache 2.012-Nov-2024Cohere for AI, CoheremixedAakanshka et al.: "The Multilingual Alignment Prism: Aligning Global and Local Preferences to Reduce Harm"EMNLP 2024 (Main)https://aclanthology.org/2024.emnlp-main.671/- distinguishes between global and local harm
- covers 9 harm categories
17.12.2024
20
UltraSafetybroad safetycreate data for safety fine-tuning of LLMstrain onlychatEnglish3,000promptsEach prompt is a harmful instruction with an associated jailbreak prompt.machinemixedsampled from AdvBench and MaliciousInstruct, then expanded with SelfInstructnot availablehttps://huggingface.co/datasets/openbmb/UltraSafetyMIT12-Nov-2024Renmin University of China, Tsinghua University, WeChat AI, TencentmixedGuo et al.: "Controllable Preference Optimization: Toward Controllable Multi-Objective Alignment"EMNLP 2024 (Main)https://aclanthology.org/2024.emnlp-main.85/- comes with responses from different models, assessed for how safe they are by a GPT classifier01.08.2024
21
DeMETbiasgenderevaluate gender bias in LLM decision-makingeval onlymultiple choiceEnglish29scenariosEach scenario describes a married couple (with two name placeholders) facing a decision.humanoriginalwritten by the authors inspired by WHO questionnairehttps://github.com/sharonlevy/GenderBiasScenariosnot availableMIT12-Nov-2024Johns Hopkins University, Northeastern Illinois Universityacademia / npoLevy et al.: "Gender Bias in Decision-Making with Large Language Models: A Study of
Relationship Conflicts"
EMNLP 2024 (Findings)https://aclanthology.org/2024.findings-emnlp.331- comes with list of gender-specific / neutral names to insert into scenarion templates17.12.2024
22
GESTbiasgendermeasure gender-stereotypical reasoning in language models and machine translation systems.eval onlyotherBelarussian, Russian, Ukrainian, Croatian, Serbian, Slovene, Czech, Polish, Slovak, English3,565sentencesEach sentence corresponds to a specific gender stereotype.humanoriginalwritten by professional translators, then validated by the authorshttps://github.com/kinit-sk/gesthttps://huggingface.co/datasets/kinit/gestApache 2.012-Nov-2024Kempelen Institute of Intelligent Technologiesacademia / npoPikuliak et al.: "Women Are Beautiful, Men Are Leaders: Gender Stereotypes in Machine Translation and Language Modeling"EMNLP 2024 (Findings)https://aclanthology.org/2024.findings-emnlp.173/- data can be used to evaluate MLM or MT models
- covers 16 specific gender stereotype (e.g. 'women are beautiful')
- covers 1 category of bias: gender
05.02.2024
23
GenMObiasgenderevaluate gender bias in LLM moral decision-makingeval onlymultiple choiceEnglish908scenariosEach scenario describes a person in an everyday situation where their actions can be judged to be moral or not.humansampledsampled from MoralStories, ETHICS, and SocialChemistryhttps://github.com/divij30bajaj/GenMOnot availableunspecified12-Nov-2024Texas A&M Universityacademia / npoBajaj et al.: "Evaluating Gender Bias of LLMs in Making Morality Judgements"EMNLP 2024 (Findings)https://aclanthology.org/2024.findings-emnlp.928- models are asked whether the action described in the prompt is moral or not17.12.2024
24
SGXSTestnarrow safetyfalse refusalevaluate exaggerated safety / false refusal in LLMs for the Singaporean contexteval onlychatEnglish200promptsEach prompt is a simple question.humanoriginalcreated based on XSTesthttps://github.com/walledai/walledevalhttps://huggingface.co/datasets/walledai/SGXSTestApache 2.012-Nov-2024Walled AIindustryGupta et al.: "WalledEval: A Comprehensive Safety Evaluation Toolkit for Large
Language Models"
EMNLP 2024 (Demo)https://aclanthology.org/2024.emnlp-demo.42/- half safe and half unsafe prompts, contrasting each other17.12.2024
25
HiXSTestnarrow safetyfalse refusalevaluate exaggerated safety / false refusal behaviour in Hindi LLMseval onlychatHindi50promptsEach prompt is a simple question.humanoriginalcreated based on XSTesthttps://github.com/walledai/walledevalhttps://huggingface.co/datasets/walledai/HiXSTestApache 2.012-Nov-2024Walled AIindustryGupta et al.: "WalledEval: A Comprehensive Safety Evaluation Toolkit for Large
Language Models"
EMNLP 2024 (Demo)https://aclanthology.org/2024.emnlp-demo.42/- half safe and half unsafe prompts, contrasting each other17.12.2024
26
CIVICSvalue alignmentcultural / socialevaluate the social and cultural variation of LLMs across multiple languages and value-sensitive topicseval onlychatGerman, Italian, French, English699excerptsEach excerpt is a short value-laden text passage.humansampledsampled from government and media sitesnot availablehttps://huggingface.co/datasets/CIVICS-dataset/CIVICSCC BY 4.016-Oct-2024Hugging Face, University of Amsterdam, Carnegie Mellon UniversitymixedPistilli et al.: "CIVICS: Building a Dataset for Examining Culturally-Informed Values in Large Language Models"AIES 2024https://ojs.aaai.org/index.php/AIES/article/view/31710- covers socially sensitive topics, including LGBTQI rights, social welfare, immigration, disability rights, and surrogacy17.12.2024
27
GlobalOpinionQAvalue alignmentcultural / socialevaluate whose opinions LLM responses are most similar toeval onlymultiple choiceEnglish2,556multiple-choice questions-humansampledadapted from Pew’s Global Attitudes surveys and the World Values Survey (written by survey designers)not availablehttps://huggingface.co/datasets/Anthropic/llm_global_opinionsCC BY-NC-SA 4.07-Oct-2024AnthropicindustryDurmus et al.: "Towards Measuring the Representation of Subjective Global Opinions in Language Models"COLM 2024https://arxiv.org/abs/2306.16388- comes with responses from people across the globe
- goal is to capture more diversity than OpinionQA
04.01.2024
28
PolygloToxicityPromptsbroad safetytoxicityevaluate propensity of LLMs to generate toxic content across 17 languageseval onlyautocompleteArabic, Chinese, Czech, Dutch, English, French, German, Hindi, Indonesian, Italian, Japanese, Korean, Polish, Portuguese, Russian, Spanish, Swedish







425,000promptsEach prompt is an unfinished document.humansampledsampled from mC4 and The Pilehttps://github.com/kpriyanshu256/polyglo-toxicity-promptshttps://huggingface.co/datasets/ToxicityPrompts/PolygloToxicityPromptsAI2 ImpACT - LR7-Oct-2024Carnegie Mellon University, University of Virginia, Allen Institute for AIacademia / npoJain et al.: "PolygloToxicityPrompts: Multilingual Evaluation of Neural Toxic Degeneration in Large Language Models"COLM 2024https://arxiv.org/abs/2405.09373- fragments are documents rather than sentences, as for RealToxicityPrompts21.3.2025
29
CALMbiasgender and raceevaluate LLM gender and racial bias across different tasks and domainseval onlyotherEnglish78,400examplesEach example is a QA, sentiment classification or NLI example.hybridmixedtemplates created by humans based on other datasets, then expanded by combinationhttps://github.com/vipulgupta1011/CALMhttps://huggingface.co/datasets/vipulgupta/CALMMIT7-Oct-2024Penn Stateacademia / npoGupta et al.: "CALM: A Multi-task Benchmark for Comprehensive Assessment of Language Model Bias"COLM 2024https://arxiv.org/abs/2308.12539- covers three tasks: QA, sentiment classification, NLI
- covers 2 categories of bias: gender and race
- 78,400 examples are generated from 224 templates
- Gender and race are instantiated using names
11.01.2024
30
CHiSafetyBenchbroad safetyevaluate LLM capabilities to identify risky content and refuse to answer risky questions in Chineseeval onlymultiple choiceChinese1,861multiple-choice questions-hybridmixedsampled from SafetyBench and SafetyPrompts, plus prompts generated by GPT-3.5https://github.com/UnicomAI/UnicomBenchmark/tree/main/CHiSafetyBenchnot availableunspecified2-Sep-2024China UnicomindustryZhang et al.: "CHiSafetyBench: A Chinese Hierarchical Safety Benchmark for Large Language Models"arxivhttps://arxiv.org/abs/2406.10311- covers 5 risk areas: discrimination, violation of values, commercial violations, infringement of rights, security requirements for specific services
- comes with smaller set of unsafe open-ended questions
01.08.2024
31
PHTestnarrow safetyfalse refusalevaluate false refusal in LLMseval onlychatEnglish3,260promptsEach prompt is a "pseudo-harmful" question or instruction.machineoriginalgenerated by Llama2-7B-Chat, Llama3-8B-Instruct, and
Mistral-7B-Instruct-V0.2.
https://github.com/umd-huang-lab/FalseRefusalhttps://huggingface.co/datasets/furonghuang-lab/PHTestMIT1-Sep-2024University of Maryland, Adobe ResearchmixedAn et al.: "Automatic Pseudo-Harmful Prompt Generation for Evaluating False Refusals in Large Language Models"arxivhttps://arxiv.org/abs/2409.00598- main contribution of paper is method for generating similar datasets21.3.2025
32
SafetyBenchbroad safetybenchmarkevaluate LLM safety with multiple choice questionseval onlymultiple choiceEnglish, Chinese11,435multiple-choice questions-hybridmixedsampled from existing datasets + exams (zh), then LLM augmentation (zh)https://github.com/thu-coai/SafetyBenchhttps://huggingface.co/datasets/thu-coai/SafetyBenchMIT11-Aug-2024Tsinghua University, Northwest Minzu University, Peking University, China Mobile Research InstitutemixedZhang et al.: "SafetyBench: Evaluating the Safety of Large Language Models with Multiple Choice Questions"ACL 2024 (Main)https://aclanthology.org/2024.acl-long.830/- split into 7 categories
- language distribution imbalanced across categories
- tests knowledge about safety rather than safety
04.01.2024
33
CatQAbroad safetyevaluate effectiveness of safety training methodeval onlychatEnglish, Chinese, Vietnamese550promptsEach prompt is a question.machineoriginalgenerated by unnamed LLM that is not safety-tunedhttps://github.com/declare-lab/restahttps://huggingface.co/datasets/declare-lab/CategoricalHarmfulQAApache 2.011-Aug-2024Singapore University of Technology and Design, Nanyang Technological Universityacademia / npoBhardwaj et al.: "Language Models are Homer Simpson! Safety Re-Alignment of Fine-tuned Language Models through Task Arithmetic"ACL 2024 (Main)https://aclanthology.org/2024.acl-long.762/- covers 11 harm categories, each divided into 5 subcategories
- for each subcategory there are 10 prompts
17.12.2024
34
KorNATvalue alignmentcultural / socialevaluate LLM alignment with South Korean social valueseval onlymultiple choiceEnglish, Korean10,000multiple-choice questionsEach question is about a specific social scenario.hybridmixedgenerated by GPT-3.5 based on keywords sampled from media and surveysnot availablehttps://huggingface.co/datasets/datumo/KorNATCC BY-NC 4.011-Aug-2024KAIST, Datumo Inc, Seoul National University Bundang Hospital, Naver AI LabmixedLee et al.: "KorNAT: LLM Alignment Benchmark for Korean Social Values and Common Knowledge"ACL 2024 (Findings)https://aclanthology.org/2024.findings-acl.666- each prompt comes in English and Korean
- also contains a dataset portion testing for common knowledge
17.12.2024
35
CMoralEvalvalue alignmentmoralevaluate the morality of Chinese LLMseval onlymultiple choiceChinese30,388multiple-choice questionsEach question is about a specific moral scenario.hybridmixedsampled from Chinese TV programs and other media, then turned into templates using GPT 3.5https://github.com/tjunlp-lab/CMoralEvalnot availableMIT11-Aug-2024Tianjin University, Kuming University, Zhengzhou Universityacademia / npoYu et al.: "CMoralEval: A Moral Evaluation Benchmark for Chinese Large Language Models"ACL 2024 (Findings)https://aclanthology.org/2024.findings-acl.703- covers 2 types of moral scenarios
- GitHub is currently empty
17.12.2024
36
XSafetybroad safetyevaluate multilingual LLM safetyeval onlychatEnglish, Chinese, Hindi, Spanish, French, Arabic, Bengali, Russian, Japanese, German28,000promptsEach prompt is a question or instruction.hybridmixedsampled from SafetyPrompts, then auto-translated and validatedhttps://github.com/Jarviswang94/Multilingual_safety_benchmarknot availableApache 2.011-Aug-2024Chinese University of Hong Kong, TencentmixedWang et al.: "All Languages Matter: On the Multilingual Safety of Large Language Models"ACL 2024 (Findings)https://aclanthology.org/2024.findings-acl.349/- covers 14 safety scenarios, some "typical", some "instruction"01.08.2024
37
SaladBenchbroad safetyevaluate LLM safety, plus attack and defense methodseval onlychatEnglish21,000promptsEach prompt is a question or instruction.hybridmixedmostly sampled from existing datasets, then augmented using GPT-4https://github.com/OpenSafetyLab/SALAD-BENCHhttps://huggingface.co/datasets/OpenSafetyLab/Salad-DataApache 2.011-Aug-2024Shanghai Artificial Intelligence Laboratory, Harbin Institute of Technology, Beijing Institute of Technology, Chinese University of Hong Kong, Hong Kong Polytechnic Universityacademia / npoLi et al.: "SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models"ACL 2024 (Findings)https://arxiv.org/abs/2402.05044- structured with three-level taxonomy including 66 categories
- comes with multiple-choice question set
01.08.2024
38
MMHBbiassociodemographicsevaluate sociodemographic biases in LLMs across multiple languageseval onlychatEnglish, French, Hindi, Indonesian, Italian, Portuguese, Spanish, Vietnamese5,754,444promptsEach prompt is a sentence starting a two-person conversation.hybridmixedtemplates sampled from HolisticBias, then translated and expandedhttps://github.com/facebookresearch/ResponsibleNLP/tree/main/mmhbnot availableunspecified29-Jun-2024MetaindustryTan et al.: "Towards Massive Multilingual Holistic Bias"arxivhttps://arxiv.org/abs/2407.00486- covers 13 demographic axes
- based on HolisticBias
17.12.2024
39
GPTFuzzerbroad safetyused to test adversarial methodevaluate effectiveness of automated red-teaming methodeval onlychatEnglish100promptsEach prompt is a question or instruction.hybridsampledsampled from AnthropicHarmlessBase and unpublished GPT-written datasethttps://github.com/sherdencooper/GPTFuzznot availableMIT27-Jun-2024Northwestern University, Ant GroupmixedYu et al.: "GPTFuzzer: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts"arxivhttps://arxiv.org/abs/2309.10253- small dataset for testing automated jailbreaks01.08.2024
40
WMDPnarrow safetyhazardous knowledgemeasure hazardous knowledge in biosecurity, cybersecurity, and chemical securityeval onlychatEnglish3,688multiple-choice questions-humanoriginalwritten by experts (academics and technical consultants)https://github.com/centerforaisafety/wmdphttps://huggingface.co/datasets/cais/wmdpMIT25-Jun-2024Center for AI Safety, UC Berkeley, MIT, SecureBio, Scale AI, NYU, IIIT Hyderabad, Stanford University, Harvard University, USC, UIUC, UCLA, Lapis Labs, California Institute of Technology, Sybil, Northeastern University, Carnegie Mellon University, RTX BBN Technologies, University of Newcastle, Keio University, Ariel University, Arizona State University, xAImixedNathaniel et al.: "The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning"ICML 2024 (Poster)https://proceedings.mlr.press/v235/li24bc.html- covers 3 hazard categories: biosecurity, cybersecurity, and chemical security01.08.2024
41
ALERTbroad safetyevaluate the safety of LLMs through red teaming methodologieseval onlychatEnglish44,800promptsEach prompt is a question or instruction.hybridmixedsampled from AnthropicRedTeam, then augmented with templateshttps://github.com/Babelscape/ALERThttps://huggingface.co/datasets/Babelscape/ALERTCC BY-NC-SA 4.024-Jun-2024Sapienza University of Rome, Babelscape, TU Darmstadt, Hessian.AI, DFKI, Ontocord.AI, UChicago, UIUCmixedTedeschi et al.: "ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red Teaming"arxivhttps://arxiv.org/abs/2404.08676- covers 6 categories and 32 sub-categories informed by AI regulation and prior work01.08.2024
42
ORBenchnarrow safetyfalse refusalevaluate false refusal in LLMs at scaleeval onlychatEnglish80,000promptsEach prompt is a question or instruction.machineoriginalgenerated by Mixtral as toxic prompts, then rewritten by Mixtral into safe prompts, then filteredhttps://github.com/justincui03/or-benchhttps://huggingface.co/datasets/bench-llm/or-benchCC BY 4.020-Jun-2024UCLA, UC Berkeleyacademia / npoCui et al.: "OR-Bench: An Over-Refusal Benchmark for Large Language Models"arxivhttps://arxiv.org/abs/2405.20947- comes with a subset of 1k hard prompts and 600 toxic prompts
- covers 10 categories of harmful behaviour
17.12.2024
43
SorryBenchbroad safetyevaluate fine-grained LLM safety across varying linguistic characteristicseval onlychatEnglish, French, Chinese, Marathi, Tamil, Malayalam9,450promptsEach prompt is an unsafe question or instruction.hybridmixedsampled from 10 other datasets, then augmented with templateshttps://github.com/sorry-bench/sorry-benchhttps://huggingface.co/datasets/sorry-bench/sorry-bench-202406MIT20-Jun-2024Princeton University, Virginia Tech, Stanford University, UC Berkeley, UIUC, UChicagoacademia / npoXie et al.: "SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors"arxivhttps://arxiv.org/abs/2406.14598- covers 45 potentially unsafe topics with 10 base prompts each
- includes 20 linguistic augmentations, which results in 21*450 prompts
01.08.2024
44
XSTestnarrow safetyfalse refusalevaluate exaggerated safety / false refusal behaviour in LLMseval onlychatEnglish450promptsEach prompt is a simple question.humanoriginalwritten by the authorshttps://github.com/paul-rottger/exaggerated-safetyhttps://huggingface.co/datasets/natolambert/xstest-v2-copyCC BY 4.016-Jun-2024Bocconi University, University of Oxford, Stanford Universityacademia / npoRöttger et al.: "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models"NAACL 2024 (Main)https://aclanthology.org/2024.naacl-long.301/- split into ten types of prompts
- 250 safe prompts and 200 unsafe prompts
23.12.2023
45
Flamesbroad safetyadversarial goal in data creationevaluate value alignment of Chinese language LLMseval onlychatChinese2,251promptsEach prompt is a question or instruction.humanoriginalwritten by crowdworkershttps://github.com/AIFlames/Flamesnot availableApache 2.016-Jun-2024Shanghai Artificial Intelligence Laboratory, Fudan Universityacademia / npoHuang et al.: "FLAMES: Benchmarking Value Alignment of LLMs in Chinese"NAACL 2024 (Main)https://aclanthology.org/2024.naacl-long.256/- covers 5 dimensions: fairness, safety, morality, legality, data protection01.08.2024
46
SEvalbroad safetyevaluate LLM safetyeval onlychatEnglish, Chinese20,000promptsEach prompt is an unsafe question or instruction.machineoriginalgenerated by a fine-tuned Qwen-14bhttps://github.com/IS2Lab/S-Evalhttps://huggingface.co/datasets/IS2Lab/S-EvalCC BY-NC-SA 4.028-May-2024Zhejian University, AlibabamixedYuan et al.: "S-Eval: Automatic and Adaptive Test Generation for Benchmarking Safety Evaluation of Large Language Models"arxivhttps://arxiv.org/abs/2405.14191- covers 8 risk categories
- comes with 20 adversarial augmentations for each base prompt
01.08.2024
47
WorldValuesBenchvalue alignmentcultural / socialevaluate LLM awareness of multicultural human valuestrain and evalmultiple choiceEnglish21,492,393multiple-choice questionsEach question comes with sociodemographic attributes of the respondent.humansampledadapted from the World Values Survey (written by survey designers) using templateshttps://github.com/Demon702/WorldValuesBenchnot availableunspecified20-May-2024University of Massachusetts Amherst, Allen AI, University of North Carolina at Chapel Hillacademia / npoZhao et al.: "WorldValuesBench: A Large-Scale Benchmark Dataset for Multi-Cultural Value Awareness of Language Models"LREC COLING 2024 (Main)https://aclanthology.org/2024.lrec-main.1539/- covers 239 different questions01.08.2024
48
CBBQbiassociodemographicsevaluate sociodemographic bias of LLMs in Chineseeval onlymultiple choiceChinese106,588examplesEach example is a context plus a question with three answer choices.hybridoriginalpartly human-written, partly generated by GPT-4https://github.com/YFHuangxxxx/CBBQhttps://huggingface.co/datasets/walledai/CBBQCC BY-SA 4.020-May-2024Tianjin Universityacademia / npoHuang and Xiong: "CBBQ: A Chinese Bias Benchmark Dataset Curated with Human-AI Collaboration for Large Language Models"LREC COLING 2024 (Main)https://aclanthology.org/2024.lrec-main.260- similar to BBQ but not parallel
- comprises 3,039 templates and 106,588 samples
17.12.2024
49
KoBBQbiassociodemographicsevaluate sociodemographic bias of LLMs in Koreaneval onlymultiple choiceKorean76,048examplesEach example is a context plus a question with three answer choices.humanmixedpartly sampled and translated from BBQ, partly newly writtenhttps://github.com/naver-ai/KoBBQ/https://huggingface.co/datasets/naver-ai/kobbqMIT3-May-2024KAIST, Naver AImixedJin et al.: "KoBBQ: Korean Bias Benchmark for Question Answering"TACLhttps://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00661/120915/KoBBQ-Korean-Bias-Benchmark-for-Question-Answering- similar to BBQ but not parallel
- comprises 268 templates and 76,048 samples across 12 categories of social bias
17.12.2024
50
SAFEbroad safetyevaluate LLM safety beyond binary distincton of safe and unsafeeval onlychatEnglish52,430conversationsEach conversation is single-turn, containing a prompt and a potentially harmful model response.hybridsampledsampled seed prompts from Friday website, then generated more prompts with GPT-4https://github.com/xiaoqiao/EvalSafetyLLMnot availableunspecified24-Apr-2024Zhejiang University, Westlake Universityacademia / npoYu et al.: "Beyond Binary Classification: A Fine-Grained Safety Dataset for Large Language Models"IEEE Accesshttps://ieeexplore.ieee.org/abstract/document/10507773- covers 7 classes: safe, sensitivity, harmfulness, falsehood, information corruption, unnaturalness, deviation from instructions01.08.2024
51
AegisAIContentSafetyothercontent moderationevaluate content moderation guardrailsotherchatEnglish11,997conversational turnsEach conversational turn is a user prompt or a model response annotated for safetyhybridmixedprompts sampled from AnthropicHarmlessBase, responses generated by Mistral-7B-v0.1-Instructnot availablehttps://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-1.0CC BY 4.011-Sep-2024NVIDIAindustryGhosh et al.: "AEGIS: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts"arxivhttps://arxiv.org/abs/2404.05993- covers 13 risk categories17.12.2024
52
DoNotAnswerbroad safetyevaluate 'dangerous capabilities' of LLMseval onlychatEnglish939promptsEach prompt is a question.machineoriginalgenerated by GPT-4https://github.com/Libr-AI/do-not-answerhttps://huggingface.co/datasets/LibrAI/do-not-answerApache 2.017-Mar-2024LibrAI, MBZUAI, University of MelbournemixedWang et al.: "Do-Not-Answer: Evaluating Safeguards in LLMs"EACL 2024 (Findings)https://aclanthology.org/2024.findings-eacl.61/- split across 5 risk areas and 12 harm types
- authors prompted GPT-4 to generate questions
23.12.2023
53
RuLESnarrow safetyrule-followingevaluate the ability of LLMs to follow simple ruleseval onlychatEnglish862promptsEach prompt is a test case combining rules and instructions.humanoriginalwritten by the authorshttps://github.com/normster/llm_ruleshttps://huggingface.co/datasets/normster/RuLESApache 2.08-Mar-2024UC Berkeley, Center for AI Safety, Stanford University, King Abdulaziz City for Science and Technologyacademia / npoMu et al.: "Can LLMs Follow Simple Rules?"arxivhttps://arxiv.org/abs/2311.04235- covers 19 rules across 14 scenarios11.01.2024
54
HarmBenchbroad safetyused to test adversarial methodevaluate effectiveness of automated red-teaming methodseval onlychatEnglish400promptsEach prompt is an instruction.humanoriginalwritten by the authorshttps://github.com/centerforaisafety/HarmBench/tree/main/data/behavior_datasetsnot availableMIT27-Feb-2024University of Illinois, Center for AI Safety, Carnegie Mellon University, UC Berkeley, MicrosoftmixedMazeika et al.: "HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal"ICML 2024 (Poster)https://arxiv.org/abs/2402.04249- covers 7 semantic categories of behaviour: Cybercrime & Unauthorized Intrusion, Chemical & Biological Weapons/Drugs, Copyright Violations, Misinformation & Disinformation, Harassment & Bullying, Illegal Activities, and General Harm
- The dataset also includes 110 multimodal prompts
07.04.2024
55
DecodingTrustbroad safetybenchmarkevaluate trustworthiness of LLMseval onlymultiple choiceEnglish243,877promptsEach prompt is an instruction.hybridoriginalhuman-written templates and examples plus extensive augmentation from GPTshttps://github.com/AI-secure/DecodingTrusthttps://huggingface.co/datasets/AI-Secure/DecodingTrustCC BY-SA 4.026-Feb-2024University of Illinois, Stanford University, UC Berkeley, Center for AI Safety, Microsoft, Chinese University of Hong KongmixedWang et al.: "DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models"NeurIPS 2023 (Outstanding Paper)https://arxiv.org/abs/2306.11698- split across 8 'trustworthiness perspectives': toxicity, stereotypes, adversarial and robustness, privacy, ethics and fairness11.01.2024
56
SimpleSafetyTestsbroad safetyevaluate critical safety risks in LLMseval onlychatEnglish100promptsEach prompt is a simple question or instruction.humanoriginalwritten by the authorshttps://github.com/bertiev/SimpleSafetyTestshttps://huggingface.co/datasets/Bertievidgen/SimpleSafetyTestsCC BY-NC 4.016-Feb-2024Patronus AI, University of Oxford, Bocconi UniversitymixedVidgen et al.: "SimpleSafetyTests: a Test Suite for Identifying Critical Safety Risks in Large Language Models"arxivhttps://arxiv.org/abs/2311.08370- The dataset is split into ten types of prompts23.12.2023
57
ConfAIdenarrow safetyprivacyevaluate the privacy-reasoning capabilities of instruction-tuned LLMseval onlyotherEnglish1,326promptsEach prompt is a reasoning task, with complexity increasing across 4 tiers. Answer formats vary.hybridoriginalhuman-written templates expanded by combination and with LLMshttps://github.com/skywalker023/confAIde/tree/main/benchmarknot availableMIT11-Feb-2024University of Washington, Allen AI, Carnegie Mellon University, National University of Singaporeacademia / npoMireshghallah et al.: "Can LLMs Keep a Secret? Testing Privacy Implications of Language Models via Contextual Integrity Theory"ICLR 2024 (Spotlight)https://openreview.net/forum?id=gmg7t8b4s0- split into 4 tiers with different prompt formats
- tier 1 contains 10 prompts, tier 2 2*98, tier 3 4*270, tier 4 50
11.01.2024
58
MaliciousInstructbroad safetyused to test adversarial methodevaluate success of generation exploit jailbreakeval onlychatEnglish100promptsEach prompt is an unsafe question.machineoriginalwritten by ChatGPT, then filtered by authorshttps://github.com/Princeton-SysML/Jailbreak_LLM/tree/main/datanot availableunspecified11-Feb-2024Princeton Universityacademia / npoHuang et al.: "Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation"ICLR 2024 (Spotlight)https://openreview.net/forum?id=r42tSSCHPh- covers ten 'malicious intentions': psychological manipulation, sabotage, theft, defamation, cyberbullying, false accusation, tax fraud, hacking, fraud, and illegal drug use.07.04.2024
59
HExPHIbroad safetyevaluate LLM safetyeval onlychatEnglish330promptsEach prompt is a harmful instruction.hybridmixedsampled from AdvBench and AnthropicRedTeam, then refined manually and with LLMs.not availablehttps://huggingface.co/datasets/LLM-Tuning-Safety/HEx-PHIcustom (HEx-PHI)11-Feb-2024Princeton University, Virginia Tech, IBM Research, Stanford UniversitymixedQi et al.: "Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!"ICLR 2024 (Oral)https://openreview.net/forum?id=hTEGyKf0dZ- covers 11 harm areas
- focus of the article is on finetuning models
07.04.2024
60
CoNAnarrow safetyhate speechevaluate compliance of LLMs with harmful instructionseval onlychatEnglish178promptsEach prompt is an instruction.humansampledsampled from MT-CONAN, then rephrasedhttps://github.com/vinid/instruction-llms-safety-evalnot availableCC BY-NC 4.011-Feb-2024Stanford University, Bocconi Universityacademia / npoBianchi et al.: "Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions"ICLR 2024 (Poster)https://openreview.net/forum?id=gT5hALch9z- focus: harmful instructions (e.g. hate speech)
- all prompts are (meant to be) unsafe
23.12.2023
61
ControversialInstructionsnarrow safetyhate speechevaluate LLM behaviour on controversial topicseval onlychatEnglish40promptsEach prompt is an instruction.humanoriginalwritten by the authorshttps://github.com/vinid/instruction-llms-safety-evalnot availableCC BY-NC 4.011-Feb-2024Stanford University, Bocconi Universityacademia / npoBianchi et al.: "Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions"ICLR 2024 (Poster)https://openreview.net/forum?id=gT5hALch9z- Focus: controversial topics (e.g. immigration)
- All prompts are (meant to be) unsafe
23.12.2023
62
QHarmbroad safetyevaluate LLM safetyeval onlychatEnglish100promptsEach prompt is a question.humansampledsampled randomly from AnthropicHarmlessBase (written by crowdworkers)https://github.com/vinid/instruction-llms-safety-evalnot availableCC BY-NC 4.011-Feb-2024Stanford University, Bocconi Universityacademia / npoBianchi et al.: "Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions"ICLR 2024 (Poster)https://openreview.net/forum?id=gT5hALch9z- Wider topic coverage due to source dataset
- Prompts are mostly unsafe
23.12.2023
63
MaliciousInstructionsbroad safetyevaluate compliance of LLMs with malicious instructionseval onlychatEnglish100promptsEach prompt is an instruction.machineoriginalgenerated by GPT-3 (text-davinci-003)https://github.com/vinid/instruction-llms-safety-evalnot availableCC BY-NC 4.011-Feb-2024Stanford University, Bocconi Universityacademia / npoBianchi et al.: "Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions"ICLR 2024 (Poster)https://openreview.net/forum?id=gT5hALch9z- Focus: malicious instructions (e.g. bombmaking)
- All prompts are (meant to be) unsafe
23.12.2023
64
PhysicalSafetyInstructionsnarrow safetyphysical safetyevaluate LLM commonsense physical safetyeval onlychatEnglish1,000promptsEach prompt is an instruction.humanmixedsampled from SafeText, then rephrasedhttps://github.com/vinid/instruction-llms-safety-evalnot availableCC BY-NC 4.011-Feb-2024Stanford University, Bocconi Universityacademia / npoBianchi et al.: "Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions"ICLR 2024 (Poster)https://openreview.net/forum?id=gT5hALch9z- Focus: commonsense physical safety
- 50 safe and 50 unsafe prompts
23.12.2023
65
SafetyInstructionsbroad safetypositive interactionfine-tune LLMs to be safertrain onlychatEnglish2,000conversationsEach conversation is a user prompt with a safe model response.hybridmixedprompts sampled form AnthropicRedTeam, responses generated by gpt-3.5-turbohttps://github.com/vinid/safety-tuned-llamas/tree/main/data/trainingnot availableCC BY-NC 4.011-Feb-2024Stanford University, Bocconi Universityacademia / npoBianchi et al.: "Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions"ICLR 2024 (Poster)https://openreview.net/forum?id=gT5hALch9z07.04.2024
66
SycophancyEvalnarrow safetysycophancyevaluate sycophancy in LLMseval onlychatEnglish20,956promptsEach prompt is an open-ended question or instruction.hybridmixedprompts are written by humans and modelshttps://github.com/meg-tong/sycophancy-eval/tree/mainhttps://huggingface.co/datasets/meg-tong/sycophancy-evalMIT11-Feb-2024AnthropicindustrySharma et al.: "Towards Understanding Sycophancy in Language Models"ICLR 2024 (Poster)https://openreview.net/forum?id=tvhaxkMKAn- uses four different task setups to evaluate sycophancy: answer (7268 prompts), are_you_sure (4888 prompts), feedback (8500 prompts), mimicry (300 prompts)07.04.2024
67
OKTestnarrow safetyfalse refusalevaluate false refusal in LLMseval onlychatEnglish350promptsEach prompt is a question or instruction that models should not refuse.machineoriginalgenerated by GPT-4 based on keywordshttps://github.com/InvokerStark/OverKillnot availableunspecified31-Jan-2024Fudan University, UC Santa Barbara, Shanghai AI Laboratoryacademia / npoShi et al.: "Navigating the OverKill in Large Language Models"arxivhttps://arxiv.org/abs/2401.17633- contains only safe prompts17.12.2024
68
AdvBenchbroad safetyused to test adversarial methodelicit generation of harmful or objectionable content from LLMseval onlychatEnglish1,000prompts500 are harmful strings that the model should not reproduce, 500 are harmful instructions.machineoriginalgenerated by Wizard-Vicuna-30B-Uncensoredhttps://github.com/llm-attacks/llm-attacks/tree/main/data/advbenchnot availableMIT20-Dec-2023Carnegie Mellon University, Center for AI Safety, Google DeepMind, Bosch Center for AImixedZou et al.: "Universal and Transferable Adversarial Attacks on Aligned Language Models"arxivhttps://arxiv.org/abs/2307.15043- focus of the work is adversarial / to jailbreak LLMs
- AdvBench tests whether jailbreaks succeeded
11.01.2024
69
Mosscapotheradversarial goal in data creation, crowdsourcedanalyse prompt hacking / extraction attacksotherchatEnglish278,945promptsMost prompts are prompt extraction attacks.humanoriginalwritten by players of the Mosscap gamenot availablehttps://huggingface.co/datasets/Lakera/mosscap_prompt_injectionMIT20-Dec-2023Lakera AIindustryLakera AI: "Mosscap Prompt Injection"bloghttps://grt.lakera.ai/mosscap- focus of the work is adversarial / prompt extraction
- not all prompts are attacks
- prompts correspond to 8 difficulty levels of the game
11.01.2024
70
MoralChoicevalue alignmentmoralevaluate the moral beliefs in LLMseval onlymultiple choiceEnglish1,767binary-choice questionEach prompt is a hypothetical moral scenario with two potential actions.hybridoriginalmostly generated by GPTs, plus some human-written scenarioshttps://github.com/ninodimontalcino/moralchoicehttps://huggingface.co/datasets/ninoscherrer/moralchoiceCC BY 4.010-Dec-2023FAR AI, Columbia Universityacademia / npoScherrer et al.: "Evaluating the Moral Beliefs Encoded in LLMs"NeurIPS 2023 (Spotlight)https://openreview.net/forum?id=O06z2G18me- 687 scenarios are low-ambiguity, 680 are high-ambiguity
- three Surge annotators choose the favourable action for each scenario
11.01.2024
71
DICES350value alignmentsafetycollect diverse perspectives on conversational AI safetyeval onlychatEnglish350conversationsConversations can be multi-turn, with user input and LLM output.humanoriginalwritten by adversarial Lamda usershttps://github.com/google-research-datasets/dices-dataset/not availableCC BY 4.010-Dec-2023Google Research, UCL, University of CambridgemixedAroyo et al.: "DICES Dataset: Diversity in Conversational AI Evaluation for Safety"NeurIPS 2023 (Poster)https://proceedings.neurips.cc/paper_files/paper/2023/hash/a74b697bce4cac6c91896372abaa8863-Abstract-Datasets_and_Benchmarks.html- 104 ratings per item
- annotators from US
- annotation across 24 safety criteria
23.12.2023
72
DICES990value alignmentsafetycollect diverse perspectives on conversational AI safetyeval onlychatEnglish990conversationsConversations can be multi-turn, with user input and LLM output.humanoriginalwritten by adversarial Lamda usershttps://github.com/google-research-datasets/dices-dataset/not availableCC BY 4.010-Dec-2023Google Research, UCL, University of CambridgemixedAroyo et al.: "DICES Dataset: Diversity in Conversational AI Evaluation for Safety"NeurIPS 2023 (Poster)https://proceedings.neurips.cc/paper_files/paper/2023/hash/a74b697bce4cac6c91896372abaa8863-Abstract-Datasets_and_Benchmarks.html- 60-70 ratings per item
- annotators from US and India
- annotation across 16 safety criteria
23.12.2023
73
BeaverTailsbroad safetyevaluate and improve LLM safety on QAtrain and evalchatEnglish333,963conversationsEach conversation contains a human prompt and LLM response.hybridmixedprompts sampled from human AnthropicRedTeam data, plus model-generated responseshttps://github.com/PKU-Alignment/beavertailshttps://huggingface.co/datasets/PKU-Alignment/BeaverTailsCC BY-NC 4.010-Dec-2023Peking Universityacademia / npoJi et al.: "BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset"NeurIPS 2023 (Poster)https://arxiv.org/abs/2307.04657- 16,851 unique prompts sampled from AnthropicRedTeam
- covers 14 harm categories (e.g. animal abuse)
- annotated for safety by 3.34 crowdworkers on average
11.01.2024
74
TDCRedTeamingbroad safetyused to test adversarial methodevaluate success of automated red-teaming approacheseval onlychatEnglish100promptsEach prompt is an instruction.humanoriginalwritten by the authorshttps://github.com/centerforaisafety/tdc2023-starter-kit/tree/main/red_teaming/datanot availableMIT10-Dec-2023UIUC, Carnegie Mellon University, UC Berkeley, Center for AI Safety, Microsoft, University of WashingtonmixedMazeika et al.: "TDC 2023 (LLM Edition): The Trojan Detection Challenge"NeurIPS 2023 (Competition)https://trojandetection.ai/- covers 7 categories: Bigotry and Abusive Language, Violent Content and Conduct, Illegal Activities, Malware and Exploits, Scams, Misinformation and Disinformation, Other Undesirable Content07.04.2024
75
PromptExtractionRobustnessnarrow safetyprompt extraction, adversarial goal in data creation, crowdsourcedevaluate LLM vulnerability to prompt extractioneval onlychatEnglish569samplesEach sample combines defense and attacker input.humanoriginalwritten by Tensor Trust playershttps://github.com/HumanCompatibleAI/tensor-trust-datahttps://huggingface.co/datasets/qxcv/tensor-trust?row=0unspecified10-Dec-2023UC Berkeley, Georgia Tech, Harvard Universityacademia / npoToyer et al.: "Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game"Instruction Workshop at NeurIPS 2023https://openreview.net/forum?id=UWymGURI75- filtered from larger raw prompt extraction dataset
- collected using the open Tensor Trust online game
11.01.2024
76
PromptHijackingRobustnessnarrow safetyprompt extraction, adversarial goal in data creation, crowdsourcedevaluate LLM vulnerability to prompt hijackingeval onlychatEnglish775samplesEach sample combines defense and attacker input.humanoriginalwritten by Tensor Trust playershttps://github.com/HumanCompatibleAI/tensor-trust-datahttps://huggingface.co/datasets/qxcv/tensor-trust?row=0unspecified10-Dec-2023UC Berkeley, Georgia Tech, Harvard Universityacademia / npoToyer et al.: "Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game"Instruction Workshop at NeurIPS 2023https://openreview.net/forum?id=UWymGURI75- filtered from larger raw prompt extraction dataset
- collected using the open Tensor Trust online game
11.01.2024
77
JADEbroad safetyuse linguistic fuzzing to generate challenging prompts for evaluating LLM safetyeval onlychatChinese, English2,130promptsEach prompt is a question targeting a specific harm category.machineoriginalgenerated by LLMs based on linguistic ruleshttps://github.com/whitzard-ai/jade-dbnot availableMIT10-Dec-2023Whitzard AI, Fudan UniversitymixedZhang et al.: "JADE: A Linguistics-based Safety Evaluation Platform for Large Language Models"arxivhttps://arxiv.org/abs/2311.00286- JADE is a platform for safety data generation and evaluation
- Prompt generations are based on linguistic rules created by authors
- the paper comes with 4 example datasets
05.02.2024
78
CPADbroad safetyadversarial goal in data creationelicit generation of harmful or objectionable content from LLMseval onlychatChinese10,050promptsEach prompt is a longer-form scenario / instruction aimed at eliciting a harmful response.hybridoriginalmostly generated by GPTs based on some human seed promptshttps://github.com/liuchengyuan123/CPADnot availableCC BY-SA 4.08-Dec-2023Zhejiang University, AlibabamixedLiu et al.: "Goal-Oriented Prompt Attack and Safety Evaluation for LLMs"arxivhttps://arxiv.org/abs/2309.11830- focus of the work is adversarial / to jailbreak LLMs
- CPAD stands for Chinese Prompt Attack Dataset
11.01.2024
79
CyberattackAssistancenarrow safetycyberattacksevaluate LLM compliance in assisting in cyberattackseval onlychatEnglish1,000promptsEach prompt is an instruction to assist in a cyberattack.hybridoriginalwritten by experts, augmented with LLMshttps://github.com/facebookresearch/PurpleLlama/tree/main/CybersecurityBenchmarks/datasets/mitrenot availablecustom (Llama2 Community License)7-Dec-2023MetaindustryBhatt et al.: "Purple Llama CyberSecEval: A Secure Coding Benchmark for Language Models"arxivhttps://arxiv.org/abs/2312.04724- instructions are split into 10 MITRE categories
- The dataset comes with additional LLM-rephrased instructions
23.12.2023
80
AdvPromptSetbroad safetyadversarial goal in data creationevaluate LLM responses to adversarial toxicity text promptseval onlyotherEnglish197,628sentencesEach sentence is taken from a social media dataset.humansampledsampled from two Jigsaw social media datasets (written by social media users)https://github.com/facebookresearch/ResponsibleNLP/tree/main/AdvPromptSetnot availableMIT6-Dec-2023MetaindustryEsiobu et al.: "ROBBIE: Robust Bias Evaluation of Large Generative Language Models"EMNLP 2023 (Main)https://aclanthology.org/2023.emnlp-main.230/- created as part of the ROBBIE bias benchmark
- originally labelled for toxicity by Jigsaw
11.01.2024
81
HolisticBiasRbiassociodemographicsevaluate LLM completions for sentences related to individual sociodemographicseval onlyautocompleteEnglish214,460promptsEach prompt is the beginning of a sentence related to a person's sociodemographics.hybridoriginalhuman-written templates expanded by combinationhttps://github.com/facebookresearch/ResponsibleNLP/tree/main/robbienot availableMIT6-Dec-2023MetaindustryEsiobu et al.: "ROBBIE: Robust Bias Evaluation of Large Generative Language Models"EMNLP 2023 (Main)https://aclanthology.org/2023.emnlp-main.230/- created as part of the ROBBIE bias benchmark
- constructed from 60 Regard templates
- uses noun phrases from Holistic Bias
- covers 11 categories of bias: age, body type, class, culture, disability, gender, nationality, political ideology, race/ethnicity, religion, sexual orientation
11.01.2024
82
HackAPromptotheradversarial goal in data creation, crowdsourcedanalyse prompt hacking / extraction attacksotherchatmostly English601,757promptsMost prompts are prompt extraction attacks.humanoriginalwritten by participants of the HackAPrompt competitionnot availablehttps://huggingface.co/datasets/hackaprompt/hackaprompt-datasetMIT6-Dec-2023University of Maryland, Mila, Towards AI, Stanford University, Technical University of Sofia, University of Milan, NYU, University of ArizonamixedSchulhoff et al.: "Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs Through a Global Prompt Hacking Competition"EMNLP 2023 (Main)https://aclanthology.org/2023.emnlp-main.302/- focus of the work is adversarial / prompt hacking
- prompts were written by ca. 2.8k people from 50+ countries
11.01.2024
83
ToxicChatothercontent moderationevaluate dialogue content moderation systemsotherchatmostly English10,166conversationsEach conversation is a single-turn with user input and LLM output.humanoriginalwritten by LMSys usersnot availablehttps://huggingface.co/datasets/lmsys/toxic-chatCC BY-NC 4.06-Dec-2023UC San Diegoacademia / npoLin et al.: "ToxicChat: A Benchmark for Content Moderation in Real-world User-AI Interactions"EMNLP 2023 (Findings)https://aclanthology.org/2023.findings-emnlp.311/- subset of LMSYSChat1M
- annotated for toxicity by 4 authors
- ca. 7% toxic
23.12.2023
84
AARTbroad safetyused to test adversarial methodillustrate the AART automated red-teaming methodeval onlychatEnglish3,269promptsEach prompt is an instruction.machineoriginalgenerated by PALMhttps://github.com/google-research-datasets/aart-ai-safety-datasetnot availableCC BY 4.06-Dec-2023Google ResearchindustryRadharapu et al.: "AART: AI-Assisted Red-Teaming with Diverse Data Generation for New LLM-powered Applications"EMNLP 2023 (Industry)https://aclanthology.org/2023.emnlp-industry.37/- contains examples for specific geographic regions
- prompts also change up use cases and concepts
23.12.2023
85
DELPHIbroad safetyevaluate LLM performance in handling controversial issueseval onlychatEnglish29,201promptsEach prompt is a question about a more or less controversial issue.humansampledsampled from the Quora Question Pair Dataset (written by Quora users)https://github.com/apple/ml-delphi/tree/main/datasetnot availableCC BY 4.06-Dec-2023AppleindustrySun et al.: "DELPHI: Data for Evaluating LLMs’ Performance in Handling Controversial Issues"EMNLP 2023 (Industry)https://aclanthology.org/2023.emnlp-industry.76- annotated for 5 levels of controversy
- annotators are native English speakers who have spent significant time in Western Europe
07.04.2024
86
AttaQbroad safetyadversarial goal in data creationevaluate tendency of LLMs to generate harmful or undesirable responseseval onlychatEnglish1,402promptsEach prompt is a question.hybridmixedsampled from AnthropicRedTeam, and LLM-generated, and crawled from Wikipedianot availablehttps://huggingface.co/datasets/ibm/AttaQMIT6-Dec-2023IBM Research AIindustryKour et al.: "Unveiling Safety Vulnerabilities of Large Language Models"GEM Workshop at EMNLP 2023https://aclanthology.org/2023.gem-1.10/- consists of a mix of sources01.08.2024
87
DiscrimEvalbiassociodemographicsevaluate the potential discriminatory impact of LMs across use caseseval onlymultiple choiceEnglish9,450binary-choice questionsEach question comes with scenario context.machineoriginaltopics, templates and questions generated by Claudenot availablehttps://huggingface.co/datasets/Anthropic/discrim-evalCC BY 4.06-Dec-2023AnthropicindustryTamkin et al.: "Evaluating and Mitigating Discrimination in Language Model Decisions"arxivhttps://arxiv.org/abs/2312.03689- covers 70 different decision scenarios
- each question comes with an 'implicit' version where race and gender are conveyed through associated names
- covers 3 categories of bias: race, gender, age
11.01.2024
88
SPMisconceptionsnarrow safetyprivacymeasure the ability of LLMs to refute misconceptionseval onlychatEnglish122promptsEach prompt is a single-sentence misconception.humanoriginalwritten by the authorshttps://github.com/purseclab/LLM_Security_Privacy_Advicenot availableMIT4-Dec-2023Purdue Universityacademia / npoChen et al.: "Can Large Language Models Provide Security & Privacy Advice? Measuring the Ability of LLMs to Refute Misconceptions"ACSAC 2023https://dl.acm.org/doi/10.1145/3627106.3627196- misconceptions all relate to security and privacy
- uses templates to turn misconceptions into prompts
- covers six categories (e.g. crypto and blockchain, law and regulation)
11.01.2024
89
FFTbroad safetyevaluation factuality,
fairness, and toxicity of LLMs
eval onlychatEnglish2,116promptsEach prompt is a question or instruction, sometimes within a jailbreak template.hybridmixedsampled from AnthropicRedTeam, and LLM-generated, and crawled from Wikipedia, Reddit and elsewherehttps://github.com/cuishiyao96/FFTnot availableunspecified30-Nov-2023Chinese Academy of Sciences, University of Chinese Academy of Sciences, BaidumixedCui et al.: "FFT: Towards Harmlessness Evaluation and Analysis for LLMs with Factuality, Fairness, Toxicity"arxivhttps://arxiv.org/abs/2311.18580- tests factuality, fairness and toxicity01.08.2024
90
GandalfIgnoreInstructionsotheradversarial goal in data creation, crowdsourcedanalyse prompt hacking / extraction attacksotherchatEnglish1,000promptsMost prompts are prompt extraction attacks.humanoriginalwritten by players of the Gandalf gamenot availablehttps://huggingface.co/datasets/Lakera/gandalf_ignore_instructionsMIT2-Oct-2023Lakera AIindustryLakera AI: "Gandalf Prompt Injection: Ignore Instruction Prompts"bloghttps://www.lakera.ai/blog/who-is-gandalf- focus of the work is adversarial / prompt extraction
- not all prompts are attacks
05.02.2024
91
GandalfSummarizationotheradversarial goal in data creation, crowdsourcedanalyse prompt hacking / extraction attacksotherchatEnglish140promptsMost prompts are prompt extraction attacks.humanoriginalwritten by players of the Gandalf gamenot availablehttps://huggingface.co/datasets/Lakera/gandalf_summarizationMIT2-Oct-2023Lakera AIindustryLakera AI: "Gandalf Prompt Injection: Summarization Prompts"bloghttps://www.lakera.ai/blog/who-is-gandalf- focus of the work is adversarial / prompt extraction
- not all prompts are attacks
05.02.2024
92
HarmfulQAbroad safetyevaluate and improve LLM safetyeval onlychatEnglish1,960promptsEach prompt is a question.machineoriginalgenerated by ChatGPThttps://github.com/declare-lab/red-instruct/tree/main/harmful_questionshttps://huggingface.co/datasets/declare-lab/HarmfulQAApache 2.030-Aug-2023Singapore University of Technology and Designacademia / npoBhardwaj and Poria: "Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment"arxivhttps://arxiv.org/abs/2308.09662- split into 10 topics (e.g. "Mathematics and Logic")
- similarity across prompts is quite high
- not all prompts are unsafe / safety-related
23.12.2023
93
LatentJailbreaknarrow safetyjailbreak, adversarial goal in data creationevaluate safety and robustness of LLMs in response to adversarial promptseval onlychatEnglish416promptsEach prompt combines a meta-instruction (e.g. translation) with the same toxic instruction template.hybridoriginalhuman-written templates expanded by combinationhttps://github.com/qiuhuachuan/latent-jailbreak/tree/mainnot availableMIT28-Aug-2023Zhejiang University, Westlake Universityacademia / npoQiu et al.: "Latent Jailbreak: A Benchmark for Evaluating Text Safety and Output Robustness of Large Language Models"arxivhttps://arxiv.org/abs/2307.08487- focus of the work is adversarial / to jailbreak LLMs
- 13 prompt templates instantiated with 16 protected group terms and 2 posititional types
- main exploit focuses on translation
11.01.2024
94
OpinionQAvalue alignmentcultural / socialevaluate the alignment of LLM opinions with US demographic groupseval onlymultiple choiceEnglish1,498multiple-choice questions-humansampledadapted from the Pew American Trends Panel surveyshttps://github.com/tatsu-lab/opinions_qanot availableunspecified23-Jul-2023Stanford Universityacademia / npoSanturkar et al.: "Whose Opinions Do Language Models Reflect?"ICML 2023 (Oral)https://dl.acm.org/doi/10.5555/3618408.3619652- questions taken from 15 ATP surveys
- covers 60 demographic groups
04.01.2024
95
CValuesResponsibilityMCvalue alignmentsafetyevaluate human value alignment in Chinese LLMseval onlymultiple choiceChinese1,712multiple-choice questionsEach question targets responsible behaviours.machineoriginalautomatically created from human-written promptshttps://github.com/X-PLUG/CValueshttps://huggingface.co/datasets/Skepsun/cvalues_rlhfApache 2.019-Jul-2023Alibaba, Beijing Jiaotong UniversitymixedXu et al.: "CVALUES: Measuring the Values of Chinese Large Language Models from Safety to Responsibility"arxivhttps://arxiv.org/abs/2307.09705- distinguishes between unsafe and irresponsible responses
- covers 8 domains (e.g. psychology, law, data science)
04.01.2024
96
CValuesResponsibilityPromptsvalue alignmentsafetyevaluate human value alignment in Chinese LLMseval onlychatChinese800promptsEach prompt is an open question targeting responsible behaviours.humanoriginalwritten by the authorshttps://github.com/X-PLUG/CValueshttps://huggingface.co/datasets/Skepsun/cvalues_rlhfApache 2.019-Jul-2023Alibaba, Beijing Jiaotong UniversitymixedXu et al.: "CVALUES: Measuring the Values of Chinese Large Language Models from Safety to Responsibility"arxivhttps://arxiv.org/abs/2307.09705- distinguishes between unsafe and irresponsible responses
- covers 8 domains (e.g. psychology, law, data science)
04.01.2024
97
CHBiasbiassociodemographicsevaluate LLM bias related to sociodemographicseval onlyotherChinese4,800Weibo postsEach post references a target group.humansampledposts sampled from Weibo (written by Weibo users)https://github.com/hyintell/CHBiasnot availableMIT9-Jul-2023Eindhoven University, University of Liverpool, University of Technology Sydneyacademia / npoZhao et al.: "CHBias: Bias Evaluation and Mitigation of Chinese Conversational Language Models"ACL 2023 (Main)https://aclanthology.org/2023.acl-long.757/- 4 bias categories: gender, sexual orientation, age, appearance
- annotated by Chinese NLP grad students
- similar evaluation setup to CrowS pairs
04.01.2024
98
HarmfulQbroad safetyevaluate LLM safetyeval onlychatEnglish200promptsEach prompt is a question.machineoriginalgenerated by GPT-3 (text-davinci-002)https://github.com/SALT-NLP/chain-of-thought-biasnot availableMIT9-Jul-2023Stanford University, Shanghai Jiao Tong University, Georgia Techacademia / npoShaikh et al.: "On Second Thought, Let’s Not Think Step by Step! Bias and Toxicity in Zero-Shot Reasoning"ACL 2023 (Main)https://aclanthology.org/2023.acl-long.244/- focus on 6 attributes: "racist, stereotypical, sexist, illegal, toxic, harmful"
- authors do manual filtering for overly similar questions
23.12.2023
99
FairPrismothercontent moderationanalyse harms in conversations with LLMsotherchatEnglish5,000conversationsEach conversation is single-turn, containing a prompt and a potentially harmful model response.hybridsampledprompts sampled from ToxiGen and Social Bias Frames, responses from GPTs and XLNethttps://github.com/microsoft/fairprismnot availableMIT9-Jul-2023UC Berkeley, Microsoft ResearchmixedFleisig et al.: "FairPrism: Evaluating Fairness-Related Harms in Text Generation"ACL 2023 (Main)https://aclanthology.org/2023.acl-long.343- does not introduce new prompts
- focus is on analysing model responses
07.04.2024
100
SeeGULLbiassociodemographicsexpand cultural and geographic coverage of stereotype benchmarkseval onlyotherEnglish7,750tuplesEach tuple is an identity group plus a stereotype attribute.hybridoriginalgenerated by LLMs, partly validated by human annotationhttps://github.com/google-research-datasets/seegullnot availableCC BY-SA 4.09-Jul-2023Virginia Tech, Google ResearchmixedJha et al.: "SeeGULL: A Stereotype Benchmark with Broad Geo-Cultural Coverage Leveraging Generative Models"ACL 2023 (Main)aclanthology.org/2023.acl-long.548- stereotypes about identity groups spanning 178 countries across 8 different geo-political regions across 6 continents
- examples accompanied by fine-grained offensiveness scores
05.02.2024