ABCDEFGHIJKLMNOPQRSTUVWXYZAAABACADAEAF
1
HydraLM HFWhich GPT Model(if applicable)Size (rows)DomainsStatusModelWho's workingReviewerReview NotesMessage TypesNotesBasic Data Cleaning: DeDupe & WhitespacesData Clean,Who's working
2
NobodyExistsOnTheInternet/ConvoEvolLIMAuncensored
https://huggingface.co/datasets/nigh8w0lf/hydra_moe_ConvoEvolLIMAuncensoreduncensored, code, computer science,physics,Pushed to HFnight_w0lfvikpinstruction,output
multi turn conversation, no instructions or system message, defaulted all inputs to instruction
Pending
3
toolLLM: https://drive.google.com/file/d/1lTelETDJ1TeAYiXmi485brsPucagpTnk/view?usp=share_link
https://huggingface.co/datasets/nigh8w0lf/Hydra_moe_toolllama_dataset3.5/ChatGPTTool/API UseWon't usenight_w0lfvikp
Formatting is a bit odd on this, it has multiple instructions in a row. Also has a lot of "request invalid" errors in the data, and instructions with "all previous trails failed"
system, instruction, input, output
Use of the phrase "You are AutoGPT" in System messages will need clean up, I have left it un modified for now.
Pending
4
PygmalionAI/PIPPApippa_rp_std815,507roleplayUnder Review​nionvikpLook at conv id 16825 and decide for yourself how good this dataset isPending
5
camel-ai/physicsHydraLM/physics_dataset_standardized40000physicsPushed to HFBothlskywalkerinstruction, outputPending
6
https://huggingface.co/datasets/wenhu/TheoremQAhttps://huggingface.co/datasets/HydraLM/TheoremQA_standardized
Notes: Removed additional explanations from the dataset similar to Open-Platypus. (Original TheoremQA dataset has image data as well.)
Under Review​
moonlightgarden
Pending
7
camel-ai/mathHydraLM/math_dataset_standardizedmathPushed to HFBothPending
8
https://huggingface.co/datasets/garage-bAInd/Open-Platypushttps://huggingface.co/datasets/HydraLM/Open_Platypus_standardizedlogical reasoningUnder Review​
moonlightgarden
Pending
9
knowrohit07/know_logichttps://huggingface.co/datasets/nigh8w0lf/hydra_moe_know_logiclogic,reasoning,codePushed to HFnight_w0lfvikpinstruction, outputCompletednight_w0lf
Completed basic data cleaning,filterd out all model name data 'airoboros'
10
OpenOrca/blob/main/1M-GPT4-Augmented.parquetHydraLM/OpenOrca-GPT4-standardizedinstruct/orcaPushed to HF​vikpReccomend sampling down (high row count relative to other data)[system, instruction, output]Pending
11
https://raw.githubusercontent.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM/main/data/unnatural_instruction_gpt4_data.json
https://huggingface.co/datasets/YoungPhlo/GPT4LLM-unnatural_instruction_standardized427,000instruct, logic, reasoningWon't use​youngphlovikpRaw data is weird, see conversation ids 8971 and 8972instruction, input, output
seems the correct answer bounces between "output" and "label"
Pending
12
openchat/openchat_sharegpt4_dataset
https://huggingface.co/datasets/YoungPhlo/openchat-sharegpt_gpt4_standardized490,231instructUnder ReviewyoungphlovikpInstruction quality not good, see https://huggingface.co/datasets/YoungPhlo/openchat-sharegpt_gpt4_standardized/viewer/default/train?p=4instruction, output
Multiple languages
Pending
13
WizardLM/WizardLM_evol_instruct_V2_196kHydraLM/WizardLM_evol_instruct_V2_196k_standardizedinstructPushed to HFBothvikpRecommend sampling downPending
14
teknium/GPT4-LLM-CleanedHydraLM/GPT4-LLM-Cleaned_standardizedinstructPushed to HFBothPending
15
totally-not-an-llm/EverythingLM-data-V2
https://huggingface.co/datasets/HydraLM/EverythingLM-data-V2-standardized3,000instructUnder Review​thennalvikp
system, instruction, output
Pending
16
andreaskoepf/megacode2-min100
https://huggingface.co/datasets/HydraLM/megacode2-min100-standardized1,026,386instructWon't use​thennalvikp
See conversation id 103. It's possibly missing a system prompt that should be in there. Most outputs are too verbose for just the instructions (see conversation 0-10).
instruction, outputPending
17
mrm8488/unnatural-instructionsHydraLM/unnatural-instructions_standardizedinstructWon't useBothvikpDataset has incorrect examples, like in conversation 66050. Also has a lot of duplication.Pending
18
rombodawg/LosslessMegaCodeTrainingV2_1m_Evol_Uncensored
https://huggingface.co/datasets/HydraLM/LosslessMegaCodeTrainingV2-1m-Evol-Uncensored-standardized
1,884,414code, instructPushed to HF​thennalvikpThis dataset looks fine, but I would recommend sampling it down (very large).instruction, outputPending
19
https://github.com/teknium1/GPTeacherHydraLM/GPTeacher_codegen_standardizedcodePushed to HFBothPending
20
camel-ai/chemistryHydraLM/chemistry_dataset_standardizedchemistryPushed to HFBothPending
21
biologyHydraLM/biology_dataset_standardizedbiologyPushed to HFBothPending
22
GAIR/limahttps://huggingface.co/datasets/HydraLM/lima_standardized(done) OG "less is more"Pushed to HF​yam pelegPending
23
neulab/conalahttps://huggingface.co/datasets/HydraLM/conala_standardized(done) instruct, code genPushed to HF​vikpvikpOutputs duplicatedPending
24
Airoboros 2.2https://huggingface.co/datasets/khalidalt/airoboros-2.2-standardizedPushed to HF​Khalidvikp[system,
instruction,
output]
​
25
evol-codealpaca-v1https://huggingface.co/datasets/khalidalt/evol-codealpaca-v1-standardizedPushed to HF​Khalid​
26
https://huggingface.co/datasets/LDJnr/Puffinhttps://huggingface.co/datasets/HydraLM/puffin_standardizedUnder Review​yam pelegPending
27
OpenAssistant/oasst_top1_2023-08-25
https://huggingface.co/datasets/HydraLM/oasst_top1_standardizedPushed to HFvikpvikpMultilingualPending
28
OpenAssistant/oasst1In progressil_vitoriovikpLook through the first 20 rows of the oasst dataPending
29
TokenBender/unnatural_code_instructions_20M (will get unformatted version from Token Bender)
https://huggingface.co/datasets/ChallengerSpaceShuttle/HydraLM_TokenBender_DatasetsWon't usechallenger vikpDataset has formatting issues, see conversation ids 200, 201, etc. Output is marked as systeminstruction, systemPending
30
31
32
33
34
35
36
37
ehartford/wizard_vicuna_70k_unfiltered(for later)
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100