ABCDEFGHIJKLMNOPQRSTUVWXYZ
1
StatusGH IssueLocation of Processed DatasetCode LocationOwnerLicenseNotes
2
summary <-> fulltextDonehttps://huggingface.co/datasets/MyloBishop/reverse_summarization_datasetJordine/Andreas/Christoph/Mylo
3
- /r/ChangeMyViewNeed to run and converthttps://github.com/LAION-AI/Open-Assistant/issues/727https://huggingface.co/datasets/kjl3080/OA_CMV_Argumentshttps://github.com/LAION-AI/Open-Assistant/blob/main/notebooks/data-augmentation/changemyview-builder/data_processor.ipynbKayJayApache
There are only 100 entries on huggingface. someone with a more powerful CUDA should augment more data using the builder attached in the repo
4
soda-dialogNeed to convert and uploadhttps://drive.google.com/file/d/1TOGQfr419n8wpzJpYLLw4nB3tSKD8zXV/view?usp=sharing Huu
5
joke explanationNeed to convert and uploadhttps://drive.google.com/file/d/1-laFweqp676HxEzse1k63owo5h3BcbYi/view?usp=sharingHuu/Theblackcat
6
kilt background contextual tasksIn progressLintang
7
OpenBuggerIn progresshttps://github.com/LAION-AI/Open-Assistant/issues/317https://github.com/furlat/OpenBuggerIriden
04.02: Big updates in the next few days
8
- code explainerNeed a new assigneeGraverman
We just need compute to run this. Code done
9
mathIn progresshecko
10
- sympyIn progresshttps://github.com/LAION-AI/Open-Assistant/issues/776hecko/King
11
debate conversation dialog datasetIn progress
12
- HippocorpusNeed to run, convert and uploadhttps://github.com/LAION-AI/Open-Assistant/issues/728https://github.com/LAION-AI/Open-Assistant/blob/main/notebooks/data-augmentation/hippocorpus/hippocorpus.ipynbMightyAlex/TaylorO-UDA v1.0
13
medical (note generation)Need to run, convert and uploadhttps://github.com/LAION-AI/Open-Assistant/issues/721https://github.com/LAION-AI/Open-Assistant/tree/main/openassistant/datasets/mt_note_generationYP
14
Chatgpt prompts with our model responsesIn progressHuu & ??
We just need compute to run this. Code done
15
Story Generation From SummaryIn progressHuu/Finitearth/Christoph
We just need compute to run this. Code done
16
Essay writerIn progressGraverman & Finitearth
We just need compute to run this. Code done
17
Wikihow Q&AIn progressLocal machinekenhktsui & b_mc2
Collection done. Cleaning in progress. Then Augment. Currently local file only
18
WikihowIn progresskenhktsui & b_mc2
Notebook started, working on augmentation process. https://colab.research.google.com/drive/1oOI1XGqCtKTScrNjUhNsZz3CIkGsDBS7?usp=drive_open
19
RedditIn progressProteusiq/SriPrarabdha
20
TwitterIn progresshttps://github.com/LAION-AI/Open-Assistant/issues/126Jmete
archive.org data collection scripts merged to main. Requires filtering to get useful entries. Waiting for 143 model for the filtering. Working on separate method to scrape twitter threads for increased quality. Notebook done and I already scraped 46,650 threads. I need help to run these through another model to create Q&A pairs.
21
stackexchangeIn progresshttps://github.com/LAION-AI/Open-Assistant/issues/191https://github.com/LAION-AI/Open-Assistant/blob/main/notebooks/data-augmentation/stackexchange-builder/stackexchange-builder.ipynbbmc2/flying_madman
Initial notebook merged, can ingest different stackexchange data dumps and format to OA format. https://github.com/LAION-AI/Open-Assistant/pull/355
22
UnifiedQA->InstructionsNeed to run, convert and uploadhttps://github.com/LAION-AI/Open-Assistant/issues/712https://github.com/LAION-AI/Open-Assistant/blob/main/notebooks/data-augmentation/unified-qa/unified-qa.ipynbvale95ntino
Notebook added to GH repository, data can be downloaded from there
23
Wriring PromptsIn progresshttps://huggingface.co/datasets/fabraz/writingPromptAugfabraz
24
Github-Unit testIn progressbitplane/theol-git
25
Github IssuesNeed a new assigneehttps://github.com/LAION-AI/Open-Assistant/issues/319https://drive.google.com/file/d/1NOAgf2bhwQYnnjjrpiWWV_5Adl9xwzUJ/viewdoroshroman
26
Youtube Transcript -> Instruciton DialogIn progressShtoner/totuta/marianna13
27
Microsoft CodeT Datasets (HumanEval, APPS, CodeContests, MBPP)
In reviewhttps://huggingface.co/datasets/OllieStanley/humaneval-mbpp-codegen-qa
https://huggingface.co/datasets/OllieStanley/humaneval-mbpp-testgen-qa
https://github.com/LAION-AI/Open-Assistant/pull/1895OllieMIT
https://github.com/microsoft/CodeT/tree/main/CodeT
28
Microsoft DIVERSE DatasetsNeed a new assigneehttps://github.com/LAION-AI/Open-Assistant/issues/746https://github.com/LAION-AI/Open-Assistant/blob/main/notebooks/diverse/diverse.ipynbOllie/ThierryderuyttereMIT
https://github.com/microsoft/CodeT/tree/main/DIVERSE
29
2D table reasoningIn progresshecko
30
Wikidata knowledge graphNeed to run, convert and uploadhttps://github.com/LAION-AI/Open-Assistant/pull/1075https://github.com/LAION-AI/Open-Assistant/pull/1075sedthh
31
Project Gutenberg BooksDonehttps://github.com/LAION-AI/Open-Assistant/issues/1110https://huggingface.co/datasets/sedthh/gutenberg_english and https://huggingface.co/datasets/sedthh/gutenberg_multilangsedthh
32
WikilinguaIn progresshttps://github.com/LAION-AI/Open-Assistant/issues/892YP
33
- python add librariesNeed to convert and uploadhttps://www.kaggle.com/code/christopherdking/notebook0548e4d207/data
King (based on "checked python" project)
Apache
34
python code dataset - need complete list In progress
35
p3/xp3 datasetDoneLintang/Markcheeky
36
rallio's QA dataNeed to convert and uploadhttps://github.com/Rallio67/language-model-agentsRallio
37
antrhopic dataset helpful/harmless (not redteam)Done
38
- checked pythonDoneHuu/Rallio/King
39
Character description instruction from video game wikipedia
Need to convert and uploadLocal machineRallio
40
Khan AcademyIn progresshttps://github.com/LAION-AI/Open-Assistant/issues/184
41
WikiHow ListsNeed to converthttps://huggingface.co/datasets/b-mc2/wikihow_listsb_mc2CC BY-NC-SA 3.0
Wikihow titles to ordered list summaries, unordered list ingredients, items needed for X
42
StorygenerationNeed to converthttps://huggingface.co/datasets/qwedsacf/story-generationVechtomov#1142
43
Competition MathNeed to converthttps://huggingface.co/datasets/qwedsacf/competition_mathVechtomov#1142
44
Yahoo QANeed to convert and uploadhttps://huggingface.co/datasets/yahoo_answers_qahttps://github.com/LAION-AI/Open-Assistant/pull/1984Shadowner
45
Essay writingNeed to converthttps://huggingface.co/datasets/ChristophSchuhmann/essays-with-instructionsChristoph
46
Q ANeed to converthttps://huggingface.co/datasets/marianna13/random_datasetmarianna13
47
RecipesDonehttps://github.com/LAION-AI/Open-Assistant/issues/1031https://huggingface.co/datasets/dctanner/oa_recipeshttps://github.com/LAION-AI/Open-Assistant/pull/1836dctanner
48
Radiology QAIn progresshttps://github.com/LAION-AI/Open-Assistant/issues/1044Luab
49
Poetry instructionsNeed a reviewhttps://github.com/LAION-AI/Open-Assistant/issues/731https://huggingface.co/datasets/isaacrehg/poetry-instructions
https://huggingface.co/datasets/isaacrehg/poetry-summary
https://huggingface.co/datasets/isaacrehg/poetry-detailed-analysis
https://github.com/LAION-AI/Open-Assistant/pull/1009IsaacRe
50
Cornell Movies DialogsNeed to coverthttps://github.com/LAION-AI/Open-Assistant/issues/1163https://huggingface.co/datasets/shahules786/OA-cornell-movies-dialoghttps://github.com/LAION-AI/Open-Assistant/pull/1319iamikka(gh: shahules786)
51
xp3 Code instructionsIn progresshttps://github.com/LAION-AI/Open-Assistant/issues/1266epicx(gh: bennman)
52
Free ChatGPT prompts generated answersIn progresshttps://github.com/LAION-AI/Open-Assistant/issues/1379iamikka(gh: shahules786)are we sure we can use these?
53
Cocktail recipesNeed to converthttps://github.com/LAION-AI/Open-Assistant/issues/1286https://huggingface.co/datasets/brianarbuckle/cocktail_recipesBrianArbuckle
54
Korean datasetsIn progresshttps://github.com/LAION-AI/Open-Assistant/issues/1157jeonsworld
55
Hungarian literature datasets (?)In progresshttps://github.com/LAION-AI/Open-Assistant/issues/1872sedthh
56
Factoid q/a pairs with difficulty ratings from Wikipedia articles.Donehttps://github.com/LAION-AI/Open-Assistant/issues/1873sedthhGFDL and CC BY-SA 3.0
http://www.cs.cmu.edu/~ark/QA-data/
57
Multilangual subtitle datasets (?)PRhttps://github.com/LAION-AI/Open-Assistant/issues/1875sedthhneed to cite "P. Lison and J. Tiedemann, 2016, OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016)"
https://opus.nlpl.eu/OpenSubtitles-v2018.php
58
ForverDreaming transcriptsIn progresssedthh
59
Ubuntu dialogue CorpusDonehttps://github.com/LAION-AI/Open-Assistant/issues/1874sedthh
https://www.kaggle.com/datasets/rtatman/ubuntu-dialogue-corpus
60
Zhihu QADonehttps://github.com/LAION-AI/Open-Assistant/issues/1459wangrui6/MLMonkATGY
61
Biomedical DialogNeed to converthttps://github.com/LAION-AI/Open-Assistant/issues/1029https://huggingface.co/datasets/ericyu3/openassistant_inpainted_dialogs_5k_biomedicalhttps://github.com/LAION-AI/Open-Assistant/pull/1092uyhcire
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100