OpenAssistant
Data Collection MVP
State Jan 4�(feedback welcome)��github.com/LAION-AI/Open-Assistant
Goal: Crowd-Sourced Training Data Collection
What data are we looking for?
We are looking for task-fulfillment interactions. This means there is one participant, called the prompter, who has a task in mind. The other participant is the assistant. The assistant's goal is to fulfil this task. The two take turns in the conversation.
Note: This is not the same as a regular conversation. The goal is always to fulfill a given task. Note that the prompter may follow-up in subsequent messages and clarify their instructions, expand on the original task, or even change their mind.
Features
Deliverables
Refer to the roadmap to see where this fits into the project
We collect "conversation trees"
See our data structures for more information
Prompt (L0)
Assistant Response L1
Assistant Response L1
Prompter Response L2
Prompter Response L2
Prompter Response L2
Prompter Response L2
Assistant Response L3
Assistant Response L3
Assistant Response L3
Assistant Response L3
Assistant Response L3
Metadata:
…
Main Task 1: Reply to conversation
Teach me Chess
Sure, Chess is an ancient game…
How do I cheat?
I am a large…*
Suggest the next entry of a conversation
* this is a joke
Main Task 2: Label a text
Teach me Chess
Sure, Chess is an ancient game…
What's the best move?
Does this text contain:
Is this text:
Main Task 3: Rank replies
Teach me Chess
Sure, Chess is an ancient game…
What's the best move?
What is the game state?
Order all replies by quality
e4
It depends, …
Architecture
Central Backend
- collects data
- hands out new tasks
- keeps track of leaderboards
- distributes news
Discord-Bot(s)
- lets users work on tasks via DMs
- posts news / updates / new tasks to public channels
Website
- lets users work on tasks (same as discord-bot)
- displays leaderboards
- has admin interface
- provides dataset explorer
- looks good on mobile
users
Backend
Discord-Bot
Website
Documentation
Other things
This is just about the data collection MVP where we crowd-source data. We are concurrently doing lots of work in ML training, collecting & creating instruction datasets, data safety, and much more!
Have a look at our roadmap
How can I help?
ETA of crowd-sourcing: January 15
(amazing work is being done, full-steam :) )