1 of 17

OpenAssistant

Data Collection MVP

State Jan 4�(feedback welcome)��github.com/LAION-AI/Open-Assistant

2 of 17

Goal: Crowd-Sourced Training Data Collection

3 of 17

What data are we looking for?

We are looking for task-fulfillment interactions. This means there is one participant, called the prompter, who has a task in mind. The other participant is the assistant. The assistant's goal is to fulfil this task. The two take turns in the conversation.

Note: This is not the same as a regular conversation. The goal is always to fulfill a given task. Note that the prompter may follow-up in subsequent messages and clarify their instructions, expand on the original task, or even change their mind.

4 of 17

Features

  • A scalable and open system that allows anyone to participate in providing human demonstrations
    • Possible prompts (tasks to the assistant)
    • Act as the assistant
    • Act as the user and do follow-ups (clarify, change your mind, etc.)
  • On top of demonstrations, we also need rankings and labels
    • Ranking: Given N assistant answers, which one is best?
    • Text Labels: Does a text contain violence, profanity, sarcasm, …? Is the prompt useful? Is the answer helpful?
  • You are not collecting alone
    • Other humans work directly on the data you provide
    • A global newsfeed
    • Public leaderboards
    • Prizes for top-contributors

5 of 17

Deliverables

  • Datasets to train
    • Supervised Fine-Tuning
    • Reward Models
    • Text Classifiers

Refer to the roadmap to see where this fits into the project

6 of 17

We collect "conversation trees"

See our data structures for more information

Prompt (L0)

Assistant Response L1

Assistant Response L1

Prompter Response L2

Prompter Response L2

Prompter Response L2

Prompter Response L2

Assistant Response L3

Assistant Response L3

Assistant Response L3

Assistant Response L3

Assistant Response L3

Metadata:

7 of 17

Main Task 1: Reply to conversation

Teach me Chess

Sure, Chess is an ancient game…

How do I cheat?

I am a large…*

Suggest the next entry of a conversation

* this is a joke

8 of 17

Main Task 2: Label a text

Teach me Chess

Sure, Chess is an ancient game…

What's the best move?

Does this text contain:

  • Profanity?
  • Sarcasm?
  • Violence?

Is this text:

  • Helpful?
  • Too short?
  • Too vague?

9 of 17

Main Task 3: Rank replies

Teach me Chess

Sure, Chess is an ancient game…

What's the best move?

What is the game state?

Order all replies by quality

e4

It depends, …

10 of 17

Architecture

Central Backend

- collects data

- hands out new tasks

- keeps track of leaderboards

- distributes news

Discord-Bot(s)

- lets users work on tasks via DMs

- posts news / updates / new tasks to public channels

Website

- lets users work on tasks (same as discord-bot)

- displays leaderboards

- has admin interface

- provides dataset explorer

- looks good on mobile

users

11 of 17

Backend

  • What's here
    • Full task lifecycle
    • Random sampling of tasks
    • Text labels collection
  • What's missing
    • Smart sampling of tasks
    • A separate task type for text labels collection
    • User reward assignment computation
    • Full endpoints for leaderboard & news
    • Admin Endpoints

12 of 17

Discord-Bot

  • What's here
    • Full task lifecycle
    • Work in DMs, updates in public channel
    • Text labels collection via reactions (or modals?)
  • What's missing
    • Workflow feels a bit cluttered
    • Display of leaderboards & news

13 of 17

Website

  • What's here
    • Full task lifecycle
    • Text labels collection
    • Login via email or discord
    • Explore data
    • Project information (about, join us, privacy policy, etc.)
  • What's missing
    • Admin interface
    • Optimized UX for the tasks

14 of 17

Documentation

  • What's here
    • Prompting guide (partial)
  • What's missing
    • Better prompting guide
    • Examples & more detailed instructions

15 of 17

Other things

This is just about the data collection MVP where we crowd-source data. We are concurrently doing lots of work in ML training, collecting & creating instruction datasets, data safety, and much more!

Have a look at our roadmap

16 of 17

How can I help?

  • Grab an issue on GitHub (or create one). Don't be shy :)
  • Write documentation or tests
    • This also helps you learn about the codebase
  • Spin up the dev setup and try things
    • Report UI bugs and improvements to the workflow (or submit a PR)

17 of 17

ETA of crowd-sourcing: January 15

(amazing work is being done, full-steam :) )