Published using Google Docs
Eval Instructions
Updated automatically every 5 minutes

Eval Instructions

Directions: You will evaluate the responses outputted by two different models in response to the same prompt and given a conversation history as context.

Attempter Instructions

Follow the below steps while working on a task:

Step 1: Read the conversation history and the latest prompt carefully. Check if you should reject the task and pay careful attention to Sensitive Content.

        

Step 2: If the task is NOT rejectable, then rate each response on a scale of 1 to 5 for the following dimensions:

Step 3: Determine the preference rank between the two responses. Inform this ranking based on the overall response scores and overall usefulness of the responses given the prompt and conversation history as context.

 

Step 4: Write a justification about which response is better in a holistic sense.

Reviewer Instructions

Auditors will review previously completed tasks and evaluate them as high-quality (accept) or low-quality (send back to queue). If a task is high-quality except for very slight errors (e.g. typos), it can be fixed, but no substantive fixes are needed. The goal here is simply to identify which tasks are high-quality and customer-ready, not to fix the subpar ones.

Steps:

  1. Access tasks via queue as usual
  2. If the task was not rejected:
  1. Read conversation history, latest prompt, and two AI-generated responses
  2. Read the attempter’s response ratings, preference ranking, and justification
  1. Each response is rated for Accuracy, Instruction Following, and Overall Quality
  1. As of March 14, 2024, additional rating dimensions are being introduced: Relevance, Depth, Grammar/Presentation, and Verbosity
  1. If everything looks good → approve the task (green button)
  2. If there are significant errors or problems with the task → Reject it (red button)
  3. If the task is high-quality other than slight errors like typos → make any quick fixes and approve with changes (blue button)
  4. Don’t spend much time fixing tasks – if it can’t be fixed quickly and easily, SBQ it
  1. If the task was rejected:
  1. Review the attempter’s reason for rejection (e.g. Tier 1 sensitive content)
  2. If the attempter’s verdict was correct → approve the task (green button)
  3. If the attempter’s verdict was not correct → SBQ it (red button)


Remember: