1 of 15

Two new RFPs on the real-world impacts of LLMs

2 of 15

Our new RFPs

  • RFP 1: We’re funding benchmarks for the capabilities of LLM agents.
  • RFP 2: We’re also funding other projects for improving our collective understanding of LLMs’ real-world impacts.

3 of 15

Motivation (pt 1)

  • There’s widespread disagreement about whether the impressive capabilities of LLMs will translate into real-world impact.
  • Some people think scaled-up LLMs/LLM agents will be capable of automating most remote jobs within a few years, others think LLMs are a parlor trick.

4 of 15

Motivation (pt 2)

  • Much regulatory (and self-regulatory) decision-making rests on capability evaluations.
    • But what capabilities/risks should be measured?
    • With what methodologies?
    • Quality and quantity of proxies for impact must be improved.

5 of 15

RFP 1: LLM Agent Benchmarks

  • A suite of tasks.
  • An environment, specifying an interface for an agent.
  • A scoring procedure.
  • Ex: AgentBench, MLAgentBench, GAIA

6 of 15

Our priorities: The Three “C”s

  • Construct validity: The benchmark measures the skill it claims to measure.
    • Everyone should agree that a high score implies a real-world capability.
    • E.g.: WebShop measures sim-to-real transfer.

7 of 15

Our priorities: The Three “C”s

  • Consequential effects: The hardest tasks on the benchmark are a big deal.
    • Hardest tasks should take human experts at least multiple hours to complete (if possible at all).
    • E.g. Kinniment et al’s tasks are hard (and could be even harder!)

Suleyman suggests the “make a million dollars” benchmark

8 of 15

Our priorities: The Three “C”s

  • Continuous difficulties: Aim to avoid dramatic jumps in benchmark scores.
    • Assigning partial credit is one way to accomplish this.
    • E.g. GAIA sorts tasks into three buckets of difficulty.

9 of 15

  • Easy/cheap to conduct.
    • It’s ok to require a human in the loop/environment.
  • Low-variance behavior/scoring.
    • Real-world action is messy and contingent.
  • Objective grading.
    • Whether LLMs have surpassed call center workers depends on customer preferences.

10 of 15

Some benchmark ideas

  • Hacking tasks, ranging from SQL injection to discovering new zero-day attacks.
  • Medicine tasks, ranging from existing medical benchmarks to multi-day diagnoses.
  • Mathematics tasks, from MATH to novel, journal-quality results.

11 of 15

Why prioritize LLM agent benchmarks?

  • It’s important to measure LLMs’ skill at pursuing goals creatively and autonomously, not just acting as tools for humans.
    • We expect autonomous agents to be influential in future.
    • This sort of capability underlies most stories about AIs causing catastrophes.
    • It’s especially neglected among evaluations.

12 of 15

RFP 2: Everything else

13 of 15

Out of scope for both RFPs

  • Projects to construct more capable LLM agents or tools.
  • Misalignment/jailbreak-vulnerability tests and demonstrations
  • (Most) toy settings.
    • But there’s a fine line between simulation and game. You can apply and make your pitch for the three “C”s.

14 of 15

What about the capabilities externalities?

  • In some cases, risks can be mitigated by not releasing the benchmarks.
  • My personal view is that, for most benchmarks, the benefits from 1) consensus on capabilities and 2) better governance outweigh the acceleration.
  • See Ajeya Cotra’s survey of a 30 AI safety researchers for more discussion of this matter.

15 of 15

  • The applications for both RFPs are available here.
  • $0.3M - $3M (!)
  • No deadline.
  • Model access (“non-standard OpenAI APIs”).
  • Tell your friends!