1 of 15

Two new RFPs on the real-world impacts of LLMs

2 of 15

Our new RFPs

RFP 1: We’re funding benchmarks for the capabilities of LLM agents.
RFP 2: We’re also funding other projects for improving our collective understanding of LLMs’ real-world impacts.

3 of 15

Motivation (pt 1)

There’s widespread disagreement about whether the impressive capabilities of LLMs will translate into real-world impact.
Some people think scaled-up LLMs/LLM agents will be capable of automating most remote jobs within a few years, others think LLMs are a parlor trick.

4 of 15

Motivation (pt 2)

Much regulatory (and self-regulatory) decision-making rests on capability evaluations.

But what capabilities/risks should be measured?
With what methodologies?
Quality and quantity of proxies for impact must be improved.

5 of 15

RFP 1: LLM Agent Benchmarks

Diagram from Kinniment et al., 2023

A suite of tasks.
An environment, specifying an interface for an agent.
A scoring procedure.
Ex: AgentBench, MLAgentBench, GAIA

6 of 15

Our priorities: The Three “C”s

Construct validity: The benchmark measures the skill it claims to measure.

Everyone should agree that a high score implies a real-world capability.
E.g.: WebShop measures sim-to-real transfer.

7 of 15

Our priorities: The Three “C”s

Consequential effects: The hardest tasks on the benchmark are a big deal.

Hardest tasks should take human experts at least multiple hours to complete (if possible at all).
E.g. Kinniment et al’s tasks are hard (and could be even harder!)

Suleyman suggests the “make a million dollars” benchmark

8 of 15

Our priorities: The Three “C”s

Continuous difficulties: Aim to avoid dramatic jumps in benchmark scores.

Assigning partial credit is one way to accomplish this.
E.g. GAIA sorts tasks into three buckets of difficulty.

9 of 15

What we’re willing to sacrifice

Easy/cheap to conduct.

It’s ok to require a human in the loop/environment.

Low-variance behavior/scoring.

Real-world action is messy and contingent.

Objective grading.

Whether LLMs have surpassed call center workers depends on customer preferences.

10 of 15

Some benchmark ideas

Hacking tasks, ranging from SQL injection to discovering new zero-day attacks.
Medicine tasks, ranging from existing medical benchmarks to multi-day diagnoses.
Mathematics tasks, from MATH to novel, journal-quality results.

11 of 15

Why prioritize LLM agent benchmarks?

It’s important to measure LLMs’ skill at pursuing goals creatively and autonomously, not just acting as tools for humans.

We expect autonomous agents to be influential in future.
This sort of capability underlies most stories about AIs causing catastrophes.
It’s especially neglected among evaluations.

12 of 15

RFP 2: Everything else

How else can we measure and forecast the impact of LLMs?
RCTs, surveys of practitioners about LLM usage, qualitative research, collecting existent info, polling experts, creating interactive experiments, and more.

13 of 15

Out of scope for both RFPs

Projects to construct more capable LLM agents or tools.
Misalignment/jailbreak-vulnerability tests and demonstrations
(Most) toy settings.

But there’s a fine line between simulation and game. You can apply and make your pitch for the three “C”s.

14 of 15

What about the capabilities externalities?

In some cases, risks can be mitigated by not releasing the benchmarks.
My personal view is that, for most benchmarks, the benefits from 1) consensus on capabilities and 2) better governance outweigh the acceleration.
See Ajeya Cotra’s survey of a 30 AI safety researchers for more discussion of this matter.

15 of 15

The applications for both RFPs are available here.
$0.3M - $3M (!)
No deadline.
Model access (“non-standard OpenAI APIs”).
Tell your friends!