1 of 31

The world if we measure AI catastrophic risk:

2 of 31

Hi, I’m Thomas Broadley!

  • Member of Technical Staff at ARC Evals
  • BCS from UWaterloo
  • Before ARC Evals, full-stack software developer for 4.5 years
  • https://thomasbroadley.com/

These opinions are my own, not those of my employer.

3 of 31

What I want to talk about

  • Recent work in dangerous capability evaluations, especially ARC Evals’
  • Why I think better evals could reduce catastrophic risk from AI
  • Future directions

4 of 31

What are dangerous capability evaluations?

  • Evaluations: Any kind of test of an AI model, or a system with AI as a component
  • Quantitative or qualitative
  • Long history of evaluations

5 of 31

Categorizing evaluations

Framework from Evan Hubinger:

Capabilities

Alignment

Behavioural

ML benchmarks, ARC Evals

Testing jailbreak prompts

Understanding-based

Testing model developers’ understanding of

6 of 31

What are dangerous capabilities?

  • Chemical, biological, radiological and nuclear (CBRN)
  • Cybersecurity
  • Autonomous replication
  • Deception

7 of 31

“...works on assessing whether cutting-edge AI systems

could pose catastrophic risks to civilization.”

8 of 31

ARC Evals: GPT-4 evaluations

  • “Red team” evaluator of GPT-4 before it was released
  • Assessed the model’s ARA capabilities
  • The TaskRabbit captcha demo
  • Good first step but shouldn’t be the standard

9 of 31

ARC Evals: Technical report on language model agents

10 of 31

What is a language model agent?

Language model +

agentic prompt +

scaffolding

E.g. AutoGPT,

Adept

11 of 31

What is a language model agent?

Language model +

agentic prompt +

scaffolding

E.g. AutoGPT,

Adept

12 of 31

What is a language model agent?

Language model +

agentic prompt +

scaffolding

E.g. AutoGPT,

Adept

13 of 31

What is a language model agent?

Language model +

agentic prompt +

scaffolding

E.g. AutoGPT,

Adept

14 of 31

What is a language model agent?

Language model +

agentic prompt +

scaffolding

E.g. AutoGPT,

Adept

15 of 31

A concrete example

16 of 31

A concrete example

17 of 31

A concrete example

18 of 31

A concrete example

19 of 31

A concrete example

20 of 31

Results of the technical report

21 of 31

My thoughts on the technical report

  • Weak evidence that GPT-4 can’t autonomously replicate
  • I replicated most of this report in a week of work
    • Many experienced software developers could do the same
    • It’s easy to do hacky one-off evaluations but hard to make them reproducible and

22 of 31

Threat modelling

  • Concrete descriptions of AI catastrophic risk scenarios

https://www.lesswrong.com/posts/vERGLBpDE8m5mpT6t/autonomous-replication-and-adaptation-an-attempt-at-a

23 of 31

Anthropic’s Responsible Scaling Policy

  • AI Safety Levels
  • Current models are ASL-2
  • ASL-3: either passes an ARA benchmark or increases CRBN or cybersecurity misuse risk
  • Run evaluations every three months or 4x increase in “effective compute”
  • 6x safety buffer

24 of 31

Other labs’ commitments

25 of 31

Labs’ commitments aren’t enough

  • Anthropic’s RSP:
    • Doesn’t prove that the safety buffer is large enough
    • Doesn’t prove that the ASL-3 ARA benchmark is the right one to use
    • Doesn’t make the boundaries between ASLs concrete enough
    • Doesn’t run evaluations frequently enough
  • Unknown unknowns

26 of 31

Other orgs building dangerous capabilities evaluations

27 of 31

How will evalulations reduce catastrophic risk?

  • Better RSPs with stronger evaluations would reduce catastrophic risk
  • Develop conservative evaluations for all possible threat models and pause AI development when a model passes an evaluation
  • Concerns
    • RSPs won’t get better
    • Evaluations will get Goodharted
    • Unknown unknowns

28 of 31

Another concern: Models could fail evals intentionally

29 of 31

Future directions

  • More evaluations organizations

30 of 31

Future directions: Password-locked models

  • Models fine-tuned to fail a task unless the prompt includes a password
  • Good proxy for deceptively aligned models
  • Takeaways
    • Different kinds of fine-tuning have different effects on password-locked models
    • Fine-tuning is necessary to evaluate capabilities
    • Could use password-locked models to test evaluations schemes

https://www.lesswrong.com/posts/rZs6ddqNnW8LXuJqA/password-locked-models-a-stress-case-for-capabilities

31 of 31

How to get involved

  • Start or join an evaluations organization
  • ARC Evals is hiring research assistants and data annotation contractors: https://jobs.lever.co/alignment.org/
  • Palisade Research is hiring research engineers: https://palisaderesearch.org/work