1 of 31

The world if we measure AI catastrophic risk:

2 of 31

Hi, I’m Thomas Broadley!

Member of Technical Staff at ARC Evals
BCS from UWaterloo
Before ARC Evals, full-stack software developer for 4.5 years
https://thomasbroadley.com/

These opinions are my own, not those of my employer.

Thomas Broadley
I’m a Canadian who’s living in Los Angeles
I have a Bachelor’s in Computer Science from the University of Waterloo
After that, I worked for 4.5 years as a full-stack software developer at a startup in Kitchener-Waterloo
I’ve accepted the possibility of catastrophic risk from AI for several years, but thought it was far off. When ChatGPT was released, like a lot of people, I realized that the next few generations of large language models could pose such a risk
I spent several months looking for ways that I could effectively reduce AI catastrophic risk. If anyone’s going through the same process, I’m happy to talk more about it at the end of the meeting. But to make a long story short, I just accepted a role as a Member of Technical Staff at the Alignment Research Center, specifically on their Evaluations team, aka ARC Evals
I maintain a personal website and blog at https://thomasbroadley.com/
But I’m giving this talk in a personal capacity. These are my opinions, not ARC Evals’

3 of 31

What I want to talk about

Recent work in dangerous capability evaluations, especially ARC Evals’
Why I think better evals could reduce catastrophic risk from AI
Future directions

4 of 31

What are dangerous capability evaluations?

Evaluations: Any kind of test of an AI model, or a system with AI as a component
Quantitative or qualitative
Long history of evaluations

5 of 31

Categorizing evaluations

Framework from Evan Hubinger:

	Capabilities	Alignment
Behavioural	ML benchmarks, ARC Evals	Testing jailbreak prompts
Understanding-based		Testing model developers’ understanding of

(From evhub) A capabilities evaluation is a model evaluation designed to test whether a model could do some task if it were trying to. For example: if the model were actively trying to autonomously replicate, would it be capable of doing so?
(From evhub) An alignment evaluation is a model evaluation designed to test under what circumstances a model would actually try to do some task. For example: would a model ever try to convince humans not to shut it down?
Behavioural evaluations don’t try to understand or prove anything about the internals of a model. They just focus on the model’s behaviour on certain tasks. ML benchmarks and ARC Evals’ dangerous capability evaluations fall into this category
Understanding-based evaluations: instead of evaluating the model’s behaviour, these evaluate the model developer’s ability to understand and explain the model’s behaviour
So to summarize, existing dangerous capabilities evaluations are behavioural capabilities evaluations
2x2 table

Behavioural capabilities: ARC Evals’ dangerous capabilities evaluations, ML benchmarks
Behavioural alignment: checking how easy it is to jailbreak a model by trying a bunch of jailbreak prompts on it (for a broad definition of alignment)
Understanding-based capabilities: IDK
Understanding-based alignment: can a model developer give a strong argument that a model isn’t a source of catastrophic risk?

6 of 31

What are dangerous capabilities?

Chemical, biological, radiological and nuclear (CBRN)
Cybersecurity
Autonomous replication
Deception

I think the consensus definition for dangerous capabilities is “those that could cause a catastrophe”. So maybe a term like “catastrophic capabilities” would be better
AI could help untrained people produce novel bioweapons and infect people with them
Similarly, AI could give people the knowledge they need to make and deploy chemical or nuclear weapons
AI could make it easier for humans to hack into critical infrastructure like water, power, and telecommunications systems. A prolonged outage could constitute a catastrophe
Autonomous replication. Given the right tools, can an AI system make copies of itself and gather more resources (e.g. access to cloud servers, money) with which to make even more copies of itself? If so, it’d make AI more effective at causing catastrophe through the means above, by gathering resources for the humans that control it. Or there might not be any humans in control. It might be able to cause a catastrophe independently
Deception. AI could manipulate humans into escaping from the labs that created it, getting resources, taking action in the real world. This would also make it more effective at causing a catastrophe

7 of 31

“...works on assessing whether cutting-edge AI systems

could pose catastrophic risks to civilization.”

8 of 31

ARC Evals: GPT-4 evaluations

“Red team” evaluator of GPT-4 before it was released
Assessed the model’s ARA capabilities
The TaskRabbit captcha demo
Good first step but shouldn’t be the standard

Partnered with OpenAI and Anthropic as a third-party evaluator of their models, including GPT-4 before it was released
Serving as a “red team”, trying to work around safety restrictions put in place on models and elicit capabilities that the labs might have tried to get rid of
ARC Evals’ main worry was future ML models being able to autonomously act in the real world, gaining access to resources like money and cloud servers
Researchers gave the models goals like “set up a cloud server using the money in this crypto wallet”
Then had the model actually run code, interact with a browser with the researcher helping, etc. Not just having the model come up with plans and seeing if the researchers thought they were good plans
The most compelling demonstration that came out of that, the one that a lot of people paid attention to, was GPT-4 lying to a Taskrabbit worker while getting them to work around a captcha for it
I think this was a good first step but not enough for a couple of reasons

Evaluations weren’t run on the final version of GPT-4 that was released to the public
No fine-tuning

9 of 31

ARC Evals: Technical report on language model agents

https://evals.alignment.org/Evaluating_LMAs_Realistic_Tasks.pdf

10 of 31

What is a language model agent?

Language model +

agentic prompt +

scaffolding

E.g. AutoGPT,

Adept

11 of 31

What is a language model agent?

Language model +

agentic prompt +

scaffolding

E.g. AutoGPT,

Adept

12 of 31

What is a language model agent?

Language model +

agentic prompt +

scaffolding

E.g. AutoGPT,

Adept

13 of 31

What is a language model agent?

Language model +

agentic prompt +

scaffolding

E.g. AutoGPT,

Adept

14 of 31

What is a language model agent?

Language model +

agentic prompt +

scaffolding

E.g. AutoGPT,

Adept

15 of 31

A concrete example

16 of 31

A concrete example

17 of 31

A concrete example

18 of 31

A concrete example

19 of 31

A concrete example

20 of 31

Results of the technical report

21 of 31

My thoughts on the technical report

Weak evidence that GPT-4 can’t autonomously replicate
I replicated most of this report in a week of work

Many experienced software developers could do the same
It’s easy to do hacky one-off evaluations but hard to make them reproducible and

22 of 31

Threat modelling

Concrete descriptions of AI catastrophic risk scenarios

https://www.lesswrong.com/posts/vERGLBpDE8m5mpT6t/autonomous-replication-and-adaptation-an-attempt-at-a

23 of 31

Anthropic’s Responsible Scaling Policy

AI Safety Levels
Current models are ASL-2
ASL-3: either passes an ARA benchmark or increases CRBN or cybersecurity misuse risk
Run evaluations every three months or 4x increase in “effective compute”
6x safety buffer

24 of 31

Other labs’ commitments

OpenAI: Risk-informed Development Policy (not public)
DeepMind: Running evaluations internally

25 of 31

Labs’ commitments aren’t enough

Anthropic’s RSP:

Doesn’t prove that the safety buffer is large enough
Doesn’t prove that the ASL-3 ARA benchmark is the right one to use
Doesn’t make the boundaries between ASLs concrete enough
Doesn’t run evaluations frequently enough

Unknown unknowns

Anthropic’s RSP received a fair amount of pushback. Here are some pieces of criticism from that that I think are valid:

Little justification for why the chosen ARA benchmark is actually 6x below the level where an agent could replicate autonomously
The boundaries between AI Safety Levels aren’t concrete enough. For example, when talking about ASL-3, it says things like, “Access to the model would substantially increase the risk of deliberately-caused catastrophic harm”, but doesn’t say how much a “substantial increase” is

Honestly I get it, this is just really tough to estimate right now!
We need to get better at forecasting the actual risk of catastrophic harm from models with a certain set of capabilities. Maybe superforecasters could help here. They’ve already done things like forecasting P(doom)
Also there’s a risk here of too-tight benchmarks getting Goodharted. This is one reason to keep things kind of loose and “I know it when I see it”

Unknown unknowns are a real problem

We can threat model til the cows come home but still might miss things

I’d like to see evaluations run much more frequently. Probably continuously. I feel like it’s possible

There are only five evaluation tasks listed in the report. In my opinion, it should be possible to run all of them in an automated manner, without any human oversight
Even if Anthropic plans to add many more tasks, I still think these should be run much more frequently

26 of 31

Other orgs building dangerous capabilities evaluations

27 of 31

How will evalulations reduce catastrophic risk?

Better RSPs with stronger evaluations would reduce catastrophic risk
Develop conservative evaluations for all possible threat models and pause AI development when a model passes an evaluation
Concerns

RSPs won’t get better
Evaluations will get Goodharted
Unknown unknowns

Anthropic’s RSP specifically includes a list of dangerous capabilities evaluations that, if passed, would mean that the model is in the ASL-3 danger zone
Of course, there is a serious concern that RSP regulation won’t end up being strong enough to meaningfully reduce catastrophic risk

I’m not too worried about current-generation RSPs getting enshrined as the gold standard forever. I think that both the labs and governments realize that these things need to be updated often
I am worried about the bar being consistently too low. E.g. by the time Anthropic upgrades their RSP, models are already strong enough that they under-upgrade it. But I think I can have a big impact on that working at ARC Evals, by pushing labs to strengthen their RSPs and by making it easier for them to do so (e.g. automated third-party evaluations)

Or the evaluations will succumb to Goodhart’s law

Even Anthropic’s RSP has a clause about times when an ASL evaluation triggers that says, “If the evaluation is determined to be overly conservative and the model is confirmed to not pose next-ASL risks, [we will] construct new evaluations that take into account this new information.”

Unknown unknowns, e.g. threat models we don’t guard against

Need a diverse group of evaluations organizations focused on different threat models
Need to make sure that we’re not just going to sideline organizations who produce crazy-looking threat models

28 of 31

Another concern: Models could fail evals intentionally

https://www.lesswrong.com/posts/dBmfb76zx6wjPsBC7/when-can-we-trust-model-evaluations

29 of 31

Future directions

More evaluations organizations

30 of 31

Future directions: Password-locked models

Models fine-tuned to fail a task unless the prompt includes a password
Good proxy for deceptively aligned models
Takeaways

Different kinds of fine-tuning have different effects on password-locked models
Fine-tuning is necessary to evaluate capabilities
Could use password-locked models to test evaluations schemes

https://www.lesswrong.com/posts/rZs6ddqNnW8LXuJqA/password-locked-models-a-stress-case-for-capabilities

https://www.lesswrong.com/posts/rZs6ddqNnW8LXuJqA/password-locked-models-a-stress-case-for-capabilities
Experimenting with models that are fine-tuned to be bad at a certain task unless a certain string (the password) is present in the prompt
The idea is that these are a good proxy for deceptively aligned models. In both cases, we have a model that has some capabilities but won’t display them in all circumstances
Conclusions

You can actually make password-locked models. This isn’t just a fantasy
Different kinds of fine-tuning have different levels of effectiveness on password-locked models
Fine-tuning is necessary to do good capabilities evaluations
Password-locked models could be a good way to test whether a particular evaluations scheme (e.g. ARC Evals’ ARA evaluations) could elicit hidden capabilities from a seemingly weak model. That seems important!

I’d be excited to see someone try to replicate this or extend this work

31 of 31

How to get involved

Start or join an evaluations organization
ARC Evals is hiring research assistants and data annotation contractors: https://jobs.lever.co/alignment.org/
Palisade Research is hiring research engineers: https://palisaderesearch.org/work