The world if we measure AI catastrophic risk:
Hi, I’m Thomas Broadley!
These opinions are my own, not those of my employer.
What I want to talk about
What are dangerous capability evaluations?
Categorizing evaluations
Framework from Evan Hubinger:
| Capabilities | Alignment |
Behavioural | ML benchmarks, ARC Evals | Testing jailbreak prompts |
Understanding-based | | Testing model developers’ understanding of |
What are dangerous capabilities?
“...works on assessing whether cutting-edge AI systems
could pose catastrophic risks to civilization.”
ARC Evals: GPT-4 evaluations
ARC Evals: Technical report on language model agents
What is a language model agent?
Language model +
agentic prompt +
scaffolding
E.g. AutoGPT,
Adept
What is a language model agent?
Language model +
agentic prompt +
scaffolding
E.g. AutoGPT,
Adept
What is a language model agent?
Language model +
agentic prompt +
scaffolding
E.g. AutoGPT,
Adept
What is a language model agent?
Language model +
agentic prompt +
scaffolding
E.g. AutoGPT,
Adept
What is a language model agent?
Language model +
agentic prompt +
scaffolding
E.g. AutoGPT,
Adept
A concrete example
A concrete example
A concrete example
A concrete example
A concrete example
Results of the technical report
My thoughts on the technical report
Threat modelling
Anthropic’s Responsible Scaling Policy
Other labs’ commitments
Labs’ commitments aren’t enough
Other orgs building dangerous capabilities evaluations
How will evalulations reduce catastrophic risk?
Another concern: Models could fail evals intentionally
Future directions
Future directions: Password-locked models
How to get involved