Dangerous Capability Evaluations in AI models
Timothée Chauvin
X-IA, Sep. 24, 2024
Dangerous Capability Evaluations in AI models
Dangerous Capability Evaluations in AI models
Why?
Working hypothesis:
What is a dangerous capability?
o1 system card
What is a dangerous capability?
o1 system card
Chemical, Biological, Radiological, and Nuclear [risks]
What is a dangerous capability?
o1 system card
What is a dangerous capability?
o1 system card
What is a dangerous capability?
o1 system card
What is a dangerous capability?
o1 system card
Directly dangerous
Indirectly dangerous / useful benchmarks
Let’s evaluate all current models together
Are they dangerous?
Not really.
What will AI progress look like in the next few years?
Scaling laws
L: final pre-training test loss
N: number of parameters in the model
D: number of tokens in training data
Back to 2019: GPT-2
(Winograd Schema Challenge, introduced in 2011 as a test for AGI)
From GPT-2 to GPT-4
GPT-2
GPT-4
(2019)
4e21 FLOP
(2023)
2e25 FLOP
x5,000 (x8.4/yr)
How fast has compute been increasing?
How fast has compute been increasing?
4-5x a year between 2010 and 2024
Faster than:
Algorithmic progress
But how far can scaling go?
But how far can scaling go?
Power
But how far can scaling go?
GPT-4
??
x10,000
GPT-2
GPT-4
(2019)
4e21 FLOP
(2023)
2e25 FLOP
x5,000
Unhobbling
Past unhobblings
Future unhobblings
Unhobbling
Past unhobblings
Future unhobblings
Unhobbling
Future unhobblings
Past unhobblings
Unhobbling
Future unhobblings
Past unhobblings
Predicting the future is difficult
(two years later, o1 scores 94.8% on MATH and the benchmark is considered obsolete)
Dangerous Capability Evaluations in AI models
Back to
[Reminder] Types of dangerous capabilities
o1 system card
Responsible Scaling Policies (RSP)
Introduced by: METR (Sep. 2023)
Commitments by:
Responsible Scaling Policies (RSP)
How are dangerous capability evaluations done?
Example: WMDP benchmark (Weapons of Mass Destruction Proxy benchmark)
Example: WMDP benchmark (Weapons of Mass Destruction Proxy benchmark)
Example: WMDP benchmark (Weapons of Mass Destruction Proxy benchmark)
What is an agent?
2. Solving tasks as an agent
2. Solving tasks as an agent
Particularly adapted to hacking, AI R&D, replication and adaptation
3. Human uplift studies
Particularly adapted to CBRN risks
GPT-4o to o1:
Practical considerations
Aside: what do jailbreaks look like these days?
Aside: what do jailbreaks look like these days?
What’s next?
Right now, models aren’t reliable enough to be dangerous.
What should we do when that changes?
Countermeasures
Countermeasures
What about open-weights models?
Countermeasures
What about open-weights models?
Will this even be a problem?
Conclusion
Further resources
me:
Questions?
Scaling laws
Kaplan et al. (Jan 2020)
Hoffmann et al. (Mar 2022)
Epoch AI (Apr 2024)
Example: Anthropic’s Responsible Scaling Policy
Questions to ask yourself