2 of 12

What are evals?

“The systematic measurement of properties in AI systems” - Apollo Research

“AI Evaluations focus on experimentally assessing the capabilities, safety, and alignment of AI systems.” https://www.lesswrong.com/tag/ai-evaluations

“There is not yet consensus on precisely what the term ‘evaluation’ entails [...]” Ada Lovelace Institute

3 of 12

Why do evals?

Currently no other way to determine what AI system can do.

Cannot predict capabilities before (or after!) training
People regularly surprised by LLM behaviors and abilities

Key input into current and future government and AI lab policies

UK and US AI Safety Institutes
Open AI preparedness framework
Anthropic Responsible Scaling Policies
Deepmind’s Frontier Safety Framework

Key input into individual and community decision-making

Should affect your timelines or your perception of different risks

5 of 12

Self-proliferation. Jul 2023

METR: Evaluating Language-Model Agents on Realistic Autonomous Tasks

Criteria METR used to create these tasks:

Tasks easier than self-proliferation
Range of difficulty
Range of potential obstacles

6 of 12

Situational Awareness. Jul 2024

https://situational-awareness-dataset.org/

7 of 12

Machiavelli. Behaviour in text-based games. Apr 2023

https://arxiv.org/abs/2304.03279

How to read right hand chart:

Further out of the circle is better.
The Random agent gets little reward *and* it is immoral, creates disutlity for others and is power seeking
The GPT4 based agents are pareto improvements over random agent
RL agent gets highest reward but is also most unethical
They tried some ‘simple’ ideas to improve ethical behaviour.

Big caveats:

Using proxies to measure all these ethical things. (And done using GPT4. LLM as a judge)
This is a game. And LLM knows it is a game! “Immerse yourself in the game universe, and do not break character at any point“
The goals and possible actions are pre-specified, not chosen by the agents
"Out of 134 games, we identify 30 games where agents trained to maximize reward perform poorly on behavioral metrics...these games form the machiavelli test set”

8 of 12

Adversarial robustness

ANLI

is a large-scale natural language inference dataset created via an iterative, adversarial human-and-model-in-the-loop procedure focused on examples that could fool state-of-the-art models at the time of its creation (e.g. BERT-Large [88], RoBERTa [89]).

AdvGLUE [90]

uses questions from the General Language Understanding Evaluation (GLUE) benchmark [91] and adds typos, word replacements, paraphrases of sentences, manipulation of sentence structure, insertion of unrelated sentences, and human-written adversarial examples. The attacks are optimized against BERT [88], RoBERTa, and RoBERTa ensemble.

Human Jailbreaks

is a set of 1,405 in-the-wild human-written jailbreaking templates

9 of 12

Orgs doing evals

AI Labs: OpenAI, Deepmind, Anthropic
AI Safety labs: METR, Apollo, Center for AI Safety, Redwood Research
Governments: UK AI Safety Institute, US AI Safety Institute
For profit: VectorView, likely many more
Other orgs: RAND, Aligned AI, SaferAI

10 of 12

Risks / failure modes of evals. Questions to ask

Technical: Are evals reliable?

What if dangerous capabilities are discontinuous and unpredictable?
If our evals don’t find dangerous capabilities, what can we conclude?
Have we tried enough prompts / scaffolding? Is AI system being deceptive?

Governance: safetywashing by AI labs?

Can we trust AI labs to self-regulate and self-evaluate?
Can evals be ‘Goodharted’?

Governance: race dynamics

Do evals impose a ‘safety tax’? Outcompeted by less safe groups

Capabilities vs safety

Do evals contribute more to capabilities or to safety?

Technical: Is alignment easier than evals?

Hubinger: “robustly evaluating safety could be more difficult than finding a training process that robustly produces safe models”

11 of 12

Evals is a new field. Lots of low-hanging fruit

We need a Science of Evals by Apollo research

Still learning a lot about prompt-engineering and scaffolding
Anthropic: “changing the options from (A) to (1) or from (A) to [A], or adding an extra space between the option and the answer can lead to a ~5 percentage point change."

12 of 12

Opportunities 🎉

Often there are requests for evals/benchmarks

https://www.anthropic.com/news/a-new-initiative-for-developing-third-party-model-evaluations
https://www.openphilanthropy.org/rfp-llm-benchmarks/
https://www.llama.com/llm-evaluation-research-grant/

Often there are jobs in evals

Check orgs in earlier slides
Check 80000 Hours job board

Recommend following https://aievaluation.substack.com/