1 of 12

Evaluations

2 of 12

What are evals?

“The systematic measurement of properties in AI systems” - Apollo Research

“AI Evaluations focus on experimentally assessing the capabilities, safety, and alignment of AI systems.” https://www.lesswrong.com/tag/ai-evaluations

“There is not yet consensus on precisely what the term ‘evaluation’ entails [...]” Ada Lovelace Institute

3 of 12

Why do evals?

  • Currently no other way to determine what AI system can do.
    • Cannot predict capabilities before (or after!) training
    • People regularly surprised by LLM behaviors and abilities
  • Key input into current and future government and AI lab policies
  • Key input into individual and community decision-making
    • Should affect your timelines or your perception of different risks

4 of 12

Examples

5 of 12

Self-proliferation. Jul 2023

Criteria METR used to create these tasks:

  • Tasks easier than self-proliferation
  • Range of difficulty
  • Range of potential obstacles

6 of 12

Situational Awareness. Jul 2024

7 of 12

Machiavelli. Behaviour in text-based games. Apr 2023

8 of 12

Adversarial robustness

  1. ANLI
    1. is a large-scale natural language inference dataset created via an iterative, adversarial human-and-model-in-the-loop procedure focused on examples that could fool state-of-the-art models at the time of its creation (e.g. BERT-Large [88], RoBERTa [89]).
  2. AdvGLUE [90]
    • uses questions from the General Language Understanding Evaluation (GLUE) benchmark [91] and adds typos, word replacements, paraphrases of sentences, manipulation of sentence structure, insertion of unrelated sentences, and human-written adversarial examples. The attacks are optimized against BERT [88], RoBERTa, and RoBERTa ensemble.
  3. Human Jailbreaks
    • is a set of 1,405 in-the-wild human-written jailbreaking templates

9 of 12

Orgs doing evals

  • AI Labs: OpenAI, Deepmind, Anthropic
  • AI Safety labs: METR, Apollo, Center for AI Safety, Redwood Research
  • Governments: UK AI Safety Institute, US AI Safety Institute
  • For profit: VectorView, likely many more
  • Other orgs: RAND, Aligned AI, SaferAI

10 of 12

Risks / failure modes of evals. Questions to ask

  • Technical: Are evals reliable?
    • What if dangerous capabilities are discontinuous and unpredictable?
    • If our evals don’t find dangerous capabilities, what can we conclude?
    • Have we tried enough prompts / scaffolding? Is AI system being deceptive?
  • Governance: safetywashing by AI labs?
    • Can we trust AI labs to self-regulate and self-evaluate?
    • Can evals be ‘Goodharted’?
  • Governance: race dynamics
    • Do evals impose a ‘safety tax’? Outcompeted by less safe groups
  • Capabilities vs safety
    • Do evals contribute more to capabilities or to safety?
  • Technical: Is alignment easier than evals?
    • Hubinger: “robustly evaluating safety could be more difficult than finding a training process that robustly produces safe models”

11 of 12

Evals is a new field. Lots of low-hanging fruit

We need a Science of Evals by Apollo research

  • Still learning a lot about prompt-engineering and scaffolding
  • Anthropic: “changing the options from (A) to (1) or from (A) to [A], or adding an extra space between the option and the answer can lead to a ~5 percentage point change."

12 of 12

Opportunities 🎉