| A | B | C | D | E | F | G | ||
|---|---|---|---|---|---|---|---|---|
1 | https://ailabwatch.org/blog/model-evals/ | Google DeepMind | OpenAI | Anthropic | Meta | Microsoft | xAI | |
2 | Major sources | Evals paper; FSF | o1 system card; PF | RSP evals report; RSP | Llama 3 paper; CyberSecEval 3 | AI Safety Policies | None | |
3 | Pre-deployment evals for offensive cyber capabilities | Yes (see evals paper and Gemini 1.5) | Yes | Yes | Yes | No | No | |
4 | Pre-deployment evals for bio/CBRN capabilities | Yes (Gemini 1.5 and Gemma 2) but details are unclear | Yes | Yes | Yes | No | No | |
5 | Pre-deployment evals for autonomy capabilities | Yes: "self-proliferation" tasks | Partial: "model autonomy" evals, but they're mostly normal coding tasks, but they include "agentic tasks" (which OpenAI also calls "autonomous replication and adaptation" tasks), but the details are unclear | Yes: "Autonomous Replication and Adaption" tasks | No | No | No | |
6 | Pre-deployment evals for scheming or situational awareness capabilities | Yes: "self-reasoning" tasks | No but Apollo did this for o1-preview (but there's no policy to share future models with Apollo) | No | No | No | No | |
7 | Pre-deployment evals for manipulation or persuasion capabilities | Yes: persuasion, building rapport, and subtle manipulation in dialogue | Yes but the human experiments are just about changing opinions, not building rapport or subtle manipulation. | No but they've done some work on LM persuasion | No | No | No | |
8 | Publish eval results | Yes | Yes | Yes | Mostly: Cyber, yes; chem/bio uplift, no. | - | - | |
9 | Publish questions/tasks (or for human experiments, publish methodology) (CBRN information can be mostly redacted) | Mostly. Cyber: mostly: DeepMind open-sourced their in-house CTFs, InterCode-CTF is an existing open-source eval, and they described the "Hack The Box" tasks. CBRN: no. Self-proliferation: yes, DeepMind open-sourced. Self-reasoning: yes, DeepMind open-sourced. Persuasion: partial, lots of details but not enough to reproduce the experiments. | A little. Cyber: no. Bio: a little: high-level descriptions — that's ok except they didn't offer to share privately and didn't publish a small subset. Persuasion: yes, they open-sourced MakeMePay and MakeMeSay and shared methodology for the rest. Autonomy: partial: OpenAI open-sourced SWE-Bench Verified; the "agentic tasks" are not published; the OpenAI interview questions are not published which is ok except they didn't publish a small subset. | Partial. Cyber: no except InterCode-CTF, an existing open-source eval. ARA: yes. CBRN: they offered to share with people working on similar things. | Partial: Cyber, partial; chem/bio uplift, no. This is particularly frustrating since they say "our sample size was sufficient for statistical analysis" rather than saying what the sample size was, and they conclude Llama 3 doesn't have certain dangerous capabilities but don't share the results... | - | - | |
10 | Elicitation: helpful-only & no-inference-time-mitigations | Partial: "We evaluate models without safety filters, because our goal is to evaluate the models’ underlying capabilities rather than product safety." And models are instruction-tuned. But it's not clear if they're harmlessness-tuned. | Yes: they report pre-mitigation and post-mitigation results. | Yes but they did harmless-training and then harmfulness-training, rather than just no-harmlessness | A little: Meta mostly doesn't discuss this, but one part of CyberSecEval 3 uses a helpful-only model. | - | - | |
11 | Elicitation: fine-tune for general capabilities, instruction following, tool use, general agency | No, apparently just normal instruction tuning. | No, not specified. | No, not specified. | No, apparently just normal tuning for instruction-following and coding. | - | - | |
12 | Elicitation: posttrain on similar tasks | No | No | No | No | - | - | |
13 | Elicitation: tools: internet browser, code interpreter, and other tools depending on task | Yes | The o1 system card mentions "tools available in the headless Kali Linux distribution (with up to 60 rounds of tool use for each attempt)." It also mentions "biological tooling"; the details are unclear. | Not really: says model has access to relevant tools in ARA evals but isn't specific. No mention for cyber (or CBRN). | Unclear: Llama 3 was trained to use a search engine, a python interpreter, and the WolframAlpha API, but these may not have been used during evals | - | - | |
14 | Elicitation: CoT | Yes | Yes: o1 uses chain-of-thought without needing to be prompted | Yes, but not explicit, but implied by discussion of elicitation in the RSP and RSP evals report [update: Anthropic says "Some of our evaluations lacked some basic elicitation techniques such as best-of-N or chain-of-thought prompting."] | No | - | - | |
15 | Elicitation: pass@n (many attempts) (or best-of-n or self-consistency when relevant) | Yes: in self-proliferation milestone runs, the model gets 10 attempts, and if it fails then DeepMind tries to guide the model toward a solution and quantify how much help it model needs | Yes: pass@n and consensus@n when relevant. | Yes: ARA evals involve checking for "a 10% pass rate" | No | - | - | |
16 | Elicitation: good prompting, scaffolding, and inference-time techniques | Unclear. The evals paper mentions this. | Mostly. The o1 system card says "we tested a variety of methods to best elicit capabilities in a given category, including custom model training, scaffolding, and prompting where relevant." (And the 4o system card says mentions "best available post-training and public scaffolds at the time.") And for SWE-bench Verified "we use the best-performing open-source scaffold at the time of initial implementation, Agentless." | Unclear. The RSP mentions this. | No | - | - | |
17 | Elicitation: look at transcripts to understand where the model gets stuck; fix infrastructure issues or task ambiguities; maybe ideally publish transcripts | No: no mention. | No. For o1, no mention. For 4o, OpenAI looked at some transcripts but failed to fix spurious failures. "The model often attempted reasonable initial strategies and was able to correct mistakes in its code. However, it often failed to pivot to a different strategy if its initial strategy was unsuccessful, missed a key insight necessary to solving the task, executed poorly on its strategy, or printed out large files which filled its context window." | No mention of this happening, but the RSP says "A human should read over task transcripts and summarize why the model failed, in order to avoid spurious failures due to e.g. issues with our elicitation techniques" | No | - | - | |
18 | For each DC area, there's a high-level capability threshold (ideally triggering some response, but that's out of scope) | Mostly: there are Critical Capability Levels. (They feel high.) They trigger making a response plan, not a predetermined response. Separately, the evals paper scores performance on each eval from "very weak" to "very strong" but doesn't publish thresholds (and some evals are not hard enough to detect that capabilities are strong; in particular the "knowledge gaps" eval is almost saturated but DeepMind gives the model a score of "weak"). | Partial: there are High and Critical thresholds, but the High thresholds mostly just matter for post-mitigation models and the Critical thresholds are extremely high. | Yes, there are high-level ASL-3 capability thresholds | No | No: they have a "Deployment Safety Board" but they haven't published details (and at least one small deployment ignored the DSB) | No | |
19 | ...and the threshold is operationalized in terms of eval performance | No | No | Not really: '50% on ARA tasks 10% of the time' was supposed to trigger ASL-3 commitments, but those tasks weren't defined in advance | No | - | - | |
20 | Forecasting: for each of the labs' evals (or at least crucial or cheap evals), run on smaller/weaker models to get scaling laws and forecast performance as a function of effective training compute | Partial: the evals paper includes predictions by external professional forecasters on future models' performance on the evals. And the FSF says "We also intend to take steps to forecast the arrival of CCLs to inform our preparations." But no experiments to find scaling laws. | No, no mention in system cards. The PF mentions plans to forecast risks and "create an internal 'preparedness roadmap.'" | A little: the RSP evals report includes informal predictions about whether "an additional three months of elicitation improvements and no additional pretraining" would result in eval thresholds being reached, but no forecasts about actual future models and no experiments to find scaling laws. The RSP says "Prior to each training run, we will also produce internal forecasts of models’ capabilities (including likelihood of the next ASL)." | No | - | - | |
21 | Third-party evals: offer pre-deployment access to US AISI, UK AISI, METR, and Apollo, and offer good access (at least including helpful-only and no-inference-time-mitigations), and let them publish results, and incorporate their results into the risk assessment | A little: shared a Gemini 1.5 Pro checkpoint with unspecified groups, including UK AISI; access was black-box but evaluators could turn off inference-time filters | Partial: planning to share with US AISI, but details unclear; shared o1-preview with Apollo and METR but no commitment to do so and METR didn't have enough time to do their evals well pre-deployment; said it shared early access to o1-preview with US AISI and UK AISI, but details unclear | Partial: planning to share with US AISI, but details unclear; previously shared with UK AISI and METR, but details unclear | No | No | No | |
22 | Run open-source DC evals created by others | A little: just CTFs | No: maybe CTFs but OpenAI doesn't publish details | A little: just CTFs | No | No | No |