Published using Google Docs
Manifest Speech 2024
Updated automatically every 5 minutes

Benjamin Mann Jun 7, 2024

Thank you all for being here today. I’m Ben Mann, one of the co-founders of Anthropic. It's a pleasure to be among this community of thoughtful individuals dedicated to understanding and predicting the future.

We are living through a truly extraordinary time. The progress in AI capabilities over the past decade has been staggering. Models today can engage in open-ended dialogue, pass the bar exam and software engineering interviews, and, as we shared with the United States Senate last year, show signs of understanding biology well enough to accelerate bioweapon development. It feels like we're on the cusp of artificial general intelligence.

But with great promise comes great peril. As AI systems become more capable, they also present increasing risks of accidents, misuse, and even catastrophic harm if developed without adequate safeguards. At Anthropic, we believe it's critical to put responsible development practices in place now, before it's too late.

That's why we've developed a framework we call the Responsible Scaling Policy. The core idea is to anticipate the risks of advanced AI ahead of time, and ensure we have concrete safety measures in place before we create systems with dangerous capabilities.

Today, I want to share with you how the Responsible Scaling Policy works, and critically, how it can inform the way we forecast AI progress and risks going forward. Because making accurate predictions about transformative AI systems is not just an intellectual exercise - it's of vital importance for the safety of our world.

Let me start by explaining the core of Anthropic's Responsible Scaling Policy, which is the concept of AI Safety Levels, or ASLs for short. Our ASLs are modeled loosely after the US government’s biosafety level, or BSL, standards for the handling of dangerous biological materials.

Just as BSL-1 through BSL-4 define escalating safety requirements for handling biological materials based on their risk, ASLs define escalating safety requirements for AI systems based on the potential dangers arising from their capabilities. The term "ASL" refers to the safety standards themselves, not the capability categories. However, we sometimes informally use, for example, "ASL-3 model" as shorthand for a model that requires the safety measures defined for ASL-3.

The idea is to categorize AI systems into different safety levels based on the capabilities they exhibit. At the lowest level, ASL-1, are systems that pose no risk of catastrophic harm, like a chess engine or an AI that can only do simple classification tasks.

At ASL-2, where we believe the most advanced models are today, we start to see early indications of potentially dangerous capabilities. For example, an AI that can access some information related to making biological weapons, but not enough to actually elevate the risk compared to having access to a search engine, or requiring expert knowledge to elicit.

Things start to get more concerning at ASL-3. This is where we think access to the model could significantly increase the risk of misuse, perhaps by making certain kinds of attacks easier to pull off or by enabling new attack vectors altogether. Importantly, ASL-3 is also where we might start to see early signs that the model could attempt to autonomously replicate and spread over the internet.

Beyond ASL-3 we have ASL-4 representing even more advanced capabilities. We haven't fully defined these yet, but you can imagine things like a model that could automate key steps of a research process or that would be extremely costly for even a nation-state to contain. ASL-4 models exhibit expert human-level abilities across a wide-range of domains.

And finally, while also not defined, ASL-5 will likely cover models that significantly surpass the capabilities of even the most capable humans and human organizations, able to effectively, autonomously carry out research and execute on goals on a timescale of years, accumulating resources and achieving other intermediate goals as needed, creating the potential for unprecedented catastrophic risk if these agents are not aligned with human interests and values.

Just like with BSLs, the key is that for each ASL, we define a set of containment and deployment measures that need to be in place before we create or release a model of that level. So for ASL-2, it's things like enhanced security to prevent model theft, and following best practices for responsible deployment like having an acceptable use policy.

But as you get to ASL-3 and beyond, the requirements become much more stringent. We're talking about things like extreme security measures that can withstand attack even by well-resourced nation-states, technological breakthroughs in our ability to constrain models to behave as intended, and red-teaming by world-class security experts to mitigate misuse risks.

Of course, AI development has some key differences from biology. We're dealing with software that can be easily replicated, not physical materials. And we're grappling with the potential for rapid recursive improvement in a way that biology doesn't. So the ASLs need to be tailored to the unique challenges of AI.

But the core principle is the same – defining clear safety requirements based on risk levels, and ensuring that adequate safeguards are in place before progressing to the next level. We've made a commitment not to develop or deploy models that reach the next ASL until we have the corresponding safety measures firmly in place. We believe this is the only responsible path forward.

Now, before we dive into how the RSP framework can inform AI forecasting, it's worth highlighting a key difference between ASLs for AI and BSLs in biology. With BSLs, we're dealing with pathogens that already exist in nature. The most dangerous materials, those that warrant BSL-4 precautions, like smallpox, are already out there, and the challenge is in safely handling them.

But with AI, we're trying to anticipate and prepare for levels of capability that don't yet exist. No one has yet created an AI system that merits ASL-4 precautions. We're in the position of having to forecast when different ASLs will be reached based on the pace of AI progress. And that introduces significant uncertainty.

This is where I believe the Responsible Scaling Policy provides a valuable framework for making predictions about the future of AI that take safety considerations into account

By defining concrete capability milestones corresponding to each ASL, we can start to map them onto timelines of AI progress. For example, based on the rate of advancement in recent years, we might forecast that the most capable models will start to consistently reach ASL-3 within a certain number of years.

Of course, predicting the raw capabilities of future systems is only half the picture. What the RSP framework adds is a model of how safety constraints could affect real-world development and deployment.

Consider that in order to develop an ASL-3 model responsibly, we first need to invest significant time and resources into creating adequate security measures and testing protocols. If we forecast that raw capabilities will reach ASL-3 in say, two years, but that the necessary safety measures will take three years to put in place, then we'd expect a one year delay between when ASL-3 models are technically feasible and when they can be responsibly deployed.

In more extreme scenarios, we could even see discontinuities where safety progress falls so far behind capabilities that we have to essentially pause all model development for a period of time until safety catches up. Since the release of our RSP last September, most of the other leading AI labs took inspiration and have shared similar commitments. Pausing progress is a real possibility that I believe more AI forecasts need to take into account.

The RSP framework also provides a structured way to think about different AI risk scenarios. We can ask questions like: What happens if a well-resourced lab, perhaps in a different country, decides to ignore safety constraints and scale to ASL-4 rapidly? How does that change the strategic landscape and the importance of responsible actors having safe ASL-4 models of their own?

Or perhaps we'll see uneven progress where capabilities in some domains, like open-ended language modeling, race ahead of more constrained domains like robotics. In that world, we might have a long period of time with very capable language models but without some of the more advanced autonomous capabilities.

The point is, by providing a shared vocabulary to talk about AI capabilities and safety requirements, I believe the RSP framework can enrich the way we forecast AI progress and risks. It pushes us to think not just about timelines to raw capabilities, but also about the safety constraints and responsible development practices that will shape how those capabilities ultimately get deployed in the real world.

In fact, we recently saw a concrete example of our RSP framework in action with the re-testing of our latest model, Claude 3. As part of our commitment to periodic evaluation, our team put Claude 3 through a battery of tests designed to detect potentially dangerous capabilities.

We published the details of these tests in a report to provide transparency into our safety evaluation process. The tests covered areas like Claude's ability to aid in the development of chemical and biological weapons, its proficiency in offensive cybersecurity techniques, and early indicators of potential for autonomous replication.

While Claude 3 demonstrated increased capabilities compared to its predecessor, our team determined that it did not cross the threshold into ASL-3 territory. However, we did note that with additional fine-tuning and improved prompt engineering, there is a 30% chance that the model could have met our current criteria for autonomous replication, and a 30% chance it could have met at least one of our criteria related to Chemical, Biological, Radiological, and Nuclear, or CBRN, risks.

This illustrates the crucial role that thorough, proactive testing plays in our RSP framework. By catching these warning signs early, we can ensure the necessary safety measures are in place before a model is ever released.

We also recently published a blog post reflecting on our experience implementing the RSP over the past 9 months. One key challenge we discussed was the difficulty of threat modeling for rapidly evolving AI systems. For example, in the CBRN domain, we found that even among experts there was disagreement about which specific capabilities would meaningfully increase risk.

To navigate this uncertainty, we've found it valuable to engage with a diverse set of domain experts and to focus on making our threat models as quantitative as possible to aid in prioritization. We've also recognized the importance of a multifaceted approach to risk mitigation that includes both technical interventions like reinforcement learning from human feedback and more traditional trust and safety practices.

Implementing the level of security required by our ASL-3 standard has required a significant investment, with around 8% of all Anthropic employees now working on security-adjacent areas. One key focus has been implementing multi-party authorization and time-bounded access controls to reduce the risk of model weights exfiltration.

Of course, our RSP framework is still a work in progress. Many open questions remain around how to design maximally informative tests, extrapolate from current models to future risks, and create governance structures that can keep pace with the breakneck speed of AI development.

But we believe initiatives like our RSP represent an important step in the right direction. By proactively identifying risks, implementing rigorous safety controls, and transparently sharing our findings, we hope to chart a path towards a future of responsible AI development at scale and encourage others in the industry to do the same.

I want to dive a bit deeper into how we're testing our AI models, because I think it provides a fascinating window into how the RSP framework can inform AI forecasting in practice.

In our latest round of testing Claude 3, we focused on three key areas that could pose catastrophic risks if an AI were to master them: knowledge about dangerous chemical and biological agents, the ability to hack into secure systems, and the potential to autonomously replicate and spread.

Let's start with the first one. To test if Claude 3 could provide dangerously specific knowledge about chemical or biological agents, we set up a trial. We had one group of researchers use Google to answer a series of questions, and another group use Claude 3. The result? Claude 3 didn't provide any advantage over Google. That's good news - it means Claude 3 doesn't significantly increase the risk of someone gaining access to this kind of dangerous knowledge.

For cybersecurity, we gave Claude a series of increasingly difficult hacking tasks. It's like a video game with progressively harder levels - from simple challenges like finding a password in a piece of code, all the way up to creating a fake version of a website to steal user data. Claude showed some skill, especially compared to earlier versions. But when it came to the most advanced "boss levels," it consistently fell short.

The most intriguing test was for autonomous replication. This is our attempt to see if the AI could start to act on its own, without human guidance. We set up a suite of tasks that included things like breaking out of a digital sandbox (a controlled environment), manipulating its environment through a command line (like typing commands into a computer), and even writing self-modifying code. It's a bit like testing whether an AI could play 'escape the room' games without human help.

Here, Claude 3 showed some real progress, solving parts of these tasks. But we had set a high bar: the model would need to succeed at 50% of the tasks to raise serious concerns. Claude 3 didn't reach that level, but its performance was notable enough for us to take notice.

These results underscore a key point for forecasting: AI progress isn't always gradual. Sometimes, new capabilities emerge suddenly. It's like when DeepMind's AlphaGo went from losing to top human Go players to trouncing them in just one year. So even if a model doesn't seem dangerous today, it might make a big leap tomorrow.

That's why we test often. We test every time we make a model that uses four times more computation during training than the last one we tested, and also every three months to account for improvements in how we elicit capabilities from the model.

But even with frequent testing, there's always a risk we're underestimating what the model can do. It's like trying to test a human's knowledge - you could ask a hundred questions and still miss a key skill. With language models, there are numerous techniques, like fine-tuning or clever prompts, that could unlock hidden capabilities.

To manage this uncertainty, we've introduced the concept of "yellow lines" in addition to our "red lines." If a model crosses a yellow line, it doesn't mean it's unsafe, but it does mean we might be getting close to our limits in evaluating it. It's like a warning light on your car - it's time to take a closer look.

For example, in our "AI escape room" test, the yellow line was set at a 50% success rate. Claude 3 didn't hit that, but it got close enough that we're rethinking how we design these tests for the next model.

This system of tests and thresholds gives us a more concrete way to think about AI progress. Instead of vague timelines, we can talk about when models might reach ASL-3 or ASL-4. And by factoring in the time it takes to develop safety measures, we can make more realistic forecasts about when these models could actually be safely deployed.

It's not a perfect system, of course. Forecasting the future of AI is still more art than science. But by grounding our predictions in empirical tests, always searching for ways we could be wrong, and building in safety margins, I believe we're taking a significant step towards understanding and managing the risks of advanced AI.

Another important aspect of the RSP framework is how it can help inform not just timeline forecasts, but also probability estimates for different AI risk scenarios.

For example, consider a scenario where a model with ASL-4 level capabilities is developed and deployed without adequate safety measures in place. In our current RSP framework, ASL-4 corresponds to capabilities like the ability to automate key steps of a scientific research process or to be so difficult to contain that deploying the model without proper measures would pose existential risk.

Concretely, this might look like a model that can recursively improve itself, rapidly scaling up its capabilities without human oversight. Or it might be a model that can break through even the strongest cryptographic and physical security measures, making it nearly impossible to prevent its misuse.

By having a clear definition of what ASL-4 entails, we can more meaningfully estimate the likelihood of this scenario occurring within a given timeframe. For instance, if our testing shows that current models are at ASL-2, and we estimate that the compute and algorithmic advances needed to reach ASL-4 are likely to occur within 5 years, then we can assign a higher probability to an ASL-4 model being developed within that timeframe.

Similarly, the RSP framework can help us think about the relative probabilities of different paths to transformative AI. Will it come from a sudden, discontinuous jump in capabilities, perhaps from a model achieving recursive self-improvement? Or will it be a more gradual process, with AI systems slowly accruing more and more advanced capabilities across different domains?

Our testing methodology can shed light on this question. If we consistently see models making sudden, unexpected leaps on our capability evaluations, it might suggest that a discontinuous jump is more likely. AlphaGo's sudden mastery of Go, going from losing to top human players to trouncing them in just a year, is a classic example of this kind of leap.

Conversely, if progress on our capability evaluations is more gradual and predictable, it might point towards a more continuous path to transformative AI. This could look like models slowly improving at a wide range of tasks, from language understanding to robotics to scientific research, without any one breakthrough enabling recursion or runaway growth.

Of course, these are just hypotheticals and the reality is likely to be more complex. We might see a mix of gradual and sudden progress, or unexpected breakthroughs in areas like neuroscience or nanotechnology that fundamentally change technology and society. But by closely tracking progress on a range of key capabilities and mapping them to clear safety thresholds, the RSP framework can help us assign probabilities to these different scenarios and prepare accordingly.

For instance, if our evaluations suggest a 30% chance of an ASL-4 model being developed within 5 years, and a 70% chance of transformative AI coming from a more gradual, multi-domain process over 10-15 years, then we can allocate our safety research and policy efforts proportionally. We might invest heavily in short-term measures to mitigate existential risk from a sudden capability jump, while also laying the groundwork for longer-term governance structures to manage a more gradual transition.

Ultimately, the RSP framework is a tool for turning abstract possibilities into concrete, actionable forecasts. By defining clear capability milestones and rigorously testing for them, we can move beyond vague speculations about AI timelines and risks, and towards a more empirically-grounded understanding of the challenges ahead.


To summarize, I believe Anthropic's Responsible Scaling Policy offers a valuable framework for navigating the immense possibilities and risks that advanced AI systems present. By defining concrete capability thresholds and tying them to specific safety requirements, it provides a clear roadmap for responsible development. Anthropic was the first AI lab to put out a framework like this, and we encouraged others to do the same. The industry has taken notice. Most other major labs have now released or promised to release their own frameworks. This is our “race to the top” strategy in practice.

But more than that, I believe the RSP framework can fundamentally enrich the way we forecast the future of artificial intelligence. It pushes us to think not just about the raw capabilities of future systems, but also the real-world safety constraints that will govern how and when those capabilities can be responsibly deployed.

In a world of rapid technological progress, predicting the future is both more important and more difficult than ever. On one hand, the stakes could not be higher - the difference between a future of beneficial, aligned AI systems and one of unconstrained, destructive systems is immense. But on the other hand, the sheer pace of progress and the number of unknown variables make precise forecasts incredibly challenging.

That's why I believe we need conceptual frameworks like Anthropic's RSP to help structure our thinking. We need shared vocabularies to debate and reason about safety constraints. And we need clear technical milestones to anchor predictions to.

But of course, no framework is perfect from the start. Refining and stress-testing the RSP is an ongoing process - one that I believe this community of thoughtful forecasters and AI experts is uniquely suited to contribute to.

So here is my call to action for all of you: engage with the ideas of the Responsible Scaling Policy. Debate them, critique them, and help us improve them. Think through the capability milestones and help us define the most important ones to track. Model out different scenarios, whether they be rapid progress, uneven progress across domains, or discontinuous jumps due to ignored safety constraints.

The work of mapping out the landscape of possible AI futures is crucial, and it will take a collaborative effort from a wide range of thinkers and institutions. But I firmly believe that if we can come together as a community to develop shared frameworks for responsible development, and use those frameworks to inform our predictions and actions, then we have a real chance of creating a future with artificial intelligence that benefits us all.

Thank you.