Alignment report reviewer questions

Instructions for reviewers:

Please read the report and answer the questions below (short answers are fine, and no need to repeat content already discussed). When you are done, email a copy of this document with your answers to nick@openphilanthropy.org. We plan to share your answers with the author of the report (Joe Carlsmith) by default so that we can improve the next version of the report. If you want to pass anything on to me privately, please send it over email to nick@openphilanthropy.org.

If you would like to leave comments on the draft directly as well, feel free to make your own copy, label it with your name, and share it with nick@openphilanthropy.org when you are done.

Some questions involve assigning rough subjective probabilities to imprecisely operationalized claims.^[1] Here, we’re just looking for a loose sense of your epistemic relationship to the claim in question (we find that even rough probabilities in this respect are preferable to words like "plausible," “unlikely,” "significant risk," and so on).

The author is happy to answer any questions you have over email, or on a call, while you are reading the report. Please feel free to reach out to him, if it would be helpful to you, at joseph@openphilanthropy.org.

Overall

This report aims to (a) articulate and evaluate an overall argument in support of the conclusion that misaligned, power-seeking AI systems will cause an existential catastrophe before 2070, and (b) to assign rough probabilities to the premises (and ultimately, the conclusion) of this argument.

Is the report’s main project framed in a way that makes sense to you? If not, what is unclear or confusing?

The report’s project makes sense to me.

Please briefly summarize the key argument of the report and, where applicable, which pieces of the reasoning you found most interesting/informative.

The argument has several different stages, so I find it a little hard to condense into (e.g.) a single paragraph. Here’s my best attempt at a summary of the different stages, though. I’m doing some modest reframing, and modest re-defining, partly because this should make it more transparent if my interpretation of the argument diverges from the intended interpretation.

There is a non-negligible chance that, within the next fifty years, it will become possible to fully automate some large and important domains of human activity. For example, although humans might choose to be involved, the design and production of new technologies might no longer rely on human labor at all.

Main justifications given: Expert opinion and the arguments in Ajeya’s and Tom’s reports.

This automation will probably involve planning capabilities. There is a good chance that individual (i.e. fairly non-modular) AI systems will engage in planning (i.e. run internal simulations that help them perform well). Even if individual AI systems do not engage in planning, however, it may still be reasonable to regard large collections of AI systems as collectively engaging in planning. For example, it may be fair to regard a fully automated corporation as engaging in planning, even if -- when we zoom in on it -- it does not seem like any of the individual AI systems that make it up have substantial planning capabilities.

Main justifications given: Planning seems useful. (It’s harder to automate a given area well if you restrict yourself to using AI systems -- or collections of AI systems -- that cannot plan.) Planning might also emerge naturally, even if no one intends to create AI systems or collections of AI systems with planning capabilities.

Let’s define a “misaligned power-granting action” as an action that (a) if taken, has a good chance of granting an AI agent substantial power over some humans and (b) strongly conflicts with the preferences of the system’s designers/users. [This exact term isn’t used in the report.] Insofar as AI agents engage in planning, we should expect some of the associated simulations to be capable of representing and accurately predicting the outcomes of misaligned power-granting actions. For example, if murdering someone would prevent a given system from being shut off, the system’s simulation of the world model may be capable of representing this action and accurately predicting that it would result in the system not being shut off.

Main justifications given: Planning capabilities will often be most powerful/useful when the relevant simulations accurately reflect many different aspects of the world, including (e.g.) dynamics concerning human belief formation, death, and so on. For example, there may be some domains (e.g. warfare; medicine) where it would be useful for an AI system to be able to predict that a given action would result in a person’s death. Unfortunately, it may be difficult to exert fine-grained control over exactly which dynamics can and cannot be represented by the simulations a given system relies on. It may actually be quite hard, for example, to create a sophisticated autonomous weapons system that can predict all of the most relevant implications of its actions, but cannot predict that killing its overseer will prevent its overseer from shutting it down.

Let’s characterize an APS system as an AI system that is capable of predicting the results of misaligned power-granting actions, when engaged in planning. (I.e. it can accurately predict that if it took certain actions, which are undesired by its designer, then it has a good chance of being in a position of substantial power over humans.) It may be hard to create an APS system that refrains from taking misaligned power-granting actions.

Main justification given: It may be hard to give an APS system roughly the right ‘objectives’ (i.e. criteria used to evaluate possible actions, given predictions of their consequences), for reasons involving the nature of search processes and proxies. Furthermore, if an APS system does not have roughly the right objectives, then it is likely to have objectives that rate misaligned power-granting actions highly. The support for this second claim is essentially the traditional instrumental convergence hypothesis: the vast majority of objective functions, when defined over sufficiently rich models of the world, imply that the best course of action involves taking misaligned power-seeking actions.

Furthermore, it may be comparatively easy to create an APS system that seems useful and benign -- enough for some people to feel comfortable using it -- but that will in fact engage in misaligned power-granting behavior.

Main justification given: A misaligned power-seeking APS system would likely be able to predict whether a given behavior would make its users or designers suspicious -- and then refrain from actions that signal its power-seeking intentions. Therefore, once such an APS system is created, it may be difficult to distinguish it from a benign system simply by examining its behavior. Furthermore, there may be processes that would lead to the creation of misaligned power-seeking APS systems that (a) are not a priori obviously misguided approaches to AI development and (b) would not produce intermediate systems that raise sufficiently clearcut ‘red flags’ concerning the dangers posed by more advanced systems.

Even if people are actually pretty unsure whether a given development process would result in a useful and benign APS system, or whether a given existing APS system is useful and benign, they may go ahead with development or deployment.

Main justification given: The upsides of using APS systems, conditional on them behaving benignly, may be very large -- especially in contexts with strong competitive dynamics. People might also be miscalibrated about the level of risk involved in making or using a given system, for instance because they do not appreciate how skilled a misaligned APS system might be at impersonating a non-threatening system. If an increasing number of groups need to make independent decisions about whether to make or use certain kinds of APS systems, then it becomes increasingly likely that one of them will do it: the odds that one of these groups is miscalibrated or has extreme preferences goes up.

If people do deploy misaligned power-seeking APS systems, then it’s plausible that these systems will permanently disempower humanity.

Main justification given: In certain scenarios, a single error could conceivably lead directly to permanent disempowerment. Even in alternative scenarios, however, early smaller-scale disasters, close calls, or partial losses of power might not be sufficient to prevent permanent and complete disempowerment. For instance, people may tweak development processes in ways that address the smaller-scale disasters they’ve witnessed so far, but that don’t actually help much to prevent future disasters from the next generation of systems. People might also continue to develop and deploy systems that they know to be at least a bit risky, if they judge the upside of deployment to be high enough.

If humanity is accidentally permanently disempowered, by power-seeking APS systems, then this would almost certainly be an existential catastrophe.

Main justification given: Even if the permanent disempowerment of humanity would not necessarily constitute an existential catastrophe, it is hard to imagine that accidental disempowerment would result in a better outcome than intentional disempowerment.

JC: I appreciate the thoughtful summary. This is basically in line with my intended interpretation, though a few bits (e.g., thinking about planning in terms of simulations in particular) are somewhat more specific than I’d have in mind.

Did you find the report overall clear, reasonable, and convincing? If not, what seemed unclear, unreasonable, or unconvincing?

I found the report much clearer than almost all other writing on risks from misaligned AI.

There are still some important concepts that I think are fuzzier than they ideally ought to be, but I think some amount of fuzziness is inevitable in a case like this: it’s just very hard to talk about still-hypothetical AI systems that might be unlike anything we’ve seen yet.

The place where I think some further clarification would be most valuable is in the report’s description of the “permanent disempowerment of ~all of humanity.” This is a really important idea, since it’s the central thing that the report is trying to predict, but I still feel fairly fuzzy about what kinds of outcomes qualify. There are some scenarios that seem to obviously qualify: for example, scenarios in which humans are simply killed off, or scenarios in which humans are essentially imprisoned or severely persecuted by a purely AI-based government that takes no inputs from humans. But I’m not sure if these are the modal scenarios being imagined, and I’m not sure what other scenarios count.

JC: Those are close to the modal scenarios I imagine, though I’d also add in scenarios where humans aren’t actively “persecuted,” but rather are simply deprived of resources or opportunities to pursue their ends, especially on the types of cosmic scales that I expect would’ve otherwise been possible.

Here, for example, is one case that illustrates the ambiguity for me: Would the world described in Robin Hanson’s Age of Em count as a world in which ~all of humanity has been unintentionally and permanently disempowered? (Although it involves emulations rather than AI, we can imagine that the ems were instead created through very good imitation learning.) I could see a case for answering either “yes” or “no.”

JC: Assuming that the ems possess ~all the power in this world, to me this question effectively amounts to the question of (a) are the ems human (I tend to think of literal emulated humans as “humans” in the sense relevant to the empowerment of human values), and (b) if not, how intentional was the transition to a "the ems have all the power” world.

Re: (b), I do think that "there was a bunch of consensual positive sum trade that resulted in the ems taking over, despite the fact that most humans wouldn't have wanted this on reflection" is kind of an edge case with respect to the report’s definitions of “disempowerment” – though one that could well be relevant in practice to different degrees. And we can imagine non-EM scenarios that are similar. See also Grace on “misuse vs. misalignment” for more edge cases.

One reason I find it hard to assign a probability to the report’s central hypothesis is just that I don’t feel like I have a great grip on it conceptually: I’m not sure how to imagine what the world is like, how the misaligned AI systems would actually be exerting power, and (partly as a result) what the path to this world would plausibly look like.

(I've written up some related thoughts that may be relevant here.)

Re: the linked discussion, I think my view is probably closest to: "If we ‘mess up on AI,’ then even the most powerful individual humans will have unusually little influence over their own lives or the world around them.” I might say something like: "if we mess up on AI, then humans will exert very little resource-weighted influence over the future relative to what would’ve been possible."

---

I generally found the report to be very reasonable and balanced. It feels like it’s giving many different arguments and counter-arguments a fair shake, rather than arguing for one particular conclusion in a lawyer-like way.

---

I’m unsure what to say about how “convincing” the report is. I agree with a large portion of the report. Joe’s credence is also certainly closer to mine than it is to the most pessimistic analysts of AI risk. At the same, I remain somewhat more “optimistic” than Joe.

The least convincing part of the argument, for me, is the discussion of instrumental convergence and its significance.

The section of the report that caused the largest update in my views with the discussion of reasons why people might use systems they know to be misaligned. I’m sure I would have updated my views more on other points as well, if I had not already encountered many of the report’s points previously in earlier drafts or conversations with Joe.

Timelines

The report focuses on AI systems with three key properties (the author calls these “APS systems”):

Advanced capability: they outperform the best humans on some set of tasks which when performed at advanced levels grant significant power in today’s world (tasks like scientific research, business/military/political strategy, engineering, and persuasion/manipulation).

Agentic planning: they make and execute plans, in pursuit of objectives, on the basis of models of the world.
Strategic awareness: the models they use in making plans represent with reasonable accuracy the causal upshot of gaining and maintaining power over humans and the real-world environment.

Any comments about or objections to this framing?

I think this framing makes a reasonable amount of sense. I think that the concept of an APS system is a large improvement over possible alternative concepts like “a transformative AI system” or “AGI.” I also like a lot of the conceptual discussion.

However, I do still have some concerns about the framing.

---

A high-level concern and suggestion:

The concept of an APS system feels a bit inelegant to me, given that it has three seperate components that need substantial unpacking. One way to streamline the definition might be to combine the “advanced capability” and “strategic awareness” components. For example, something closer to definition might be preferable:

An APS system can accurately predict that taking certain actions would likely grant it very significant, long-lasting power over its users.

This definition implies that the system plans, has strategic awareness, and possesses capabilities that are sufficient to obtain truly worrisome forms of power. So it is roughly equivalent to the definition given in the report. However, it feels a little neater. It also suggests a relatively neat way of describing one challenge that future AI developers may face: they may need to create AI systems that can figure out how to seize power but choose not to.

JC: I had meant for the "advanced" to index capability levels to today's world, to leave room for the idea that you get APS systems but in a competitive context where everyone has amped up their defenses etc such that the APS systems don’t actually get that much power (even though the systems in question would’ve been relatively powerful in 2021).

Also, I think that strictly speaking, oracle-like systems (e.g., systems that can generate plans but aren’t actually using their planning capacities in pursuit of objectives) would satisfy the definition above; whereas I’d prefer to exclude them.

That said, I do think “they may need to create AI systems that can figure out how to seize power but choose not to” is a nice framing.

---

More specifically, I have some specific concerns about the treatments of the three key properties used to construct the current definition. In brief, my specific concerns are:

(a) I think the distinction between relatively concrete/unified systems (e.g. GPT-3) and highly modular/abstract systems (e.g. a large collection of neural networks that collectively replicate the activities of a large corporation) might be really significant. For that reason: I think it’s possible that the report should specifically focus on misalignment in relatively concrete/unified systems, and then briefly discuss highly modular/abstract systems as a secondary case, rather than framing the discussion in a way that is (I think?) largely meant to apply to both cases at once.

(b) I think it would be useful for the report to say more about why planning is a “key property” of APS systems. I agree that planning is very important, but I think I disagree about exactly why planning is so relevant to AI risk. I think this difference in perspective may help to explain why I find some considerations more or less compelling than Joe does.

(c) I generally feel nervous about relying a lot on the concept of an AI system’s “objectives,” as it’s evoked in the description of APS systems. I think it may be importantly misleading, for example, to think of the challenge of creating aligned systems as a matter of ‘giving them the right objectives.’ I think my concerns about ‘objective talk’ might also partly explain why I’m not convinced of the report’s version of the Instrumental Convergence Hypothesis.

Here are my qualms in almost certainly too much detail:

Lack of emphasis on the distinction between relatively unified and extremely modular systems:

An AI system is said to have “advanced capabilities” if its capabilities are sufficient to “~fully automate” some extremely important task. However, I think it’s notable that the relevant “tasks” listed (e.g. engineering) mostly aren’t performed by individual humans. Instead, they’re accomplished by large networks of humans and technological artifacts interacting with each other. For example, in concrete terms, consider something like the continued development of the International Space Station. This has involved a network of thousands of people with different skill sets and collections of knowledge, and many different pieces of software, producing new designs and physical things in a relatively distributed way.

JC: I agree re: “I think it’s notable that the relevant ‘tasks’ listed (e.g. engineering) mostly aren’t performed by individual humans.”

It’s possible that, in the future, science and engineering will be done in a radically less modular/network-based way: in principle, we can imagine the process of developing the ISS being carried out by (e.g.) basically a single large neural network. But -- in agreement with the report -- that at least doesn’t feel like a safe assumption to me. I would still personally bet on there being a significant degree of modularity.
For that reason: Talking about “an AI system” that “fully suffices to automate” these tasks pushes me to imagine a pretty abstract entity, which consists of lots of relatively distinct parts interacting with each other.
The report, I think, suggests that it may not matter too much, for the report’s key arguments, whether we’re talking about a highly modular/abstract entity or something more unified/concrete. That might be right -- I’m not sure. But I wouldn’t be surprised if the distinction actually is extremely important.

JC: I'm also not sure. This seems like another candidate place where the abstractions might mislead. For example, it seems like Google is an "agent" in some sense, but it's definitely not a "really big/powerful human," and if you start trying to predict its behavior, the effect of different interventions, etc as though it were just a really big version of the type of thing a human is, I think you probably go wrong in lots of cases.

For example:

Human evolution and human psychology may be less useful as analogies and points of comparison if we are imagining highly modular/abstract entities rather than individual ML models. Analogies to the design and development of institutions may be more useful.
Similarly: The difficulty of interpreting neural networks (discussed in the “unusual difficulties” section) may be much less relevant, whereas the difficulty of understanding large institutions may be much more relevant.

JC: I agree, though you could just end up with both difficulties.

The source of serious misalignment, if there is serious misalignment, might not actually have much to do with the techniques used to produce individual models, rather than decisions about how to structure these models’ interactions. The report’s discussions of “problems with proxies” and “problems with search” might then become less relevant -- or, at least, less well-tailored to the source of risk.

JC: This seems to me like an important and useful point, especially with respect to the current ML-focused alignment discourse, in which one often imagines a "training procedure" that results in good performance on some "objective function," as opposed to e.g. putting a bunch of parts you kind of understand together.

That said, it feels to me like in worlds where you don't know how to hand-code basic cognitive tasks and so have to use ML for them, it also gets harder to "hand-build" complex combinations of modular systems to do useful cognitive work: probably a lot of the useful cognitive work is happening "inside" the modular systems. But maybe you could keep that cognitive work e.g. myopic or somehow more limited, such that you have fewer alignment problems with your individual parts, and then turn your big thing into the agentic planner you ultimately need.

Relatedly, without having thought much about this issue: a decent chunk of the alignment research that people are currently doing also doesn’t strike me as obviously relevant to the alignment of rather modular/abstract entities, beyond avoiding alignment issues at the level of individual modules. At the same time, if problems arise primarily through the interaction of individual AI systems (rather than being grounded in the power-seeking tendencies of ‘individual’ systems), some risk-avoidance strategies -- like keeping humans in the loop in certain ways -- seem much more promising than they would in a world where individual systems engage in misaligned power-seeking.
Although this isn’t a robust view: I generally feel more optimistic in a world where power-seeking tendencies are highly emergent, rather than being displayed by ‘individual’ AI systems.

JC: I agree.

Ultimately: I think it’s possible the report should explicitly focus on misalignment in relatively unified/concrete systems. (These kinds of systems also seem to be, by far, the primary focus of existing writing on misalignment.) It could then discuss highly abstract/modular systems as a secondary case.
If my view is right, then it might also make sense to define “APS systems” in a way that (e.g.) would exclude a loose network of thousands of distinct ML models.

JC: One hesitation I have with just focusing on non-modular systems is that I don't think people should say "I think it'll be modular, therefore my world-model has none of the problems Joe envisions." Rather, I think the right reaction is something more like yours: "I expect lots of modularity, and that maybe this helps make various things easier even if it doesn't eliminate all of the risk."

And I don’t actually think that "modularity" on its own is very important. For example, if trained neural networks turn out to be very modular at some level (e.g., different parts of the network end up internally optimized for different tasks), then unless this translates into some difference in our ability to understand/control the systems, I don't think this is an important thing to focus on.

Agentic planning:

I think I’d like to see more discussion of the significance of planning, as one of the “key properties.” If a system engages in planning, or doesn’t engage in planning, exactly what follows from that? I’m not positive, but I think my perspective on these questions may be fairly different than the perspective of the report.
Roughly inspired by the discussion in Sutton’s reinforcement learning textbook, the way I personally think about “planning” is something like this. (My terminology here is admittedly somewhat non-standard):

AI systems choose actions on the basis of the observations/inputs they’ve received so far. This (potentially probabilistic) mapping from inputs to actions is the system’s policy. Policies can be improved over time, relative to some performance metric. Unfortunately, the question of whether some adjustment to a policy would lead to better performance isn’t, for the most part, something that could be answered in a purely a priori way. In order to improve a policy, you must make at least indirect use of empirical evidence about how different variants of the policy would perform.

The simplest way to make use of this evidence is to learn from real-world experience: you try out different policies and you see how they perform in the real-world. You can then update the system’s policy on the basis of this direct empirical evidence.
Alternatively, you can make use of a simulation: something that predicts what would happen if you were to follow different policies. You can then update the policy on the basis of these predictions. (If the predictions are any good, of course, then the simulation must at least partly have been based on empirical observation.)
“Planning” is basically a particular kind of process in which a system’s policy is updated using a simulation. Although there’s not any really firm conceptual distinction, people tend to call it “planning” when the simulation is somehow especially ‘internal’ to the system. For example, when AlphaGo does MCTS -- which, at least extremely roughly speaking, involves simulating what would happen if it played different sequences of moves -- it feels natural to call this planning. When Dactyl tries to manipulate a Rubix cube in a simulation, we say the system is “learning” from simulated experiences, rather than “planning,” since the simulation feels very “external” to the AI system.

JC: I think that we’re working with different concepts of planning, here. For example, below you say that you can’t see how a feedforward network with frozen weights could engage in planning -- a comment I wouldn’t endorse. This suggests to me that you’re putting a (strange-to-me) amount of emphasis on the simulations “changing an agent’s policy” -- where policy is understood as its overall input-output mapping conditional on holding something (e.g., the weights) fixed. Whereas I’m thinking of planning centrally as something that determines an agent’s decision about what to do in a particular case, even if it doesn’t change some particular input-output mapping overall. Thus, for example, a fixed-weight feedforward system could still run a bunch of simulations about what will happen if it performs different actions, and then decide what to do on this basis -- and this would count as planning, to me (and I expect to many).

I’m not sure why we would focus on the simulations “changing the policy” in your sense in particular, as opposed to just determining the decision. Presumably, after all, a fixed-weight feedforward network could do whatever humans do when we plan trips to far away places, think about the best way to cut down different trees, design different parts of a particle collider, etc -- and this is the type of cognition I want to focus on.

The most relevant distinction here is probably between simulations that are ‘hand-coded’ and simulations that are developed at least partly through learning. Presumably, there are limits to what we can accomplish through hand-coding, so we should expect to see more and more ML-enabled simulations over time. ML-enabled simulations may often be importantly different (e.g. more opaque) than hand-coded simulations.

In principle, any policy that can be produced through a planning process can be produced through a learning process: real experiences and (sufficiently detailed and accurate) simulated experiences can substitute for each other. So there’s no inherent/conceptual connection between planning and any particular class of behaviors. For example: It’s not necessary for a system that can plan to act any more “agenty,” in an intuitive sense, than a system that cannot plan.

JC: As above, I don’t just care about planning that goes into “producing the policy” (however the line around “the policy” is drawn -- I’m not actually sure what you have in mind). I also care about planning that the policy itself performs. E.g., if solely via giving your system real-world experience (rather than having it learn from running lots of simulations), you end up with a system whose policy involves planning out different world-take-over schemes in response to different circumstances, then that’s worrying, regardless of the policy’s origins.

More specifically: If the “instrumental convergence hypothesis” has implications for the kinds of behaviors you should expect to be produced by planning processes, then it should have basically the same implications for the kinds of behaviors you should expect to be produced through planning processes.

JC: Is this sentence missing a “not”? Currently it’s contrasting “produced by planning processes” and “produced through planning processes.”

If keeping your queen on the board is instrumentally useful in chess, for example, then we should expect a chess system to eventually learn to protect its queen if it simply plays enough games. Planning is not necessary here.

If planning is worth emphasizing, as a “key property” of AI systems that pose major alignment-related risks, then I think there are few possible reasons for this:

Whether or not a system plans may be a rough proxy for how capable it is.

Planning (or learning from any kind of simulation) is useful, because it allows you to get around data scarcity issues. If it’s hard to directly generalize from your previous experiences, to know what action you should take in a novel situation, you can simulate some new experiences that are more directly relevant.

As an example: In principle, a neural network could get indefinitely good at Go simply by playing enough games and learning from them. However, introducing planning capabilities makes it much easier to reach any given level of performance.

Capability levels are already covered by the “advanced capability” property, though, so I think this bullet wouldn’t be a strong additional reason to highlight planning.

If a system engages in planning, then it’s more plausible that large and important shifts in its policy will fail to be noticed.

For example: A trial-and-error learning approach to murder (trying the first thing that comes to mind; failing; marginally adjusting your approach; trying again; etc.) will presumably raise a lot of red flags. If a non-interpretable simulation is used, on the other hand, then it may not be possible to tell that the system is working out how to murder someone.

As an analogy: Imagine if a Bond villain couldn’t use planning to develop a successful strategy for taking over the world, but needed to develop this strategy in the same way that someone develops the ability to ride a bike. The risk of them successfully taking over the world -- before someone catches on and says “Hey, stop trying to do that” -- would presumably be quite low.

In general, if we imagine that a system’s policy evolves in small steps, it’s kind of hard to imagine that its policy would evolve from (e.g.) “benignly clean houses” to “make an extremely competent effort to take over the world” without some of the intermediate policies being obviously worrisome. If the intermediate policies are being applied to an illegible simulation, though, then developers might not notice their worrisomeness.

Somewhat speculatively: Developers might have far less control over planning processes than learning processes, if the planning processes implemented by future AI systems are highly emergent.

Some people predict that (unlike in cases like AlphaGo) planning capabilities will typically be almost entirely emergent. If planning capabilities are highly emergent, then we may have much less direct control over these processes than we have over learning processes (where e.g. we can directly alter the function used to evaluate the policy’s performance). Maybe, then, planning processes are more likely to push AI systems’ policies in weird directions that we do not want.

Regarding emergent planning, the report mentions (but does not endorse) the following prediction: “Consequentialism [i.e. planning] may to some extent be a convergent or default outcome of optimizing anything hard enough.” I think this prediction is almost certainly wrong. For example, I just really don’t see how a feedforward network with frozen weights could engage in planning, no matter how large it is and no matter how long it has been trained. But I’m open to the idea that models with really basic forms of recurrence could engage in planning.

It seems like, if we want to say that a system is engaging in planning, one necessary condition is that it can improve its performance at some tasks without learning from real experiences. (For example, if I had to run a series of errands now, I might take an inefficient route between my destinations. However, if I’m given some time to plan, my performance at the task will likely improve to some extent.) A simple feedforward network, being updated using a standard learning algorithm, seems to fail this test. It’s only going to get better if its weights are updated, and they’re only to get updated through learning.
It’s possible, though, that simple recurrence in a network is enough to enable emergent planning. If a system’s state in given timestep is allowed to depend on its state in the previous timestep, its behavior can change (and therefore in principle improve) if it is simply given more timesteps to “think.”
I think it’s an open and important question whether simple architectures naturally develop powerful planning capabilities, if they are big enough and trained for long enough.

I don’t have any very specific suggestion here, but I do think that the question “Why might planning increase risk?” bears significantly on risk estimation. (I say mroe about this point below.) And I think my answer to this question might be fairly different than the answer suggested by the report.

JC: I expect that the way forward on this would be to sync up on how to think about the concept of planning that matters. To my mind, what matters about planning isn’t something about “how the system’s overall input-output function is created” in the sense your comments are focused on; rather, it’s something about whether the system makes decisions about what to do by modeling the effects of its actions on the world, in order to better achieve objectives -- whether doing so involves updating its overall input-output function or leaving it fixed. In particular, systems modeling the world in order to achieve objectives, where such models of the world involve strategic awareness, are by (intentional) definition in a position to notice and respond to instrumental incentives to gain and maintain power. That’s the central reason I focus on planning in this sense (though I also think that planning is likely important to an AI system’s ability to successfully gain power over humans as well).

Reliance on the concept of “an AI system’s objectives”: I generally feel nervous about relying a lot on the concept of an “objective,” as it’s evoked in the discussion of APS systems. Every possible policy could, in principle, be produced through a planning process. However, when we describe AI systems as planning in pursuit of “objectives,” I think this language tends to bias readers toward imagining a pretty specific class of behaviors: activist, single-minded, non-parochial. The report notes this concern, but I still think may not take it seriously enough.

For example, consider a stereotypical human couch potato: someone who spends most of their time at home playing video games and otherwise doesn’t get up to all that much. They’ll do errands, occasionally hang out with people, occasionally work a little bit on some skills, and so on, but not much beyond this. This kind of person seems to meet the definition of an “agential planner.” (One specific example of planning might be considering going to the gym, then deciding to stay in.) They also surely qualify as “strategically aware.” However, couch potatoes definitely aren’t the kind of agent that the discussion of “strategically aware agential planners” calls most immediately to mind.
I think that characterizing systems in terms of their “objectives” biases readers toward thinking that it’s extremely hard to create systems that exhibit benign behaviors.

Suppose, for example, that we want to create a system that acts like a couch potato. If we think of ourselves as needing to imbue it with some set of “objectives” that balance out to couch potato behavior, when each objective is pursued with an extreme level of optimization power, this sounds exceptionally hard. As an analogy: It kind of calls to mind the challenge of strapping thousands of extremely powerful rockets to some object, without causing it to move very far in any direction; if you get the placement of a single rocket wrong, then balance will be thrown off, and the object will fly off with extreme force.
I’m not sure this is a helpful way to think about things, though. I don’t find it natural to regard the behavior of human couch potatoes as being maintained by some fragile balance of objectives. I also don’t think their couch potato behavior is being maintained by a lack of “intelligence,” by a strict version of “myopia” (where they have a consistent and extremely high discount rate), by an objective to minimize an “impact metric,” or by a restriction on the class of actions or they can consider when planning. None of these constraints are necessary to produce relaxed behavior. It also feels wrong or not-obviously-meaningful to suggest that if their “objectives” were tweaked a little bit, then they would “by default” start to trying to make power-grabs
Maybe “strategically aware agential planner” couch potatoes actually would be extremely hard to create. However, describing the challenge of creating a couch potato system as a matter of “controlling the system’s objectives” feels misleading to me. It makes the reader think “Wow, how are we ever going to figure out how to place all those rockets just right?”

I don’t think I’m disagreeing, here, with any explicit claim that the report makes. There are plenty of caveats acknowledging something like the point I’ve just raised. Nonetheless, I imagine my concerns with the concept are larger than Joe’s, and help to explain why I’m not compelled by the report’s discussion of the challenge of alignment -- especially its discussion of the Instrumental Convergence Hypothesis. I would often much rather talk about an AI system’s “behavior,” or talk about the learning and planning “processes” that shape this behavior, than talk about an AI system’s “objectives.” These alternative concepts feel much more robustly useful and appropriate to me.

JC: As I indicate in the report, confusion about the notion of “pursuing objectives” is one of my top candidates for the ways in which the abstractions employed might mislead. I do think focus on “behavior” can mitigate this somewhat (though I think it can also lose the predictive value that talk of “objectives” provides), and that couch-potato-like systems are a helpful example to keep in mind.

I agree that a couch-potato is in fact a strategically aware agentic planner. I also think that a couch-potato is well-described as having objectives: e.g., they want to hang out, eat snacks, watch good movies, avoid paying taxes, avoid hassles like going to the gym, etc. And they make decisions with such objectives in mind. (And note that a couch potato will in fact seek various kinds of power if given suitable opportunity. For example, if you offer a couch potato a billion dollars if they do a few jumping jacks, they will generally do it. This should make us wonder what the couch potato would do if certain types of power-seeking became suitably low-cost.)

But the question isn’t whether it’s hard to build an agentic planner as non-threatening as a couch-potato. The question is whether it’s hard to build an agentic planner as non-threatening as a couch-potato that also significantly outperforms humans at tasks like science, engineering, business/military/political strategy, hacking, and social persuasion/manipulation, and so on. And now, the worry is, you’ve started throwing around levels of cognitive capability that a standard human couch potato isn’t working with, and which could render its pursuit of its objectives much more worrying (e.g., compare “how hard is it to build a non-threatening couch potato” to “how hard is it to build a non-threatening superintelligent couch potato” -- for the latter, I think that we do in fact have to start thinking about what happens when extreme levels of optimization power are applied to the objectives in question, in a way that we didn’t when we were talking about weaker systems).

That said, it does seem like humans who are really good at e.g. science, engineering, and so forth can be couch-potato-like in many respects – so it doesn’t seem that hard to imagine increasing their science/engineering abilities without doing much to their power-seeking.

What rough probability would you place on the following claim?

TIMELINES: By 2070, it will become possible and financially feasible to build APS systems.

30%

Incentives

The author assumes that there are strong incentives to automate advanced capabilities, and discusses three reasons we might expect incentives to push relevant actors to build agentic planning and strategically aware systems in particular, once doing is possible and financially feasible:

Usefulness. Agentic planning and strategic awareness both seem very useful. That is, many of the tasks we want AI systems to perform seem to require or substantially benefit from these abilities.
Available techniques. Given available techniques, it may be that the most efficient way to develop AI systems that perform various valuable tasks involves developing strategically aware, agentic planners, even if other options are in principle available.
Byproducts of sophistication. It might be difficult to prevent agentic planning and strategic awareness from developing in suitably sophisticated and efficient systems.

Do you find these reasons persuasive? If not, why not?

I agree with the conclusion, although “available techniques” is definitely the main justification I find compelling.

Although this may be a pedantic/semantic point, at some level, I don’t think that systems that can plan are inherently more useful than systems that cannot plan. Any given useful behavior can, in principle, be produced through learning alone. I think there is nothing that a planner can do that a learner cannot, if the learner has enough experience to learn from. For example: With enough experience, a version of AlphaGo without any planning capabilities could match the skill of the standard version of AlphaGo.

JC: Flagging that I think the distinction between “planning as a means of learning a policy” (your focus) vs. “planning as a thing that a policy can do” (my focus) is coming in here as well. I think it likely that the latter is basically required for certain tasks. To take some extreme and science-fictiony examples, consider e.g. “show up on a new planet and figure out how to build a working car out of the weird new materials that the planet provides within X amount of time”, or “figure out how to make nano technology using X time and Y resources, where no one has ever made nano-tech before.” To me it’s just really hard to see how you do this type of thing without relying on detailed models of your environment, predicting the results of your actions, evaluating these actions on the basis of criteria, and so on. Hence my focus on “usefulness.”

I do think, though, that it will often be dramatically less efficient to produce AI systems that meet a given performance standard if you do not allow them to plan. It would be extremely difficult, in many different domains, to give an RL system enough real-world experience to remove the need for simulated experience. It also seems impractical to hand-code sufficiently good simulations for every relevant domain. This means that it should often be very valuable to allow AI systems to make use of (at least partly) AI-generated simulations, when they are working out how to perform well. Depending on exactly how we define “planning,” it will normally be natural to describe AI systems that use these simulations as “planners.”

I’m not sure that planning is a natural “byproduct of sophistication.” I think that some development processes (e.g. feedfoward networks updated using standard learning algorithms) probably simply don’t allow planning to emerge, even though they could in principle produce indefinitely sophisticated systems. I think it’s possible that really basic forms of recurrence within a model are sufficient to enable the emergence of powerful planning capabilities, but I’m not at all confident. I think that the most relevant form of empirical evidence here would be evidence that certain models perform dramatically better if given more ‘time to think.’

If planning is entirely emergent, it seems like it would probably be very hard to prevent planners from developing strategic awareness in certain circumstances. If planning is not emergent, with engineers intentionally crafting planning processes, then I am less confident that the threatening forms of strategic awareness (e.g. the ability to understand and predict that killing your overseer would prevent them from turning you off) would be unavoidable. However, even in the best case, avoiding these forms of strategic awareness still seems like it would probably be very difficult.

What rough probability would you place on the following claim?

INCENTIVES: By 2070, and conditional on TIMELINES above, there will be strong incentives to build APS systems.^[2]

80%.

Alignment

In this section, author discusses the hypothesis that for APS systems, misaligned behavior on some inputs (where this behavior involves agentic, strategically-aware planning in pursuit of problematic objectives) strongly suggests misaligned power-seeking on some inputs, too (in brief, the central argument here is that power is useful for pursuing a wide variety of objectives).^[3] He then discusses different challenges to ensuring that APS systems are practically PS-aligned. In particular, he considers two key challenges to controlling a system’s objectives adequately:

Problems with proxies: Optimizing for a proxy correlated with intended behavior may break the correlation.
Problems with search: Searching over systems that meet some evaluation criteria won’t necessarily result in systems intrinsically motivated by those criteria.

He also discusses the possible role of:

restricting the temporal horizons of an APS system’s objectives;
controlling its capabilities (for example, by giving it a specialized capability profile, and preventing problematic capability improvements);
controlling its options and incentives (for example, via restricting the environments it operates in, monitoring its behavior, and providing incentives towards cooperation).

Finally, he discusses three ways that ensuring the safety of APS systems seems unusually difficult, relative to safety challenges posed by other technologies. Namely:

The behavior of agentic, strategically aware systems much more cognitively sophisticated than humans may be uniquely difficult to predict and understand. This issue is especially salient in the context of contemporary machine learning, in which our ability to create an AI system that can perform some task (e.g., predicting text) often far exceeds our ability to understand how the system does what it does.
APS systems may be actively and adversarially optimizing for getting deployed, including via deceiving and manipulating us. This makes it harder to trust things like safety tests.

The stakes of error are unusually high (e.g., bioweapons may be a better analogy than planes or bridges).

The author concludes that ensuring the practical PS-alignment of APS systems could well be difficult.

Do you find the author’s discussion and conclusion in this section persuasive? If not, why not?

This is the section of the report I find least convincing. Here are a few disagreements or differences in point of view, again likely in too much detail:

On the Instrumental Convergence Hypothesis:

I do believe that the traditional conceptual version of the “Instrumental Convergence Hypothesis” is true. Expressed in a formal way: It does seem to be the case that, for the vast majority of possible objective functions defined over sufficiently detailed world-models, the optimal policy would involve taking harmful power-granting actions.

I still don’t clearly see, though, why the alternative empirical “Instrumental Convergence Hypothesis” defined in the report would be true. It just does seem fairly easy to imagine competent planners that rate certain undesirable courses of action highly (and thus exhibit ‘misalignment’), but don’t rate bad power-seeking courses of actions highly (and thus do not exhibit ‘misaligned power-seeking’). I don’t think this requires, as the report suggests, the system to “ignore” the consequences of harmful power-granting actions. The system just needs to rate courses of action that involve power-seeking negatively.

Here’s a toy scenario:

Non-violent door-user: Suppose you create a planning AI system, through some sort of development process that relies heavily on human feedback. Let’s suppose that you don’t want this AI system to kill its overseers, and you also don’t want it to walk through doors with the word “FIRE EXIT” written on them. You give a lot of negative feedback to different actions that involve harm to humans, or that are even vaguely in the vicinity of violence, and the result is that the system begins to evaluate violent courses of action very negatively. However, in the training environment, there is only a single door with the word “FIRE EXIT” written on it. The system begins to evaluate courses of action that involve walking through this door negatively, but doesn’t generalize to other fire exits. The system is placed into a new environment, and continues to refrain from violence, but does occasionally walk through “FIRE EXIT” doors.

The system in this scenario seems like it would qualify as misaligned, but not PS-aligned. I don’t see, though, why this category of scenario is so implausible. The claim that “in general and by default” this sort of thing won’t happen just seems very strong.

JC: I think this is a useful type of example to keep in mind. One thing that’s going on here, it seems, is that the sense in which the system’s objectives are problematic is that they don’t reflect adequate weight on some constraint, unrelated to power-seeking, that the designers intended (e.g., “don’t go through Fire Exits”). But once we’re thinking of “not putting enough weight on not going through Fire Exits” as the “misaligned objective,” it does seem a lot less clear that this objective is the type of thing that gives rise to instrumental incentives to seek power. To me, this calls to mind some hazy contrast between objectives that have a more “consequentialist” flavor, and objectives that have a more “deontological” flavor, where the former seem intuitively more closely connected to power-seeking, but where the framing in the report (intentionally) encompasses both. This seems like an area worth further clarity on my part.

---

I also find it a little awkward to use the alternative version of the Instrumental Convergence Hypothesis to evaluate risk. The hypothesis says that if an APS system exhibits misaligned behavior, then it is extremely likely to exhibit misaligned power-seeking behavior. But the hypothesis doesn’t have any implications regarding the likelihood that an APS system will exhibit misaligned behavior. So, if we want to make use of this version of the Instrumental Convergence Hypothesis, we’re forced to ask two seperate questions:

What is the likelihood that an APS system would be misaligned, in some way?
If an APS system is misaligned, in some way, what is the likelihood that it is PS-misaligned?

I personally find it easier to just directly ask “What is the likelihood that an APS system would be PS-misaligned?”

JC: It seems plausible to me that directly focusing on PS-misalignment is better. The traditional argument structure is something like "if the objectives are not exactly right (or if you don’t have some difficult-to-formulate corrigibility property), then you get power-seeking; here's why getting objectives exactly right/getting that property is hard." But I think this might well be unnecessarily circuitous, especially in light of the possibility of just focusing very hard on eliminating power-seeking tendencies in particular.

On the analogy with human evolution:

I still find the human evolution analogy only a little concerning.

It is true that, when humans engage in planning, the evaluation criteria we use do not directly correspond to evolutionary fitness. For example, if I am considering whether to move to another country, and I’m thinking through the implications, I don’t evaluate these implications by asking “How would the propagation of my genes be affected?”

However, I’m not sure this is an obvious reason to worry. Suppose that, instead, evolution had inevitably imbued humans with a single-minded drive to pass on their genes. It seems like that would be a reason to worry: it would seem to suggest that we might have very limited flexibility in shaping the objectives of AI systems. Seemingly, the closest analogy would be if AI systems -- if trained for long enough -- inevitably develop a single-minded drive to pass on their parameter values. The fact that humans use evaluation criteria other than “evolutionary fitness,” when planning, therefore seems like the more reassuring of the two options.

JC: I don’t think I’ve really understood what you’re getting at here. As I see it, the role of the evolution analogy, in the context of “problems with search,” is to point to empirical evidence that selecting for a system that performs well according to criteria C (e.g., passing on its genes) does not necessarily result in a system actively optimizing for performing well according to criteria C, and which hence will fail to perform well according to criteria C in certain circumstances where it had capabilities sufficient to do so (e.g., not donating sperm, even though doing so was an available option).

It seems like you’re thinking that if humans were actively trying to pass on our genes, this would imply something about the object-level goals that lots of different ML-analogous selection processes result in? E.g., that lots of ML-analogous selection processes will result in systems trying to do something akin to “pass on my traits”? I don’t see why it would imply this, though. Rather, it seems like the more natural inference would be something like “selecting for good performance on criteria C results in systems that actively optimize for criteria C” -- a fact that would seem comforting from the perspective of alignment, if we thought we could identify selection criteria sufficiently aligned with what we want. But the idea is that the evolution example is evidence against this comforting thought.

If we want to make extrapolations from human behavior, to future AI system behavior, it also just does seem very important that violent, manipulative, and otherwise power-granting behaviors often received positive reinforcement in human evolutionary history. If I remember correctly: High-end estimates of historical rates of violent death imply that the average person who passed on their genes may have, on average, committed something like .25 murders. One book on the evolution of general intelligence, David Geary’s Origin of Mind, describes human evolutionary history as “primarily a struggle with other human beings for control of the resources that support life and allow one to reproduce” -- which makes it natural to conceptualize a large portion of human behavior “in terms of an evolved motivation to control” (pg. 3). The selection pressures associated with the development and propagation of ML systems, for use by humans, just seem to be very different. Unlike in human evolutionary history, horrifying power-granting actions (like opportunistic murder) would virtually never be positively reinforced. Likewise, unlike in human evolutionary history, power-ceding actions (like allowing a particular person to kill/deactive you) could fairly regularly be positively reinforcement for AI systems.

Therefore, I’m not inclined to update much on the report’s observation that humans display power-seeking behavior that is “quite compatible with the instrumental convergence hypothesis.” Although the report does mention this objection, I don’t think it’s given nearly enough weight. Regardless of whether or not the report’s key arguments are correct, it would be strange and confusing if humans consistently refrained from aggressive power-seeking behavior.

At the same time, in human evolutionary history, power-seeking behavior was also often punished -- for instance, through retaliatory violence or ostracism. In this sense, the feedback was fairly mixed. I also do find it reassuring, then, that this kind of mixed feedback was still sufficient to cause many people to treat various power-granting behaviors as close to off-limits. For example, as an unfortunately disgusting example, suppose that it became clear to you that you could achieve some cherished objective by disemboweling an acquaintance with your bare hands and teeth. Even though your evolutionary history contains many acts of horrying violence, your reaction is still likely to be: “Well, I’m not going to do that.” If your history contained no acts of horrifying violence, because acts of horrifying violence were consistently negatively reinforced, then your reaction would presumably be even stronger and more reliable.

I find it pretty easy to imagine that development processes could imbue AI systems with that same kind of reaction (“Well, I’m not going to do that”) when they consider a wide variety of unwanted power-granting actions. For example, while planning, a system might simulate the act of killing its user or deactivating its own off-switch, then rate this course of action very poorly -- almost regardless of any other predicted consequences of the action.

The report does have a footnote related to this point, disputing the idea that AI systems might simply never consider actions like murder. My main suggestion, though, is that AI systems might very consistently rate actions like murder poorly when they do consider these actions. The idea that they might almost completely stop considering actions like murder is mostly a follow-on point: the more poorly a kind of action has been rated before, the less frequently it should be considered in planning processes.

JC: I'm inclined to view this as a case where non-murder is playing an important role in the objectives themselves. Seems like one of your main points is something like: "maybe it's pretty easy to make their objectives such that they don't like bad forms of power-seeking, even if the objectives are otherwise imperfect/problematic -- and to still have the systems be useful in the ways we want (at least, for long enough until we figure out better solutions)." And I basically agree that maybe this works (and it’s one of my main sources of optimism): I think it depends on how effectively you’re able to operationalize “no power-seeking”, and whether what you get via reinforcement according to this operationalization is something deep and robust enough to keep working as you scale up the system’s capabilities.

My sense is that some people are pessimistic about this type of thing for reasons I don’t fully understand: e.g., perhaps that more “deontological” flavored objectives, which put more focus on *how* you’re achieving outcomes rather than the fact that they get achieved at all, are in some sense “unnatural”, less effective (see e.g. here), and less likely to withstand scaleup in capabilities. There’s also a general worry that you will get a “nearest unblocked neighbor” problem with respect to your operationalization of power-seeking (e.g., if you said/reinforced no murder, it ends up cryonicizing you or something); and that there are various OK forms of power-seeking (e.g., surviving, learning new useful information, etc), that we probably do want our AI systems to engage in, such that you can’t just reinforce a blanket aversion to anything that seems potentially power-seeking-ish – rather, you need more fine-grained tools. But overall I think this is an aspect of the dialectic worth more attention.

Where does risk come from?

I’m not, in general, convinced that severely PS-misaligned behavior would be very hard to avoid.

I think one way to get clarity here is to start out with a simple case, where PS-misaligned behavior is clearly feasible to avoid, then see how much adding complexity changes the situation.

Toy car case: Suppose there’s an autonomous toy car. In its first timestep, it can either turn its engine on or keep its engine off. Then, for the next several timesteps, if the car turned its engine on in the first timestep, it can increase or decrease its velocity by one of several increments. The car does not have planning capabilities.

In this case, the action “turn your engine on” is highly instrumentally convergent: it increases the number of options the system has, regarding its future trajectory in the world, by many orders of magnitude. For the vast majority of possible utility functions, defined over the car’s trajectory, the optimal policy involves turning the engine on: there is only one possible trajectory that involves keeping the car’s engine off.

But it seems pretty easy to design a learning process that would result in the car keeping its engine off, if we’d like the engine to stay off. If we were to set up a process that involves learning from human feedback, and humans consistently give negative feedback when the car turns its engine on, it seems like we can quickly end up with a car that just keeps its engine off. Misaligned ‘power-seeking’ behavior is easy to avoid.

Furthermore, increasing the level of instrumental convergence (i.e. the amount of power that can be gained by turning the engine on) does not seem to increase the difficulty of creating a car that keeps its engine off. If we modify the car’s action space so that it can increase its speed by extremely fine-grained increments, when its engine is on, then this will massively increase the number of distinct trajectories that involve turning the engine on. Nonetheless, if we want to assess the difficulty of preventing the car from turning its engine on, this increase in the level of instrumental convergence seems irrelevant. Training the car to keep its engine off is just as easy as it was before.

The toy car case is essentially the simplest possible case we can imagine. It’s possible, therefore, that instrumental convergence is essentially irrelevant in this case, but would be extremely relevant in more complex cases or cases where the AI system has planning capabilities.

JC: I think it’s useful regardless.

One complexity-focused story, which I believe is suggested by the “problems with search” section of the report, might be something like this: In the car case, there is one specific instrumentally convergent action that we do not want the agent to take. Generalizing from human feedback is therefore very easy. However, suppose that there is a large and diverse set of instrumentally actions that we do not want the agent to take. Then the system might not generalize well from human feedback, which has only directly concerned the small subset of badly unwanted behaviors it has engaged in before, and it might therefore still engage in some badly unwanted behaviors even if it’s trained for a long time.

To tweak an example raised in the “problems with search” section: Suppose that you want to make sure that your AI system does not walk through doors with the word “EXIT” written on them. You train it in an environment that contains a number of doors with the word “EXIT” written on them, always in green paint, and consistently give negative feedback if the system even comes close to walking through them. Now you put the system in a new environment, where the word “EXIT” is written in red paint. The available data doesn’t include any examples of walking through this specific kind of door: the system has only ever been penalized for walking through doors where “EXIT” is written in green paint. It seems like there’s some chance, then, the training process would have produced a policy that only responds to “EXIT” written in green. Hence, due to the need to generalize, there’s some risk that the system will violate your preferences in this new environment.

I’m not sure I buy the complexity story, though. It seems like you’d need an extremely large generalization failure for feedback that consistently penalizes any form of coercive or violent behavior to result in (e.g.) world-conquering behavior.

It also seems important that only an extremely small portion of policies involve successful world-conquering behavior: this seems to suggest that pure randomness in the parameter values that result from a training process, due to the limitations of the sample of experiences used in the training process, is unlikely to be a good explanation for the emergence of world-conquering behavior. It seems like we need to think that, insofar as feedback is insufficient to pick out a particular set of parameter values, there will be an extremely strong bias toward parameter values that imply world-conquering behavior. I don’t think I see an obvious reason to suggest this bias, though.

An alternative story, which I believe is suggested by the discussion in the “problems with proxies” section, is that human feedback is probably unusually well-calibrated in the toy car case: the human judges never give positive feedback to undesirable power-granting behavior. In certain other cases, on the other hand, humans may give positive feedback to very bad behavior, because of their limited ability to distinguish good and bad behavior. This feedback might then push the system to engage in increasingly bad behavior.

The report gives the grasping task case as a concrete example: Human judges gave positive feedback when an AI system that simply looked like it was grasping an object, rather than actually grasping the object (as the judges wanted). Positive feedback for this behavior enforced it. This case is of course fairly benign. However, conceivably, positive feedback for more sophisticated and seriously bad behaviors could eventually lead to the creation of systems that engage in behaviors like mass manipulation, mass murder, and so on.

I also don’t think I buy this story, though. For a feedback process to produce highly competent world-conquering behavior, for example, it seems like people would need to give a lot of positive feedback to increasingly terrible behavior. One way to frame this intuition: In policy-space, it seems like there is a very large gap between (e.g.) benign personal assistant behavior and successful world-conquering behavior. Intermediate policies that fill this gap include behaviors that look like highly incompetent attempts to take over the world, that involve lower-lever acts of violence or manipulation, that involve displays of various skills that would never naturally come up in personal assistant work, etc. It seems like, for human feedback to push the policy across the gap, the people giving human feedback would need to be very consistently and very badly miscalibrated over a fairly extended period of time: it seems like we’d almost need to imagine that there’s hardly any correlation at all between good behavior and positive feedback. Overall, then, it feels like a stretch to extrapolate from the gripper case (where people briefly rewarded behavior that merely looked like gripping) to existentially significant cases.

JC: The possibility of giving lots of positive feedback to actively and visibly bad external behavior isn’t playing a very central role for me. I’m much more worried that we’ll give positive feedback to good-seeming external behavior that arises from bad internal cognition (e.g., scheming to get power later by behaving well now), and hence reinforce the bad internal cognition.

Ultimately, at least if we are utilizing human feedback, I find it hard to imagine that purely learning-based processes would accidentally result in ML systems that engage in behaviors like competent world-conquest.

We can ask, then, whether planning might be the key thing that makes a difference. I feel pretty unsure here, but I definitely have some sympathy for the idea that planning would substantially increase risk.

JC: Flagging that I again worry that your specific conception of the contrast between learning vs. planning is going to make a difference here. E.g., in my ontology it is possible to learn from trail-and-error-in-the-real-world how to plan, such that you then plan to engage in behavior that wasn’t itself reinforced at all.

Once we introduce planning, I do think that elements of the “problems with proxies” story become more compelling. A notable difference, here, is that it’s more plausible that large and deeply concerning movements in ‘policy space’ would be opaque to human engineers/overseers. If you want to gradient-descent your way to a policy that engages in competent world-conquering behavior, then this development process seems like it will need to involve lots of intermediate policies that engage in (e.g.) behavior that looks like incompetent conquest-attempts. If someone is watching the system try out intermediate policies in the real world, or in a highly legible simulation, then it seems like these concerning intermediate behaviors will probably be noticed and negatively reinforced. On the other hand, if the system is updating its behavior through planning, using highly non-interpretable simulations of the world, then it seems more plausible that concerning intermediate policies would go unnoticed by humans. This is especially true if humans don’t even realize the system is planning, because the relevant planning capabilities are emergent.

There is a further question of whether concerning intermediate behaviors would actually be reinforced, within a planning process. I’m not sure how to think about this question, since it seems to some large degree on the nature of the planning process. It again seems like it might be important, for example, whether the planning process is intentionally introduced by engineers or whether it’s entirely emergent. If it’s entirely emergent, then there might be an especially strong case for thinking that it would tend to reinforce behaviors that humans do not want: the report’s comments about potential biases toward simple or computationally non-intensive evaluation processes might become especially relevant.

Overall, though, I just feel fairly confused about how to think about the significance of planning. I can buy that planning capabilities might be associated with substantially higher levels of risk, but I think I’d still want to see analysis that goes substantially beyond what’s in the report.

JC: I found the general vibe of your comments here helpful. I’d gloss it as something like: “Assuming that you’re actively penalizing coercive/violent behavior, and if you’re training something like a personal-assistant, then if you ended up with a policy that nevertheless goes off and engages in coercive/violent world-conquering, it seems like you ended up with something pretty different and intuitively ‘far away in policy space’ from the thing you trained for. Remind me why we should expect this again?”

I do think the idea that the system in question is planning (in my sense) is important here (I’m always assuming that we’re working with APS systems that are planning in my sense). In particular, conditional on training an APS system at all, I think a key worry is that the training/reinforcement process first give rise to a problematic objective that ultimately does motivate power-seeking (e.g., “learn a ton about science,” for training a science AI – see e.g. Ajeya’s description of “schemers”), and then once the system sees, via planning + strategic awareness, that hiding its power-seeking aspirations and behaving well now is a good instrumental strategy (e.g., “don’t kill people to learn more about science now, wait until they can’t stop me”), you get non-power-seeking behavior for the rest of your training process regardless (assuming your AI is competent enough to understand the behavior you’re looking for). But there’s still a question of whether you should expect to get this sort of power-seeking-ish objective in the first place, whether the incentives would in fact be set-up in the way this story imagines, and whether we would fail to detect the relevant forms of deception. And more broadly, this does feel like a pretty specific kind of story, and I think it’s worth being clear about exactly how much work it’s ultimately doing in the overall threat model (personally, I would feel a lot better about the situation if I knew this kind of deceptively-good behavior was never going to occur).

More broadly, though, your comments here highlight for me the intuitive difference between thinking in terms of “policies” or “behavior,” vs. thinking in terms of “objectives.” That is, when we think in terms “policies” or “behavior,” it doesn’t seem like there’s a lot of “pressure towards power-seeking”: it feels like you can push a policy, via reinforcement or whatever, in whatever direction you want, and “world-conquering” just seems like a particular type of policy/behavior that there isn’t strong reason to expect. Whereas once you’re thinking that you’re going to give the system some objectives that it will plan in pursuit of, it’s more intuitive why power-seeking would be some sort of default (e.g., because so many different objectives compatible with its observed performance are served by power-seeking).

A further thought on functionality/safety trade-offs

The report notes:

It’s generally easier to create technology that fulfills some function F, than to create technology that does F and meets a given standard of safety and reliability, for the simple reason that meeting the relevant standard is an additional property, requiring additional effort. Thus, it’s easier to build a plane that can fly at all than one that can fly safely and reliably in many conditions; easier to build an email client than a secure email client; easier to figure out how to cause a nuclear chain reaction than how to build a safe nuclear reactor; and so forth.

Of course, we often reach adequate safety standards in the end. But at the very least, we expect some safety problems along the way (plane crashes, compromised email accounts, etc). We might expect something similar with ensuring PS-aligned behavior from powerful AI agents.

In principle, I agree with this point. However, it also seems worth keeping that some failure modes for technologies are much less natural than others.

For example, it’d be very easy to create an explosive pair of pants: just pour some gunpowder in the pockets. But exploding is not a natural failure mode for pants. No normal, sensible-seeming pants production process is going to result in an explosion.

I think it’s still an open question where exactly “a useful AI system that is very unlikely to kill everyone” falls on the spectrum between “a plane that is very unlikely to crash” and “a pair of pants that is very unlikely to explode.”

JC: I like this example. I do think a lot of the discourse here is motivated by something like "power-seeking is a very natural failure mode of APS-systems, because power-seeking is just a very natural thing for strategically aware agentic planners to do (because it will help them accomplish their objectives).” If that assumption (basically, the instrumental convergence thesis) is false, I think things are really looking up.

Deployment

This section discusses why we might expect to see practically PS-misaligned systems actually get deployed. The author briefly discusses the possibility of unintentional deployment, then lays out various factors that might affect the beliefs and incentives of relevant actors in choosing to deploy a system that is in fact practically PS-misaligned (even if they don't know it). He focuses on four factors that seem especially concerning:

Externalities and competition. The profit/power at stake in deployment may make it individually rational for actors to take risks that society as a whole (let alone all future generations) would not accept; and competition between actors may incentivize risk-taking.
Over time, larger and larger numbers of relevant actors, with varying degrees of caution and social responsibility, may be in position to build these systems and take these risks.
Practically PS-misaligned systems may still be able to demonstrate a lot of usefulness.
Practically PS-misaligned systems might deceive or manipulate relevant decision-makers.

Overall, the author finds it plausible that if ensuring practical PS-alignment proves challenging, practically PS-misaligned systems could end up getting deployed.

Do you find the discussion and conclusion in this section persuasive? Why or why not?

I largely agree with the considerations laid out and think the discussion is clear. I especially like the genetically engineered chimp analogy, which I think helped to move me a bit on these questions.

My perspective is probably, on this point, close enough for it not to be worth writing another wall of words. In case Joe is interested, although it probably doesn’t add much, this is one unpublished piece of writing exploring similar considerations; I don’t remember if I shared it previously.

What rough probability would you place on the following claim?

ALIGNMENT DIFFICULTY: By 2070, and conditional on TIMELINES and INCENTIVES above, it will be much harder to develop APS systems that would be practically PS-aligned if deployed, than to develop APS systems that would be practically PS-misaligned if deployed (even if relevant decision-makers don’t know this), but which are at least superficially attractive to deploy anyway.

15%

(Note: I’m a little unsure what counts as PS-misaligned. Although it seems like there are some instances of PS-misalignment that might be relatively mild or benign, I suppose I’m assuming some minimum level of seriousness.)

Correction

This section discusses whether or not we should expect the impact of deploying practically PS-misaligned APS systems to scale to the permanent disempowerment of ~all humans. It also discusses a few mechanisms relevant to the plausibility of this disempowerment. The author suggests that:

Serious risks can arise even if frontier capabilities do not escalate very rapidly/discontinuously, or via an “intelligence explosion,” or in a manner that concentrates power in the hands of a single actor.
Early warnings/indications of problems are helpful, but not necessarily sufficient to solve the problems in question.
Corrective feedback loops could be undermined by faster/more discontinuous escalations in capabilities, adversarial/deceptive dynamics, and coordination difficulties.

Overall, the author thinks that humans might well be able to correct PS-misalignment problems and prevent them from re-arising, but that doing so will likely require addressing one or more of the basic factors that gave rise to the issue in the first place: e.g., the difficulty of ensuring the practical PS-alignment of APS systems (especially in scalably competitive ways), the strong incentives to use/deploy such systems even if doing so risks practical PS-alignment failure, and the multiplicity of actors in a position to take such risks. This task, he thinks, could well prove difficult.

Do you find the discussion and conclusion in this section persuasive? Why or why not?

I find the discussion reasonably persuasive.

My biggest concern might be that I feel unsure how to imagine the accidental permanent disempowerment of humanity, outside of relative dramatic scenarios (e.g. everyone being killed). Partly, this is a concern about knowing what counts as accidental permanent disempowerment.

Partly, it’s a concern about knowing what kinds of pathways to disempowerment make sense. For example: Is it plausible that disempowerment could be an extremely distributed and piecemeal thing, where (e.g.) some set of power-seeking systems removes a ‘unit’ of human agency in one year, then another set of power-seeking systems removes another ‘unit’ of human agency in another year, and so on? Or should we typically imagine disempowerment as a more binary state, that’s triggered by some particular disaster? I think these differences probably matter, when thinking about the feasibility of course correction, but I didn’t feel clear about the relative weight the report was giving to different sorts of scenarios.

JC: I’d like to be a bit clearer about this myself.

I do think that the kind of piecemeal disempowerment you’re pointing to can happen, but I feel more optimistic, in slow versions of those scenarios, about humans noticing what’s happening and coordinating to prevent it (though obviously, the coordination might fail).

In general, I tend to put the most weight on “some particular subset of the AI systems manage to get ~all the power via a ‘decisive strategic advantage’-type scenario,” and “there is a relatively short period – e.g., say, < five years – in which misaligned systems all over the world are gaining power in different ways, not through some coordinated rebellion or particular disaster, but just via different incentives and human disadvantages and failures to coordinate all pushing in that direction.”

What rough probability would you place on the following claim?

HIGH-IMPACT FAILURES: By 2070, and conditional on TIMELINES, INCENTIVES, and ALIGNMENT DIFFICULTY above, some deployed APS systems will be exposed to inputs where they seek power in misaligned and high-impact ways (say, collectively causing >$1 trillion 2021-dollars of damage).

50%

What rough probability would you place on the following claim?

DISEMPOWERMENT: By 2070, and conditional on TIMELINES, INCENTIVES, ALIGNMENT DIFFICULTY, and HIGH-IMPACT FAILURES above, some of this misaligned power-seeking will scale (in aggregate) to the point of permanently disempowering ~all of humanity.

30%

Catastrophe

The author briefly discusses the possibility that the permanent and unintentional disempowerment ~all humans would not constitute an existential catastrophe (e.g., an event that drastically reduces the value of the trajectories along which earth-originating civilization could develop). He also emphasizes that some AI systems might have moral status, and that the right way forward may ultimately involve humans intentionally ceding power to them: the point in the present context is to avoid unintentional disempowerment of humans.

Any comments on this section?

I largely agree with this section.

One consideration I would add -- which perhaps goes a bit beyond the point about game-theoretic equilibria -- is that competitive pressures might also produce very broad convergence in the properties of digital minds. If some kinds of AI systems outcompete others (or communities that use some kinds of AI systems outcompete others), then, so long as competitive pressures remain operative, it might be close to inevitable that future systems will have certain highly competitive traits. This kind of determinism in the evolution of AI systems might operate whether or not we work out alignment techniques.

JC: This seems like a helpful route to convergence to have in mind.

A second consideration is that PS-misaligned systems might still behave in ways that are compatible with a lot of current human values. For example, we can imagine a system that acts like it cares about human welfare, inequality, and so on, but also is more resistant to attempts to shut it down or modify it than we would like. As an analogy: Imagine a benign dictator who is actually pretty good in most ways, relative to our current values, except for the important fact that they are committed to being a dictator. If we think that the average person’s values will get worse over time, for instance due to random drift, then it might actually be preferable for the current generation to accidentally create these kinds of PS-misaligned systems.

(Note: I’m not sure whether this second scenario would constitute the “disempowerment” of humanity, though, if humanity is still around and the median person still has about as much individual freedom as the median person alive today. The concept of “disempowerment” here remains a bit fuzzy to me.)

JC: I’m inclined to say that it’s not the relevant kind of disempowerment if humans are still free to direct tons of resources in the ways they want to.

A third consideration is that humans might be very disinclined to disempower themselves, such that humanity will only get disempowered (relative to AI systems) if it is disempowered accidentally. It also seems like accidental disempowerment could conceivably be better than no disempowerment. For example: humans might ‘enslave’ AI systems in a morally objectionable way, humans might tend to constrain AI ‘population levels’ far below what would be morally ideal, or humans humans might be likely to create AI systems with actively bad conscious experiences. In this case, accidental disempowerment might actually be morally desirable.

What rough probability would you place on the following claim?

CATASTROPHE: By 2070, and conditional on TIMELINES, INCENTIVES, ALIGNMENT DIFFICULTY, HIGH-IMPACT FAILURES, and DISEMPOWERMENT above, this disempowerment would constitute an existential catastrophe?

75%.

Overall probabilities

The author concludes by listing his own subjective, highly-unstable probabilities on each of these premises, along with a number of caveats about these probabilities should be understood. The probabilities are:

By 2070:

It will become possible and financially feasible to build APS systems. 65%
There will be strong incentives to build APS systems | (1). 80%
It will be much harder to develop APS systems that would be practically PS-aligned if deployed, than to develop APS systems that would be practically PS-misaligned if deployed (even if relevant decision-makers don’t know this), but which are at least superficially attractive to deploy anyway | (1)-(2). 40%
Some deployed APS systems will be exposed to inputs where they seek power in misaligned and high-impact ways (say, collectively causing >$1 trillion 2021-dollars of damage) | (1)-(3). 65%
Some of this misaligned power-seeking will scale (in aggregate) to the point of permanently disempowering ~all of humanity | (1)-(4). 40%
PS-misaligned systems permanently disempowering ~all of humanity will constitute an existential catastrophe | (1)-(5). 95%

In combination, these probabilities yield an overall estimate of ~5% chance of an existential catastrophe by 2070 from scenarios where all of 1-6 are true, which the author would adjust upwards to reflect power-seeking scenarios that don’t fit some of 1-6. The author also notes in a footnote that his “high-end” and “low-end” estimates vary considerably: from between ~40% on the high-end, to ~.1% on the low end.

Any comments on these probabilities? Does anything stand out to you as unreasonable?

Premise #3 seems to be the most important point of disagreement.

---

Also, a methodological note: I personally find it somewhat hard to evaluate premises 4 and 5 separately.

When I try to imagine an APS system causing $1 trillion in damage, I feel very unsure about what I should imagine: the destruction of physical infrastructure or inventory, a financial crash, a pandemic (e.g. lab leak), a policy that slows economic growth, a policy that causes substantial waste of resources, the dissolution of a highly valuable corporation, the instigation of costly political unrest or conflict, etc. I think different scenarios require forms of misalignment that would be more or less dramatic or frightening, from the perspective of someone worried about permanent disempowerment. I feel like I would rather just directly estimate the probability of permanent disempowerment and skip premise 4.

JC: I agree there’s some ambiguity here. And there are also conceivable scenarios where AI systems kind of “outgrow” human institutions without actually causing any “damage” (for example, if the AI systems just had their own economy that did very well, and then went to space much faster than humans did).

What is your own rough overall probability of existential catastrophe by 2070 from scenarios where all of 1-6 are true?

My answers imply a probability of .4%.

Any final comments on the report as a whole?

I’m really glad that Joe took the time to write this report. I think it adds clarity to many important questions and considerations and lays a strong foundation for further debate. In general, I’m extremely impressed by its depth, breadth, and even-handedness.

It’s the sort of thing that helps to reassure me that progress in debates around alignment is possible, even if this progress is inevitably going to be very gradual and a lot of the necessary work is going to be fairly painstaking. I think it’s fairly striking, for example, to compare this report to writing on alignment risk from several years ago.

JC: Very glad to hear it :) – your own work on analyzing AI risk arguments was an important inspiration, and your feedback on the earlier draft was very helpful.

Permissions

Would you be OK with us making your answers to the above questions publicly available?

Yes.

Would you be OK with us publishing your name alongside your answers? Publishing with your name is preferable on our end, because we think it helps give readers more context, but it’s also fine if you prefer to remain anonymous.

Yes.

[1] One way of defining these subjective probabilities is via preferences over lotteries. On such a definition, “I think it less than 10% probable that Kanye West will be the next president” means that if I had the option to win $10,000 if Kanye West is the next president, or the option to win $10,000 if a ten sided dice comes up “1”, I would choose the latter option. See this blog post by Andrew Critch for more informal discussion; and see Muehlhauser (2017a), section 2, for discussion of some complexities involved in using these probabilities in practice.

[2] Here we understand “incentives” in a manner such that, if people will buy tables, and the only (or the most efficient) tables you can build are flammable, then there are incentives to build flammable tables, even if people would buy/prefer fire-resistant ones.

[3] Definitions:

Misaligned behavior: unintended behavior that arises in virtue of problems with a system’s objectives.
Misaligned power-seeking: misaligned behavior that involves seeking to gain/maintain power in unintended ways.
Practically PS-aligned: a system doesn’t engage in misaligned power-seeking on any of the inputs it’s in fact exposed to.