Alignment report reviewer questions
Instructions for reviewers:
The author is happy to answer any questions you have over email, or on a call, while you are reading the report. Please feel free to reach out to him, if it would be helpful to you, at joseph@openphilanthropy.org.
Is the report’s main project framed in a way that makes sense to you? If not, what is unclear or confusing?
The report’s project makes sense to me.
The argument has several different stages, so I find it a little hard to condense into (e.g.) a single paragraph. Here’s my best attempt at a summary of the different stages, though. I’m doing some modest reframing, and modest re-defining, partly because this should make it more transparent if my interpretation of the argument diverges from the intended interpretation.
JC: I appreciate the thoughtful summary. This is basically in line with my intended interpretation, though a few bits (e.g., thinking about planning in terms of simulations in particular) are somewhat more specific than I’d have in mind.
I found the report much clearer than almost all other writing on risks from misaligned AI.
There are still some important concepts that I think are fuzzier than they ideally ought to be, but I think some amount of fuzziness is inevitable in a case like this: it’s just very hard to talk about still-hypothetical AI systems that might be unlike anything we’ve seen yet.
The place where I think some further clarification would be most valuable is in the report’s description of the “permanent disempowerment of ~all of humanity.” This is a really important idea, since it’s the central thing that the report is trying to predict, but I still feel fairly fuzzy about what kinds of outcomes qualify. There are some scenarios that seem to obviously qualify: for example, scenarios in which humans are simply killed off, or scenarios in which humans are essentially imprisoned or severely persecuted by a purely AI-based government that takes no inputs from humans. But I’m not sure if these are the modal scenarios being imagined, and I’m not sure what other scenarios count.
JC: Those are close to the modal scenarios I imagine, though I’d also add in scenarios where humans aren’t actively “persecuted,” but rather are simply deprived of resources or opportunities to pursue their ends, especially on the types of cosmic scales that I expect would’ve otherwise been possible.
Here, for example, is one case that illustrates the ambiguity for me: Would the world described in Robin Hanson’s Age of Em count as a world in which ~all of humanity has been unintentionally and permanently disempowered? (Although it involves emulations rather than AI, we can imagine that the ems were instead created through very good imitation learning.) I could see a case for answering either “yes” or “no.”
JC: Assuming that the ems possess ~all the power in this world, to me this question effectively amounts to the question of (a) are the ems human (I tend to think of literal emulated humans as “humans” in the sense relevant to the empowerment of human values), and (b) if not, how intentional was the transition to a "the ems have all the power” world.
Re: (b), I do think that "there was a bunch of consensual positive sum trade that resulted in the ems taking over, despite the fact that most humans wouldn't have wanted this on reflection" is kind of an edge case with respect to the report’s definitions of “disempowerment” – though one that could well be relevant in practice to different degrees. And we can imagine non-EM scenarios that are similar. See also Grace on “misuse vs. misalignment” for more edge cases.
One reason I find it hard to assign a probability to the report’s central hypothesis is just that I don’t feel like I have a great grip on it conceptually: I’m not sure how to imagine what the world is like, how the misaligned AI systems would actually be exerting power, and (partly as a result) what the path to this world would plausibly look like.
(I've written up some related thoughts that may be relevant here.)
Re: the linked discussion, I think my view is probably closest to: "If we ‘mess up on AI,’ then even the most powerful individual humans will have unusually little influence over their own lives or the world around them.” I might say something like: "if we mess up on AI, then humans will exert very little resource-weighted influence over the future relative to what would’ve been possible."
---
I generally found the report to be very reasonable and balanced. It feels like it’s giving many different arguments and counter-arguments a fair shake, rather than arguing for one particular conclusion in a lawyer-like way.
---
I’m unsure what to say about how “convincing” the report is. I agree with a large portion of the report. Joe’s credence is also certainly closer to mine than it is to the most pessimistic analysts of AI risk. At the same, I remain somewhat more “optimistic” than Joe.
The least convincing part of the argument, for me, is the discussion of instrumental convergence and its significance.
The section of the report that caused the largest update in my views with the discussion of reasons why people might use systems they know to be misaligned. I’m sure I would have updated my views more on other points as well, if I had not already encountered many of the report’s points previously in earlier drafts or conversations with Joe.
The report focuses on AI systems with three key properties (the author calls these “APS systems”):
I think this framing makes a reasonable amount of sense. I think that the concept of an APS system is a large improvement over possible alternative concepts like “a transformative AI system” or “AGI.” I also like a lot of the conceptual discussion.
However, I do still have some concerns about the framing.
---
A high-level concern and suggestion:
The concept of an APS system feels a bit inelegant to me, given that it has three seperate components that need substantial unpacking. One way to streamline the definition might be to combine the “advanced capability” and “strategic awareness” components. For example, something closer to definition might be preferable:
An APS system can accurately predict that taking certain actions would likely grant it very significant, long-lasting power over its users.
This definition implies that the system plans, has strategic awareness, and possesses capabilities that are sufficient to obtain truly worrisome forms of power. So it is roughly equivalent to the definition given in the report. However, it feels a little neater. It also suggests a relatively neat way of describing one challenge that future AI developers may face: they may need to create AI systems that can figure out how to seize power but choose not to.
JC: I had meant for the "advanced" to index capability levels to today's world, to leave room for the idea that you get APS systems but in a competitive context where everyone has amped up their defenses etc such that the APS systems don’t actually get that much power (even though the systems in question would’ve been relatively powerful in 2021).
Also, I think that strictly speaking, oracle-like systems (e.g., systems that can generate plans but aren’t actually using their planning capacities in pursuit of objectives) would satisfy the definition above; whereas I’d prefer to exclude them.
That said, I do think “they may need to create AI systems that can figure out how to seize power but choose not to” is a nice framing.
---
More specifically, I have some specific concerns about the treatments of the three key properties used to construct the current definition. In brief, my specific concerns are:
(a) I think the distinction between relatively concrete/unified systems (e.g. GPT-3) and highly modular/abstract systems (e.g. a large collection of neural networks that collectively replicate the activities of a large corporation) might be really significant. For that reason: I think it’s possible that the report should specifically focus on misalignment in relatively concrete/unified systems, and then briefly discuss highly modular/abstract systems as a secondary case, rather than framing the discussion in a way that is (I think?) largely meant to apply to both cases at once.
(b) I think it would be useful for the report to say more about why planning is a “key property” of APS systems. I agree that planning is very important, but I think I disagree about exactly why planning is so relevant to AI risk. I think this difference in perspective may help to explain why I find some considerations more or less compelling than Joe does.
(c) I generally feel nervous about relying a lot on the concept of an AI system’s “objectives,” as it’s evoked in the description of APS systems. I think it may be importantly misleading, for example, to think of the challenge of creating aligned systems as a matter of ‘giving them the right objectives.’ I think my concerns about ‘objective talk’ might also partly explain why I’m not convinced of the report’s version of the Instrumental Convergence Hypothesis.
Here are my qualms in almost certainly too much detail:
JC: I agree re: “I think it’s notable that the relevant ‘tasks’ listed (e.g. engineering) mostly aren’t performed by individual humans.”
JC: I'm also not sure. This seems like another candidate place where the abstractions might mislead. For example, it seems like Google is an "agent" in some sense, but it's definitely not a "really big/powerful human," and if you start trying to predict its behavior, the effect of different interventions, etc as though it were just a really big version of the type of thing a human is, I think you probably go wrong in lots of cases.
For example:
JC: I agree, though you could just end up with both difficulties.
JC: This seems to me like an important and useful point, especially with respect to the current ML-focused alignment discourse, in which one often imagines a "training procedure" that results in good performance on some "objective function," as opposed to e.g. putting a bunch of parts you kind of understand together.
That said, it feels to me like in worlds where you don't know how to hand-code basic cognitive tasks and so have to use ML for them, it also gets harder to "hand-build" complex combinations of modular systems to do useful cognitive work: probably a lot of the useful cognitive work is happening "inside" the modular systems. But maybe you could keep that cognitive work e.g. myopic or somehow more limited, such that you have fewer alignment problems with your individual parts, and then turn your big thing into the agentic planner you ultimately need.
JC: I agree.
JC: One hesitation I have with just focusing on non-modular systems is that I don't think people should say "I think it'll be modular, therefore my world-model has none of the problems Joe envisions." Rather, I think the right reaction is something more like yours: "I expect lots of modularity, and that maybe this helps make various things easier even if it doesn't eliminate all of the risk."
And I don’t actually think that "modularity" on its own is very important. For example, if trained neural networks turn out to be very modular at some level (e.g., different parts of the network end up internally optimized for different tasks), then unless this translates into some difference in our ability to understand/control the systems, I don't think this is an important thing to focus on.
JC: I think that we’re working with different concepts of planning, here. For example, below you say that you can’t see how a feedforward network with frozen weights could engage in planning -- a comment I wouldn’t endorse. This suggests to me that you’re putting a (strange-to-me) amount of emphasis on the simulations “changing an agent’s policy” -- where policy is understood as its overall input-output mapping conditional on holding something (e.g., the weights) fixed. Whereas I’m thinking of planning centrally as something that determines an agent’s decision about what to do in a particular case, even if it doesn’t change some particular input-output mapping overall. Thus, for example, a fixed-weight feedforward system could still run a bunch of simulations about what will happen if it performs different actions, and then decide what to do on this basis -- and this would count as planning, to me (and I expect to many).
I’m not sure why we would focus on the simulations “changing the policy” in your sense in particular, as opposed to just determining the decision. Presumably, after all, a fixed-weight feedforward network could do whatever humans do when we plan trips to far away places, think about the best way to cut down different trees, design different parts of a particle collider, etc -- and this is the type of cognition I want to focus on.
JC: As above, I don’t just care about planning that goes into “producing the policy” (however the line around “the policy” is drawn -- I’m not actually sure what you have in mind). I also care about planning that the policy itself performs. E.g., if solely via giving your system real-world experience (rather than having it learn from running lots of simulations), you end up with a system whose policy involves planning out different world-take-over schemes in response to different circumstances, then that’s worrying, regardless of the policy’s origins.
JC: Is this sentence missing a “not”? Currently it’s contrasting “produced by planning processes” and “produced through planning processes.”
JC: I expect that the way forward on this would be to sync up on how to think about the concept of planning that matters. To my mind, what matters about planning isn’t something about “how the system’s overall input-output function is created” in the sense your comments are focused on; rather, it’s something about whether the system makes decisions about what to do by modeling the effects of its actions on the world, in order to better achieve objectives -- whether doing so involves updating its overall input-output function or leaving it fixed. In particular, systems modeling the world in order to achieve objectives, where such models of the world involve strategic awareness, are by (intentional) definition in a position to notice and respond to instrumental incentives to gain and maintain power. That’s the central reason I focus on planning in this sense (though I also think that planning is likely important to an AI system’s ability to successfully gain power over humans as well).
JC: As I indicate in the report, confusion about the notion of “pursuing objectives” is one of my top candidates for the ways in which the abstractions employed might mislead. I do think focus on “behavior” can mitigate this somewhat (though I think it can also lose the predictive value that talk of “objectives” provides), and that couch-potato-like systems are a helpful example to keep in mind.
I agree that a couch-potato is in fact a strategically aware agentic planner. I also think that a couch-potato is well-described as having objectives: e.g., they want to hang out, eat snacks, watch good movies, avoid paying taxes, avoid hassles like going to the gym, etc. And they make decisions with such objectives in mind. (And note that a couch potato will in fact seek various kinds of power if given suitable opportunity. For example, if you offer a couch potato a billion dollars if they do a few jumping jacks, they will generally do it. This should make us wonder what the couch potato would do if certain types of power-seeking became suitably low-cost.)
But the question isn’t whether it’s hard to build an agentic planner as non-threatening as a couch-potato. The question is whether it’s hard to build an agentic planner as non-threatening as a couch-potato that also significantly outperforms humans at tasks like science, engineering, business/military/political strategy, hacking, and social persuasion/manipulation, and so on. And now, the worry is, you’ve started throwing around levels of cognitive capability that a standard human couch potato isn’t working with, and which could render its pursuit of its objectives much more worrying (e.g., compare “how hard is it to build a non-threatening couch potato” to “how hard is it to build a non-threatening superintelligent couch potato” -- for the latter, I think that we do in fact have to start thinking about what happens when extreme levels of optimization power are applied to the objectives in question, in a way that we didn’t when we were talking about weaker systems).
That said, it does seem like humans who are really good at e.g. science, engineering, and so forth can be couch-potato-like in many respects – so it doesn’t seem that hard to imagine increasing their science/engineering abilities without doing much to their power-seeking.
TIMELINES: By 2070, it will become possible and financially feasible to build APS systems.
30%
The author assumes that there are strong incentives to automate advanced capabilities, and discusses three reasons we might expect incentives to push relevant actors to build agentic planning and strategically aware systems in particular, once doing is possible and financially feasible:
I agree with the conclusion, although “available techniques” is definitely the main justification I find compelling.
Although this may be a pedantic/semantic point, at some level, I don’t think that systems that can plan are inherently more useful than systems that cannot plan. Any given useful behavior can, in principle, be produced through learning alone. I think there is nothing that a planner can do that a learner cannot, if the learner has enough experience to learn from. For example: With enough experience, a version of AlphaGo without any planning capabilities could match the skill of the standard version of AlphaGo.
JC: Flagging that I think the distinction between “planning as a means of learning a policy” (your focus) vs. “planning as a thing that a policy can do” (my focus) is coming in here as well. I think it likely that the latter is basically required for certain tasks. To take some extreme and science-fictiony examples, consider e.g. “show up on a new planet and figure out how to build a working car out of the weird new materials that the planet provides within X amount of time”, or “figure out how to make nano technology using X time and Y resources, where no one has ever made nano-tech before.” To me it’s just really hard to see how you do this type of thing without relying on detailed models of your environment, predicting the results of your actions, evaluating these actions on the basis of criteria, and so on. Hence my focus on “usefulness.”
I do think, though, that it will often be dramatically less efficient to produce AI systems that meet a given performance standard if you do not allow them to plan. It would be extremely difficult, in many different domains, to give an RL system enough real-world experience to remove the need for simulated experience. It also seems impractical to hand-code sufficiently good simulations for every relevant domain. This means that it should often be very valuable to allow AI systems to make use of (at least partly) AI-generated simulations, when they are working out how to perform well. Depending on exactly how we define “planning,” it will normally be natural to describe AI systems that use these simulations as “planners.”
I’m not sure that planning is a natural “byproduct of sophistication.” I think that some development processes (e.g. feedfoward networks updated using standard learning algorithms) probably simply don’t allow planning to emerge, even though they could in principle produce indefinitely sophisticated systems. I think it’s possible that really basic forms of recurrence within a model are sufficient to enable the emergence of powerful planning capabilities, but I’m not at all confident. I think that the most relevant form of empirical evidence here would be evidence that certain models perform dramatically better if given more ‘time to think.’
If planning is entirely emergent, it seems like it would probably be very hard to prevent planners from developing strategic awareness in certain circumstances. If planning is not emergent, with engineers intentionally crafting planning processes, then I am less confident that the threatening forms of strategic awareness (e.g. the ability to understand and predict that killing your overseer would prevent them from turning you off) would be unavoidable. However, even in the best case, avoiding these forms of strategic awareness still seems like it would probably be very difficult.
INCENTIVES: By 2070, and conditional on TIMELINES above, there will be strong incentives to build APS systems.[2]
80%.
In this section, author discusses the hypothesis that for APS systems, misaligned behavior on some inputs (where this behavior involves agentic, strategically-aware planning in pursuit of problematic objectives) strongly suggests misaligned power-seeking on some inputs, too (in brief, the central argument here is that power is useful for pursuing a wide variety of objectives).[3] He then discusses different challenges to ensuring that APS systems are practically PS-aligned. In particular, he considers two key challenges to controlling a system’s objectives adequately:
He also discusses the possible role of:
Finally, he discusses three ways that ensuring the safety of APS systems seems unusually difficult, relative to safety challenges posed by other technologies. Namely:
The author concludes that ensuring the practical PS-alignment of APS systems could well be difficult.
This is the section of the report I find least convincing. Here are a few disagreements or differences in point of view, again likely in too much detail:
On the Instrumental Convergence Hypothesis:
I do believe that the traditional conceptual version of the “Instrumental Convergence Hypothesis” is true. Expressed in a formal way: It does seem to be the case that, for the vast majority of possible objective functions defined over sufficiently detailed world-models, the optimal policy would involve taking harmful power-granting actions.
I still don’t clearly see, though, why the alternative empirical “Instrumental Convergence Hypothesis” defined in the report would be true. It just does seem fairly easy to imagine competent planners that rate certain undesirable courses of action highly (and thus exhibit ‘misalignment’), but don’t rate bad power-seeking courses of actions highly (and thus do not exhibit ‘misaligned power-seeking’). I don’t think this requires, as the report suggests, the system to “ignore” the consequences of harmful power-granting actions. The system just needs to rate courses of action that involve power-seeking negatively.
Here’s a toy scenario:
Non-violent door-user: Suppose you create a planning AI system, through some sort of development process that relies heavily on human feedback. Let’s suppose that you don’t want this AI system to kill its overseers, and you also don’t want it to walk through doors with the word “FIRE EXIT” written on them. You give a lot of negative feedback to different actions that involve harm to humans, or that are even vaguely in the vicinity of violence, and the result is that the system begins to evaluate violent courses of action very negatively. However, in the training environment, there is only a single door with the word “FIRE EXIT” written on it. The system begins to evaluate courses of action that involve walking through this door negatively, but doesn’t generalize to other fire exits. The system is placed into a new environment, and continues to refrain from violence, but does occasionally walk through “FIRE EXIT” doors.
The system in this scenario seems like it would qualify as misaligned, but not PS-aligned. I don’t see, though, why this category of scenario is so implausible. The claim that “in general and by default” this sort of thing won’t happen just seems very strong.
JC: I think this is a useful type of example to keep in mind. One thing that’s going on here, it seems, is that the sense in which the system’s objectives are problematic is that they don’t reflect adequate weight on some constraint, unrelated to power-seeking, that the designers intended (e.g., “don’t go through Fire Exits”). But once we’re thinking of “not putting enough weight on not going through Fire Exits” as the “misaligned objective,” it does seem a lot less clear that this objective is the type of thing that gives rise to instrumental incentives to seek power. To me, this calls to mind some hazy contrast between objectives that have a more “consequentialist” flavor, and objectives that have a more “deontological” flavor, where the former seem intuitively more closely connected to power-seeking, but where the framing in the report (intentionally) encompasses both. This seems like an area worth further clarity on my part.
---
I also find it a little awkward to use the alternative version of the Instrumental Convergence Hypothesis to evaluate risk. The hypothesis says that if an APS system exhibits misaligned behavior, then it is extremely likely to exhibit misaligned power-seeking behavior. But the hypothesis doesn’t have any implications regarding the likelihood that an APS system will exhibit misaligned behavior. So, if we want to make use of this version of the Instrumental Convergence Hypothesis, we’re forced to ask two seperate questions:
I personally find it easier to just directly ask “What is the likelihood that an APS system would be PS-misaligned?”
JC: It seems plausible to me that directly focusing on PS-misalignment is better. The traditional argument structure is something like "if the objectives are not exactly right (or if you don’t have some difficult-to-formulate corrigibility property), then you get power-seeking; here's why getting objectives exactly right/getting that property is hard." But I think this might well be unnecessarily circuitous, especially in light of the possibility of just focusing very hard on eliminating power-seeking tendencies in particular.
On the analogy with human evolution:
I still find the human evolution analogy only a little concerning.
It is true that, when humans engage in planning, the evaluation criteria we use do not directly correspond to evolutionary fitness. For example, if I am considering whether to move to another country, and I’m thinking through the implications, I don’t evaluate these implications by asking “How would the propagation of my genes be affected?”
However, I’m not sure this is an obvious reason to worry. Suppose that, instead, evolution had inevitably imbued humans with a single-minded drive to pass on their genes. It seems like that would be a reason to worry: it would seem to suggest that we might have very limited flexibility in shaping the objectives of AI systems. Seemingly, the closest analogy would be if AI systems -- if trained for long enough -- inevitably develop a single-minded drive to pass on their parameter values. The fact that humans use evaluation criteria other than “evolutionary fitness,” when planning, therefore seems like the more reassuring of the two options.
JC: I don’t think I’ve really understood what you’re getting at here. As I see it, the role of the evolution analogy, in the context of “problems with search,” is to point to empirical evidence that selecting for a system that performs well according to criteria C (e.g., passing on its genes) does not necessarily result in a system actively optimizing for performing well according to criteria C, and which hence will fail to perform well according to criteria C in certain circumstances where it had capabilities sufficient to do so (e.g., not donating sperm, even though doing so was an available option).
It seems like you’re thinking that if humans were actively trying to pass on our genes, this would imply something about the object-level goals that lots of different ML-analogous selection processes result in? E.g., that lots of ML-analogous selection processes will result in systems trying to do something akin to “pass on my traits”? I don’t see why it would imply this, though. Rather, it seems like the more natural inference would be something like “selecting for good performance on criteria C results in systems that actively optimize for criteria C” -- a fact that would seem comforting from the perspective of alignment, if we thought we could identify selection criteria sufficiently aligned with what we want. But the idea is that the evolution example is evidence against this comforting thought.
If we want to make extrapolations from human behavior, to future AI system behavior, it also just does seem very important that violent, manipulative, and otherwise power-granting behaviors often received positive reinforcement in human evolutionary history. If I remember correctly: High-end estimates of historical rates of violent death imply that the average person who passed on their genes may have, on average, committed something like .25 murders. One book on the evolution of general intelligence, David Geary’s Origin of Mind, describes human evolutionary history as “primarily a struggle with other human beings for control of the resources that support life and allow one to reproduce” -- which makes it natural to conceptualize a large portion of human behavior “in terms of an evolved motivation to control” (pg. 3). The selection pressures associated with the development and propagation of ML systems, for use by humans, just seem to be very different. Unlike in human evolutionary history, horrifying power-granting actions (like opportunistic murder) would virtually never be positively reinforced. Likewise, unlike in human evolutionary history, power-ceding actions (like allowing a particular person to kill/deactive you) could fairly regularly be positively reinforcement for AI systems.
Therefore, I’m not inclined to update much on the report’s observation that humans display power-seeking behavior that is “quite compatible with the instrumental convergence hypothesis.” Although the report does mention this objection, I don’t think it’s given nearly enough weight. Regardless of whether or not the report’s key arguments are correct, it would be strange and confusing if humans consistently refrained from aggressive power-seeking behavior.
At the same time, in human evolutionary history, power-seeking behavior was also often punished -- for instance, through retaliatory violence or ostracism. In this sense, the feedback was fairly mixed. I also do find it reassuring, then, that this kind of mixed feedback was still sufficient to cause many people to treat various power-granting behaviors as close to off-limits. For example, as an unfortunately disgusting example, suppose that it became clear to you that you could achieve some cherished objective by disemboweling an acquaintance with your bare hands and teeth. Even though your evolutionary history contains many acts of horrying violence, your reaction is still likely to be: “Well, I’m not going to do that.” If your history contained no acts of horrifying violence, because acts of horrifying violence were consistently negatively reinforced, then your reaction would presumably be even stronger and more reliable.
I find it pretty easy to imagine that development processes could imbue AI systems with that same kind of reaction (“Well, I’m not going to do that”) when they consider a wide variety of unwanted power-granting actions. For example, while planning, a system might simulate the act of killing its user or deactivating its own off-switch, then rate this course of action very poorly -- almost regardless of any other predicted consequences of the action.
The report does have a footnote related to this point, disputing the idea that AI systems might simply never consider actions like murder. My main suggestion, though, is that AI systems might very consistently rate actions like murder poorly when they do consider these actions. The idea that they might almost completely stop considering actions like murder is mostly a follow-on point: the more poorly a kind of action has been rated before, the less frequently it should be considered in planning processes.
JC: I'm inclined to view this as a case where non-murder is playing an important role in the objectives themselves. Seems like one of your main points is something like: "maybe it's pretty easy to make their objectives such that they don't like bad forms of power-seeking, even if the objectives are otherwise imperfect/problematic -- and to still have the systems be useful in the ways we want (at least, for long enough until we figure out better solutions)." And I basically agree that maybe this works (and it’s one of my main sources of optimism): I think it depends on how effectively you’re able to operationalize “no power-seeking”, and whether what you get via reinforcement according to this operationalization is something deep and robust enough to keep working as you scale up the system’s capabilities.
My sense is that some people are pessimistic about this type of thing for reasons I don’t fully understand: e.g., perhaps that more “deontological” flavored objectives, which put more focus on *how* you’re achieving outcomes rather than the fact that they get achieved at all, are in some sense “unnatural”, less effective (see e.g. here), and less likely to withstand scaleup in capabilities. There’s also a general worry that you will get a “nearest unblocked neighbor” problem with respect to your operationalization of power-seeking (e.g., if you said/reinforced no murder, it ends up cryonicizing you or something); and that there are various OK forms of power-seeking (e.g., surviving, learning new useful information, etc), that we probably do want our AI systems to engage in, such that you can’t just reinforce a blanket aversion to anything that seems potentially power-seeking-ish – rather, you need more fine-grained tools. But overall I think this is an aspect of the dialectic worth more attention.
Where does risk come from?
I’m not, in general, convinced that severely PS-misaligned behavior would be very hard to avoid.
I think one way to get clarity here is to start out with a simple case, where PS-misaligned behavior is clearly feasible to avoid, then see how much adding complexity changes the situation.
Toy car case: Suppose there’s an autonomous toy car. In its first timestep, it can either turn its engine on or keep its engine off. Then, for the next several timesteps, if the car turned its engine on in the first timestep, it can increase or decrease its velocity by one of several increments. The car does not have planning capabilities.
In this case, the action “turn your engine on” is highly instrumentally convergent: it increases the number of options the system has, regarding its future trajectory in the world, by many orders of magnitude. For the vast majority of possible utility functions, defined over the car’s trajectory, the optimal policy involves turning the engine on: there is only one possible trajectory that involves keeping the car’s engine off.
But it seems pretty easy to design a learning process that would result in the car keeping its engine off, if we’d like the engine to stay off. If we were to set up a process that involves learning from human feedback, and humans consistently give negative feedback when the car turns its engine on, it seems like we can quickly end up with a car that just keeps its engine off. Misaligned ‘power-seeking’ behavior is easy to avoid.
Furthermore, increasing the level of instrumental convergence (i.e. the amount of power that can be gained by turning the engine on) does not seem to increase the difficulty of creating a car that keeps its engine off. If we modify the car’s action space so that it can increase its speed by extremely fine-grained increments, when its engine is on, then this will massively increase the number of distinct trajectories that involve turning the engine on. Nonetheless, if we want to assess the difficulty of preventing the car from turning its engine on, this increase in the level of instrumental convergence seems irrelevant. Training the car to keep its engine off is just as easy as it was before.
The toy car case is essentially the simplest possible case we can imagine. It’s possible, therefore, that instrumental convergence is essentially irrelevant in this case, but would be extremely relevant in more complex cases or cases where the AI system has planning capabilities.
JC: I think it’s useful regardless.
One complexity-focused story, which I believe is suggested by the “problems with search” section of the report, might be something like this: In the car case, there is one specific instrumentally convergent action that we do not want the agent to take. Generalizing from human feedback is therefore very easy. However, suppose that there is a large and diverse set of instrumentally actions that we do not want the agent to take. Then the system might not generalize well from human feedback, which has only directly concerned the small subset of badly unwanted behaviors it has engaged in before, and it might therefore still engage in some badly unwanted behaviors even if it’s trained for a long time.
To tweak an example raised in the “problems with search” section: Suppose that you want to make sure that your AI system does not walk through doors with the word “EXIT” written on them. You train it in an environment that contains a number of doors with the word “EXIT” written on them, always in green paint, and consistently give negative feedback if the system even comes close to walking through them. Now you put the system in a new environment, where the word “EXIT” is written in red paint. The available data doesn’t include any examples of walking through this specific kind of door: the system has only ever been penalized for walking through doors where “EXIT” is written in green paint. It seems like there’s some chance, then, the training process would have produced a policy that only responds to “EXIT” written in green. Hence, due to the need to generalize, there’s some risk that the system will violate your preferences in this new environment.
I’m not sure I buy the complexity story, though. It seems like you’d need an extremely large generalization failure for feedback that consistently penalizes any form of coercive or violent behavior to result in (e.g.) world-conquering behavior.
It also seems important that only an extremely small portion of policies involve successful world-conquering behavior: this seems to suggest that pure randomness in the parameter values that result from a training process, due to the limitations of the sample of experiences used in the training process, is unlikely to be a good explanation for the emergence of world-conquering behavior. It seems like we need to think that, insofar as feedback is insufficient to pick out a particular set of parameter values, there will be an extremely strong bias toward parameter values that imply world-conquering behavior. I don’t think I see an obvious reason to suggest this bias, though.
An alternative story, which I believe is suggested by the discussion in the “problems with proxies” section, is that human feedback is probably unusually well-calibrated in the toy car case: the human judges never give positive feedback to undesirable power-granting behavior. In certain other cases, on the other hand, humans may give positive feedback to very bad behavior, because of their limited ability to distinguish good and bad behavior. This feedback might then push the system to engage in increasingly bad behavior.
The report gives the grasping task case as a concrete example: Human judges gave positive feedback when an AI system that simply looked like it was grasping an object, rather than actually grasping the object (as the judges wanted). Positive feedback for this behavior enforced it. This case is of course fairly benign. However, conceivably, positive feedback for more sophisticated and seriously bad behaviors could eventually lead to the creation of systems that engage in behaviors like mass manipulation, mass murder, and so on.
I also don’t think I buy this story, though. For a feedback process to produce highly competent world-conquering behavior, for example, it seems like people would need to give a lot of positive feedback to increasingly terrible behavior. One way to frame this intuition: In policy-space, it seems like there is a very large gap between (e.g.) benign personal assistant behavior and successful world-conquering behavior. Intermediate policies that fill this gap include behaviors that look like highly incompetent attempts to take over the world, that involve lower-lever acts of violence or manipulation, that involve displays of various skills that would never naturally come up in personal assistant work, etc. It seems like, for human feedback to push the policy across the gap, the people giving human feedback would need to be very consistently and very badly miscalibrated over a fairly extended period of time: it seems like we’d almost need to imagine that there’s hardly any correlation at all between good behavior and positive feedback. Overall, then, it feels like a stretch to extrapolate from the gripper case (where people briefly rewarded behavior that merely looked like gripping) to existentially significant cases.
JC: The possibility of giving lots of positive feedback to actively and visibly bad external behavior isn’t playing a very central role for me. I’m much more worried that we’ll give positive feedback to good-seeming external behavior that arises from bad internal cognition (e.g., scheming to get power later by behaving well now), and hence reinforce the bad internal cognition.
Ultimately, at least if we are utilizing human feedback, I find it hard to imagine that purely learning-based processes would accidentally result in ML systems that engage in behaviors like competent world-conquest.
We can ask, then, whether planning might be the key thing that makes a difference. I feel pretty unsure here, but I definitely have some sympathy for the idea that planning would substantially increase risk.
JC: Flagging that I again worry that your specific conception of the contrast between learning vs. planning is going to make a difference here. E.g., in my ontology it is possible to learn from trail-and-error-in-the-real-world how to plan, such that you then plan to engage in behavior that wasn’t itself reinforced at all.
Once we introduce planning, I do think that elements of the “problems with proxies” story become more compelling. A notable difference, here, is that it’s more plausible that large and deeply concerning movements in ‘policy space’ would be opaque to human engineers/overseers. If you want to gradient-descent your way to a policy that engages in competent world-conquering behavior, then this development process seems like it will need to involve lots of intermediate policies that engage in (e.g.) behavior that looks like incompetent conquest-attempts. If someone is watching the system try out intermediate policies in the real world, or in a highly legible simulation, then it seems like these concerning intermediate behaviors will probably be noticed and negatively reinforced. On the other hand, if the system is updating its behavior through planning, using highly non-interpretable simulations of the world, then it seems more plausible that concerning intermediate policies would go unnoticed by humans. This is especially true if humans don’t even realize the system is planning, because the relevant planning capabilities are emergent.
There is a further question of whether concerning intermediate behaviors would actually be reinforced, within a planning process. I’m not sure how to think about this question, since it seems to some large degree on the nature of the planning process. It again seems like it might be important, for example, whether the planning process is intentionally introduced by engineers or whether it’s entirely emergent. If it’s entirely emergent, then there might be an especially strong case for thinking that it would tend to reinforce behaviors that humans do not want: the report’s comments about potential biases toward simple or computationally non-intensive evaluation processes might become especially relevant.
Overall, though, I just feel fairly confused about how to think about the significance of planning. I can buy that planning capabilities might be associated with substantially higher levels of risk, but I think I’d still want to see analysis that goes substantially beyond what’s in the report.
JC: I found the general vibe of your comments here helpful. I’d gloss it as something like: “Assuming that you’re actively penalizing coercive/violent behavior, and if you’re training something like a personal-assistant, then if you ended up with a policy that nevertheless goes off and engages in coercive/violent world-conquering, it seems like you ended up with something pretty different and intuitively ‘far away in policy space’ from the thing you trained for. Remind me why we should expect this again?”
I do think the idea that the system in question is planning (in my sense) is important here (I’m always assuming that we’re working with APS systems that are planning in my sense). In particular, conditional on training an APS system at all, I think a key worry is that the training/reinforcement process first give rise to a problematic objective that ultimately does motivate power-seeking (e.g., “learn a ton about science,” for training a science AI – see e.g. Ajeya’s description of “schemers”), and then once the system sees, via planning + strategic awareness, that hiding its power-seeking aspirations and behaving well now is a good instrumental strategy (e.g., “don’t kill people to learn more about science now, wait until they can’t stop me”), you get non-power-seeking behavior for the rest of your training process regardless (assuming your AI is competent enough to understand the behavior you’re looking for). But there’s still a question of whether you should expect to get this sort of power-seeking-ish objective in the first place, whether the incentives would in fact be set-up in the way this story imagines, and whether we would fail to detect the relevant forms of deception. And more broadly, this does feel like a pretty specific kind of story, and I think it’s worth being clear about exactly how much work it’s ultimately doing in the overall threat model (personally, I would feel a lot better about the situation if I knew this kind of deceptively-good behavior was never going to occur).
More broadly, though, your comments here highlight for me the intuitive difference between thinking in terms of “policies” or “behavior,” vs. thinking in terms of “objectives.” That is, when we think in terms “policies” or “behavior,” it doesn’t seem like there’s a lot of “pressure towards power-seeking”: it feels like you can push a policy, via reinforcement or whatever, in whatever direction you want, and “world-conquering” just seems like a particular type of policy/behavior that there isn’t strong reason to expect. Whereas once you’re thinking that you’re going to give the system some objectives that it will plan in pursuit of, it’s more intuitive why power-seeking would be some sort of default (e.g., because so many different objectives compatible with its observed performance are served by power-seeking).
A further thought on functionality/safety trade-offs
The report notes:
It’s generally easier to create technology that fulfills some function F, than to create technology that does F and meets a given standard of safety and reliability, for the simple reason that meeting the relevant standard is an additional property, requiring additional effort. Thus, it’s easier to build a plane that can fly at all than one that can fly safely and reliably in many conditions; easier to build an email client than a secure email client; easier to figure out how to cause a nuclear chain reaction than how to build a safe nuclear reactor; and so forth.
Of course, we often reach adequate safety standards in the end. But at the very least, we expect some safety problems along the way (plane crashes, compromised email accounts, etc). We might expect something similar with ensuring PS-aligned behavior from powerful AI agents.
In principle, I agree with this point. However, it also seems worth keeping that some failure modes for technologies are much less natural than others.
For example, it’d be very easy to create an explosive pair of pants: just pour some gunpowder in the pockets. But exploding is not a natural failure mode for pants. No normal, sensible-seeming pants production process is going to result in an explosion.
I think it’s still an open question where exactly “a useful AI system that is very unlikely to kill everyone” falls on the spectrum between “a plane that is very unlikely to crash” and “a pair of pants that is very unlikely to explode.”
JC: I like this example. I do think a lot of the discourse here is motivated by something like "power-seeking is a very natural failure mode of APS-systems, because power-seeking is just a very natural thing for strategically aware agentic planners to do (because it will help them accomplish their objectives).” If that assumption (basically, the instrumental convergence thesis) is false, I think things are really looking up.
This section discusses why we might expect to see practically PS-misaligned systems actually get deployed. The author briefly discusses the possibility of unintentional deployment, then lays out various factors that might affect the beliefs and incentives of relevant actors in choosing to deploy a system that is in fact practically PS-misaligned (even if they don't know it). He focuses on four factors that seem especially concerning:
Overall, the author finds it plausible that if ensuring practical PS-alignment proves challenging, practically PS-misaligned systems could end up getting deployed.
I largely agree with the considerations laid out and think the discussion is clear. I especially like the genetically engineered chimp analogy, which I think helped to move me a bit on these questions.
My perspective is probably, on this point, close enough for it not to be worth writing another wall of words. In case Joe is interested, although it probably doesn’t add much, this is one unpublished piece of writing exploring similar considerations; I don’t remember if I shared it previously.
ALIGNMENT DIFFICULTY: By 2070, and conditional on TIMELINES and INCENTIVES above, it will be much harder to develop APS systems that would be practically PS-aligned if deployed, than to develop APS systems that would be practically PS-misaligned if deployed (even if relevant decision-makers don’t know this), but which are at least superficially attractive to deploy anyway.
15%
(Note: I’m a little unsure what counts as PS-misaligned. Although it seems like there are some instances of PS-misalignment that might be relatively mild or benign, I suppose I’m assuming some minimum level of seriousness.)
This section discusses whether or not we should expect the impact of deploying practically PS-misaligned APS systems to scale to the permanent disempowerment of ~all humans. It also discusses a few mechanisms relevant to the plausibility of this disempowerment. The author suggests that:
Overall, the author thinks that humans might well be able to correct PS-misalignment problems and prevent them from re-arising, but that doing so will likely require addressing one or more of the basic factors that gave rise to the issue in the first place: e.g., the difficulty of ensuring the practical PS-alignment of APS systems (especially in scalably competitive ways), the strong incentives to use/deploy such systems even if doing so risks practical PS-alignment failure, and the multiplicity of actors in a position to take such risks. This task, he thinks, could well prove difficult.
I find the discussion reasonably persuasive.
My biggest concern might be that I feel unsure how to imagine the accidental permanent disempowerment of humanity, outside of relative dramatic scenarios (e.g. everyone being killed). Partly, this is a concern about knowing what counts as accidental permanent disempowerment.
Partly, it’s a concern about knowing what kinds of pathways to disempowerment make sense. For example: Is it plausible that disempowerment could be an extremely distributed and piecemeal thing, where (e.g.) some set of power-seeking systems removes a ‘unit’ of human agency in one year, then another set of power-seeking systems removes another ‘unit’ of human agency in another year, and so on? Or should we typically imagine disempowerment as a more binary state, that’s triggered by some particular disaster? I think these differences probably matter, when thinking about the feasibility of course correction, but I didn’t feel clear about the relative weight the report was giving to different sorts of scenarios.
JC: I’d like to be a bit clearer about this myself.
I do think that the kind of piecemeal disempowerment you’re pointing to can happen, but I feel more optimistic, in slow versions of those scenarios, about humans noticing what’s happening and coordinating to prevent it (though obviously, the coordination might fail).
In general, I tend to put the most weight on “some particular subset of the AI systems manage to get ~all the power via a ‘decisive strategic advantage’-type scenario,” and “there is a relatively short period – e.g., say, < five years – in which misaligned systems all over the world are gaining power in different ways, not through some coordinated rebellion or particular disaster, but just via different incentives and human disadvantages and failures to coordinate all pushing in that direction.”
HIGH-IMPACT FAILURES: By 2070, and conditional on TIMELINES, INCENTIVES, and ALIGNMENT DIFFICULTY above, some deployed APS systems will be exposed to inputs where they seek power in misaligned and high-impact ways (say, collectively causing >$1 trillion 2021-dollars of damage).
50%
DISEMPOWERMENT: By 2070, and conditional on TIMELINES, INCENTIVES, ALIGNMENT DIFFICULTY, and HIGH-IMPACT FAILURES above, some of this misaligned power-seeking will scale (in aggregate) to the point of permanently disempowering ~all of humanity.
30%
The author briefly discusses the possibility that the permanent and unintentional disempowerment ~all humans would not constitute an existential catastrophe (e.g., an event that drastically reduces the value of the trajectories along which earth-originating civilization could develop). He also emphasizes that some AI systems might have moral status, and that the right way forward may ultimately involve humans intentionally ceding power to them: the point in the present context is to avoid unintentional disempowerment of humans.
I largely agree with this section.
One consideration I would add -- which perhaps goes a bit beyond the point about game-theoretic equilibria -- is that competitive pressures might also produce very broad convergence in the properties of digital minds. If some kinds of AI systems outcompete others (or communities that use some kinds of AI systems outcompete others), then, so long as competitive pressures remain operative, it might be close to inevitable that future systems will have certain highly competitive traits. This kind of determinism in the evolution of AI systems might operate whether or not we work out alignment techniques.
JC: This seems like a helpful route to convergence to have in mind.
A second consideration is that PS-misaligned systems might still behave in ways that are compatible with a lot of current human values. For example, we can imagine a system that acts like it cares about human welfare, inequality, and so on, but also is more resistant to attempts to shut it down or modify it than we would like. As an analogy: Imagine a benign dictator who is actually pretty good in most ways, relative to our current values, except for the important fact that they are committed to being a dictator. If we think that the average person’s values will get worse over time, for instance due to random drift, then it might actually be preferable for the current generation to accidentally create these kinds of PS-misaligned systems.
(Note: I’m not sure whether this second scenario would constitute the “disempowerment” of humanity, though, if humanity is still around and the median person still has about as much individual freedom as the median person alive today. The concept of “disempowerment” here remains a bit fuzzy to me.)
JC: I’m inclined to say that it’s not the relevant kind of disempowerment if humans are still free to direct tons of resources in the ways they want to.
A third consideration is that humans might be very disinclined to disempower themselves, such that humanity will only get disempowered (relative to AI systems) if it is disempowered accidentally. It also seems like accidental disempowerment could conceivably be better than no disempowerment. For example: humans might ‘enslave’ AI systems in a morally objectionable way, humans might tend to constrain AI ‘population levels’ far below what would be morally ideal, or humans humans might be likely to create AI systems with actively bad conscious experiences. In this case, accidental disempowerment might actually be morally desirable.
CATASTROPHE: By 2070, and conditional on TIMELINES, INCENTIVES, ALIGNMENT DIFFICULTY, HIGH-IMPACT FAILURES, and DISEMPOWERMENT above, this disempowerment would constitute an existential catastrophe?
75%.
The author concludes by listing his own subjective, highly-unstable probabilities on each of these premises, along with a number of caveats about these probabilities should be understood. The probabilities are:
By 2070:
In combination, these probabilities yield an overall estimate of ~5% chance of an existential catastrophe by 2070 from scenarios where all of 1-6 are true, which the author would adjust upwards to reflect power-seeking scenarios that don’t fit some of 1-6. The author also notes in a footnote that his “high-end” and “low-end” estimates vary considerably: from between ~40% on the high-end, to ~.1% on the low end.
Premise #3 seems to be the most important point of disagreement.
---
Also, a methodological note: I personally find it somewhat hard to evaluate premises 4 and 5 separately.
When I try to imagine an APS system causing $1 trillion in damage, I feel very unsure about what I should imagine: the destruction of physical infrastructure or inventory, a financial crash, a pandemic (e.g. lab leak), a policy that slows economic growth, a policy that causes substantial waste of resources, the dissolution of a highly valuable corporation, the instigation of costly political unrest or conflict, etc. I think different scenarios require forms of misalignment that would be more or less dramatic or frightening, from the perspective of someone worried about permanent disempowerment. I feel like I would rather just directly estimate the probability of permanent disempowerment and skip premise 4.
JC: I agree there’s some ambiguity here. And there are also conceivable scenarios where AI systems kind of “outgrow” human institutions without actually causing any “damage” (for example, if the AI systems just had their own economy that did very well, and then went to space much faster than humans did).
My answers imply a probability of .4%.
I’m really glad that Joe took the time to write this report. I think it adds clarity to many important questions and considerations and lays a strong foundation for further debate. In general, I’m extremely impressed by its depth, breadth, and even-handedness.
It’s the sort of thing that helps to reassure me that progress in debates around alignment is possible, even if this progress is inevitably going to be very gradual and a lot of the necessary work is going to be fairly painstaking. I think it’s fairly striking, for example, to compare this report to writing on alignment risk from several years ago.
JC: Very glad to hear it :) – your own work on analyzing AI risk arguments was an important inspiration, and your feedback on the earlier draft was very helpful.
Yes.
Yes.
[1] One way of defining these subjective probabilities is via preferences over lotteries. On such a definition, “I think it less than 10% probable that Kanye West will be the next president” means that if I had the option to win $10,000 if Kanye West is the next president, or the option to win $10,000 if a ten sided dice comes up “1”, I would choose the latter option. See this blog post by Andrew Critch for more informal discussion; and see Muehlhauser (2017a), section 2, for discussion of some complexities involved in using these probabilities in practice.
[2] Here we understand “incentives” in a manner such that, if people will buy tables, and the only (or the most efficient) tables you can build are flammable, then there are incentives to build flammable tables, even if people would buy/prefer fire-resistant ones.
[3] Definitions: