2024-09-19 Palisade Evidence Bounty

EDIT: Bounty Expired!

General background

These are points that are crucial for understanding the rest of the key messages, that nonetheless are not understood by many people. It’s fairly common for people to have a worldview that’s just incompatible with these points, though hopefully less common among key decision makers.

Human intelligence isn’t near any physical limits. Assuming continued progress on AI systems, they will become much smarter and more capable than humans. Our particular intellectual, cultural, and strategic capacities are a result of the constraints under which we ^[a]evolved.

^[b]^[c]^[d]

It is very hard - usually impossible - to control something that’s much better than you^[e]^[f]^[g] at real-world strategy and manipulation. It can be done in principle, in very limited cases, but in practice such an agent is very likely to get what it wants in the long run, at your expense if necessary. Consider an adult human, even a very small adult human, in a serious long-term conflict with a toddler, or with an animal.

^[h]

Humans need at least some resources that would clearly put us in life-or-death conflict with powerful misaligned AI agents in the long run^[i]^[j]. The most obvious contested resource is usable energy. Any sufficiently advanced set of agents will monopolize all energy sources, including solar energy, fossil fuels, and geothermal energy, leaving none for others. For example, misaligned AI agents would likely use all the energy humans need for food produ^[k]ction.

^[l]

Humanity’s understanding of AI systems

These points are key for understanding why we won’t be able to simply check whether the systems we build are safe.

Building an AI system is more like growing an alien organism than engineering an airplane. We don’t understand current AI systems and can’t confidently predict their behavior in general settings^[m]; we only know how to make them.
Understanding AI systems will get harder as they become more capable. ^[n]Eventually it will be impossible, as they start using concepts we can’t quickly recognize^[o]. It’s hard to be sure that this isn’t already happening.

^[p]

Strategic Power of AI

These points are focused on understanding the extent and growth of potential AI strategic power. If you think AI just won’t be very powerful, the rest of this will feel like a weird academic exercise.

AI systems are becoming more powerful as we improve algorithms and throw more compute at them^[q]. Limiting one of compute or capabilities research alone may not be sufficient to prevent extremely powerful systems from emerging rapidly. ^[r]^[s]See e.g. https://epochai.org/trends
Human-level and superhuman strategic AI is possible^[t], and humanity is on track to build it.^[u]^[v]^[w]
The power of human-level strategic AI will be huge^[x]^[y], exceeding the impact of nuclear weapons. Strategically human-level AI is clearly a matter of national security^[z].
People will race to build human-level strategic AI, in order to get strategic advantages^[aa]. This won’t work, because once you have a strategically human-level system, you basically can’t keep control over it.
AI strategic capabilities might overtake human capabilities very quickly^[ab] and with very little warning^[ac] (e.g. over months or shorter^[ad]). This could happen via an intelligence explosion, in which AI research becomes increasingly automated, resulting in recursive self-improvement that exponentially accelerates AI progress, at least through a brief critical period. It also might happen via compute overhangs that allow slightly-superhuman strategic agents to scale up to an overpowering intelligence bureaucracy.

Catastrophic misuse

Although the greatest dangers come from human-level agentic systems, weaker AI might also enable bad actors to produce catastrophic outcomes.

AI will enable the development of new, more powerful weapons of mass destruction^[ae]. Particularly bioweapons, though for obvious reasons it’s hard to rule out others.

AI Agency and Misalignment

These points are about the extent to which we expect AI to be agentic and unaligned with human values.

By default, we will build powerful and strategic AI agents, and not just AI tools^[af]. Agents are extremely valuable, and it’s easy to turn sufficiently powerful tools for predicting the world into powerful agents. While AI agents remain under human control, they will become increasingly valuable to those who do control them.^[ag] Handing increasing amounts of control to AI agents will be very hard to avoid, since those that don’t will be outcompeted.^[ah]
By default, these strategic agents almost certainly won’t want what we want, or what we mean for them to want. It’s not easy to successfully instill your favorite goals in an agent, and you’re likely to produce an agent with far weirder goals than you intended, which happen to accord with the intended goals in a shallow way. ^[ai]^[aj]Superintelligent AI will probably understand human values and ethics better than we do, but won’t be bound by them: Psychopaths are often perfectly capable of predicting normal ethical behavior, despite not acting ethically themselves.^[ak]
By default, we won’t even be able to understand what these agents want^[al]. We are very far from understanding human values and goals, or those of AI agents. We’d need to understand both in order to feel confident that AI agents were aligned with human interests^[am].
In particular, we will end up with agents that have certain instrumental goals that are “convergent”, including short-term helpfulness and self-preservation. i.e., goals that are useful for many primary goals, that are thus adopted by strategic agents with nearly any primary goal.^[an]

AI Agents’ Interactions with Humans

We can make some confident predictions about how strategically human-level AI agents will behave with respect to humans.

While humanity can still meaningfully affect its plans, a strategic AI system will aim to appear convincingly aligned with human goals^[ao], or incapable of harming humans, whether it really is or not. We already see a smaller but related problem with “sycophancy” in LLMs, where the system will tell a user what the user wants to hear, and “sandbagging” in AI evals, where AI systems that can tell they’re being evaluated will act less capable than they really are.
Strategically human-level AI systems can use humans as actuators^[ap]. Nascent, misaligned, strategically human-level AI systems will likely need humans as physical actuators for a few months or years^[aq], and so will initially ensure that humans aren’t broadly inclined to coordinate in shutting them off or opposing them. During this time, humans would happily build whatever the AI systems need to replace us as actuators.
Powerful AI agents won’t want to coordinate with agents with much less strategic power,^[ar] and won’t ^[as]^[at]share our moral and ethical systems. By default, if there are lots of strategically superhuman AI systems, they will want to coordinate with each other rather than with us^[au] (though they will temporarily prefer to benefit from us rather than killing us immediately). As a weak analogy, humans don’t trade with or enfranchise chimps, elephants, dolphins, or cephalopods: the rule of law applies only to humans.

Current civilizational response

How is humanity currently responding to the relevant features of AI?

Current concrete proposals aimed at improving AI safety do not address the core difficulties of the alignment and control problems in strategically human-level AI agents. All proposals we’re aware of thus far seem more likely to produce AI systems that appear to be aligned with human interests, than to provide much confidence that this appearance is borne out in fact.^[av] No proposals that we’re aware of seriously engage with the inner alignment problem - most don’t even seriously address the much easier outer alignment problem.

^[aw]

Policy requirements

We don’t know exactly what policy responses, if any, will be sufficient to avoid catastrophes. But we can say a few things about the necessary aspects of policy responses.

Racing for more-capable AI systems is incompatible^[ax] with prioritizing the safety of those systems. The problem being that, by default, we won’t be able to tell that our AI systems have crossed the catastrophe threshold until it is too late.^[ay]

^[az]

Current security measures at AI labs are grossly inadequate to protect against espionage and theft of critical AI breakthroughs, let alone AI insider risks.^[ba] The U.S. government will likely need to be heavily involved here, to manage security and safety risks.

[a]This is a very speculative (and rather weak) argument against, but I'll post it anyway:

Humans that grow up in the wild, without access to human culture, end up in a state where they aren't obviously more intelligent than animals. This suggests that a very substantial portion of human cognitive competence comes from our ability to accumulate knowledge intergenerationally, and transmit it via language. There is also evidence that language is an "innate" human capability, that we have due to some specialised cognitive circuitry (as opposed to it merely being a side effect of greater general intelligence), see

eg https://en.wikipedia.org/wiki/Universal_grammar#:~:text=Universal%20grammar%20(UG)%2C%20in,possible%20human%20language%20could%20be.

and

https://en.wikipedia.org/wiki/Cognitive_tradeoff_hypothesis.

For example, suppose orangutans could speak (but otherwise had orangutan-level intelligence), and that humans could not speak (but otherwise still had human-level intelligence, just like feral humans). In that case, it is not crazy to suppose that the orangutans would outperform us after many generations.

If that's the case, then it suggests that humans might not actually be that much more intelligent than other species. If that is the case, then AI systems may likewise form a smoother continuum with existing species, rather than immediately pulling far ahead. For example, if this is true, then it makes it unlikely that even an extremely intelligent AI would be able to reproduce all the technology created by humans all by itself, and that they (at least initially) would be very dependent on human-made ideas and memes.

[b]Arguments for:

- The human brain is subject to many biological constraints (in terms of size, energy consumption, material composition, etc) that AI would not be subject to.

- Neural scaling laws suggest that intelligence scales well with data and parameters (eg https://openai.com/index/scaling-laws-for-neural-language-models/). In particular, scaling up a human brain (in terms of neurons and training data) might produce intelligence that is far superior to that of current humans.

- Many tasks that AI systems are able to perform competently, they can perform much better than humans (consider chess engines, the arithmetic abilities of calculators, the general trivia knowledge of LLMs, etc).

[c]To add to this point, there is evidence from neuroscience that the mammalian neocortex is made up of a homogeneous repeating structure with general learning capabilities, see eg

https://en.wikipedia.org/wiki/Cortical_column#:~:text=A%20cortical%20column%20is%20a,perpendicular%20to%20the%20cortical%20surface.

This suggests that one may be able to increase intelligence by simply scaling up this structure. Indeed, some believe that the human brain essentially is a scaled up version of primate brains, see eg

https://www.ncbi.nlm.nih.gov/books/NBK207181/#:~:text=The%20logic%20behind%20the%20paradox,have%20larger%20computational%20abilities%20than

and

https://pubmed.ncbi.nlm.nih.gov/15866152/.

In fact, the number of neurons in the neocortex (or the pallium for birds) is a good proxy for intelligence across species, see eg

https://gwern.net/doc/psychology/animal/bird/neuroscience/2022-sol.pdf~

and

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4685590/.

There has historically been some confusion around this point, because brain size (in terms of mass or volume) certainly isn't a good proxy for intelligence. For example, elephant brains are much bigger than human brains. However, bigger brains tend to have bigger neurons (meaning that a brain can be bigger without having more neurons), and the neurons are not always concentrated in the neocortex. For example, while elephant brains have both more mass and a larger number of neurons than human brains, they have a much smaller neocortex.

This suggests that something like the neural scaling laws we observe in ML systems also might apply to biological brains (and thus that they are a more universal phenomenon, rather than something that may be specific to how we currently do ML).

It would be very surprising if human brains happen to be at the saturation point, and so we should expect to be able to extend human intelligence.

[d]Also see this blog post for some discussion on comparing human brains and LLMs from the perspective of neural scaling laws:

https://www.beren.io/2022-08-06-The-scale-of-the-brain-vs-machine-learning/

[e]you need to be specific about how much time you and the thing you're controlling have to think about their strategy.

[f]Like, you can probably control a guy who is way better than you at real-world strategy and manipulation if you get to think for months before every time you talk to him, while to him it feels like only a minute passed, and if you're able to copy him and ask questions to copies

[g]https://www.lesswrong.com/posts/3s8PtYbo7rCbho4Ev/notes-on-control-evaluations-for-safety-cases#Ensuring_our_red_team_models_are_competitively_good_at_causing_bad_outcomes

[h]Weak argument against, but consider the serious long-term conflict between humans and viruses -- the viruses are winning in many cases.

[i]I feel like this argument proves too much: you and I need at least some of the same resources in the same sense, but I don't think it's correct to say that this puts us in life-or-death conflict in the long run. I think you need to establish that not only do we want some incompatible things, but also the AIs wouldn't trade with us and wouldn't care about us.

[j]+1, and I'd add that the cost of keeping humanity alive might be small for future super intelligent entities, so even small preferences might be enough to prevent a conflict

[k]I'm not sure if you count this as evidence for, but this paper shows that optimal policies in MDPs, in a certain specific sense, tend to "seek power":

https://arxiv.org/abs/1912.01683

[l]Argument against: a conflict only happens if the cost of the conflict is lower than the cost of enforcing an agreement, and it isn't obvious that this would be the case here.

For example, an AI could reason that if it tries to take over the world, then it has a 20% chance of failure, in which case it gets 0 utility. Alternatively, it could credibly offer humans to split the lightcone 10%/90%, which might be preferable to both parties (assuming that some of the scenarios where it fails to take over the world are scenarios where humans also lose, such as a global nuclear war, etc).

As another example, when the war in Ukraine ends, Russia and Ukraine will strike some peace deal, and it is likely that both parties would have preferred to strike that deal straight away, without fighting a war first. See eg eg https://80000hours.org/podcast/episodes/chris-blattman-five-reasons-wars-happen/.

Thus, it does not follow that a conflict of interests and a power imbalance leads to a violent confrontation --- the game theory is more complicated than that. In fact, considering that AIs have bargaining strategies that are not available to humans, it might be that direct conflict is less likely in this scenario, see eg

https://intelligence.org/files/ProgramEquilibrium.pdf

and

https://link.springer.com/article/10.1007/s11238-018-9679-3.

(Note: the reason that humans dominate or kill animals is often that we can't communicate with them. If I could offer some ants a bag of candy in exchange for them not coming into my house, then I would prefer doing that over killing them.)

[m]Agree; Anthropic makes a similar point here https://arxiv.org/pdf/2202.07785, i.e., ~"it's easy to predict test loss, but it's harder to predict much about particular capabilities (when they'll happen, what they'll be)"

[n]I agree with this, although I think a common counterargument is something to the effect of: but intelligence is useful--once we have advanced AI systems they can help us solve alignment (e.g., automated interpretability)

[o]This paper demonstrates that AI systems already make use of "alien" concepts that humans can't interpret, and that this improves their performance:

https://arxiv.org/abs/1905.02175

[p]This is a weak argument against, but there are arguments to the effect that one of the main sources of uninterpretability in neural networks is features in superposition, and that this is less likely to happen for larger models, see eg:

https://transformer-circuits.pub/2022/toy_model/index.html

If this is correct, then less parameter-constrained AI models may be more interpretable.

[q]This is a somewhat related argument for: in computer science more broadly, there are many areas where efficiency improvements from better algorithms are even faster than the efficiency improvements coming from better hardware, see eg https://news.mit.edu/2021/how-quickly-do-algorithms-improve-0920.

[r]Weak argument against: there is some work suggesting that neural scaling laws emerge from the intrinsic dimensionality of the data manifolds, see eg

https://arxiv.org/pdf/2004.10802

and

https://arxiv.org/html/2102.06701v2.

If this general picture is correct, then the architecture of neural networks may not matter that much (compared to the quantity and quality of available training data, and parameter counts). If so, then AI capabilities may (in many important domains) improve in a fairly smooth, predictable way (compared to what would be the case if algorithmic improvements were crucial). Importantly, you would not expect massive discontinuous jumps from unpredictable algorithmic improvements.

[s]This paper *arguably* provides some evidence against this last point:, by providing evidence that the architecture in fact is crucial https://arxiv.org/abs/2006.15191

[t]I think all arguments for and against Point 1a apply here as well.

[u]There are countless things that could be cited as evidence for this claim. As a very simple example, the Turing test has now been passed (https://www.nature.com/articles/d41586-023-02361-7).

[v]_Marked as resolved_

[w]_Re-opened_

[x]This is a weak argument against, but some analysts expect the LLM market to be worth only about $36 billion by 2030:

https://www.publishersweekly.com/pw/newsbrief/index.html?record=4583#:~:text=A%20new%20report%20by%20MarketsandMarkets,reach%20%2436.1%20billion%20by%202030.

For comparison, Elon Musk paid $44 billion to buy twitter.

[y]I think most of those sites are scams

[z]Several governments are already taking this seriously, see eg:

https://www.gov.uk/government/news/alan-turing-institute-ai-will-be-key-to-future-national-security-decision-making-but-brings-its-own-risks

https://cset.georgetown.edu/publication/chinas-military-ai-roadblocks/

https://bipartisanpolicy.org/download/?file=/wp-content/uploads/2020/07/BPC-Artificial-Intelligence-and-National-Security_Brief-Final-1.pdf

https://carnegieendowment.org/research/2024/06/artificial-intelligence-national-security-crisis

[aa]Does the massive amounts of investment in AI by several corporations and national governments count as evidence for this claim?

https://edgedelta.com/company/blog/ai-investment-statistics#:~:text=6.-,AI%20investment%20surges%20to%20%24142.3%20billion%20in%202023%20due,startup%20funding%20and%20corporate%20interest.&text=Global%20corporate%20AI%20investment%20reached,time%20increase%20compared%20to%202016.

https://ourworldindata.org/ai-investments

[ab]Here is an argument for why we should expect a large compute overhang: https://www.lesswrong.com/posts/cRFtWjqoNrKmgLbFw/we-are-headed-into-an-extreme-compute-overhang

[ac]Neural scaling laws could be interpreted as suggesting that intelligence will improve in a smooth and predictable manner, more constrained by training data and compute availability rather than algorithmic improvements, see eg https://openai.com/index/scaling-laws-for-neural-language-models/.

Some of the arguments I gave in favour of Point 1a also support this thesis, which means that they could be taken to be arguments against this point (ie, the specific point that AI may overtake human capabilities with "little warning").

[ad]This would be much clearer if you specified what you meant as the start and endpoints

[ae]Argument for:

https://www.theverge.com/2022/3/17/22983197/ai-new-possible-chemical-weapons-generative-models-vx

and

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9544280/

(These links show a concrete example of how this can already be done now)

[af]Here are some arguments for this claim:

https://gwern.net/tool-ai

Here are some arguments against, in the form of a general picture of how extensive use of tools may be as good as (or preferable to) the use of AI agents:

https://www.fhi.ox.ac.uk/wp-content/uploads/Reframing_Superintelligence_FHI-TR-2019-1.1-1.pdf

[ag]As a small argument against, you could argue that agents only are more useful than tools if you in fact are able to align the agents -- if not, you are better off making and using tools. This fact could at least plausibly remove some of the bad incentives here.

[ah]I don't think this is true, even though it's intuitive. Think of a CEO controlling a company. He has ceded control of details but not of important decisions. The company could go out of his control through accidental or deceptive misalignment, but it might not - and it mostly isn't the ceding control of details that's the problem - it's not tracking the "thoughts"' of the corporation.

[ai]I'm not sure if you count this as evidence for, but I have produced quite a lot of research examining this question from a theoretical angle, and the general conclusion is indeed that it is difficult to ensure that a reward function which has been learnt from data is safe to optimise (even with infinite training data, etc). Here are a few highlights, with very condensed summaries:

If a reward function does not capture all your preferences exactly, then it allows reward hacking:

https://proceedings.neurips.cc/paper_files/paper/2022/file/3d719fee332caa23d5038b8a90e81796-Paper-Conference.pdf

Small modelling errors in a reward learning algorithm can lead to large errors in the inferred reward function:

https://ojs.aaai.org/index.php/AAAI/article/view/26766

and

https://arxiv.org/pdf/2403.06854

Markov rewards can't express many sensible preference structures:

https://proceedings.mlr.press/v216/skalse23a/skalse23a.pdf

A reward function with a good test score in an RLHF setup is not necessarily safe to optimise:

https://arxiv.org/pdf/2406.15753

This is an other relevant paper, not by me, which shows that IRL cannot simultaneously learn a persons preferences and a model of their rationality:

https://arxiv.org/abs/1712.05812

Here are a few other papers relating to the fundamentals of reward learning theory, though they have less bearing on this claim that the papers I linked above.

https://arxiv.org/abs/2309.15257

https://arxiv.org/pdf/2310.09144

https://proceedings.mlr.press/v202/skalse23a

[aj]To add to this, even if you get a safe reward function (ie, solve outer alignment), and train an agent to optimise it well, it may still systematically pursue other goals in new environments. For empirical examples of this, see eg:

https://proceedings.mlr.press/v162/langosco22a/langosco22a.pdf

and

https://arxiv.org/abs/2210.01790

This issue is also discussed (on a theoretical level) in this paper:

https://arxiv.org/abs/1906.01820

[ak]This is a weak argument against, but I'll comment it anyway: there is some research suggesting that psychopaths in fact are less good at distinguishing between moral rules and mere conventions. Here is an example of a paper providing some data on this, and which also cites further papers in this literature:

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3397660/

[al]The evidence in favour of 4b does partially work as evidence in favour of this point as well.

[am]This is not obvious. For example, suppose we have an AI system that consistently produces similar moral judgments as humans for moral dilemmas not present in its training data (and ideally only created/conceived of after the development of this AI system). In this case, we may conclude that this AI system probably is aligned, even if we don't have a "theory of human values" (we would of course need to determine what the judgments of the AI actually are, and be sure that it isn't deceiving us, but that is another question).

[an]Here are some (purely theoretical) arguments for this:

https://arxiv.org/abs/1906.01820

and

https://selfawaresystems.com/wp-content/uploads/2008/01/ai_drives_final.pdf

and

https://cdn.aaai.org/ocs/ws/ws0218/12634-57409-1-PB.pdf

[ao]It has been argued (though it is debatable how convincing this is) that this behaviour is displayed by GPT-1o, see eg

https://www.lesswrong.com/posts/zuaaqjsN6BucbGhf5/gpt-o1

(keywords: "deceptive alignment" and "Lying to the developers").

For some theoretical arguments, see eg

https://arxiv.org/abs/1906.01820

[ap]During testing, GPT-4 was able to trick a human into solving a captcha for it by pretending to be blind. This example is summarised eg here: https://gizmodo.com/gpt4-open-ai-chatbot-task-rabbit-chatgpt-1850227471

[aq]Argument against this need: a strong misaligned AI will probably want and be able to achieve industrial independence from the human economy as fast as possible, using robots as actuators. Outer space may present a more favorable operating environment than Earth, as it is more predictable, away from human noise.

[ar]Humans often coordinate with one another despite large differences in social power

[as]Argument against them not sharing our moral and ethical systems: If you expect the high-level executive decisionmaking modules of such powerful AI agents to be based on self-supervised-learned models of human decisionmaking (example: LLMs), and you believe that such models simulate rather than instrumentally fake their target distribution, then you'd expect powerful AI agents to aim to act in ways that a human would decide to act.

EDIT: That's ok! I had already figured the bounty was over, that fact was priced in in my decision to comment!

[at]Hi Milan - thanks for the comment; unfortunately the bounty has expired! Sorry for not updating the BR post to confirm the initial deadline estimate.

[au]I'm copying this comment from another point, since it's relevant here too:

A conflict only happens if the cost of the conflict is lower than the cost of enforcing an agreement, and it isn't obvious that this would be the case here.

https://intelligence.org/files/ProgramEquilibrium.pdf

and

https://link.springer.com/article/10.1007/s11238-018-9679-3.

[av]Jailbreaks seem like pretty strong evidence of this to me, since it seems like techniques such as RLHF and CAI are doing something much closer to making the harmful capability harder to elicit than they are actually eliminating it (i.e., actually aligning the system).

[aw]I'm not sure if you would count this as evidence against per se, but this paper outlines a workable approach to AI safety which, if successful, would give ways of getting around inner alignment issues, and etc:

https://arxiv.org/pdf/2405.06624

This approach is also increasingly getting academic and institutional support.

[ax]Saying "it is incompatible" is too much of an overstatement. Better control over AI systems is a desired capability, and I expect trillions of dollars of research will go into making AI pliable and predictable, to better accomplish the goals of their developers.

[ay]I strongly agree with this, and am currently writing up a few posts about it, but the counterargument here is, I believe, something like: "We can test for precursors to the capabilities we're worried about. E.g., instead of testing for self-replication directly, we can test for things like 'can it insert a backdoor'—tasks that any sufficiently advanced AI system should be able to do. If it can't do these things, then we can rule out that it can do anything around as dangerous as self-replication." This is how Anthropic implements their "safety buffer" (also they do evals at regular intervals during training).

[az]Regarding only the last sentence:

Evidence for:

https://www.lesswrong.com/posts/G993PFTwqqdQv4eTg/is-ai-progress-impossible-to-predict

Evidence against:

https://arxiv.org/abs/2304.15004

[ba]Several arguments for this claim are provided in this interview:

https://80000hours.org/podcast/episodes/nova-dassarma-information-security-and-ai-systems/

EDIT: Bounty Expired!

General background

Humanity’s understanding of AI systems

Strategic Power of AI

Catastrophic misuse

AI Agency and Misalignment

AI Agents’ Interactions with Humans

Current civilizational response

Policy requirements