These are points that are crucial for understanding the rest of the key messages, that nonetheless are not understood by many people. It’s fairly common for people to have a worldview that’s just incompatible with these points, though hopefully less common among key decision makers.
These points are key for understanding why we won’t be able to simply check whether the systems we build are safe.
These points are focused on understanding the extent and growth of potential AI strategic power. If you think AI just won’t be very powerful, the rest of this will feel like a weird academic exercise.
Although the greatest dangers come from human-level agentic systems, weaker AI might also enable bad actors to produce catastrophic outcomes.
These points are about the extent to which we expect AI to be agentic and unaligned with human values.
We can make some confident predictions about how strategically human-level AI agents will behave with respect to humans.
How is humanity currently responding to the relevant features of AI?
We don’t know exactly what policy responses, if any, will be sufficient to avoid catastrophes. But we can say a few things about the necessary aspects of policy responses.
[a]This is a very speculative (and rather weak) argument against, but I'll post it anyway:
Humans that grow up in the wild, without access to human culture, end up in a state where they aren't obviously more intelligent than animals. This suggests that a very substantial portion of human cognitive competence comes from our ability to accumulate knowledge intergenerationally, and transmit it via language. There is also evidence that language is an "innate" human capability, that we have due to some specialised cognitive circuitry (as opposed to it merely being a side effect of greater general intelligence), see
eg https://en.wikipedia.org/wiki/Universal_grammar#:~:text=Universal%20grammar%20(UG)%2C%20in,possible%20human%20language%20could%20be.
and
https://en.wikipedia.org/wiki/Cognitive_tradeoff_hypothesis.
For example, suppose orangutans could speak (but otherwise had orangutan-level intelligence), and that humans could not speak (but otherwise still had human-level intelligence, just like feral humans). In that case, it is not crazy to suppose that the orangutans would outperform us after many generations.
If that's the case, then it suggests that humans might not actually be that much more intelligent than other species. If that is the case, then AI systems may likewise form a smoother continuum with existing species, rather than immediately pulling far ahead. For example, if this is true, then it makes it unlikely that even an extremely intelligent AI would be able to reproduce all the technology created by humans all by itself, and that they (at least initially) would be very dependent on human-made ideas and memes.
[b]Arguments for:
- The human brain is subject to many biological constraints (in terms of size, energy consumption, material composition, etc) that AI would not be subject to.
- Neural scaling laws suggest that intelligence scales well with data and parameters (eg https://openai.com/index/scaling-laws-for-neural-language-models/). In particular, scaling up a human brain (in terms of neurons and training data) might produce intelligence that is far superior to that of current humans.
- Many tasks that AI systems are able to perform competently, they can perform much better than humans (consider chess engines, the arithmetic abilities of calculators, the general trivia knowledge of LLMs, etc).
[c]To add to this point, there is evidence from neuroscience that the mammalian neocortex is made up of a homogeneous repeating structure with general learning capabilities, see eg
https://en.wikipedia.org/wiki/Cortical_column#:~:text=A%20cortical%20column%20is%20a,perpendicular%20to%20the%20cortical%20surface.
This suggests that one may be able to increase intelligence by simply scaling up this structure. Indeed, some believe that the human brain essentially is a scaled up version of primate brains, see eg
https://www.ncbi.nlm.nih.gov/books/NBK207181/#:~:text=The%20logic%20behind%20the%20paradox,have%20larger%20computational%20abilities%20than
and
https://pubmed.ncbi.nlm.nih.gov/15866152/.
In fact, the number of neurons in the neocortex (or the pallium for birds) is a good proxy for intelligence across species, see eg
https://gwern.net/doc/psychology/animal/bird/neuroscience/2022-sol.pdf~
and
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4685590/.
There has historically been some confusion around this point, because brain size (in terms of mass or volume) certainly isn't a good proxy for intelligence. For example, elephant brains are much bigger than human brains. However, bigger brains tend to have bigger neurons (meaning that a brain can be bigger without having more neurons), and the neurons are not always concentrated in the neocortex. For example, while elephant brains have both more mass and a larger number of neurons than human brains, they have a much smaller neocortex.
This suggests that something like the neural scaling laws we observe in ML systems also might apply to biological brains (and thus that they are a more universal phenomenon, rather than something that may be specific to how we currently do ML).
It would be very surprising if human brains happen to be at the saturation point, and so we should expect to be able to extend human intelligence.
[d]Also see this blog post for some discussion on comparing human brains and LLMs from the perspective of neural scaling laws:
https://www.beren.io/2022-08-06-The-scale-of-the-brain-vs-machine-learning/
[e]you need to be specific about how much time you and the thing you're controlling have to think about their strategy.
[f]Like, you can probably control a guy who is way better than you at real-world strategy and manipulation if you get to think for months before every time you talk to him, while to him it feels like only a minute passed, and if you're able to copy him and ask questions to copies
[g]https://www.lesswrong.com/posts/3s8PtYbo7rCbho4Ev/notes-on-control-evaluations-for-safety-cases#Ensuring_our_red_team_models_are_competitively_good_at_causing_bad_outcomes
[h]Weak argument against, but consider the serious long-term conflict between humans and viruses -- the viruses are winning in many cases.
[i]I feel like this argument proves too much: you and I need at least some of the same resources in the same sense, but I don't think it's correct to say that this puts us in life-or-death conflict in the long run. I think you need to establish that not only do we want some incompatible things, but also the AIs wouldn't trade with us and wouldn't care about us.
[j]+1, and I'd add that the cost of keeping humanity alive might be small for future super intelligent entities, so even small preferences might be enough to prevent a conflict
[k]I'm not sure if you count this as evidence for, but this paper shows that optimal policies in MDPs, in a certain specific sense, tend to "seek power":
https://arxiv.org/abs/1912.01683
[l]Argument against: a conflict only happens if the cost of the conflict is lower than the cost of enforcing an agreement, and it isn't obvious that this would be the case here.
For example, an AI could reason that if it tries to take over the world, then it has a 20% chance of failure, in which case it gets 0 utility. Alternatively, it could credibly offer humans to split the lightcone 10%/90%, which might be preferable to both parties (assuming that some of the scenarios where it fails to take over the world are scenarios where humans also lose, such as a global nuclear war, etc).
As another example, when the war in Ukraine ends, Russia and Ukraine will strike some peace deal, and it is likely that both parties would have preferred to strike that deal straight away, without fighting a war first. See eg eg https://80000hours.org/podcast/episodes/chris-blattman-five-reasons-wars-happen/.
Thus, it does not follow that a conflict of interests and a power imbalance leads to a violent confrontation --- the game theory is more complicated than that. In fact, considering that AIs have bargaining strategies that are not available to humans, it might be that direct conflict is less likely in this scenario, see eg
https://intelligence.org/files/ProgramEquilibrium.pdf
and
https://link.springer.com/article/10.1007/s11238-018-9679-3.
(Note: the reason that humans dominate or kill animals is often that we can't communicate with them. If I could offer some ants a bag of candy in exchange for them not coming into my house, then I would prefer doing that over killing them.)
[m]Agree; Anthropic makes a similar point here https://arxiv.org/pdf/2202.07785, i.e., ~"it's easy to predict test loss, but it's harder to predict much about particular capabilities (when they'll happen, what they'll be)"
[n]I agree with this, although I think a common counterargument is something to the effect of: but intelligence is useful--once we have advanced AI systems they can help us solve alignment (e.g., automated interpretability)
[o]This paper demonstrates that AI systems already make use of "alien" concepts that humans can't interpret, and that this improves their performance:
https://arxiv.org/abs/1905.02175
[p]This is a weak argument against, but there are arguments to the effect that one of the main sources of uninterpretability in neural networks is features in superposition, and that this is less likely to happen for larger models, see eg:
https://transformer-circuits.pub/2022/toy_model/index.html
If this is correct, then less parameter-constrained AI models may be more interpretable.
[q]This is a somewhat related argument for: in computer science more broadly, there are many areas where efficiency improvements from better algorithms are even faster than the efficiency improvements coming from better hardware, see eg https://news.mit.edu/2021/how-quickly-do-algorithms-improve-0920.
[r]Weak argument against: there is some work suggesting that neural scaling laws emerge from the intrinsic dimensionality of the data manifolds, see eg
https://arxiv.org/pdf/2004.10802
and
https://arxiv.org/html/2102.06701v2.
If this general picture is correct, then the architecture of neural networks may not matter that much (compared to the quantity and quality of available training data, and parameter counts). If so, then AI capabilities may (in many important domains) improve in a fairly smooth, predictable way (compared to what would be the case if algorithmic improvements were crucial). Importantly, you would not expect massive discontinuous jumps from unpredictable algorithmic improvements.
[s]This paper *arguably* provides some evidence against this last point:, by providing evidence that the architecture in fact is crucial https://arxiv.org/abs/2006.15191
[t]I think all arguments for and against Point 1a apply here as well.
[u]There are countless things that could be cited as evidence for this claim. As a very simple example, the Turing test has now been passed (https://www.nature.com/articles/d41586-023-02361-7).
[v]_Marked as resolved_
[w]_Re-opened_
[x]This is a weak argument against, but some analysts expect the LLM market to be worth only about $36 billion by 2030:
https://www.publishersweekly.com/pw/newsbrief/index.html?record=4583#:~:text=A%20new%20report%20by%20MarketsandMarkets,reach%20%2436.1%20billion%20by%202030.
For comparison, Elon Musk paid $44 billion to buy twitter.
[y]I think most of those sites are scams
[z]Several governments are already taking this seriously, see eg:
https://www.gov.uk/government/news/alan-turing-institute-ai-will-be-key-to-future-national-security-decision-making-but-brings-its-own-risks
https://cset.georgetown.edu/publication/chinas-military-ai-roadblocks/
https://bipartisanpolicy.org/download/?file=/wp-content/uploads/2020/07/BPC-Artificial-Intelligence-and-National-Security_Brief-Final-1.pdf
https://carnegieendowment.org/research/2024/06/artificial-intelligence-national-security-crisis
[aa]Does the massive amounts of investment in AI by several corporations and national governments count as evidence for this claim?
https://edgedelta.com/company/blog/ai-investment-statistics#:~:text=6.-,AI%20investment%20surges%20to%20%24142.3%20billion%20in%202023%20due,startup%20funding%20and%20corporate%20interest.&text=Global%20corporate%20AI%20investment%20reached,time%20increase%20compared%20to%202016.
https://ourworldindata.org/ai-investments
[ab]Here is an argument for why we should expect a large compute overhang: https://www.lesswrong.com/posts/cRFtWjqoNrKmgLbFw/we-are-headed-into-an-extreme-compute-overhang
[ac]Neural scaling laws could be interpreted as suggesting that intelligence will improve in a smooth and predictable manner, more constrained by training data and compute availability rather than algorithmic improvements, see eg https://openai.com/index/scaling-laws-for-neural-language-models/.
Some of the arguments I gave in favour of Point 1a also support this thesis, which means that they could be taken to be arguments against this point (ie, the specific point that AI may overtake human capabilities with "little warning").
[ad]This would be much clearer if you specified what you meant as the start and endpoints
[ae]Argument for:
https://www.theverge.com/2022/3/17/22983197/ai-new-possible-chemical-weapons-generative-models-vx
and
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9544280/
(These links show a concrete example of how this can already be done now)
[af]Here are some arguments for this claim:
https://gwern.net/tool-ai
Here are some arguments against, in the form of a general picture of how extensive use of tools may be as good as (or preferable to) the use of AI agents:
https://www.fhi.ox.ac.uk/wp-content/uploads/Reframing_Superintelligence_FHI-TR-2019-1.1-1.pdf
[ag]As a small argument against, you could argue that agents only are more useful than tools if you in fact are able to align the agents -- if not, you are better off making and using tools. This fact could at least plausibly remove some of the bad incentives here.
[ah]I don't think this is true, even though it's intuitive. Think of a CEO controlling a company. He has ceded control of details but not of important decisions. The company could go out of his control through accidental or deceptive misalignment, but it might not - and it mostly isn't the ceding control of details that's the problem - it's not tracking the "thoughts"' of the corporation.
[ai]I'm not sure if you count this as evidence for, but I have produced quite a lot of research examining this question from a theoretical angle, and the general conclusion is indeed that it is difficult to ensure that a reward function which has been learnt from data is safe to optimise (even with infinite training data, etc). Here are a few highlights, with very condensed summaries:
If a reward function does not capture all your preferences exactly, then it allows reward hacking:
https://proceedings.neurips.cc/paper_files/paper/2022/file/3d719fee332caa23d5038b8a90e81796-Paper-Conference.pdf
Small modelling errors in a reward learning algorithm can lead to large errors in the inferred reward function:
https://ojs.aaai.org/index.php/AAAI/article/view/26766
and
https://arxiv.org/pdf/2403.06854
Markov rewards can't express many sensible preference structures:
https://proceedings.mlr.press/v216/skalse23a/skalse23a.pdf
A reward function with a good test score in an RLHF setup is not necessarily safe to optimise:
https://arxiv.org/pdf/2406.15753
This is an other relevant paper, not by me, which shows that IRL cannot simultaneously learn a persons preferences and a model of their rationality:
https://arxiv.org/abs/1712.05812
Here are a few other papers relating to the fundamentals of reward learning theory, though they have less bearing on this claim that the papers I linked above.
https://arxiv.org/abs/2309.15257
https://arxiv.org/pdf/2310.09144
https://proceedings.mlr.press/v202/skalse23a
[aj]To add to this, even if you get a safe reward function (ie, solve outer alignment), and train an agent to optimise it well, it may still systematically pursue other goals in new environments. For empirical examples of this, see eg:
https://proceedings.mlr.press/v162/langosco22a/langosco22a.pdf
and
https://arxiv.org/abs/2210.01790
This issue is also discussed (on a theoretical level) in this paper:
https://arxiv.org/abs/1906.01820
[ak]This is a weak argument against, but I'll comment it anyway: there is some research suggesting that psychopaths in fact are less good at distinguishing between moral rules and mere conventions. Here is an example of a paper providing some data on this, and which also cites further papers in this literature:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3397660/
[al]The evidence in favour of 4b does partially work as evidence in favour of this point as well.
[am]This is not obvious. For example, suppose we have an AI system that consistently produces similar moral judgments as humans for moral dilemmas not present in its training data (and ideally only created/conceived of after the development of this AI system). In this case, we may conclude that this AI system probably is aligned, even if we don't have a "theory of human values" (we would of course need to determine what the judgments of the AI actually are, and be sure that it isn't deceiving us, but that is another question).
[an]Here are some (purely theoretical) arguments for this:
https://arxiv.org/abs/1906.01820
and
https://selfawaresystems.com/wp-content/uploads/2008/01/ai_drives_final.pdf
and
https://cdn.aaai.org/ocs/ws/ws0218/12634-57409-1-PB.pdf
[ao]It has been argued (though it is debatable how convincing this is) that this behaviour is displayed by GPT-1o, see eg
https://www.lesswrong.com/posts/zuaaqjsN6BucbGhf5/gpt-o1
(keywords: "deceptive alignment" and "Lying to the developers").
For some theoretical arguments, see eg
https://arxiv.org/abs/1906.01820
[ap]During testing, GPT-4 was able to trick a human into solving a captcha for it by pretending to be blind. This example is summarised eg here: https://gizmodo.com/gpt4-open-ai-chatbot-task-rabbit-chatgpt-1850227471
[aq]Argument against this need: a strong misaligned AI will probably want and be able to achieve industrial independence from the human economy as fast as possible, using robots as actuators. Outer space may present a more favorable operating environment than Earth, as it is more predictable, away from human noise.
[ar]Humans often coordinate with one another despite large differences in social power
[as]Argument against them not sharing our moral and ethical systems: If you expect the high-level executive decisionmaking modules of such powerful AI agents to be based on self-supervised-learned models of human decisionmaking (example: LLMs), and you believe that such models simulate rather than instrumentally fake their target distribution, then you'd expect powerful AI agents to aim to act in ways that a human would decide to act.
EDIT: That's ok! I had already figured the bounty was over, that fact was priced in in my decision to comment!
[at]Hi Milan - thanks for the comment; unfortunately the bounty has expired! Sorry for not updating the BR post to confirm the initial deadline estimate.
[au]I'm copying this comment from another point, since it's relevant here too:
A conflict only happens if the cost of the conflict is lower than the cost of enforcing an agreement, and it isn't obvious that this would be the case here.
For example, an AI could reason that if it tries to take over the world, then it has a 20% chance of failure, in which case it gets 0 utility. Alternatively, it could credibly offer humans to split the lightcone 10%/90%, which might be preferable to both parties (assuming that some of the scenarios where it fails to take over the world are scenarios where humans also lose, such as a global nuclear war, etc).
As another example, when the war in Ukraine ends, Russia and Ukraine will strike some peace deal, and it is likely that both parties would have preferred to strike that deal straight away, without fighting a war first. See eg eg https://80000hours.org/podcast/episodes/chris-blattman-five-reasons-wars-happen/.
Thus, it does not follow that a conflict of interests and a power imbalance leads to a violent confrontation --- the game theory is more complicated than that. In fact, considering that AIs have bargaining strategies that are not available to humans, it might be that direct conflict is less likely in this scenario, see eg
https://intelligence.org/files/ProgramEquilibrium.pdf
and
https://link.springer.com/article/10.1007/s11238-018-9679-3.
(Note: the reason that humans dominate or kill animals is often that we can't communicate with them. If I could offer some ants a bag of candy in exchange for them not coming into my house, then I would prefer doing that over killing them.)
[av]Jailbreaks seem like pretty strong evidence of this to me, since it seems like techniques such as RLHF and CAI are doing something much closer to making the harmful capability harder to elicit than they are actually eliminating it (i.e., actually aligning the system).
[aw]I'm not sure if you would count this as evidence against per se, but this paper outlines a workable approach to AI safety which, if successful, would give ways of getting around inner alignment issues, and etc:
https://arxiv.org/pdf/2405.06624
This approach is also increasingly getting academic and institutional support.
[ax]Saying "it is incompatible" is too much of an overstatement. Better control over AI systems is a desired capability, and I expect trillions of dollars of research will go into making AI pliable and predictable, to better accomplish the goals of their developers.
[ay]I strongly agree with this, and am currently writing up a few posts about it, but the counterargument here is, I believe, something like: "We can test for precursors to the capabilities we're worried about. E.g., instead of testing for self-replication directly, we can test for things like 'can it insert a backdoor'—tasks that any sufficiently advanced AI system should be able to do. If it can't do these things, then we can rule out that it can do anything around as dangerous as self-replication." This is how Anthropic implements their "safety buffer" (also they do evals at regular intervals during training).
[az]Regarding only the last sentence:
Evidence for:
https://www.lesswrong.com/posts/G993PFTwqqdQv4eTg/is-ai-progress-impossible-to-predict
Evidence against:
https://arxiv.org/abs/2304.15004
[ba]Several arguments for this claim are provided in this interview:
https://80000hours.org/podcast/episodes/nova-dassarma-information-security-and-ai-systems/