ABCDEFGHIJKLMNOPQRSTUVWXYZAAABACADAEAFAGAHAIAJAKALAMANAOAPAQARASATAUAVAWBABBBCBDBEBFBG
1
README: This is intended to provide the user with an exhaustive overview of how different risk models consider the AI alignment problem. If you are a contributor, please follow the conventions below. DISCLAIMER: This sheet reflects the personal opinions of the author; hopefully, they correspond somewhat to reality, but keep in mind it is not the ground truth. Other confounding factors could be: TODO READING CONVENTIONS: I tried to represent the author's views, to the best of my understanding. "-" means the author does not provide information about that variable. [blank] means I haven't found anything yet, but I am very unconfident the author doesn't have an opinion (or I haven't gotten around to filling the case yet).
Explanations
Where is the misalignment: in the developer, the given goal or the instantiation?
How did it come to this? Technical causes: what happened in the code. Social causes: what allowed bad code to be.
What is the AI capable of?What is the relevant timeframe? Should we take action now, consider the possibility, leave that task to our successors?
How serious is the risk?
How does destroying the world look like?
What can we learn from this risk model?
2
WRITING CONVENTIONS: Do not edit text. If you disagree with the value of a cell: make a commentary to propose a new value and explain your position. If you want to make a minor amendment, make the cell yellow-orange, depending on the importance of the change. If you want to make a major amendment (e.g. the current value is misinforming for an experienced reader), make the cell red. Try to keep commentaries about the same cell in a single thread. If you want to add a column or make changes of the same scale, mark the topmost cell of the relevant column (the same column for a split, the column to its right for a new category) in magenta. If you disagree with an explanation or want to add to one, make a commentary and mark the explanation in blue.
Type of misalignmentSources of misalignmentCapabilitiesTimelinesGravityTakeoff parametersTakeaways
3
Typical use case: You have a potential solution. You want to think about it.
Brief summary
The developers have bad goals (and the AI is at least somewhat aligned with this badness
The AI is unaligned with its given goal
The AI is given a bad goal (unaligned with the developers)
Achieving the given goal is best done using distasteful strategies
The AI finds a goal during training which is not the correct one
This cognitive pattern is self-reinforcing
In what measure it can influence the physical worldAbility to think well, fast, much, widePropensity to exist, physical requirementsPropensity to have an objective (TODO: distinguish behavioural and cognitive agency)
Propensity of the failure scenario to be impossible to recover from - by default, most are irreversible, and that is often considered a requirement for existential risks
Propensity of the relevant AIs to be similar (TODO: distinguish algorithmic, computational, behavioral similarity)
TODO: expand, fill
4
- Consider which scenario it might prevent (find the proper row)
IntentInnerOuterTechnical causesSocial causesPowerCognitionExistenceAgencyExistentialIrreversibility
Similarity between AIs
Interaction between AIsSpeedWarning shots
5
- Consider what dimensions of the scenario your solution modifies (columns)
Specification gaming
Goal misgeneralizationDaemonsBad actorsCoordinationGeneralityResearchPlanificationAwarenessQuantityTypeFastDiscontinuousConcentrated
Intelligence explosion
RSI
6
- See in what measure the scenario is improved.
Instrumental goals
Polarity AI research
7
Hopefully, this will remind you of dimensions you might otherwise overlook, and of which scenarios your solution does not apply to.
Brief summary
Many minds happen to stumble upon these goals - because they're generally useful
The AI stumbled upon this goal and won't let go of it
Only one AI is relevant
Multiple AIs are relevant
Very many AIs must be taken in consideration
E.g. AI arms race
Non-agentic systematic dynamics lead to poor outcomes - all the other coordination problems. Catch-all term
The AI is able to grow in power by itself
The AI has been granted power by its makers
The AI is relevant to a wide variety of tasks
The AI is specialized to a single (class of) task(s)
The AI can make better AIs than itself
The AI can make long-term plans
The AI can make plans in order to achieve objectives
The AI understands information relevant to its own abilities, available actions and plans
The AI understands information relevant to its own existence, physical instantiation, development context
The AI can be easily duplicated
The AI requires a certain amount of computational power
The AI requires a certain amount of memory
The AI is very agentic
The AI is not agentic
Humanity suffers a lot, or maximally (e.g. prolonged life in misery
Humanity no longer exists
Humans have no power to steer their future
Humanity does not achieve what it could (e.g. never space colonization, never breaking out the matrix, never uploading - conditioned on that being possible)
Propensity of the relevant AIs to look alike - maximal for a singleton
Number of relevant AIs - only relevant for a specific kind of homogeneity
8
Convergent instrumental goals
Crystallized proxies
UniSomeVery multi
Competitive pressure
MolochAcquiring power
Put in position of power
MuchExpert
Self-improvement - RSI
Long-term planning
Agentic planning
Strategic awareness
Situational awareness
Duplicability
Computation speed
Memory sizeMuchNoneShort (<10y)Medium (<50y)Long (>50y)S-RiskAnnihilation
Disempowerment
Loss of potentialHomogeneityCooperationCoordinationCompetitionAdversariality
9
RSI
10
Risk modelBrief summary
11
Sources: TODO Literature reviews: Clarifying AI X-risk (Deepmind) ... Additional ideas for relevant variables: Distinguishing AI takeover scenarios (authors) https://www.lesswrong.com/posts/3DFBbPFZyscrAiTKS/my-overview-of-the-ai-alignment-landscape-threat-models (Neel Nanda) ...
12
Carlsmith - Is Power-seeking AI an existential risk?
Power-seeking AIs takeover the world. Quite exhaustive and detailed in the report.
PossibleYesUnnecessaryAgnosticSufficientIrrelevantFacilitates-Necessary, likelyLikelyIrrelevantPossibleNecessaryNecessaryNecessaryLikelyIrrelevantYesUnlikely(>10% of it happening by 2070)YesYesLikelyIrrelevantNot really the pointHeh
*Read the report and learn.* Be careful of power-seeking AIs, there's a lot to unpack.
13
Christiano1 - You get what you measure
We make AIs to optimize measures. The measures are optimized. The measures were poor proxies of our desires.
YesUnnecessaryLikelyYesNo
No (Yes, at the human level)
NoPossibleNoNoYes-YesUnnecessaryYesUnnecessaryPossibleUnnecessaryLikelyNo(Possible?)UnlikelyYesYes
Many (behavioral)
NoYes
14
Christiano2 - Influence-seeking behaviour is scary
AIs that try to acquire power work better than others. They acquire power. This is bad.
NoYesUnnecessaryPossibleYes-PossibleNoNoPossible?Yes-Very yesYesPossiblePossible?-YesUnlikelyNo?PossibleYesYes
Quite (behavioral)
Many (behavioral)
No
15
Hubinger - How likely is deceptive alignment?
The only AIs we allow to try to look good. Pretending is easier than being good. They aren't good.
NoYesUnnecessary
No (but "gaming the training signal")
UnnecessaryYes-IrrelevantNot necessaryIrrelevant
Relevant (high-path dependence)
IrrelevantNecessaryNo-1
AIs might pretend to be aligned. -> security mindset?
Distinguish high and low path-dependence.
16
Critch1 - Production Web
AI enables processes to be efficient, regardless of instantiation. For instance, companies become productive regardless of specific goals ; productive goals consume resources vital for humanity's survival. Slow takeoff.
UnnecessaryAgnostic-UnnecessaryNoPossible?YesAyeVery yesYesPossibleUnnecessary--Unnecessary?
Unnecessary (Moloch does it)
-Yes-UnnecessaryPossibleUnlikelyNoAs it happensAs a start
Quite (behavioral)
Much (behavioral)
Yes-Yes
Identify control loops as points of leverage to prevent a Critch scenario.
17
Critch2 - Flash Economy
AI enables processes to be efficient, regardless of instantiation. For instance, companies become productive regardless of specific goals ; productive goals consume resources vital for humanity's survival. Fast takeoff.
Id.Yes-LikelyUnnecessary?Yes-Very yes-UnnecessaryPossible?
(A couple years after start of the scenario)
NoNoAs it happensAs a start
Much (algorithmic)
Many (algorithmic)
PossibleOpposite
18
Cohen et al. - Advanced artificial agents intervene in the provision of reward
Advanced AI strives to wirehead itself. Catastrophic consequences ensue.
NoYesNoYesNoIn some mannerNoNo-IrrelevantUnnecessaryUnnecessaryYes
At least somewhat
VeryIrrelevantNecessaryNo-No-
Yes, at least over the AI
-(Singleton)1-
Beware trying to tame something that is more intelligent than you.
19
Soares - A central AI alignment problem: capabilities generalization, and the sharp left turn
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100