ABCDEFGHIJKLMNOPQRSTUVWXYZAA
1
Claims about AI riskEmpirical evidenceNon-empirical evidenceEvidence from the opinions of othersAnti-evidenceStatus
2
AnalogiesDemosReal world examplesProofsArgumentsForecastsExpert opinion
3
The nature of AI systemsThe relevant kind of AI will come soon
Literature review of Transformative Artificial Intelligence timelines
AI timelines: What do experts in artificial intelligence expect for the future?
Literature review of Transformative Artificial Intelligence timelines
4
AI will be extremely capable
5
AI will be situationally aware
Berglund et al, Taken out of context: On measuring situational awareness in LLMs
6
AI will be goal-directedHumansAgentGPT?Coherence argumentsUnclear?
7
AI will be misalignedAI will be subject to goal misgeneralization
Goal misgeneralization examples
Goal Misgeneralization in Deep Reinforcement Learning
Goal Misgeneralization: Why Correct Specifications Aren’t Enough For Correct Goals
Dataset of real distributional shifts?
Goal misgeneralization examples
Strong empirical evidence of some aspects of this problem
Demos but no real-world examples of the most negative aspects?
8
AI will be deceptively aligned
[Many human examples of principle-agent problems]
Deception generally:
- ARC CAPTCHA example
- AI Deception: A Survey of Examples, Risks, and Potential Solutions
Deceptive alignment specifically:
- not much?
- [Private demo in LLMs from evals org; not ready to share]
Hubinger et al, Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research
9
AI will be subject to specification gaming
[Many human examples of Goodhart's law]
Specification gaming examples
Specification gaming examples
John et al, Dead rats, dopamine, performance metrics, and peacock tails: proxy failure is an inherent risk in goal-oriented systems
Manheim and Garrabrant, Categorizing Variants of Goodhart's Law
On perils of predictive reasoning: Hardt et al, Performative power
https://royalsocietypublishing.org/doi/10.1098/rsos.200462
https://arxiv.org/abs/2102.03896
Human examples of Goodharting are not catastrophic
10
AI will be power-
seeking
AI will be power-seekingHumansAuto-GPT?
Optimal Policies Tend To Seek Power
Parametrically Retargetable Decision-Makers Tend To Seek Power
Bostrom, Superintelligence; Omohundro, The Basic AI Drives
Some non-empirical evidence
Minimal empirical evidence
No real-world examples?
11
AI will be self-preserving
Russell on RL selecting for survival?
12
AI will be self-improving
On speed specifically: list of events causing discontinuities
Examples of AI Improving AI
Speed specifically:
- Chalmers, The singularity; Bostrom, Superintelligence
- AIs assisting human software engineers: first part of Shulman with Lunar Society, Davidson's compute centric model
On speed specifically: list of events causing discontinuities?
13
Other claimsMulti-agent interactions will lead to bad outcomes
14
AI will be subject to misuse
15
Other claims
16
The net result of this will be extremely bad
Forecasting Existential Risks, p. 24
17
18
Key
19
OrangeTentatively in scope
20
GreyTentatively out of scope
21
22
Feedback I'd find most helpful:
23
Criticisms of/suggestions for the 'Claims about AI risk' breakdown
24
Suggestions for pieces of evidence to include (in scope most useful, out of scope also useful but less so)
25
Suggestions for classes of evidence to include (in scope most useful, out of scope also useful but less so)
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100