ABCDEFGJ
1
LinkAuthorPost KarmaPingback Countnon-author pingbacksTotal Pingback KarmaAvg Pingback Karma
2
AGI Ruin: A List of LethalitiesEliezer Yudkowsky8701591571251879
3
Simulatorsjanus612128123772360
4
A central AI alignment problem: capabilities generalization, and the sharp left turnSo8res2739790772880
5
Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeoverAjeya Cotra3678484514761
6
MIRI announces new "Death With Dignity" strategyEliezer Yudkowsky33474738158110
7
Reward is not the optimization targetTurnTrout3416346451772
8
A Mechanistic Interpretability Analysis of GrokkingNeel Nanda3674942347471
9
How likely is deceptive alignment?evhub1014835293161
10
How To Go From Interpretability To Alignment: Just Retarget The Searchjohnswentworth1674643339874
11
Why Agent Foundations? An Overly Abstract Explanationjohnswentworth2854342275464
12
The shard theory of human valuesQuintin Pope2384343286767
13
On how various plans miss the hard bits of the alignment challengeSo8res2924134331281
14
[Intro to brain-like-AGI safety] 3. Two subsystems: Learning & SteeringSteven Byrnes79379305983
15
Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]LawrenceC1953631229164
16
Mysteries of mode collapsejanus2793330286687
17
How might we align transformative AI if it’s developed very soon?HoldenKarnofsky1363322237572
18
A transparency and interpretability tech treeevhub1483225236774
19
[Intro to brain-like-AGI safety] 2. “Learning from scratch” in the brainSteven Byrnes57317275589
20
Shard Theory: An OverviewDavid Udell1572928204370
21
Externalized reasoning oversight: a research direction for language model alignmenttamera1172929181262
22
Notes on ResolveDavid Gross928051418
23
Brain Efficiency: Much More than You Wanted to Knowjacob_cannell2012823183165
24
How to Diversify Conceptual Alignment: the Model Behind RefineadamShimi87282586931
25
Where I agree and disagree with Eliezerpaulfchristiano8622828186066
26
Notes on RationalityDavid Gross1627046917
27
A Longlist of Theories of Impact for InterpretabilityNeel Nanda1242726261397
28
[Intro to brain-like-AGI safety] 6. Big picture of motivation, decision-making, and RLSteven Byrnes66264148457
29
Supervise Process, not Outcomesstuhlmueller1322623228888
30
[Intro to brain-like-AGI safety] 1. What's the problem & Why work on it now?Steven Byrnes146269143155
31
What's General-Purpose Search, And Why Might We Expect To See It In Trained ML Systems?johnswentworth1182524101240
32
(My understanding of) What Everyone in Technical Alignment is Doing and WhyThomas Larsen4112424155465
33
[Intro to brain-like-AGI safety] 13. Symbol grounding & human social instinctsSteven Byrnes67247127953
34
A shot at the diamond-alignment problemTurnTrout922418187278
35
You Are Not Measuring What You Think You Are Measuringjohnswentworth3502218147367
36
Refine: An Incubator for Conceptual Alignment Research BetsadamShimi1432221181783
37
Epistemological Vigilance for AlignmentadamShimi612213203292
38
Prizes for ELK proposalspaulfchristiano1432120104650
39
A note about differential technological developmentSo8res18521152294109
40
Six Dimensions of Operational Adequacy in AGI ProjectsEliezer Yudkowsky2982121163178
41
Humans provide an untapped wealth of evidence about alignmentTurnTrout1862014167184
42
200 Concrete Open Problems in Mechanistic Interpretability: IntroductionNeel Nanda98201051526
43
Call For Distillersjohnswentworth204201990245
44
Discovering Language Model Behaviors with Model-Written Evaluationsevhub10020152360118
45
chinchilla's wild implicationsnostalgebraist4031919117562
46
Abstractions as Redundant Informationjohnswentworth641912124065
47
Two-year update on my personal AI timelinesAjeya Cotra2871919155482
48
A challenge for AGI organizations, and a challenge for readersRob Bensinger2991918136072
49
ELK prize resultspaulfchristiano1351818125970
50
Inner and outer alignment decompose one hard problem into two extremely hard problemsTurnTrout115181798355
51
Godzilla Strategiesjohnswentworth1371816159789
52
[Intro to brain-like-AGI safety] 5. The “long-term predictor”, and TD learningSteven Byrnes5218288349
53
Worlds Where Iterative Design Failsjohnswentworth1851816114664
54
Let’s think about slowing down AIKatjaGrace5221818129772
55

[Intro to brain-like-AGI safety] 4. The “short-term predictor”
Steven Byrnes6417291454
56
[Interim research report] Taking features out of superposition with sparse autoencodersLee Sharkey126171678446
57
What an actually pessimistic containment strategy looks likelc6471717119270
58
[Intro to brain-like-AGI safety] 15. Conclusion: Open problems, how to help, AMASteven Byrnes90177150689
59
How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment SchemeCollin2401717159994
60
[Intro to brain-like-AGI safety] 12. Two paths forward: “Controlled AGI” and “Social-instinct AGI”Steven Byrnes42164101664
61
Instead of technical research, more people should focus on buying timeAkash100161392858
62
Human values & biases are inaccessible to the genomeTurnTrout901514147498
63
Conjecture: Internal Infohazard PolicyConnor Leahy1321513136491
64
Open Problems in AI X-Risk [PAIS #5]Dan H591512147098
65
Optimality is the tiger, and agents are its teethVeedrac2881514134390
66
why assume AGIs will optimize for fixed goals?nostalgebraist1381514112775
67
Mechanistic anomaly detection and ELKpaulfchristiano133141380257
68
Notes on CautionDavid Gross1314028921
69
Threat Model Literature Reviewzac_kenton73141397770
70
Discovering Agentszac_kenton711414101873
71
[Intro to brain-like-AGI safety] 8. Takeaways from neuro 1/2: On AGI developmentSteven Byrnes5013277059
72
[Link] A minimal viable product for alignmentjanleike531313120893
73
An Open Agency Architecture for Safe Transformative AIdavidad74131385566
74
«Boundaries», Part 3a: Defining boundaries as directed Markov blanketsAndrew_Critch86131250139
75
Niceness is unnaturalSo8res121139128799
76
RL with KL penalties is better seen as Bayesian inferenceTomek Korbak114131278560
77
The Plan - 2022 Updatejohnswentworth235131370654
78
The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretableberen192121063253
79
Latent Adversarial TrainingAdam Jermyn40121293878
80
PreDCA: vanessa kosoy's alignment protocolTamsin Leake5012351743
81
Common misconceptions about OpenAIJacob_Hilton2391212105288
82
AI strategy nearcastingHoldenKarnofsky7912773862
83
Conditioning Generative ModelsAdam Jermyn241291386116
84
What does it take to defend the world against out-of-control AGIs?Steven Byrnes18012887773
85
Monitoring for deceptive alignmentevhub13512987573
86
“Pivotal Act” Intentions: Negative Consequences and Fallacious ArgumentsAndrew_Critch129121093778
87
Language models seem to be much better than humans at next-token predictionBuck172121297681
88
Superintelligent AI is necessary for an amazing future, but far from sufficientSo8res1321271359113
89
Announcing the Alignment of Complex Systems Research GroupJan_Kulveit911281271106
90
Acceptability Verification: A Research AgendaDavid Udell5012111206101
91
«Boundaries», Part 1: a key missing concept from utility theoryAndrew_Critch158121054145
92
Circumventing interpretability: How to defeat mind-readersLee Sharkey1091210107189
93
Gradient hacking: definitions and examplesRichard_Ngo38129110392
94
Don't leave your fingerprints on the futureSo8res10912891476
95
Productive Mistakes, Not Perfect AnswersadamShimi9712674262
96
Searching for SearchNicholasKees81111158253
97
[Intro to brain-like-AGI safety] 10. The alignment problemSteven Byrnes4811172666
98
Nearcast-based "deployment problem" analysisHoldenKarnofsky8511843339
99
Path dependence in ML inductive biasesVivek Hebbar67111151046
100
[Intro to brain-like-AGI safety] 9. Takeaways from neuro 2/2: On AGI motivationSteven Byrnes4211260655