A | B | C | D | E | F | G | J | |
---|---|---|---|---|---|---|---|---|
1 | Link | Author | Post Karma | Pingback Count | non-author pingbacks | Total Pingback Karma | Avg Pingback Karma | |
2 | AGI Ruin: A List of Lethalities | Eliezer Yudkowsky | 870 | 159 | 157 | 12518 | 79 | |
3 | Simulators | janus | 612 | 128 | 123 | 7723 | 60 | |
4 | A central AI alignment problem: capabilities generalization, and the sharp left turn | So8res | 273 | 97 | 90 | 7728 | 80 | |
5 | Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover | Ajeya Cotra | 367 | 84 | 84 | 5147 | 61 | |
6 | MIRI announces new "Death With Dignity" strategy | Eliezer Yudkowsky | 334 | 74 | 73 | 8158 | 110 | |
7 | Reward is not the optimization target | TurnTrout | 341 | 63 | 46 | 4517 | 72 | |
8 | A Mechanistic Interpretability Analysis of Grokking | Neel Nanda | 367 | 49 | 42 | 3474 | 71 | |
9 | How likely is deceptive alignment? | evhub | 101 | 48 | 35 | 2931 | 61 | |
10 | How To Go From Interpretability To Alignment: Just Retarget The Search | johnswentworth | 167 | 46 | 43 | 3398 | 74 | |
11 | Why Agent Foundations? An Overly Abstract Explanation | johnswentworth | 285 | 43 | 42 | 2754 | 64 | |
12 | The shard theory of human values | Quintin Pope | 238 | 43 | 43 | 2867 | 67 | |
13 | On how various plans miss the hard bits of the alignment challenge | So8res | 292 | 41 | 34 | 3312 | 81 | |
14 | [Intro to brain-like-AGI safety] 3. Two subsystems: Learning & Steering | Steven Byrnes | 79 | 37 | 9 | 3059 | 83 | |
15 | Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research] | LawrenceC | 195 | 36 | 31 | 2291 | 64 | |
16 | Mysteries of mode collapse | janus | 279 | 33 | 30 | 2866 | 87 | |
17 | How might we align transformative AI if it’s developed very soon? | HoldenKarnofsky | 136 | 33 | 22 | 2375 | 72 | |
18 | A transparency and interpretability tech tree | evhub | 148 | 32 | 25 | 2367 | 74 | |
19 | [Intro to brain-like-AGI safety] 2. “Learning from scratch” in the brain | Steven Byrnes | 57 | 31 | 7 | 2755 | 89 | |
20 | Shard Theory: An Overview | David Udell | 157 | 29 | 28 | 2043 | 70 | |
21 | Externalized reasoning oversight: a research direction for language model alignment | tamera | 117 | 29 | 29 | 1812 | 62 | |
22 | Notes on Resolve | David Gross | 9 | 28 | 0 | 514 | 18 | |
23 | Brain Efficiency: Much More than You Wanted to Know | jacob_cannell | 201 | 28 | 23 | 1831 | 65 | |
24 | How to Diversify Conceptual Alignment: the Model Behind Refine | adamShimi | 87 | 28 | 25 | 869 | 31 | |
25 | Where I agree and disagree with Eliezer | paulfchristiano | 862 | 28 | 28 | 1860 | 66 | |
26 | Notes on Rationality | David Gross | 16 | 27 | 0 | 469 | 17 | |
27 | A Longlist of Theories of Impact for Interpretability | Neel Nanda | 124 | 27 | 26 | 2613 | 97 | |
28 | [Intro to brain-like-AGI safety] 6. Big picture of motivation, decision-making, and RL | Steven Byrnes | 66 | 26 | 4 | 1484 | 57 | |
29 | Supervise Process, not Outcomes | stuhlmueller | 132 | 26 | 23 | 2288 | 88 | |
30 | [Intro to brain-like-AGI safety] 1. What's the problem & Why work on it now? | Steven Byrnes | 146 | 26 | 9 | 1431 | 55 | |
31 | What's General-Purpose Search, And Why Might We Expect To See It In Trained ML Systems? | johnswentworth | 118 | 25 | 24 | 1012 | 40 | |
32 | (My understanding of) What Everyone in Technical Alignment is Doing and Why | Thomas Larsen | 411 | 24 | 24 | 1554 | 65 | |
33 | [Intro to brain-like-AGI safety] 13. Symbol grounding & human social instincts | Steven Byrnes | 67 | 24 | 7 | 1279 | 53 | |
34 | A shot at the diamond-alignment problem | TurnTrout | 92 | 24 | 18 | 1872 | 78 | |
35 | You Are Not Measuring What You Think You Are Measuring | johnswentworth | 350 | 22 | 18 | 1473 | 67 | |
36 | Refine: An Incubator for Conceptual Alignment Research Bets | adamShimi | 143 | 22 | 21 | 1817 | 83 | |
37 | Epistemological Vigilance for Alignment | adamShimi | 61 | 22 | 13 | 2032 | 92 | |
38 | Prizes for ELK proposals | paulfchristiano | 143 | 21 | 20 | 1046 | 50 | |
39 | A note about differential technological development | So8res | 185 | 21 | 15 | 2294 | 109 | |
40 | Six Dimensions of Operational Adequacy in AGI Projects | Eliezer Yudkowsky | 298 | 21 | 21 | 1631 | 78 | |
41 | Humans provide an untapped wealth of evidence about alignment | TurnTrout | 186 | 20 | 14 | 1671 | 84 | |
42 | 200 Concrete Open Problems in Mechanistic Interpretability: Introduction | Neel Nanda | 98 | 20 | 10 | 515 | 26 | |
43 | Call For Distillers | johnswentworth | 204 | 20 | 19 | 902 | 45 | |
44 | Discovering Language Model Behaviors with Model-Written Evaluations | evhub | 100 | 20 | 15 | 2360 | 118 | |
45 | chinchilla's wild implications | nostalgebraist | 403 | 19 | 19 | 1175 | 62 | |
46 | Abstractions as Redundant Information | johnswentworth | 64 | 19 | 12 | 1240 | 65 | |
47 | Two-year update on my personal AI timelines | Ajeya Cotra | 287 | 19 | 19 | 1554 | 82 | |
48 | A challenge for AGI organizations, and a challenge for readers | Rob Bensinger | 299 | 19 | 18 | 1360 | 72 | |
49 | ELK prize results | paulfchristiano | 135 | 18 | 18 | 1259 | 70 | |
50 | Inner and outer alignment decompose one hard problem into two extremely hard problems | TurnTrout | 115 | 18 | 17 | 983 | 55 | |
51 | Godzilla Strategies | johnswentworth | 137 | 18 | 16 | 1597 | 89 | |
52 | [Intro to brain-like-AGI safety] 5. The “long-term predictor”, and TD learning | Steven Byrnes | 52 | 18 | 2 | 883 | 49 | |
53 | Worlds Where Iterative Design Fails | johnswentworth | 185 | 18 | 16 | 1146 | 64 | |
54 | Let’s think about slowing down AI | KatjaGrace | 522 | 18 | 18 | 1297 | 72 | |
55 | [Intro to brain-like-AGI safety] 4. The “short-term predictor” | Steven Byrnes | 64 | 17 | 2 | 914 | 54 | |
56 | [Interim research report] Taking features out of superposition with sparse autoencoders | Lee Sharkey | 126 | 17 | 16 | 784 | 46 | |
57 | What an actually pessimistic containment strategy looks like | lc | 647 | 17 | 17 | 1192 | 70 | |
58 | [Intro to brain-like-AGI safety] 15. Conclusion: Open problems, how to help, AMA | Steven Byrnes | 90 | 17 | 7 | 1506 | 89 | |
59 | How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme | Collin | 240 | 17 | 17 | 1599 | 94 | |
60 | [Intro to brain-like-AGI safety] 12. Two paths forward: “Controlled AGI” and “Social-instinct AGI” | Steven Byrnes | 42 | 16 | 4 | 1016 | 64 | |
61 | Instead of technical research, more people should focus on buying time | Akash | 100 | 16 | 13 | 928 | 58 | |
62 | Human values & biases are inaccessible to the genome | TurnTrout | 90 | 15 | 14 | 1474 | 98 | |
63 | Conjecture: Internal Infohazard Policy | Connor Leahy | 132 | 15 | 13 | 1364 | 91 | |
64 | Open Problems in AI X-Risk [PAIS #5] | Dan H | 59 | 15 | 12 | 1470 | 98 | |
65 | Optimality is the tiger, and agents are its teeth | Veedrac | 288 | 15 | 14 | 1343 | 90 | |
66 | why assume AGIs will optimize for fixed goals? | nostalgebraist | 138 | 15 | 14 | 1127 | 75 | |
67 | Mechanistic anomaly detection and ELK | paulfchristiano | 133 | 14 | 13 | 802 | 57 | |
68 | Notes on Caution | David Gross | 13 | 14 | 0 | 289 | 21 | |
69 | Threat Model Literature Review | zac_kenton | 73 | 14 | 13 | 977 | 70 | |
70 | Discovering Agents | zac_kenton | 71 | 14 | 14 | 1018 | 73 | |
71 | [Intro to brain-like-AGI safety] 8. Takeaways from neuro 1/2: On AGI development | Steven Byrnes | 50 | 13 | 2 | 770 | 59 | |
72 | [Link] A minimal viable product for alignment | janleike | 53 | 13 | 13 | 1208 | 93 | |
73 | An Open Agency Architecture for Safe Transformative AI | davidad | 74 | 13 | 13 | 855 | 66 | |
74 | «Boundaries», Part 3a: Defining boundaries as directed Markov blankets | Andrew_Critch | 86 | 13 | 12 | 501 | 39 | |
75 | Niceness is unnatural | So8res | 121 | 13 | 9 | 1287 | 99 | |
76 | RL with KL penalties is better seen as Bayesian inference | Tomek Korbak | 114 | 13 | 12 | 785 | 60 | |
77 | The Plan - 2022 Update | johnswentworth | 235 | 13 | 13 | 706 | 54 | |
78 | The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable | beren | 192 | 12 | 10 | 632 | 53 | |
79 | Latent Adversarial Training | Adam Jermyn | 40 | 12 | 12 | 938 | 78 | |
80 | PreDCA: vanessa kosoy's alignment protocol | Tamsin Leake | 50 | 12 | 3 | 517 | 43 | |
81 | Common misconceptions about OpenAI | Jacob_Hilton | 239 | 12 | 12 | 1052 | 88 | |
82 | AI strategy nearcasting | HoldenKarnofsky | 79 | 12 | 7 | 738 | 62 | |
83 | Conditioning Generative Models | Adam Jermyn | 24 | 12 | 9 | 1386 | 116 | |
84 | What does it take to defend the world against out-of-control AGIs? | Steven Byrnes | 180 | 12 | 8 | 877 | 73 | |
85 | Monitoring for deceptive alignment | evhub | 135 | 12 | 9 | 875 | 73 | |
86 | “Pivotal Act” Intentions: Negative Consequences and Fallacious Arguments | Andrew_Critch | 129 | 12 | 10 | 937 | 78 | |
87 | Language models seem to be much better than humans at next-token prediction | Buck | 172 | 12 | 12 | 976 | 81 | |
88 | Superintelligent AI is necessary for an amazing future, but far from sufficient | So8res | 132 | 12 | 7 | 1359 | 113 | |
89 | Announcing the Alignment of Complex Systems Research Group | Jan_Kulveit | 91 | 12 | 8 | 1271 | 106 | |
90 | Acceptability Verification: A Research Agenda | David Udell | 50 | 12 | 11 | 1206 | 101 | |
91 | «Boundaries», Part 1: a key missing concept from utility theory | Andrew_Critch | 158 | 12 | 10 | 541 | 45 | |
92 | Circumventing interpretability: How to defeat mind-readers | Lee Sharkey | 109 | 12 | 10 | 1071 | 89 | |
93 | Gradient hacking: definitions and examples | Richard_Ngo | 38 | 12 | 9 | 1103 | 92 | |
94 | Don't leave your fingerprints on the future | So8res | 109 | 12 | 8 | 914 | 76 | |
95 | Productive Mistakes, Not Perfect Answers | adamShimi | 97 | 12 | 6 | 742 | 62 | |
96 | Searching for Search | NicholasKees | 81 | 11 | 11 | 582 | 53 | |
97 | [Intro to brain-like-AGI safety] 10. The alignment problem | Steven Byrnes | 48 | 11 | 1 | 726 | 66 | |
98 | Nearcast-based "deployment problem" analysis | HoldenKarnofsky | 85 | 11 | 8 | 433 | 39 | |
99 | Path dependence in ML inductive biases | Vivek Hebbar | 67 | 11 | 11 | 510 | 46 | |
100 | [Intro to brain-like-AGI safety] 9. Takeaways from neuro 2/2: On AGI motivation | Steven Byrnes | 42 | 11 | 2 | 606 | 55 |