LessWrong Pingback Karma

	A	B	C	D	E	F	G
1	Link	Author	Post Karma	Pingback Count	non-author pingbacks	Total Pingback Karma	Avg Pingback Karma

2	AGI Ruin: A List of Lethalities	Eliezer Yudkowsky	870	159	157	12518	79
3	Simulators	janus	612	128	123	7723	60
4	A central AI alignment problem: capabilities generalization, and the sharp left turn	So8res	273	97	90	7728	80
5	Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover	Ajeya Cotra	367	84	84	5147	61
6	MIRI announces new "Death With Dignity" strategy	Eliezer Yudkowsky	334	74	73	8158	110
7	Reward is not the optimization target	TurnTrout	341	63	46	4517	72
8	A Mechanistic Interpretability Analysis of Grokking	Neel Nanda	367	49	42	3474	71
9	How likely is deceptive alignment?	evhub	101	48	35	2931	61
10	How To Go From Interpretability To Alignment: Just Retarget The Search	johnswentworth	167	46	43	3398	74
11	Why Agent Foundations? An Overly Abstract Explanation	johnswentworth	285	43	42	2754	64
12	The shard theory of human values	Quintin Pope	238	43	43	2867	67
13	On how various plans miss the hard bits of the alignment challenge	So8res	292	41	34	3312	81
14	[Intro to brain-like-AGI safety] 3. Two subsystems: Learning & Steering	Steven Byrnes	79	37	9	3059	83
15	Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]	LawrenceC	195	36	31	2291	64
16	Mysteries of mode collapse	janus	279	33	30	2866	87
17	How might we align transformative AI if it’s developed very soon?	HoldenKarnofsky	136	33	22	2375	72
18	A transparency and interpretability tech tree	evhub	148	32	25	2367	74
19	[Intro to brain-like-AGI safety] 2. “Learning from scratch” in the brain	Steven Byrnes	57	31	7	2755	89
20	Shard Theory: An Overview	David Udell	157	29	28	2043	70
21	Externalized reasoning oversight: a research direction for language model alignment	tamera	117	29	29	1812	62
22	Notes on Resolve	David Gross	9	28	0	514	18
23	Brain Efficiency: Much More than You Wanted to Know	jacob_cannell	201	28	23	1831	65
24	How to Diversify Conceptual Alignment: the Model Behind Refine	adamShimi	87	28	25	869	31
25	Where I agree and disagree with Eliezer	paulfchristiano	862	28	28	1860	66
26	Notes on Rationality	David Gross	16	27	0	469	17
27	A Longlist of Theories of Impact for Interpretability	Neel Nanda	124	27	26	2613	97
28	[Intro to brain-like-AGI safety] 6. Big picture of motivation, decision-making, and RL	Steven Byrnes	66	26	4	1484	57
29	Supervise Process, not Outcomes	stuhlmueller	132	26	23	2288	88
30	[Intro to brain-like-AGI safety] 1. What's the problem & Why work on it now?	Steven Byrnes	146	26	9	1431	55
31	What's General-Purpose Search, And Why Might We Expect To See It In Trained ML Systems?	johnswentworth	118	25	24	1012	40
32	(My understanding of) What Everyone in Technical Alignment is Doing and Why	Thomas Larsen	411	24	24	1554	65
33	[Intro to brain-like-AGI safety] 13. Symbol grounding & human social instincts	Steven Byrnes	67	24	7	1279	53
34	A shot at the diamond-alignment problem	TurnTrout	92	24	18	1872	78
35	You Are Not Measuring What You Think You Are Measuring	johnswentworth	350	22	18	1473	67
36	Refine: An Incubator for Conceptual Alignment Research Bets	adamShimi	143	22	21	1817	83
37	Epistemological Vigilance for Alignment	adamShimi	61	22	13	2032	92
38	Prizes for ELK proposals	paulfchristiano	143	21	20	1046	50
39	A note about differential technological development	So8res	185	21	15	2294	109
40	Six Dimensions of Operational Adequacy in AGI Projects	Eliezer Yudkowsky	298	21	21	1631	78
41	Humans provide an untapped wealth of evidence about alignment	TurnTrout	186	20	14	1671	84
42	200 Concrete Open Problems in Mechanistic Interpretability: Introduction	Neel Nanda	98	20	10	515	26
43	Call For Distillers	johnswentworth	204	20	19	902	45
44	Discovering Language Model Behaviors with Model-Written Evaluations	evhub	100	20	15	2360	118
45	chinchilla's wild implications	nostalgebraist	403	19	19	1175	62
46	Abstractions as Redundant Information	johnswentworth	64	19	12	1240	65
47	Two-year update on my personal AI timelines	Ajeya Cotra	287	19	19	1554	82
48	A challenge for AGI organizations, and a challenge for readers	Rob Bensinger	299	19	18	1360	72
49	ELK prize results	paulfchristiano	135	18	18	1259	70
50	Inner and outer alignment decompose one hard problem into two extremely hard problems	TurnTrout	115	18	17	983	55
51	Godzilla Strategies	johnswentworth	137	18	16	1597	89
52	[Intro to brain-like-AGI safety] 5. The “long-term predictor”, and TD learning	Steven Byrnes	52	18	2	883	49
53	Worlds Where Iterative Design Fails	johnswentworth	185	18	16	1146	64
54	Let’s think about slowing down AI	KatjaGrace	522	18	18	1297	72
55	[Intro to brain-like-AGI safety] 4. The “short-term predictor”	Steven Byrnes	64	17	2	914	54
56	[Interim research report] Taking features out of superposition with sparse autoencoders	Lee Sharkey	126	17	16	784	46
57	What an actually pessimistic containment strategy looks like	lc	647	17	17	1192	70
58	[Intro to brain-like-AGI safety] 15. Conclusion: Open problems, how to help, AMA	Steven Byrnes	90	17	7	1506	89
59	How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme	Collin	240	17	17	1599	94
60	[Intro to brain-like-AGI safety] 12. Two paths forward: “Controlled AGI” and “Social-instinct AGI”	Steven Byrnes	42	16	4	1016	64
61	Instead of technical research, more people should focus on buying time	Akash	100	16	13	928	58
62	Human values & biases are inaccessible to the genome	TurnTrout	90	15	14	1474	98
63	Conjecture: Internal Infohazard Policy	Connor Leahy	132	15	13	1364	91
64	Open Problems in AI X-Risk [PAIS #5]	Dan H	59	15	12	1470	98
65	Optimality is the tiger, and agents are its teeth	Veedrac	288	15	14	1343	90
66	why assume AGIs will optimize for fixed goals?	nostalgebraist	138	15	14	1127	75
67	Mechanistic anomaly detection and ELK	paulfchristiano	133	14	13	802	57
68	Notes on Caution	David Gross	13	14	0	289	21
69	Threat Model Literature Review	zac_kenton	73	14	13	977	70
70	Discovering Agents	zac_kenton	71	14	14	1018	73
71	[Intro to brain-like-AGI safety] 8. Takeaways from neuro 1/2: On AGI development	Steven Byrnes	50	13	2	770	59
72	[Link] A minimal viable product for alignment	janleike	53	13	13	1208	93
73	An Open Agency Architecture for Safe Transformative AI	davidad	74	13	13	855	66
74	«Boundaries», Part 3a: Defining boundaries as directed Markov blankets	Andrew_Critch	86	13	12	501	39
75	Niceness is unnatural	So8res	121	13	9	1287	99
76	RL with KL penalties is better seen as Bayesian inference	Tomek Korbak	114	13	12	785	60
77	The Plan - 2022 Update	johnswentworth	235	13	13	706	54
78	The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable	beren	192	12	10	632	53
79	Latent Adversarial Training	Adam Jermyn	40	12	12	938	78
80	PreDCA: vanessa kosoy's alignment protocol	Tamsin Leake	50	12	3	517	43
81	Common misconceptions about OpenAI	Jacob_Hilton	239	12	12	1052	88
82	AI strategy nearcasting	HoldenKarnofsky	79	12	7	738	62
83	Conditioning Generative Models	Adam Jermyn	24	12	9	1386	116
84	What does it take to defend the world against out-of-control AGIs?	Steven Byrnes	180	12	8	877	73
85	Monitoring for deceptive alignment	evhub	135	12	9	875	73
86	“Pivotal Act” Intentions: Negative Consequences and Fallacious Arguments	Andrew_Critch	129	12	10	937	78
87	Language models seem to be much better than humans at next-token prediction	Buck	172	12	12	976	81
88	Superintelligent AI is necessary for an amazing future, but far from sufficient	So8res	132	12	7	1359	113
89	Announcing the Alignment of Complex Systems Research Group	Jan_Kulveit	91	12	8	1271	106
90	Acceptability Verification: A Research Agenda	David Udell	50	12	11	1206	101
91	«Boundaries», Part 1: a key missing concept from utility theory	Andrew_Critch	158	12	10	541	45
92	Circumventing interpretability: How to defeat mind-readers	Lee Sharkey	109	12	10	1071	89
93	Gradient hacking: definitions and examples	Richard_Ngo	38	12	9	1103	92
94	Don't leave your fingerprints on the future	So8res	109	12	8	914	76
95	Productive Mistakes, Not Perfect Answers	adamShimi	97	12	6	742	62
96	Searching for Search	NicholasKees	81	11	11	582	53
97	[Intro to brain-like-AGI safety] 10. The alignment problem	Steven Byrnes	48	11	1	726	66
98	Nearcast-based "deployment problem" analysis	HoldenKarnofsky	85	11	8	433	39
99	Path dependence in ML inductive biases	Vivek Hebbar	67	11	11	510	46
100	[Intro to brain-like-AGI safety] 9. Takeaways from neuro 2/2: On AGI motivation	Steven Byrnes	42	11	2	606	55