ABCDEFG
1
Please add/comment with papers/posts:
2
EditionTeamParticipantsFormatTitleLinkComments
3
AISC1personal postJustin Shovelain, Michael Aird, others?Blog post
Improving the future by influencing actors' benevolence, intelligence, and power
https://forum.effectivealtruism.org/posts/4oGYbvcy2SRHTWgWk/improving-the-future-by-influencing-actors-benevolence#fn-dCBkq5f8sD4CzHwn8-1Approximately 1/4 of the work on this post was done as part of AISC according to Justin Shovelain
4
AISC1Side effects in Grid WorldJessica Cooper, Karol Kubicki, Gavin Leech, Tom McGrathSoftwarePreventing Side-effects in Gridworldshttps://www.gleech.org/grids/Noted in Krakovna's AIS resources. Cited in AIES paper.
5
AISC1Safe AFJames Bell, Linda Linsefors, Caspar Oesterheld, Joar SkalsePaperReinforcement Learning in Newcomblike Environmentshttps://proceedings.neurips.cc/paper/2021/hash/b9ed18a301c9f3d183938c451fa183df-Abstract.html
Was started at the first AISC in beginning '18 but published in Dec '20. It’s now been accepted for a spotlight presentation at NeurIPS 2021
6
AISC2Human Preference TypesNandi, Sabrina, ErinBlog postAcknowledging Human Preference Types to Support Value Learninghttps://www.alignmentforum.org/posts/mSPsyEwaymS74unND/acknowledging-human-preference-types-to-support-value
7
AISC2Policymaking for AI StrategyBrandon Perry, Risto UukPaperAI Governance and the Policymaking Process: Key Considerations for Reducing AI Riskhttps://www.mdpi.com/2504-2289/3/2/26Cited by FHI research associate
8
AISC2Corrupt Reward MDPsTomasz Kisielewski, David Lindner, Jason Mancuso, Alok SinghSoftwareCorrupt Reward MDPshttps://github.com/jvmancuso/safe-grid-agents
9
AISC2CorrigibilityVegard Blindheim, Anton Osika, Roland PihlakasBlog postExponentially diminishing returns and conjunctive goals: Mitigating Goodhart’s law with common sense. Towards corrigibility and interruptibility via the golden middle way.https://medium.com/threelaws/diminishing-returns-and-conjunctive-goals-towards-corrigibility-and-interruptibility-2ec594fed75c
10
AISC2Feature Visualization for Deep Reinforcement LearningZera Alexander, Andrew Schreiber, Fabian SteuerSoftwareFeature Visualization for Deep Reinforcement Learninghttps://github.com/andrewschreiber/agentGot to third and final round of EA Grant (not accepted)
11
AISC2IRL BenchmarkAdria Garriga-Alonso, Anton Osika, Johannes Heidecke, Max Daniel, Sayan SarkarSoftwareIRL Benchmarkhttps://github.com/JohannesHeidecke/irl-benchmark
12
AISC2Assumptions of Human ValuesJan Kulveit, Linda Linsefors, Alexey TurchinBlog postMulti-agent predictive minds and AI alignmenthttps://www.lesswrong.com/posts/3fkBWpE4f9nYbdf7E/multi-agent-minds-and-ai-alignment
Jan has written a blog post about his best-guess model of how human values and motivations work. Probably not directly stemming much from work at AISC2, since he was busy with other stuff during physical retreat.
Mentioned in the post: "Part of this originated in the efforts of the “Hidden Assumptions” team on the 2nd AI safety camp, and my thoughts about how minds work are inspired by CFAR."
13
AISC2Value Learning in GamesStanislav Böhm, Tomáš Gavenčiak, Torben Swoboda, Mikhail YagudinBlog postValue learning in gameshttps://docs.google.com/document/d/1kxXk7KkFfJAqrk0kDjDJ6Tvz_FL04p34twLj19Tv_IQ/edit#heading=h.cy7im45es3q0
14
AISC2 Corrupt Reward MDPsJason Mancuso, Tomasz Kisielewski, David Lindner, Alok SinghPaperDetecting Spiky Corruption in Markov Decision Processeshttps://ceur-ws.org/Vol-2419/paper_28.pdf Presented in session at AI Safety Workshop in IJCAI 2019
15
AISC3Modeling CooperationJonas Müller, Miles Tidmarsh, Vasily KuznetsovSoftware(implementation of their formal mathematical model)www.modelingcooperation.com/model
16
AISC3DebateVojta Kovarik, Anna Gajdova, David Lindner, Lukas Finnveden, Rajashree AgrawalBlog postAI Safety Debate and Its Applicationshttps://www.lesswrong.com/posts/5Kv2qNfRyXXihNrx2/ai-safety-debate-and-its-applications
17
AISC3Embedded agentsArushi, Davide, SayanPaperCategorizing Wireheading in Partially Embedded Agentshttps://arxiv.org/abs/1906.09136Presented poster at AI Safety Workshop in IJCAI 2019
18
AISC3RL AttentionDmitry Nikulin, Sebastian Kosch, Fabian Steuer, Hoagy CunninghamBlog postRegularization and visualization of attention in reinforcement learning agentshttps://attentionentropy.github.io/
19
AISC4Generalization in Reward LearningAnton Makiievskyi, Liang Zhou, Max ChiswickBlog postAssessing Generalization in Reward Learning with Procedurally Generated Gameshttps://chisness.medium.com/assessing-generalization-in-reward-learning-intro-and-background-da6c99d9e48
20
AISC4Goal DirectednessAdam Shimi, Joe Collman, Michele Campolo, Sabrina Tang.Blog post
Focus: you are allowed to be bad at accomplishing your goals
https://www.lesswrong.com/s/DTnoFhDm7ZT2ecJMw/
Note that Adam Shimi was already focussed on doing work on goal-directedness before applying to AISC4, and would have probably have written a similar volume of posts in either case (in Remmelt's opinion). Last 5 were posts published (not necessarily fully finished) after the camp.
21
AISC4Goal DirectednessAdam Shimi, Joe Collman, Michele CampoloBlog postUnderstanding Goal Directedness (sequence)https://www.alignmentforum.org/s/o58ZMNaovdztbLfvN3/4 of goal direcedness team who continued researching together after camp officially ended
22
AISC4Survey on AI X-Risk ScenariosSam Clarke, Alexis Carlier, Jonas SchuettBlog postSurvey on AI existential risk scenarioshttps://www.lesswrong.com/posts/WiXePTj7KeEycbiwK/survey-on-ai-existential-risk-scenarios
Also shared full results internally with researchers at FHI and elsewhere. Said didn't publish widely because of PR-risks.
23
AISC4Human extracted preferencesMislav Juric, Taylor Kulp-McDowall, Arun Raja, Riccardo Volpato, Nevan WichersBlog postExtraction of human preferences 👨→🤖https://www.lesswrong.com/posts/PZYD5kBpeHWgE5jX4/extraction-of-human-preferences
24
AISC5Objective Robustness FailuresJack Koch, Lauro Langosco, Jacob Pfau, James Le (and Lee Sharkey)PaperGoal Misgeneralization in Deep Reinforcement Learninghttps://proceedings.mlr.press/v162/langosco22a.htmlAccepted for a short presentation at ICML. Accepted for a poster at the the ICML UDL workshop accepted.
Cited by
Dan Hendrickx et al.TODO: Add NeurIPS link if published.
https://theturingprize.com/ wants to retrospectively award them a prize of 10k$, to be given as donation to a charity or fund of their choice (or mix of charity/funds).
25
AISC5Objective Robustness FailuresJack Koch, Lauro Langosco, Jacob Pfau, James Le (and Lee Sharkey)Blog post
Empirical Observations of Objective Robustness Failures; Discussion: Objective Robustness and Inner Alignment Terminology
Post 1 (empirical failures) Post 2 (Terminology) Two simultaneous posts. Summarised in AI-Alignment Newsletter by Rohin Shah.
26
AISC5Cooperativity & Common Pool Resources Quinn Doughtery, Ben Greenberg, Ariel KwiatkowskiSoftwarehttps://github.com/RedTachyon/cpr_reputation/
27
AISC5Pessimistic Ask-For-Help Agents for Safe ExplorationJamie Bernardi, David Reber, Magdalena Wache, Peter Barnett, Max ClarkeSoftwarehttps://github.com/j-bernardi/pessimistic-agents
28
AISC5Human extracted preferencesMislav Juric, Taylor Kulp-McDowall, Arun Raja, Riccardo Volpato, Nevan WichersSoftwarehttps://github.com/arunraja-hub/Preference_Extraction
29
AISC5Objective Robustness FailuresJack Koch, Lauro Langosco, Jacob Pfau, James Le (and Lee Sharkey)Blog post[Video actually]: 'We Were Right! Real Inner Misalignment'https://www.youtube.com/watch?v=zkbPdEHEyEI&ab_channel=RobertMilesWere emailed by Rob Miles for possibly putting together a YouTube explanation of it
30
AISC5Cooperativity & Common Pool Resources Quinn Doughtery, Ben Greenberg, Ariel KwiatkowskiBlog post
AISC5 Retrospective: Mechanisms for Avoiding Tragedy of the Commons in Common Pool Resource Problems
https://www.lesswrong.com/posts/LBwpubeZSi3ottfjs/aisc5-retrospective-mechanisms-for-avoiding-tragedy-of-the
31
AISC5Multi-Objective Decision-MakingRobert Klassert, Roland Pihlakas, Ben SmithBlog postA brief review of the reasons multi-objective RL could be important in AI Safety Researchhttps://www.lesswrong.com/posts/i5dLfi6m6FCexReK9/a-brief-review-of-the-reasons-multi-objective-rl-could-be
Briefly mentioned in a later post by Peter Vampew: "We have provided a short list of recommended reading at the end of this post, and we refer the reader again to the post of Smith, Pihlakas and Klassert for an overview of work in this area." https://www.lesswrong.com/posts/eeEEgNeTepZb6F6NF
32
AISC6Multi-Objective Decision-MakingBlog postSets of objectives for a multi-objective RL agent to optimizehttps://www.lesswrong.com/posts/4mvdZXjwJHv9tSAWB/sets-of-objectives-for-a-multi-objective-rl-agent-to-1
33
AISC7Multi-Objective Decision-MakingPaperUsing soft maximin for risk averse multi-objective decision-makinhttps://link.springer.com/article/10.1007/s10458-022-09586-2Published in the journal Autonomous Agents and Multi-Agent Systems.
34
AISC6(personal post)Jan Kirchner Blog postInferring utility functions from locally non-transitive preferenceshttps://www.lesswrong.com/posts/QZiGEDiobFz8ropA5/inferring-utility-functions-from-locally-non-transitive
"As part of the AI Safety Camp, I've been diving a bit deeper into the foundations of expected utility theory and preference learning. In this post, I am making explicit a connection between those two things that (I assume) many people already made implicitly. But I couldn't find a nice exposition of this argument so I wrote it up. Any feedback is of course highly welcome!"
35
AISC6Language Models as Tools for Alignment ResearchJan Kirchner, Jacques Thibodeau, Logan Smith (as external collaborator?), Kyle and Laria (mentors), Arush?Blog postA survey of tool use and workflows in alignment researchhttps://www.alignmentforum.org/posts/ebYiodG3MAEqskCDG/a-survey-of-tool-use-and-workflows-in-alignment-research-1
Got a shout out from Jan Leike ('Other researchers have started working on this approach too.') in the post A minimal viable product for alignment (https://aligned.substack.com/p/alignment-mvp?s=r)
36
AISC6Constraints from Selection - Modularity subteamLucius Bushnaq, Avery Griffin, Callum McDougallBlog postProject Intro: Selection Theorems for Modularityhttps://www.alignmentforum.org/posts/XKwKJCXgSKhSr9bZY/project-intro-selection-theorems-for-modularity
37
AISC6Constraints from Selection - Modularity subteamLucius Bushnaq, Avery Griffin, Callum McDougallBlog postTheories of Modularity in the Biological Literaturehttps://www.alignmentforum.org/posts/JzTfKrgC7Lfz3zcwM/theories-of-modularity-in-the-biological-literature
38
AISC6Semantic Side-Effect MinimisationFabian Schimpf, Lukas FluriBlog postOpen Problems in Negative Side Effect Minimizationhttps://www.alignmentforum.org/posts/pnAxcABq9GBDG5BNW/open-problems-in-negative-side-effect-minimization
39
AISC6Impact of Memetics on AlignmentHarriet FarlowBlog postMachines vs Memes Part 1: AI Alignment and Memeticshttps://www.lesswrong.com/posts/JLH6ido4qoBtYmnNR/machines-vs-memes-part-1
40
AISC6Impact of Memetics on AlignmentNate RushBlog post
Machines vs. Memes 2: Memetically-Motivated Model Extensions
https://www.lesswrong.com/posts/gumkW3vy9mhjZriuc/machines-vs-memes-2-memetically-motivated-model-extensions
41
AISC6Impact of Memetics on AlignmentClaudio CerutiBlog postMachines vs Memes Part 3: Imitation and Memeshttps://www.lesswrong.com/posts/nbDFj4ZS6WSDKtSk4/machines-vs-memes-part-3-imitation-and-memes
42
AISC6Impact of Memetics on AlignmentHarriet Farlow & Claudio CerutiPaperMemes in the Machine: Ideological Propagation in Large Language Modelshttps://ieeexplore.ieee.org/document/10894714Won best paper in 2024, says Harriet
43
AISC6Impact of Memetics on AlignmentHarriet Farlow& Claudio CerutiPaperBeyond Words: Memetic Theory as a Lens to Expose Ideological Bias in Language Modelshttps://ieeexplore.ieee.org/document/10851522
44
AISC6(personal post)Jan CzechowskyBlog postSteganography and the CycleGAN - alignment failure case studyhttps://www.lesswrong.com/posts/uutXLm2DRcCtFBZ2D/steganography-and-the-cyclegan-alignment-failure-case-study
45
AISC6Constraints from Selection - Modularity subteamLucius Bushnaq, Avery Griffin, Callum McDougallBlog postTen experiments in modularity, which we'd like you to run!https://www.lesswrong.com/posts/99WtcMpsRqZcrocCd/ten-experiments-in-modularity-which-we-d-like-you-to-run
46
AISC6Pipeline for Measuring MisalignmentMarius Hobbhahn, Eric Landgrebe, Beth Barnes (mentor)Blog postReflection Mechanisms as an Alignment target: A surveyhttps://www.lesswrong.com/posts/XyBWkoaqfnuEyNWXi/reflection-mechanisms-as-an-alignment-target-a-survey-1
47
AISC6Pipeline for Measuring MisalignmentMarius Hobbhahn, Eric Landgrebe, Beth Barnes (mentor)PaperReflection Mechanisms as an Alignment Target: A Surveyhttps://openreview.net/forum?id=4eMzKmZ6xWPaper version that was accepted to the NeurIPS ML Safety workshop.
48
AISC6Constraints from Selection - Modularity subteamLucius Bushnaq, Avery Griffin, Callum McDougallBlog postWhat Is The True Name of Modularity?https://www.lesswrong.com/posts/TTTHwLpcewGjQHWzh/what-is-the-true-name-of-modularity
49
AISC6Table-Top Role-Playing GameBlog post[Announcement:] AI takeover tabletop RPG: "The Treacherous Turn"https://www.lesswrong.com/posts/b5EqwQZw7ww2K28Ki/ai-takeover-tabletop-rpg-the-treacherous-turn
50
AISC6Language Models as Tools for Alignment ResearchJan Kirchner, Jacques Thibodeau, Logan Smith, "janus"Blog post
Results from a survey on tool use and workflows in alignment research
https://www.lesswrong.com/posts/a2io2mcxTWS4mxodF/results-for-a-survey-of-tool-use-and-workflows-in-alignment
51
AISC6Language Models as Tools for Alignment ResearchJan Kirchner, Jacques Thibodeau, Logan Smith, "janus"Blog post
A descriptive, not prescriptive, overview of current AI Alignment Research
https://www.lesswrong.com/posts/FgjcHiWvADgsocE34/a-descriptive-not-prescriptive-overview-of-current-ai
52
AISC7AGI Safety Impossibility TheoremForrest LandryBlog post[List of posts:
- See Comments for posts published around the October 2022 retreat.
- See Link to look for posts published afterwards.]
https://mflb.com/ai_alignment_1/title_reorg_psr.html
*Published around AISC8 retreat*
Narrative structure:
- AI Scope of Work (written day after camp); Meta-Narrative Sequence of AI Substrate Takeover
Explanation snippets:
- XKCD-style Comic Overview (just minor edits); Superintelligence Safety Q&A (just minor edits); Negative Arguments; Substrate-Dependent Needs; APS review.
Responses to anonymised questions/skeptical counterarguments:
- Super-ordinate Claims; SGD Selection; Optimisation Cycles; Alignment Drift; Math Expectations; Right Skepticism.
53
AISC8Uncontrollable DynamicsRemmelt EllenBlog postThe Control Problem: Unsolved or Unsolvable?https://www.lesswrong.com/posts/xp6n2MG5vQkPpFEBH/the-control-problem-unsolved-or-unsolvable
54
AISC8Uncontrollable DynamicsRoman YenBlog postOn the possibility of impossibility of AGI Long-Term Safetyhttps://www.lesswrong.com/posts/zuXtMKuQRGAhZMoKk/on-the-possibility-of-impossibility-of-agi-long-term-safety#fnmso3ekucj2b
55
AISC8Failure StoriesKarl von Wendt, Sofia Bharadia, Peter Drotos, Artem Korotkov... mespa, mruwnik Blog postAgentic Messhttps://www.lesswrong.com/posts/LyJAFBuuEfd4kxgsw/agentic-mess-a-failure-storyVideo version here.
56
AISC8Failure StoriesKarl von Wendt, Sofia Bharadia, Peter Drotos, Artem Korotkov... mespa, mruwnik Blog postPaths to Failurehttps://www.lesswrong.com/posts/yv4xAnkEyWvpXNBte/paths-to-failure
57
AISC8Failure StoriesKarl von Wendt, Sofia Bharadia, Peter Drotos, Artem Korotkov... mespa, mruwnik Blog postA Friendly Facehttps://www.lesswrong.com/posts/iRFxvNeLbHNRCzA2S/a-friendly-face-another-failure-story
58
AISC8Interpretable ArchitecturesRobert Kralisch, Anton Zheltoukhov, David Liu, Sohaib ImranBlog postAn Investigation of the Frameworks of “Positive Attractors” and “Inherently Interpretable Architectures” https://www.lesswrong.com/s/z7JTHHdapYdvgfPhM
59
AISC8Team Cyborg- ...Kanad Chakrabarti, Roman Leventov, Nicholas Kees DupuisBlog postPhilosophical Cyborg (Part 1)https://www.lesswrong.com/posts/k93NEoXZq6CdXegdx/philosophical-cyborg-part-1
60
AISC8Team Cyborg- ...Kanad ChakrabartiBlog postPhilosophical Cyborg (Part 2)...or, The Good Successorhttps://www.lesswrong.com/posts/ZZ57cBkpQ5hpAux9T/philosophical-cyborg-part-2-or-the-good-successor
61
AISC8Behavioural AnnotationNell Watson...PaperDraft towards paperhttps://docs.google.com/document/d/186iPTOUtofEsL1qgXsH5qX1IBlq7fn2m/edit
62
AISC8Soft OptimizationBlog posthttps://www.lesswrong.com/posts/XXrGhqSNZjcG2nNiy/aisc-team-report-soft-optimization-bayes-and-goodhart
63
AISC8Machine Learning For Scientific DiscoveryBlog post[Sequence:] Machine Learning For Scientific Discoveryhttps://www.lesswrong.com/s/xoXeJZRCBEBnBoGbC
64
AISC8Literature Review of the Neurological Basis of Human Values and PreferencesMateusz BagińskiBlog post"Wanting" and "liking"https://www.lesswrong.com/posts/opJxxfrN33xQx3eXu/wanting-and-liking
65
AISC8Interdisciplinary Investigation of DebateGPTPaul Bricman, Elfia Bezou-Vrakatseli, Thomas Feeney, and Yimeng XieBlog postTruthhttps://compphil.github.io/truth/
66
AISC8Understanding Search in TransformersMichael I. Ivanitskiy, Alex F. Spies, Tilman Räuker, Guillaume Corlouer, Chris Mathwin, Lucia Quirke, Can Rager, Rusheb Shah, Dan Valentine,
Cecilia Diniz Behn, Katsumi Inoue, Samy Wu Fung
PaperStructured World Representations in Maze-Solving Transformershttps://arxiv.org/pdf/2312.02566.pdf
67
AISC8Inducing Human-Like Biases in Moral Reasoning LLMsArtyom Karpov, Austin Meek, Bogdan Ionut Cirstea, SChoBlog postInducing Human-Like Biases in Moral Reasoning LLMshttps://www.lesswrong.com/posts/TDSTmePg9jfL6nfJH/a-taxonomy-of-ai-system-evaluations-wip
68
AISC9Domenic Rosati, Jan Wehner, Kai Williams, Łukasz Bartoszcze, Jan Batzner, Hassan Sajjad, Frank RudziczPaperImmunization against harmful fine-tuning attackshttps://arxiv.org/abs/2402.16382
69
AISC9Congressional Messaging CampaignsTristan Williams, davekasten, jacob.turn, Felix De Simone, gergoBlog post
Talking to Congress: Can constituents contacting their legislator influencepolicy?
https://forum.effectivealtruism.org/posts/5oStggnYLGzomhvvn/talking-to-congress-can-constituents-contacting-their
70
AISC9SatisfIAVitalii Chyhirov, Simon Fischer, Benjamin Kolb, Martin Kunev, Ariel Kwiatkowski, Jeremy Rich. Lead: Jobst Heitzig (we were also joined by several interns at his lab and members of SPAR)Blog postAspiration-based, non-maximizing AI agent designshttps://www.lesswrong.com/s/4TT69Yt5FDWijAWab
71
AISC10SatisfIAVassil Tashev, sevdeawesomeBlog postA Review of Weak to Strong Generalizationhttps://www.lesswrong.com/posts/ELbGqXiLbRe6zSkTu/a-review-of-weak-to-strong-generalization-ai-safety-camp
72
AISC9SatisfIASimon Dima, Simon Fischer, Jobst Heitzig, Joss OliverPaper
Non-maximizing policies that fulfill multi-criterion aspirations in expectation
https://arxiv.org/abs/2408.04385
73
AISC9Out-of-context learning interpretabilityVictor Levoso Fernandez (lead), Luan Fletcher, Leo Mckee-Reid, Andrei Cristea, Florian van der Steen, Nikita Menon, Kunvar ThamanSoftwareaisc_oocl_experimentshttps://github.com/fletchel/aisc_oocl_experiments
74
AISC9High-Level Mechanistic Interpretability Activation Engineering Library 🔥Jamie Coombes, Ardy Haroen, Fergus Fettes, Lukas Linauer, Shaheen Ahmed-Chowdhury, Vy HongSoftwareobvslibhttps://github.com/obvslib/obvs
75
AISC9Ambitious Mechanistic InterpretabilityAlice Rigg, Jacob Goldman-Wetzler, Karthik Murugadoss, Leonard Bereska, Lucas Hayne, Wolodymyr Krywonos, Michael Pearce, Kola Ayonrinde, Gonçalo PauloBlog post[Various outputs by individual team members]ghost gradients implementation, by Jacob
Various Mamba interp things, by Goncalo & others
Atp* implementation, by Kola
Reverse engineering MNIST, by Michael
Hierarchical feature clustering, by Alice
Clustering features by their topology, by Karthik
Mech interp survey paper, by Leonard
Computation in superposition extensions, by Lucas
76
AISC9Modelling Trajectories of Language ModelsNicky Pochinkov, Tetra Jones, Rashidur RahmanPaperModularity In Transformers: Investigating
Separability & Neuron Task Specialization
https://cloud.nicky.pro/s/A2srG3f8W9TLwrGUnder review as a conference paper at ICLR 2024
77
AISC9Modelling Trajectories of Language ModelsNicky Pochinkov, Ben Pasero, Skylar ShibayamaPaperInvestigating Neuron Ablation In Attention
Heads: The Case For Peak Activation Centering
https://cloud.nicky.pro/s/cM7sFPQfBSsaikxUnder review as a conference paper at the SeT LLM workshop at ICLR 2024
78
AISC9MILD Marcel Mir, Alex Champandard, Remmelt EllenPaperMILD: Minimal Item-Level Documentation of Training Datahttps://docs.google.com/document/d/1tP5j1sUf5JI6E700JpU8j_ZKP_zvAlsEFzrMv1PUJQI/editDraft doc; will be published later
79
AISC9Asymmetric control in LLMsDomenic Rosati, Jan Wehner, Kai Williams, Łukasz Bartoszcze, Jan Batzner, Simon LermanPaperImmunization against harmful fine-tuning attackshttps://arxiv.org/abs/2402.16382Published in EMNLP 2024
80
AISC9Asymmetric control in LLMsDomenic Rosati, Jan Wehner, Kai Williams, Łukasz Bartoszcze, David Atanasov, Robie Gonzales, Subhabrata Majumdar, Carsten Maple, Hassan Sajjad, Frank RudziczPaperRepresentation noising effectively prevents harmful fine-tuning on LLMshttps://arxiv.org/abs/2405.14577Published in NeurIPS 2024
81
AISC9Asymmetric control in LLMsDomenic Rosati, Jan Wehner, David AtanasovBlog postTraining-time domain authorization could be helpful for safetyhttps://www.lesswrong.com/posts/38avQYy782zXgNo9u/training-time-domain-authorization-could-be-helpful-for
82
AISC9The promisingness of automated alignmentBogdan Ionut Cirstea, AISC: Jaeson Booker, Leo Mckee-Reid, Marcel Mir, Severin Field, Milton Lin, Sai Joseph, Vassil Tashev, Yuan Yuan Sun; MARS: Alfie Lamerton, Tim Chan, Robayet Hossain; SPAR: Joyee Chen, Joe Emerson, Minh Nguyen, Yixiong Hao.Blog postA Review of Weak to Strong Generalization https://www.lesswrong.com/posts/ELbGqXiLbRe6zSkTu/a-review-of-weak-to-strong-generalization-ai-safety-camp
83
AISC9The promisingness of automated alignmentBogdan Ionut Cirstea, AISC: Jaeson Booker, Leo Mckee-Reid, Marcel Mir, Severin Field, Milton Lin, Sai Joseph, Vassil Tashev, Yuan Yuan Sun; MARS: Alfie Lamerton, Tim Chan, Robayet Hossain; SPAR: Joyee Chen, Joe Emerson, Minh Nguyen, Yixiong Hao.Blog postPaper review: “The Unreasonable Effectiveness of Easy Training Data for Hard Tasks”https://www.lesswrong.com/posts/Wd9vzwqcYuEokJYCH/paper-review-the-unreasonable-effectiveness-of-easy-training
84
AISC9The promisingness of automated alignmentBogdan Ionut Cirstea, AISC: Jaeson Booker, Leo Mckee-Reid, Marcel Mir, Severin Field, Milton Lin, Sai Joseph, Vassil Tashev, Yuan Yuan Sun; MARS: Alfie Lamerton, Tim Chan, Robayet Hossain; SPAR: Joyee Chen, Joe Emerson, Minh Nguyen, Yixiong Hao.Blog postA Review of In-Context Learning Hypotheses for Automated AI Alignment Researchhttps://www.lesswrong.com/posts/GPcwP8pgyPFPwvi2h/a-review-of-in-context-learning-hypotheses-for-automated-ai
85
AISC9Towards realistic ODDs for foundation model based AI offerings Igor Krawczuk, Paulius Skaisgiris, Scott Bursese Arghya Sarkar and Tanvir Iqbal Softwarehttps://github.com/genalgodds
86
AISC9High-Level Mechanistic Interpretability Activation Engineering LibraryJamie Coombes, Ardy Haroen, Fergus Fettes, Lukas Linauer, Shaheen Ahmed-Chowdhury, Vy HongSoftwarehttps://github.com/obvslib/obvs
87
AISC9Out-of-context learning interpretability
Victor Levoso Fernandez, Luan Fletcher, Leo Mckee-Reid, Andrei Cristea, Florian van der Steen, Nikita Menon, Kunvar Thaman
Software
88
AISC9Does sufficient optimization imply agent structure?Tyler Tracy, Mateusz Bagiński, Einar Urdshals, Amaury Lorin, Jasmina Nasufi, Alfred Harwood, Alex Altair (RL)Blog postTowards a formalization of the agent structure problemhttps://www.lesswrong.com/posts/oxsBpx9v3bgxraiPj/towards-a-formalization-of-the-agent-structure-problem
89
AISC9Personal Fine-Tuning Implementations for AI Value AlignmentMinh Nguyen, Sarah Pan, Nell WatsonBlog post[We intend to publish a paper on our experiments and observations.]
90
AISC9Self-Other OverlapMarc Carauleanu, Mike Vaiana, Judd Rosenblatt, Diogo de Lucena, Cameron BergBlog postSelf-Other Overlap: A Neglected Approach to AI Alignmenthttps://www.lesswrong.com/posts/hzt9gHpNwA2oHtwKX/self-other-overlap-a-neglected-approach-to-ai-alignmentTeam was run with support of multiple organisations.
91
AISC9Evaluating alignment evaluationsMaxime Riché, Harrison Gietz, Jaime Raldua Veuthey, Edoardo PonaBlog postA Taxonomy Of AI System Evaluationshttps://www.lesswrong.com/posts/TDSTmePg9jfL6nfJH/a-taxonomy-of-ai-system-evaluations
92
AISC9Evaluating alignment evaluationsMaxime Riché, Harrison Gietz, Jaime Raldua Veuthey, Edoardo PonaBlog postThinking About Propensity Evaluationshttps://www.lesswrong.com/posts/sWf8wj64AdDfMeTvf/thinking-about-what-are-propensity-evaluations-wip
93
AISC9AI-Driven Economic Safety NetsTillman Schenk, Arturs Kanepajs, David ConradPaper
Navigating AI's Impact on Labor: Challenges, Scenarios, and Policy Pathways
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5032882
94
AISC9Personal Fine-Tuning Implementations for AI Value AlignmentEleanor (Nell) Watson, Minh Nguyen, Sarah Pan, Shujun ZhangPaper
Choice Vectors: Streamlining Personal AI Alignment Through Binary Selection
https://www.mdpi.com/2414-4088/9/3/22
95
AISC10Evaluating LLM Safety in a Multilingual WorldLukasz Bartoszcze, Sarthak Munshi, Bryan Sukidi, Jennifer Yen, Zejia Yang, David Williams-King, Linh Le, Kosi Asuzu, Carsten MaplePaper
Representation Engineering for Large-Language Models: Survey and Research Challenges
https://arxiv.org/abs/2502.17601
96
AISC10Formalize the Hashiness ModelAnders Sandberg, Thibaud Veron, Aybars KocogluOtherPoster: Complexity, Control Theory, and AI Alignmenthttps://drive.google.com/file/d/18mY5uAWO79c_yvlad-UNl_TqEompXwhm/viewPoster presented at Control Conference 2025
97
AISC10Simulator Theory Blog post
LessWrong Sequence: Simulators vs Agents: Updating Risk Models
https://www.google.com/url?q=https%3A%2F%2Fwww.lesswrong.com%2Fs%2FpwKrMXjYNK5LNeKCu&sa=D
98
AISC10Building the Pause ButtonJoep Meindertsma, Farhan Shafiq, Raymond Koopmanschap, Ananthi Al Ramiah, Dominika Kunertova, Mitali Mittal, Ricardo Manhães Savii Blog postBuilding the Pause Button webpagehttps://pauseai.info/building-the-pause-button
99
AISC10Building the Pause ButtonAnanthi Al Ramiah, Raymond Koopmanschap, Josh Thorsteinson, Sadruddin Khan, Jim Zhou, Shafira Noh, Joep Meindertsma, Farhan ShafiqPaper
Toward a Global Regime for Compute Governance: Building the Pause Button
https://arxiv.org/abs/2506.20530
100
AISC10Growing PauseAI Chris Gerrby, Sharon Mwaniki, Alyssa Chase-Vilchez, Manuela García Toro, Andrei-Octavian Dirla Blog postSee write-ups here.