ABCDEFG
1
Please add/comment with papers/posts:
2
EditionTeamParticipantsFormatTitleLinkComments
3
AISC1personal postJustin Shovelain, Michael Aird, others?Blog post
Improving the future by influencing actors' benevolence, intelligence, and power
https://forum.effectivealtruism.org/posts/4oGYbvcy2SRHTWgWk/improving-the-future-by-influencing-actors-benevolence#fn-dCBkq5f8sD4CzHwn8-1Approximately 1/4 of the work on this post was done as part of AISC according to Justin Shovelain
4
AISC1
Side effects in Grid World
Jessica Cooper, Karol Kubicki, Gavin Leech, Tom McGrathSoftwarePreventing Side-effects in Gridworldshttps://www.gleech.org/grids/Noted in Krakovna's AIS resources. Cited in AIES paper.
5
AISC1Safe AFJames Bell, Linda Linsefors, Caspar Oesterheld, Joar SkalsePaperReinforcement Learning in Newcomblike Environmentshttps://proceedings.neurips.cc/paper/2021/hash/b9ed18a301c9f3d183938c451fa183df-Abstract.html
Was started at the first AISC in beginning '18 but published in Dec '20. It’s now been accepted for a spotlight presentation at NeurIPS 2021
6
AISC2
Human Preference Types
Nandi, Sabrina, ErinBlog postAcknowledging Human Preference Types to Support Value Learninghttps://www.alignmentforum.org/posts/mSPsyEwaymS74unND/acknowledging-human-preference-types-to-support-value
7
AISC2
Policymaking for AI Strategy
Brandon Perry, Risto UukPaperAI Governance and the Policymaking Process: Key Considerations for Reducing AI Riskhttps://www.mdpi.com/2504-2289/3/2/26Cited by FHI research associate
8
AISC2
Corrupt Reward MDPs
Tomasz Kisielewski, David Lindner, Jason Mancuso, Alok SinghSoftwareCorrupt Reward MDPshttps://github.com/jvmancuso/safe-grid-agents
9
AISC2CorrigibilityVegard Blindheim, Anton Osika, Roland PihlakasBlog postExponentially diminishing returns and conjunctive goals: Mitigating Goodhart’s law with common sense. Towards corrigibility and interruptibility via the golden middle way.https://medium.com/threelaws/diminishing-returns-and-conjunctive-goals-towards-corrigibility-and-interruptibility-2ec594fed75c
10
AISC2
Feature Visualization for Deep Reinforcement Learning
Zera Alexander, Andrew Schreiber, Fabian SteuerSoftwareFeature Visualization for Deep Reinforcement Learninghttps://github.com/andrewschreiber/agentGot to third and final round of EA Grant (not accepted)
11
AISC2IRL BenchmarkAdria Garriga-Alonso, Anton Osika, Johannes Heidecke, Max Daniel, Sayan SarkarSoftwareIRL Benchmarkhttps://github.com/JohannesHeidecke/irl-benchmark
12
AISC2
Assumptions of Human Values
Jan Kulveit, Linda Linsefors, Alexey TurchinBlog postMulti-agent predictive minds and AI alignmenthttps://www.lesswrong.com/posts/3fkBWpE4f9nYbdf7E/multi-agent-minds-and-ai-alignment
Jan has written a blog post about his best-guess model of how human values and motivations work. Probably not directly stemming much from work at AISC2, since he was busy with other stuff during physical retreat.
Mentioned in the post: "Part of this originated in the efforts of the “Hidden Assumptions” team on the 2nd AI safety camp, and my thoughts about how minds work are inspired by CFAR."
13
AISC2
Value Learning in Games
Stanislav Böhm, Tomáš Gavenčiak, Torben Swoboda, Mikhail YagudinBlog postValue learning in gameshttps://docs.google.com/document/d/1kxXk7KkFfJAqrk0kDjDJ6Tvz_FL04p34twLj19Tv_IQ/edit#heading=h.cy7im45es3q0
14
AISC2
Corrupt Reward MDPs
Jason Mancuso, Tomasz Kisielewski, David Lindner, Alok SinghPaperDetecting Spiky Corruption in Markov Decision Processeshttps://ceur-ws.org/Vol-2419/paper_28.pdf Presented in session at AI Safety Workshop in IJCAI 2019
15
AISC3
Modeling Cooperation
Jonas Müller, Miles Tidmarsh, Vasily KuznetsovSoftware(implementation of their formal mathematical model)www.modelingcooperation.com/model
16
AISC3DebateVojta Kovarik, Anna Gajdova, David Lindner, Lukas Finnveden, Rajashree AgrawalBlog postAI Safety Debate and Its Applicationshttps://www.lesswrong.com/posts/5Kv2qNfRyXXihNrx2/ai-safety-debate-and-its-applications
17
AISC3Embedded agentsArushi, Davide, SayanPaperCategorizing Wireheading in Partially Embedded Agentshttps://arxiv.org/abs/1906.09136Presented poster at AI Safety Workshop in IJCAI 2019
18
AISC3RL AttentionDmitry Nikulin, Sebastian Kosch, Fabian Steuer, Hoagy CunninghamBlog postRegularization and visualization of attention in reinforcement learning agentshttps://attentionentropy.github.io/
19
AISC4
Generalization in Reward Learning
Anton Makiievskyi, Liang Zhou, Max ChiswickBlog postAssessing Generalization in Reward Learning with Procedurally Generated Gameshttps://chisness.medium.com/assessing-generalization-in-reward-learning-intro-and-background-da6c99d9e48
20
AISC4Goal DirectednessAdam Shimi, Joe Collman, Michele Campolo, Sabrina Tang.Blog post
Focus: you are allowed to be bad at accomplishing your goals
https://www.lesswrong.com/s/DTnoFhDm7ZT2ecJMw/
Note that Adam Shimi was already focussed on doing work on goal-directedness before applying to AISC4, and would have probably have written a similar volume of posts in either case (in Remmelt's opinion). Last 5 were posts published (not necessarily fully finished) after the camp.
21
AISC4Goal DirectednessAdam Shimi, Joe Collman, Michele CampoloBlog postUnderstanding Goal Directedness (sequence)https://www.alignmentforum.org/s/o58ZMNaovdztbLfvN3/4 of goal direcedness team who continued researching together after camp officially ended
22
AISC4
Survey on AI X-Risk Scenarios
Sam Clarke, Alexis Carlier, Jonas SchuettBlog postSurvey on AI existential risk scenarioshttps://www.lesswrong.com/posts/WiXePTj7KeEycbiwK/survey-on-ai-existential-risk-scenarios
Also shared full results internally with researchers at FHI and elsewhere. Said didn't publish widely because of PR-risks.
23
AISC4
Human extracted preferences
Mislav Juric, Taylor Kulp-McDowall, Arun Raja, Riccardo Volpato, Nevan WichersBlog postExtraction of human preferences 👨→🤖https://www.lesswrong.com/posts/PZYD5kBpeHWgE5jX4/extraction-of-human-preferences
24
AISC5
Objective Robustness Failures
Jack Koch, Lauro Langosco, Jacob Pfau, James Le (and Lee Sharkey)PaperGoal Misgeneralization in Deep Reinforcement Learninghttps://proceedings.mlr.press/v162/langosco22a.htmlAccepted for a short presentation at ICML. Accepted for a poster at the the ICML UDL workshop accepted.
Cited by
Dan Hendrickx et al.TODO: Add NeurIPS link if published.
https://theturingprize.com/ wants to retrospectively award them a prize of 10k$, to be given as donation to a charity or fund of their choice (or mix of charity/funds).
25
AISC5
Objective Robustness Failures
Jack Koch, Lauro Langosco, Jacob Pfau, James Le (and Lee Sharkey)Blog post
Empirical Observations of Objective Robustness Failures; Discussion: Objective Robustness and Inner Alignment Terminology
Post 1 (empirical failures) Post 2 (Terminology) Two simultaneous posts. Summarised in AI-Alignment Newsletter by Rohin Shah.
26
AISC5
Cooperativity & Common Pool Resources
Quinn Doughtery, Ben Greenberg, Ariel KwiatkowskiSoftwarehttps://github.com/RedTachyon/cpr_reputation/
27
AISC5
Pessimistic Ask-For-Help Agents for Safe Exploration
Jamie Bernardi, David Reber, Magdalena Wache, Peter Barnett, Max ClarkeSoftwarehttps://github.com/j-bernardi/pessimistic-agents
28
AISC5
Human extracted preferences
Mislav Juric, Taylor Kulp-McDowall, Arun Raja, Riccardo Volpato, Nevan WichersSoftwarehttps://github.com/arunraja-hub/Preference_Extraction
29
AISC5
Objective Robustness Failures
Jack Koch, Lauro Langosco, Jacob Pfau, James Le (and Lee Sharkey)Blog post[Video actually]: 'We Were Right! Real Inner Misalignment'https://www.youtube.com/watch?v=zkbPdEHEyEI&ab_channel=RobertMilesWere emailed by Rob Miles for possibly putting together a YouTube explanation of it
30
AISC5
Cooperativity & Common Pool Resources
Quinn Doughtery, Ben Greenberg, Ariel KwiatkowskiBlog post
AISC5 Retrospective: Mechanisms for Avoiding Tragedy of the Commons in Common Pool Resource Problems
https://www.lesswrong.com/posts/LBwpubeZSi3ottfjs/aisc5-retrospective-mechanisms-for-avoiding-tragedy-of-the
31
AISC5
Multi-Objective Decision-Making
Robert Klassert, Roland Pihlakas, Ben SmithBlog postA brief review of the reasons multi-objective RL could be important in AI Safety Researchhttps://www.lesswrong.com/posts/i5dLfi6m6FCexReK9/a-brief-review-of-the-reasons-multi-objective-rl-could-be
Briefly mentioned in a later post by Peter Vampew: "We have provided a short list of recommended reading at the end of this post, and we refer the reader again to the post of Smith, Pihlakas and Klassert for an overview of work in this area." https://www.lesswrong.com/posts/eeEEgNeTepZb6F6NF
32
AISC6
Multi-Objective Decision-Making
Blog postSets of objectives for a multi-objective RL agent to optimizehttps://www.lesswrong.com/posts/4mvdZXjwJHv9tSAWB/sets-of-objectives-for-a-multi-objective-rl-agent-to-1
33
AISC7
Multi-Objective Decision-Making
PaperUsing soft maximin for risk averse multi-objective decision-makinhttps://link.springer.com/article/10.1007/s10458-022-09586-2Published in the journal Autonomous Agents and Multi-Agent Systems.
34
AISC6(personal post)Jan Kirchner Blog postInferring utility functions from locally non-transitive preferenceshttps://www.lesswrong.com/posts/QZiGEDiobFz8ropA5/inferring-utility-functions-from-locally-non-transitive
"As part of the AI Safety Camp, I've been diving a bit deeper into the foundations of expected utility theory and preference learning. In this post, I am making explicit a connection between those two things that (I assume) many people already made implicitly. But I couldn't find a nice exposition of this argument so I wrote it up. Any feedback is of course highly welcome!"
35
AISC6
Language Models as Tools for Alignment Research
Jan Kirchner, Jacques Thibodeau, Logan Smith (as external collaborator?), Kyle and Laria (mentors), Arush?Blog postA survey of tool use and workflows in alignment researchhttps://www.alignmentforum.org/posts/ebYiodG3MAEqskCDG/a-survey-of-tool-use-and-workflows-in-alignment-research-1
Got a shout out from Jan Leike ('Other researchers have started working on this approach too.') in the post A minimal viable product for alignment (https://aligned.substack.com/p/alignment-mvp?s=r)
36
AISC6
Constraints from Selection - Modularity subteam
Lucius Bushnaq, Avery Griffin, Callum McDougallBlog postProject Intro: Selection Theorems for Modularityhttps://www.alignmentforum.org/posts/XKwKJCXgSKhSr9bZY/project-intro-selection-theorems-for-modularity
37
AISC6
Constraints from Selection - Modularity subteam
Lucius Bushnaq, Avery Griffin, Callum McDougallBlog postTheories of Modularity in the Biological Literaturehttps://www.alignmentforum.org/posts/JzTfKrgC7Lfz3zcwM/theories-of-modularity-in-the-biological-literature
38
AISC6
Semantic Side-Effect Minimisation
Fabian Schimpf, Lukas FluriBlog postOpen Problems in Negative Side Effect Minimizationhttps://www.alignmentforum.org/posts/pnAxcABq9GBDG5BNW/open-problems-in-negative-side-effect-minimization
39
AISC6
Impact of Memetics on Alignment
Harriet FarlowBlog postMachines vs Memes Part 1: AI Alignment and Memeticshttps://www.lesswrong.com/posts/JLH6ido4qoBtYmnNR/machines-vs-memes-part-1
40
AISC6
Impact of Memetics on Alignment
Nate RushBlog post
Machines vs. Memes 2: Memetically-Motivated Model Extensions
https://www.lesswrong.com/posts/gumkW3vy9mhjZriuc/machines-vs-memes-2-memetically-motivated-model-extensions
41
AISC6
Impact of Memetics on Alignment
Claudio CerutiBlog postMachines vs Memes Part 3: Imitation and Memeshttps://www.lesswrong.com/posts/nbDFj4ZS6WSDKtSk4/machines-vs-memes-part-3-imitation-and-memes
42
AISC6(personal post)Jan CzechowskyBlog postSteganography and the CycleGAN - alignment failure case studyhttps://www.lesswrong.com/posts/uutXLm2DRcCtFBZ2D/steganography-and-the-cyclegan-alignment-failure-case-study
43
AISC6
Constraints from Selection - Modularity subteam
Lucius Bushnaq, Avery Griffin, Callum McDougallBlog postTen experiments in modularity, which we'd like you to run!https://www.lesswrong.com/posts/99WtcMpsRqZcrocCd/ten-experiments-in-modularity-which-we-d-like-you-to-run
44
AISC6
Pipeline for Measuring Misalignment
Marius Hobbhahn, Eric Landgrebe, Beth Barnes (mentor)Blog postReflection Mechanisms as an Alignment target: A surveyhttps://www.lesswrong.com/posts/XyBWkoaqfnuEyNWXi/reflection-mechanisms-as-an-alignment-target-a-survey-1
45
AISC6
Pipeline for Measuring Misalignment
Marius Hobbhahn, Eric Landgrebe, Beth Barnes (mentor)PaperReflection Mechanisms as an Alignment Target: A Surveyhttps://openreview.net/forum?id=4eMzKmZ6xWPaper version that was accepted to the NeurIPS ML Safety workshop.
46
AISC6
Constraints from Selection - Modularity subteam
Lucius Bushnaq, Avery Griffin, Callum McDougallBlog postWhat Is The True Name of Modularity?https://www.lesswrong.com/posts/TTTHwLpcewGjQHWzh/what-is-the-true-name-of-modularity
47
AISC6
Table-Top Role-Playing Game
Blog post[Announcement:] AI takeover tabletop RPG: "The Treacherous Turn"https://www.lesswrong.com/posts/b5EqwQZw7ww2K28Ki/ai-takeover-tabletop-rpg-the-treacherous-turn
48
AISC6
Language Models as Tools for Alignment Research
Jan Kirchner, Jacques Thibodeau, Logan Smith, "janus"Blog post
Results from a survey on tool use and workflows in alignment research
https://www.lesswrong.com/posts/a2io2mcxTWS4mxodF/results-for-a-survey-of-tool-use-and-workflows-in-alignment
49
AISC6
Language Models as Tools for Alignment Research
Jan Kirchner, Jacques Thibodeau, Logan Smith, "janus"Blog post
A descriptive, not prescriptive, overview of current AI Alignment Research
https://www.lesswrong.com/posts/FgjcHiWvADgsocE34/a-descriptive-not-prescriptive-overview-of-current-ai
50
AISC7
AGI Safety Impossibility Theorem
Forrest LandryBlog post[List of posts:
- See Comments for posts published around the October 2022 retreat.
- See Link to look for posts published afterwards.]
https://mflb.com/ai_alignment_1/title_reorg_psr.html
*Published around AISC8 retreat*
Narrative structure:
- AI Scope of Work (written day after camp); Meta-Narrative Sequence of AI Substrate Takeover
Explanation snippets:
- XKCD-style Comic Overview (just minor edits); Superintelligence Safety Q&A (just minor edits); Negative Arguments; Substrate-Dependent Needs; APS review.
Responses to anonymised questions/skeptical counterarguments:
- Super-ordinate Claims; SGD Selection; Optimisation Cycles; Alignment Drift; Math Expectations; Right Skepticism.
51
AISC8
Uncontrollable Dynamics
Remmelt EllenBlog postThe Control Problem: Unsolved or Unsolvable?https://www.lesswrong.com/posts/xp6n2MG5vQkPpFEBH/the-control-problem-unsolved-or-unsolvable
52
AISC8
Uncontrollable Dynamics
Roman YenBlog postOn the possibility of impossibility of AGI Long-Term Safetyhttps://www.lesswrong.com/posts/zuXtMKuQRGAhZMoKk/on-the-possibility-of-impossibility-of-agi-long-term-safety#fnmso3ekucj2b
53
AISC8Failure StoriesKarl von Wendt, Sofia Bharadia, Peter Drotos, Artem Korotkov... mespa, mruwnik Blog postAgentic Messhttps://www.lesswrong.com/posts/LyJAFBuuEfd4kxgsw/agentic-mess-a-failure-storyVideo version here.
54
AISC8Failure StoriesKarl von Wendt, Sofia Bharadia, Peter Drotos, Artem Korotkov... mespa, mruwnik Blog postPaths to Failurehttps://www.lesswrong.com/posts/yv4xAnkEyWvpXNBte/paths-to-failure
55
AISC8Failure StoriesKarl von Wendt, Sofia Bharadia, Peter Drotos, Artem Korotkov... mespa, mruwnik Blog postA Friendly Facehttps://www.lesswrong.com/posts/iRFxvNeLbHNRCzA2S/a-friendly-face-another-failure-story
56
AISC8
Interpretable Architectures
Robert Kralisch, Anton Zheltoukhov, David Liu, Sohaib ImranBlog postAn Investigation of the Frameworks of “Positive Attractors” and “Inherently Interpretable Architectures” https://www.lesswrong.com/s/z7JTHHdapYdvgfPhM
57
AISC8Team Cyborg- ...Kanad Chakrabarti, Roman Leventov, Nicholas Kees DupuisBlog postPhilosophical Cyborg (Part 1)https://www.lesswrong.com/posts/k93NEoXZq6CdXegdx/philosophical-cyborg-part-1
58
AISC8Team Cyborg- ...Kanad ChakrabartiBlog postPhilosophical Cyborg (Part 2)...or, The Good Successorhttps://www.lesswrong.com/posts/ZZ57cBkpQ5hpAux9T/philosophical-cyborg-part-2-or-the-good-successor
59
AISC8
Behavioural Annotation
Nell Watson...PaperDraft towards paperhttps://docs.google.com/document/d/186iPTOUtofEsL1qgXsH5qX1IBlq7fn2m/edit
60
AISC8Soft OptimizationBlog posthttps://www.lesswrong.com/posts/XXrGhqSNZjcG2nNiy/aisc-team-report-soft-optimization-bayes-and-goodhart
61
AISC8
Machine Learning For Scientific Discovery
Blog post[Sequence:] Machine Learning For Scientific Discoveryhttps://www.lesswrong.com/s/xoXeJZRCBEBnBoGbC
62
AISC8
Literature Review of the Neurological Basis of Human Values and Preferences
Mateusz BagińskiBlog post"Wanting" and "liking"https://www.lesswrong.com/posts/opJxxfrN33xQx3eXu/wanting-and-liking
63
AISC8
Interdisciplinary Investigation of DebateGPT
Paul Bricman, Elfia Bezou-Vrakatseli, Thomas Feeney, and Yimeng XieBlog postTruthhttps://compphil.github.io/truth/
64
AISC8
Understanding Search in Transformers
Michael I. Ivanitskiy, Alex F. Spies, Tilman Räuker, Guillaume Corlouer, Chris Mathwin, Lucia Quirke, Can Rager, Rusheb Shah, Dan Valentine,
Cecilia Diniz Behn, Katsumi Inoue, Samy Wu Fung
PaperStructured World Representations in Maze-Solving Transformershttps://arxiv.org/pdf/2312.02566.pdf
65
AISC9Domenic Rosati, Jan Wehner, Kai Williams, Łukasz Bartoszcze, Jan Batzner, Hassan Sajjad, Frank RudziczPaperImmunization against harmful fine-tuning attackshttps://arxiv.org/abs/2402.16382
66
AISC9
Congressional Messaging Campaigns
Tristan Williams, davekasten, jacob.turn, Felix De Simone, gergoBlog post
Talking to Congress: Can constituents contacting their legislator influencepolicy?
https://forum.effectivealtruism.org/posts/5oStggnYLGzomhvvn/talking-to-congress-can-constituents-contacting-their
67
AISC9SatisfIAVitalii Chyhirov, Simon Fischer, Benjamin Kolb, Martin Kunev, Ariel Kwiatkowski, Jeremy Rich. Lead: Jobst Heitzig (we were also joined by several interns at his lab and members of SPAR)Blog postAspiration-based, non-maximizing AI agent designshttps://www.lesswrong.com/s/4TT69Yt5FDWijAWab
68
AISC9
Out-of-context learning interpretability
Victor Levoso Fernandez (lead), Luan Fletcher, Leo Mckee-Reid, Andrei Cristea, Florian van der Steen, Nikita Menon, Kunvar ThamanSoftwareaisc_oocl_experimentshttps://github.com/fletchel/aisc_oocl_experiments
69
AISC9
High-Level Mechanistic Interpretability Activation Engineering Library 🔥
Jamie Coombes, Ardy Haroen, Fergus Fettes, Lukas Linauer, Shaheen Ahmed-Chowdhury, Vy HongSoftwareobvslibhttps://github.com/obvslib/obvs
70
AISC9
Ambitious Mechanistic Interpretability
Alice Rigg, Jacob Goldman-Wetzler, Karthik Murugadoss, Leonard Bereska, Lucas Hayne, Wolodymyr Krywonos, Michael Pearce, Kola Ayonrinde, Gonçalo PauloBlog post[Various outputs by individual team members]ghost gradients implementation, by Jacob
Various Mamba interp things, by Goncalo & others
Atp* implementation, by Kola
Reverse engineering MNIST, by Michael
Hierarchical feature clustering, by Alice
Clustering features by their topology, by Karthik
Mech interp survey paper, by Leonard
Computation in superposition extensions, by Lucas
71
AISC9
Modelling Trajectories of Language Models
Nicky Pochinkov, Tetra Jones, Rashidur RahmanPaperModularity In Transformers: Investigating
Separability & Neuron Task Specialization
https://cloud.nicky.pro/s/A2srG3f8W9TLwrGUnder review as a conference paper at ICLR 2024
72
AISC9
Modelling Trajectories of Language Models
Nicky Pochinkov, Ben Pasero, Skylar ShibayamaPaperInvestigating Neuron Ablation In Attention
Heads: The Case For Peak Activation Centering
https://cloud.nicky.pro/s/cM7sFPQfBSsaikxUnder review as a conference paper at the SeT LLM workshop at ICLR 2024
73
AISC9MILD Marcel Mir, Alex Champandard, Remmelt EllenPaperMILD: Minimal Item-Level Documentation of Training Datahttps://docs.google.com/document/d/1tP5j1sUf5JI6E700JpU8j_ZKP_zvAlsEFzrMv1PUJQI/editDraft doc; will be published later
74
AISC9
Asymmetric control in LLMs
Domenic Rosati, Jan Wehner, Kai Williams, Łukasz Bartoszcze, Jan Batzner, Simon LermanPaperImmunization against harmful fine-tuning attackshttps://arxiv.org/abs/2402.16382
75
AISC9
Asymmetric control in LLMs
Domenic Rosati, Jan Wehner, Kai Williams, Łukasz Bartoszcze, David Atanasov, Robie Gonzales, Subhabrata Majumdar, Carsten Maple, Hassan Sajjad, Frank RudziczPaperRepresentation noising effectively prevents harmful fine-tuning on LLMshttps://arxiv.org/abs/2405.14577
76
AISC9
Asymmetric control in LLMs
Domenic Rosati, Jan Wehner, David AtanasovBlog postTraining-time domain authorization could be helpful for safetyhttps://www.lesswrong.com/posts/38avQYy782zXgNo9u/training-time-domain-authorization-could-be-helpful-for
77
AISC9
The promisingness of automated alignment
Bogdan Ionut Cirstea, AISC: Jaeson Booker, Leo Mckee-Reid, Marcel Mir, Severin Field, Milton Lin, Sai Joseph, Vassil Tashev, Yuan Yuan Sun; MARS: Alfie Lamerton, Tim Chan, Robayet Hossain; SPAR: Joyee Chen, Joe Emerson, Minh Nguyen, Yixiong Hao.Blog postA Review of Weak to Strong Generalization https://www.lesswrong.com/posts/ELbGqXiLbRe6zSkTu/a-review-of-weak-to-strong-generalization-ai-safety-camp
78
AISC9
The promisingness of automated alignment
Bogdan Ionut Cirstea, AISC: Jaeson Booker, Leo Mckee-Reid, Marcel Mir, Severin Field, Milton Lin, Sai Joseph, Vassil Tashev, Yuan Yuan Sun; MARS: Alfie Lamerton, Tim Chan, Robayet Hossain; SPAR: Joyee Chen, Joe Emerson, Minh Nguyen, Yixiong Hao.Blog postPaper review: “The Unreasonable Effectiveness of Easy Training Data for Hard Tasks”https://www.lesswrong.com/posts/Wd9vzwqcYuEokJYCH/paper-review-the-unreasonable-effectiveness-of-easy-training
79
AISC9
The promisingness of automated alignment
Bogdan Ionut Cirstea, AISC: Jaeson Booker, Leo Mckee-Reid, Marcel Mir, Severin Field, Milton Lin, Sai Joseph, Vassil Tashev, Yuan Yuan Sun; MARS: Alfie Lamerton, Tim Chan, Robayet Hossain; SPAR: Joyee Chen, Joe Emerson, Minh Nguyen, Yixiong Hao.Blog postA Review of In-Context Learning Hypotheses for Automated AI Alignment Researchhttps://www.lesswrong.com/posts/GPcwP8pgyPFPwvi2h/a-review-of-in-context-learning-hypotheses-for-automated-ai
80
AISC9
Towards realistic ODDs for foundation model based AI offerings
Igor Krawczuk, Paulius Skaisgiris, Scott Bursese Arghya Sarkar and Tanvir Iqbal Softwarehttps://github.com/genalgodds
81
AISC9
Does sufficient optimization imply agent structure?
Tyler Tracy, Mateusz Bagiński, Einar Urdshals, Amaury Lorin, Jasmina Nasufi, Alfred Harwood, Alex Altair (RL)Blog postTowards a formalization of the agent structure problemhttps://www.lesswrong.com/posts/oxsBpx9v3bgxraiPj/towards-a-formalization-of-the-agent-structure-problem
82
AISC9
Evaluating alignment evaluations
Blog post[wrapping up drafts]
83
AISC9
Exploring toy models of agents
Paul Colognese, Ben Sturgeon, Narmeen Oozer, Arun JoseBlog post[subscribe to Paul’s LessWrong posts to be notified when we post the results of this project.]
84
AISC9
Benchmarks for Stable Reflectivity
Jacques Thibodeau (lead), Kanad Chakrabarti, Youlian Simidjiyski, Thee Ho, Jiaming (George) Yu, Jannes ElstnerBlog post[more should be published under @jacquesthibs or @ukc10014 on LessWrong ]
85
AISC9
Personal Fine-Tuning Implementations for AI Value Alignment
Minh Nguyen, Sarah Pan, Nell WatsonBlog post[We intend to publish a paper on our experiments and observations.]
86
AISC9
AI-Driven Economic Safety Nets
David Conrad, Rafael Andersson Lipcsey, Arturs Kanepajs, Tillman Schenk, Jacob SchaalBlog post[drafting]
87