A | B | C | D | E | F | G | |
---|---|---|---|---|---|---|---|
1 | Please add/comment with papers/posts: | ||||||
2 | Edition | Team | Participants | Format | Title | Link | Comments |
3 | AISC1 | personal post | Justin Shovelain, Michael Aird, others? | Blog post | Improving the future by influencing actors' benevolence, intelligence, and power | https://forum.effectivealtruism.org/posts/4oGYbvcy2SRHTWgWk/improving-the-future-by-influencing-actors-benevolence#fn-dCBkq5f8sD4CzHwn8-1 | Approximately 1/4 of the work on this post was done as part of AISC according to Justin Shovelain |
4 | AISC1 | Side effects in Grid World | Jessica Cooper, Karol Kubicki, Gavin Leech, Tom McGrath | Software | Preventing Side-effects in Gridworlds | https://www.gleech.org/grids/ | Noted in Krakovna's AIS resources. Cited in AIES paper. |
5 | AISC1 | Safe AF | James Bell, Linda Linsefors, Caspar Oesterheld, Joar Skalse | Paper | Reinforcement Learning in Newcomblike Environments | https://proceedings.neurips.cc/paper/2021/hash/b9ed18a301c9f3d183938c451fa183df-Abstract.html | Was started at the first AISC in beginning '18 but published in Dec '20. It’s now been accepted for a spotlight presentation at NeurIPS 2021 |
6 | AISC2 | Human Preference Types | Nandi, Sabrina, Erin | Blog post | Acknowledging Human Preference Types to Support Value Learning | https://www.alignmentforum.org/posts/mSPsyEwaymS74unND/acknowledging-human-preference-types-to-support-value | |
7 | AISC2 | Policymaking for AI Strategy | Brandon Perry, Risto Uuk | Paper | AI Governance and the Policymaking Process: Key Considerations for Reducing AI Risk | https://www.mdpi.com/2504-2289/3/2/26 | Cited by FHI research associate |
8 | AISC2 | Corrupt Reward MDPs | Tomasz Kisielewski, David Lindner, Jason Mancuso, Alok Singh | Software | Corrupt Reward MDPs | https://github.com/jvmancuso/safe-grid-agents | |
9 | AISC2 | Corrigibility | Vegard Blindheim, Anton Osika, Roland Pihlakas | Blog post | Exponentially diminishing returns and conjunctive goals: Mitigating Goodhart’s law with common sense. Towards corrigibility and interruptibility via the golden middle way. | https://medium.com/threelaws/diminishing-returns-and-conjunctive-goals-towards-corrigibility-and-interruptibility-2ec594fed75c | |
10 | AISC2 | Feature Visualization for Deep Reinforcement Learning | Zera Alexander, Andrew Schreiber, Fabian Steuer | Software | Feature Visualization for Deep Reinforcement Learning | https://github.com/andrewschreiber/agent | Got to third and final round of EA Grant (not accepted) |
11 | AISC2 | IRL Benchmark | Adria Garriga-Alonso, Anton Osika, Johannes Heidecke, Max Daniel, Sayan Sarkar | Software | IRL Benchmark | https://github.com/JohannesHeidecke/irl-benchmark | |
12 | AISC2 | Assumptions of Human Values | Jan Kulveit, Linda Linsefors, Alexey Turchin | Blog post | Multi-agent predictive minds and AI alignment | https://www.lesswrong.com/posts/3fkBWpE4f9nYbdf7E/multi-agent-minds-and-ai-alignment | Jan has written a blog post about his best-guess model of how human values and motivations work. Probably not directly stemming much from work at AISC2, since he was busy with other stuff during physical retreat. Mentioned in the post: "Part of this originated in the efforts of the “Hidden Assumptions” team on the 2nd AI safety camp, and my thoughts about how minds work are inspired by CFAR." |
13 | AISC2 | Value Learning in Games | Stanislav Böhm, Tomáš Gavenčiak, Torben Swoboda, Mikhail Yagudin | Blog post | Value learning in games | https://docs.google.com/document/d/1kxXk7KkFfJAqrk0kDjDJ6Tvz_FL04p34twLj19Tv_IQ/edit#heading=h.cy7im45es3q0 | |
14 | AISC2 | Corrupt Reward MDPs | Jason Mancuso, Tomasz Kisielewski, David Lindner, Alok Singh | Paper | Detecting Spiky Corruption in Markov Decision Processes | https://ceur-ws.org/Vol-2419/paper_28.pdf | Presented in session at AI Safety Workshop in IJCAI 2019 |
15 | AISC3 | Modeling Cooperation | Jonas Müller, Miles Tidmarsh, Vasily Kuznetsov | Software | (implementation of their formal mathematical model) | www.modelingcooperation.com/model | |
16 | AISC3 | Debate | Vojta Kovarik, Anna Gajdova, David Lindner, Lukas Finnveden, Rajashree Agrawal | Blog post | AI Safety Debate and Its Applications | https://www.lesswrong.com/posts/5Kv2qNfRyXXihNrx2/ai-safety-debate-and-its-applications | |
17 | AISC3 | Embedded agents | Arushi, Davide, Sayan | Paper | Categorizing Wireheading in Partially Embedded Agents | https://arxiv.org/abs/1906.09136 | Presented poster at AI Safety Workshop in IJCAI 2019 |
18 | AISC3 | RL Attention | Dmitry Nikulin, Sebastian Kosch, Fabian Steuer, Hoagy Cunningham | Blog post | Regularization and visualization of attention in reinforcement learning agents | https://attentionentropy.github.io/ | |
19 | AISC4 | Generalization in Reward Learning | Anton Makiievskyi, Liang Zhou, Max Chiswick | Blog post | Assessing Generalization in Reward Learning with Procedurally Generated Games | https://chisness.medium.com/assessing-generalization-in-reward-learning-intro-and-background-da6c99d9e48 | |
20 | AISC4 | Goal Directedness | Adam Shimi, Joe Collman, Michele Campolo, Sabrina Tang. | Blog post | Focus: you are allowed to be bad at accomplishing your goals | https://www.lesswrong.com/s/DTnoFhDm7ZT2ecJMw/ | Note that Adam Shimi was already focussed on doing work on goal-directedness before applying to AISC4, and would have probably have written a similar volume of posts in either case (in Remmelt's opinion). Last 5 were posts published (not necessarily fully finished) after the camp. |
21 | AISC4 | Goal Directedness | Adam Shimi, Joe Collman, Michele Campolo | Blog post | Understanding Goal Directedness (sequence) | https://www.alignmentforum.org/s/o58ZMNaovdztbLfvN | 3/4 of goal direcedness team who continued researching together after camp officially ended |
22 | AISC4 | Survey on AI X-Risk Scenarios | Sam Clarke, Alexis Carlier, Jonas Schuett | Blog post | Survey on AI existential risk scenarios | https://www.lesswrong.com/posts/WiXePTj7KeEycbiwK/survey-on-ai-existential-risk-scenarios | Also shared full results internally with researchers at FHI and elsewhere. Said didn't publish widely because of PR-risks. |
23 | AISC4 | Human extracted preferences | Mislav Juric, Taylor Kulp-McDowall, Arun Raja, Riccardo Volpato, Nevan Wichers | Blog post | Extraction of human preferences 👨→🤖 | https://www.lesswrong.com/posts/PZYD5kBpeHWgE5jX4/extraction-of-human-preferences | |
24 | AISC5 | Objective Robustness Failures | Jack Koch, Lauro Langosco, Jacob Pfau, James Le (and Lee Sharkey) | Paper | Goal Misgeneralization in Deep Reinforcement Learning | https://proceedings.mlr.press/v162/langosco22a.html | Accepted for a short presentation at ICML. Accepted for a poster at the the ICML UDL workshop accepted. Cited by Dan Hendrickx et al.TODO: Add NeurIPS link if published. https://theturingprize.com/ wants to retrospectively award them a prize of 10k$, to be given as donation to a charity or fund of their choice (or mix of charity/funds). |
25 | AISC5 | Objective Robustness Failures | Jack Koch, Lauro Langosco, Jacob Pfau, James Le (and Lee Sharkey) | Blog post | Empirical Observations of Objective Robustness Failures; Discussion: Objective Robustness and Inner Alignment Terminology | Post 1 (empirical failures) Post 2 (Terminology) | Two simultaneous posts. Summarised in AI-Alignment Newsletter by Rohin Shah. |
26 | AISC5 | Cooperativity & Common Pool Resources | Quinn Doughtery, Ben Greenberg, Ariel Kwiatkowski | Software | https://github.com/RedTachyon/cpr_reputation/ | ||
27 | AISC5 | Pessimistic Ask-For-Help Agents for Safe Exploration | Jamie Bernardi, David Reber, Magdalena Wache, Peter Barnett, Max Clarke | Software | https://github.com/j-bernardi/pessimistic-agents | ||
28 | AISC5 | Human extracted preferences | Mislav Juric, Taylor Kulp-McDowall, Arun Raja, Riccardo Volpato, Nevan Wichers | Software | https://github.com/arunraja-hub/Preference_Extraction | ||
29 | AISC5 | Objective Robustness Failures | Jack Koch, Lauro Langosco, Jacob Pfau, James Le (and Lee Sharkey) | Blog post | [Video actually]: 'We Were Right! Real Inner Misalignment' | https://www.youtube.com/watch?v=zkbPdEHEyEI&ab_channel=RobertMiles | Were emailed by Rob Miles for possibly putting together a YouTube explanation of it |
30 | AISC5 | Cooperativity & Common Pool Resources | Quinn Doughtery, Ben Greenberg, Ariel Kwiatkowski | Blog post | AISC5 Retrospective: Mechanisms for Avoiding Tragedy of the Commons in Common Pool Resource Problems | https://www.lesswrong.com/posts/LBwpubeZSi3ottfjs/aisc5-retrospective-mechanisms-for-avoiding-tragedy-of-the | |
31 | AISC5 | Multi-Objective Decision-Making | Robert Klassert, Roland Pihlakas, Ben Smith | Blog post | A brief review of the reasons multi-objective RL could be important in AI Safety Research | https://www.lesswrong.com/posts/i5dLfi6m6FCexReK9/a-brief-review-of-the-reasons-multi-objective-rl-could-be | Briefly mentioned in a later post by Peter Vampew: "We have provided a short list of recommended reading at the end of this post, and we refer the reader again to the post of Smith, Pihlakas and Klassert for an overview of work in this area." https://www.lesswrong.com/posts/eeEEgNeTepZb6F6NF |
32 | AISC6 | Multi-Objective Decision-Making | Blog post | Sets of objectives for a multi-objective RL agent to optimize | https://www.lesswrong.com/posts/4mvdZXjwJHv9tSAWB/sets-of-objectives-for-a-multi-objective-rl-agent-to-1 | ||
33 | AISC7 | Multi-Objective Decision-Making | Paper | Using soft maximin for risk averse multi-objective decision-makin | https://link.springer.com/article/10.1007/s10458-022-09586-2 | Published in the journal Autonomous Agents and Multi-Agent Systems. | |
34 | AISC6 | (personal post) | Jan Kirchner | Blog post | Inferring utility functions from locally non-transitive preferences | https://www.lesswrong.com/posts/QZiGEDiobFz8ropA5/inferring-utility-functions-from-locally-non-transitive | "As part of the AI Safety Camp, I've been diving a bit deeper into the foundations of expected utility theory and preference learning. In this post, I am making explicit a connection between those two things that (I assume) many people already made implicitly. But I couldn't find a nice exposition of this argument so I wrote it up. Any feedback is of course highly welcome!" |
35 | AISC6 | Language Models as Tools for Alignment Research | Jan Kirchner, Jacques Thibodeau, Logan Smith (as external collaborator?), Kyle and Laria (mentors), Arush? | Blog post | A survey of tool use and workflows in alignment research | https://www.alignmentforum.org/posts/ebYiodG3MAEqskCDG/a-survey-of-tool-use-and-workflows-in-alignment-research-1 | Got a shout out from Jan Leike ('Other researchers have started working on this approach too.') in the post A minimal viable product for alignment (https://aligned.substack.com/p/alignment-mvp?s=r) |
36 | AISC6 | Constraints from Selection - Modularity subteam | Lucius Bushnaq, Avery Griffin, Callum McDougall | Blog post | Project Intro: Selection Theorems for Modularity | https://www.alignmentforum.org/posts/XKwKJCXgSKhSr9bZY/project-intro-selection-theorems-for-modularity | |
37 | AISC6 | Constraints from Selection - Modularity subteam | Lucius Bushnaq, Avery Griffin, Callum McDougall | Blog post | Theories of Modularity in the Biological Literature | https://www.alignmentforum.org/posts/JzTfKrgC7Lfz3zcwM/theories-of-modularity-in-the-biological-literature | |
38 | AISC6 | Semantic Side-Effect Minimisation | Fabian Schimpf, Lukas Fluri | Blog post | Open Problems in Negative Side Effect Minimization | https://www.alignmentforum.org/posts/pnAxcABq9GBDG5BNW/open-problems-in-negative-side-effect-minimization | |
39 | AISC6 | Impact of Memetics on Alignment | Harriet Farlow | Blog post | Machines vs Memes Part 1: AI Alignment and Memetics | https://www.lesswrong.com/posts/JLH6ido4qoBtYmnNR/machines-vs-memes-part-1 | |
40 | AISC6 | Impact of Memetics on Alignment | Nate Rush | Blog post | Machines vs. Memes 2: Memetically-Motivated Model Extensions | https://www.lesswrong.com/posts/gumkW3vy9mhjZriuc/machines-vs-memes-2-memetically-motivated-model-extensions | |
41 | AISC6 | Impact of Memetics on Alignment | Claudio Ceruti | Blog post | Machines vs Memes Part 3: Imitation and Memes | https://www.lesswrong.com/posts/nbDFj4ZS6WSDKtSk4/machines-vs-memes-part-3-imitation-and-memes | |
42 | AISC6 | (personal post) | Jan Czechowsky | Blog post | Steganography and the CycleGAN - alignment failure case study | https://www.lesswrong.com/posts/uutXLm2DRcCtFBZ2D/steganography-and-the-cyclegan-alignment-failure-case-study | |
43 | AISC6 | Constraints from Selection - Modularity subteam | Lucius Bushnaq, Avery Griffin, Callum McDougall | Blog post | Ten experiments in modularity, which we'd like you to run! | https://www.lesswrong.com/posts/99WtcMpsRqZcrocCd/ten-experiments-in-modularity-which-we-d-like-you-to-run | |
44 | AISC6 | Pipeline for Measuring Misalignment | Marius Hobbhahn, Eric Landgrebe, Beth Barnes (mentor) | Blog post | Reflection Mechanisms as an Alignment target: A survey | https://www.lesswrong.com/posts/XyBWkoaqfnuEyNWXi/reflection-mechanisms-as-an-alignment-target-a-survey-1 | |
45 | AISC6 | Pipeline for Measuring Misalignment | Marius Hobbhahn, Eric Landgrebe, Beth Barnes (mentor) | Paper | Reflection Mechanisms as an Alignment Target: A Survey | https://openreview.net/forum?id=4eMzKmZ6xW | Paper version that was accepted to the NeurIPS ML Safety workshop. |
46 | AISC6 | Constraints from Selection - Modularity subteam | Lucius Bushnaq, Avery Griffin, Callum McDougall | Blog post | What Is The True Name of Modularity? | https://www.lesswrong.com/posts/TTTHwLpcewGjQHWzh/what-is-the-true-name-of-modularity | |
47 | AISC6 | Table-Top Role-Playing Game | Blog post | [Announcement:] AI takeover tabletop RPG: "The Treacherous Turn" | https://www.lesswrong.com/posts/b5EqwQZw7ww2K28Ki/ai-takeover-tabletop-rpg-the-treacherous-turn | ||
48 | AISC6 | Language Models as Tools for Alignment Research | Jan Kirchner, Jacques Thibodeau, Logan Smith, "janus" | Blog post | Results from a survey on tool use and workflows in alignment research | https://www.lesswrong.com/posts/a2io2mcxTWS4mxodF/results-for-a-survey-of-tool-use-and-workflows-in-alignment | |
49 | AISC6 | Language Models as Tools for Alignment Research | Jan Kirchner, Jacques Thibodeau, Logan Smith, "janus" | Blog post | A descriptive, not prescriptive, overview of current AI Alignment Research | https://www.lesswrong.com/posts/FgjcHiWvADgsocE34/a-descriptive-not-prescriptive-overview-of-current-ai | |
50 | AISC7 | AGI Safety Impossibility Theorem | Forrest Landry | Blog post | [List of posts: - See Comments for posts published around the October 2022 retreat. - See Link to look for posts published afterwards.] | https://mflb.com/ai_alignment_1/title_reorg_psr.html | *Published around AISC8 retreat* Narrative structure: - AI Scope of Work (written day after camp); Meta-Narrative Sequence of AI Substrate Takeover Explanation snippets: - XKCD-style Comic Overview (just minor edits); Superintelligence Safety Q&A (just minor edits); Negative Arguments; Substrate-Dependent Needs; APS review. Responses to anonymised questions/skeptical counterarguments: - Super-ordinate Claims; SGD Selection; Optimisation Cycles; Alignment Drift; Math Expectations; Right Skepticism. |
51 | AISC8 | Uncontrollable Dynamics | Remmelt Ellen | Blog post | The Control Problem: Unsolved or Unsolvable? | https://www.lesswrong.com/posts/xp6n2MG5vQkPpFEBH/the-control-problem-unsolved-or-unsolvable | |
52 | AISC8 | Uncontrollable Dynamics | Roman Yen | Blog post | On the possibility of impossibility of AGI Long-Term Safety | https://www.lesswrong.com/posts/zuXtMKuQRGAhZMoKk/on-the-possibility-of-impossibility-of-agi-long-term-safety#fnmso3ekucj2b | |
53 | AISC8 | Failure Stories | Karl von Wendt, Sofia Bharadia, Peter Drotos, Artem Korotkov... mespa, mruwnik | Blog post | Agentic Mess | https://www.lesswrong.com/posts/LyJAFBuuEfd4kxgsw/agentic-mess-a-failure-story | Video version here. |
54 | AISC8 | Failure Stories | Karl von Wendt, Sofia Bharadia, Peter Drotos, Artem Korotkov... mespa, mruwnik | Blog post | Paths to Failure | https://www.lesswrong.com/posts/yv4xAnkEyWvpXNBte/paths-to-failure | |
55 | AISC8 | Failure Stories | Karl von Wendt, Sofia Bharadia, Peter Drotos, Artem Korotkov... mespa, mruwnik | Blog post | A Friendly Face | https://www.lesswrong.com/posts/iRFxvNeLbHNRCzA2S/a-friendly-face-another-failure-story | |
56 | AISC8 | Interpretable Architectures | Robert Kralisch, Anton Zheltoukhov, David Liu, Sohaib Imran | Blog post | An Investigation of the Frameworks of “Positive Attractors” and “Inherently Interpretable Architectures” | https://www.lesswrong.com/s/z7JTHHdapYdvgfPhM | |
57 | AISC8 | Team Cyborg- ... | Kanad Chakrabarti, Roman Leventov, Nicholas Kees Dupuis | Blog post | Philosophical Cyborg (Part 1) | https://www.lesswrong.com/posts/k93NEoXZq6CdXegdx/philosophical-cyborg-part-1 | |
58 | AISC8 | Team Cyborg- ... | Kanad Chakrabarti | Blog post | Philosophical Cyborg (Part 2)...or, The Good Successor | https://www.lesswrong.com/posts/ZZ57cBkpQ5hpAux9T/philosophical-cyborg-part-2-or-the-good-successor | |
59 | AISC8 | Behavioural Annotation | Nell Watson... | Paper | Draft towards paper | https://docs.google.com/document/d/186iPTOUtofEsL1qgXsH5qX1IBlq7fn2m/edit | |
60 | AISC8 | Soft Optimization | Blog post | https://www.lesswrong.com/posts/XXrGhqSNZjcG2nNiy/aisc-team-report-soft-optimization-bayes-and-goodhart | |||
61 | AISC8 | Machine Learning For Scientific Discovery | Blog post | [Sequence:] Machine Learning For Scientific Discovery | https://www.lesswrong.com/s/xoXeJZRCBEBnBoGbC | ||
62 | AISC8 | Literature Review of the Neurological Basis of Human Values and Preferences | Mateusz Bagiński | Blog post | "Wanting" and "liking" | https://www.lesswrong.com/posts/opJxxfrN33xQx3eXu/wanting-and-liking | |
63 | AISC8 | Interdisciplinary Investigation of DebateGPT | Paul Bricman, Elfia Bezou-Vrakatseli, Thomas Feeney, and Yimeng Xie | Blog post | Truth | https://compphil.github.io/truth/ | |
64 | AISC8 | Understanding Search in Transformers | Michael I. Ivanitskiy, Alex F. Spies, Tilman Räuker, Guillaume Corlouer, Chris Mathwin, Lucia Quirke, Can Rager, Rusheb Shah, Dan Valentine, Cecilia Diniz Behn, Katsumi Inoue, Samy Wu Fung | Paper | Structured World Representations in Maze-Solving Transformers | https://arxiv.org/pdf/2312.02566.pdf | |
65 | AISC9 | Domenic Rosati, Jan Wehner, Kai Williams, Łukasz Bartoszcze, Jan Batzner, Hassan Sajjad, Frank Rudzicz | Paper | Immunization against harmful fine-tuning attacks | https://arxiv.org/abs/2402.16382 | ||
66 | AISC9 | Congressional Messaging Campaigns | Tristan Williams, davekasten, jacob.turn, Felix De Simone, gergo | Blog post | Talking to Congress: Can constituents contacting their legislator influencepolicy? | https://forum.effectivealtruism.org/posts/5oStggnYLGzomhvvn/talking-to-congress-can-constituents-contacting-their | |
67 | AISC9 | SatisfIA | Vitalii Chyhirov, Simon Fischer, Benjamin Kolb, Martin Kunev, Ariel Kwiatkowski, Jeremy Rich. Lead: Jobst Heitzig (we were also joined by several interns at his lab and members of SPAR) | Blog post | Aspiration-based, non-maximizing AI agent designs | https://www.lesswrong.com/s/4TT69Yt5FDWijAWab | |
68 | AISC9 | Out-of-context learning interpretability | Victor Levoso Fernandez (lead), Luan Fletcher, Leo Mckee-Reid, Andrei Cristea, Florian van der Steen, Nikita Menon, Kunvar Thaman | Software | aisc_oocl_experiments | https://github.com/fletchel/aisc_oocl_experiments | |
69 | AISC9 | High-Level Mechanistic Interpretability Activation Engineering Library 🔥 | Jamie Coombes, Ardy Haroen, Fergus Fettes, Lukas Linauer, Shaheen Ahmed-Chowdhury, Vy Hong | Software | obvslib | https://github.com/obvslib/obvs | |
70 | AISC9 | Ambitious Mechanistic Interpretability | Alice Rigg, Jacob Goldman-Wetzler, Karthik Murugadoss, Leonard Bereska, Lucas Hayne, Wolodymyr Krywonos, Michael Pearce, Kola Ayonrinde, Gonçalo Paulo | Blog post | [Various outputs by individual team members] | ghost gradients implementation, by Jacob Various Mamba interp things, by Goncalo & others Atp* implementation, by Kola Reverse engineering MNIST, by Michael Hierarchical feature clustering, by Alice Clustering features by their topology, by Karthik Mech interp survey paper, by Leonard Computation in superposition extensions, by Lucas | |
71 | AISC9 | Modelling Trajectories of Language Models | Nicky Pochinkov, Tetra Jones, Rashidur Rahman | Paper | Modularity In Transformers: Investigating Separability & Neuron Task Specialization | https://cloud.nicky.pro/s/A2srG3f8W9TLwrG | Under review as a conference paper at ICLR 2024 |
72 | AISC9 | Modelling Trajectories of Language Models | Nicky Pochinkov, Ben Pasero, Skylar Shibayama | Paper | Investigating Neuron Ablation In Attention Heads: The Case For Peak Activation Centering | https://cloud.nicky.pro/s/cM7sFPQfBSsaikx | Under review as a conference paper at the SeT LLM workshop at ICLR 2024 |
73 | AISC9 | MILD | Marcel Mir, Alex Champandard, Remmelt Ellen | Paper | MILD: Minimal Item-Level Documentation of Training Data | https://docs.google.com/document/d/1tP5j1sUf5JI6E700JpU8j_ZKP_zvAlsEFzrMv1PUJQI/edit | Draft doc; will be published later |
74 | AISC9 | Asymmetric control in LLMs | Domenic Rosati, Jan Wehner, Kai Williams, Łukasz Bartoszcze, Jan Batzner, Simon Lerman | Paper | Immunization against harmful fine-tuning attacks | https://arxiv.org/abs/2402.16382 | |
75 | AISC9 | Asymmetric control in LLMs | Domenic Rosati, Jan Wehner, Kai Williams, Łukasz Bartoszcze, David Atanasov, Robie Gonzales, Subhabrata Majumdar, Carsten Maple, Hassan Sajjad, Frank Rudzicz | Paper | Representation noising effectively prevents harmful fine-tuning on LLMs | https://arxiv.org/abs/2405.14577 | |
76 | AISC9 | Asymmetric control in LLMs | Domenic Rosati, Jan Wehner, David Atanasov | Blog post | Training-time domain authorization could be helpful for safety | https://www.lesswrong.com/posts/38avQYy782zXgNo9u/training-time-domain-authorization-could-be-helpful-for | |
77 | AISC9 | The promisingness of automated alignment | Bogdan Ionut Cirstea, AISC: Jaeson Booker, Leo Mckee-Reid, Marcel Mir, Severin Field, Milton Lin, Sai Joseph, Vassil Tashev, Yuan Yuan Sun; MARS: Alfie Lamerton, Tim Chan, Robayet Hossain; SPAR: Joyee Chen, Joe Emerson, Minh Nguyen, Yixiong Hao. | Blog post | A Review of Weak to Strong Generalization | https://www.lesswrong.com/posts/ELbGqXiLbRe6zSkTu/a-review-of-weak-to-strong-generalization-ai-safety-camp | |
78 | AISC9 | The promisingness of automated alignment | Bogdan Ionut Cirstea, AISC: Jaeson Booker, Leo Mckee-Reid, Marcel Mir, Severin Field, Milton Lin, Sai Joseph, Vassil Tashev, Yuan Yuan Sun; MARS: Alfie Lamerton, Tim Chan, Robayet Hossain; SPAR: Joyee Chen, Joe Emerson, Minh Nguyen, Yixiong Hao. | Blog post | Paper review: “The Unreasonable Effectiveness of Easy Training Data for Hard Tasks” | https://www.lesswrong.com/posts/Wd9vzwqcYuEokJYCH/paper-review-the-unreasonable-effectiveness-of-easy-training | |
79 | AISC9 | The promisingness of automated alignment | Bogdan Ionut Cirstea, AISC: Jaeson Booker, Leo Mckee-Reid, Marcel Mir, Severin Field, Milton Lin, Sai Joseph, Vassil Tashev, Yuan Yuan Sun; MARS: Alfie Lamerton, Tim Chan, Robayet Hossain; SPAR: Joyee Chen, Joe Emerson, Minh Nguyen, Yixiong Hao. | Blog post | A Review of In-Context Learning Hypotheses for Automated AI Alignment Research | https://www.lesswrong.com/posts/GPcwP8pgyPFPwvi2h/a-review-of-in-context-learning-hypotheses-for-automated-ai | |
80 | AISC9 | Towards realistic ODDs for foundation model based AI offerings | Igor Krawczuk, Paulius Skaisgiris, Scott Bursese Arghya Sarkar and Tanvir Iqbal | Software | https://github.com/genalgodds | ||
81 | AISC9 | Does sufficient optimization imply agent structure? | Tyler Tracy, Mateusz Bagiński, Einar Urdshals, Amaury Lorin, Jasmina Nasufi, Alfred Harwood, Alex Altair (RL) | Blog post | Towards a formalization of the agent structure problem | https://www.lesswrong.com/posts/oxsBpx9v3bgxraiPj/towards-a-formalization-of-the-agent-structure-problem | |
82 | AISC9 | Evaluating alignment evaluations | Blog post | [wrapping up drafts] | |||
83 | AISC9 | Exploring toy models of agents | Paul Colognese, Ben Sturgeon, Narmeen Oozer, Arun Jose | Blog post | [subscribe to Paul’s LessWrong posts to be notified when we post the results of this project.] | ||
84 | AISC9 | Benchmarks for Stable Reflectivity | Jacques Thibodeau (lead), Kanad Chakrabarti, Youlian Simidjiyski, Thee Ho, Jiaming (George) Yu, Jannes Elstner | Blog post | [more should be published under @jacquesthibs or @ukc10014 on LessWrong ] | ||
85 | AISC9 | Personal Fine-Tuning Implementations for AI Value Alignment | Minh Nguyen, Sarah Pan, Nell Watson | Blog post | [We intend to publish a paper on our experiments and observations.] | ||
86 | AISC9 | AI-Driven Economic Safety Nets | David Conrad, Rafael Andersson Lipcsey, Arturs Kanepajs, Tillman Schenk, Jacob Schaal | Blog post | [drafting] | ||
87 |