| A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Published safety research from AI companies, 2023 | |||||||||||||||||
2 | Company | Title | Date | URL | Category | Notes | More notes | |||||||||||
3 | Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level | 22-Dec-23 | https://www.alignmentforum.org/posts/iGuwZTHWb6DFY3sKB/fact-finding-attempting-to-reverse-engineer-factual-recall | Mechanistic interpretability | ||||||||||||||
4 | Challenges with unsupervised LLM knowledge discovery | 15-Dec-23 | https://deepmind.google/research/publications/66937/ | Mechanistic interpretability | https://www.alignmentforum.org/posts/wtfvbsYjNHYYBmT3k/discussion-challenges-with-unsupervised-llm-knowledge-1 | |||||||||||||
5 | Scalable AI Safety via Doubly-Efficient Debate | 23-Nov-23 | https://deepmind.google/research/publications/34920/ | Scalable oversight | ||||||||||||||
6 | Frontier Language Models are not Robust to Adversarial Arithmetic, or "What do I need to say so you agree 2+2=5? | 8-Nov-23 | https://arxiv.org/abs/2311.07587 | Robustness | ||||||||||||||
7 | Tracr: Compiled Transformers as a Laboratory for Interpretability | 21-Sep-23 | https://deepmind.google/research/publications/22295/ | Mechanistic interpretability | https://github.com/google-deepmind/tracr | |||||||||||||
8 | Explaining grokking through circuit efficiency | 5-Sep-23 | https://arxiv.org/abs/2309.02390 | Mechanistic interpretability | ||||||||||||||
9 | RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback | 1-Sep-23 | https://arxiv.org/abs/2309.00267 | RLHF/etc. | ||||||||||||||
10 | The Hydra Effect: Emergent Self-repair in Language Model Computations | 28-Jul-23 | https://arxiv.org/abs/2307.15771 | Mechanistic interpretability | ||||||||||||||
11 | Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla | 18-Jul-23 | https://arxiv.org/abs/2307.09458 | Mechanistic interpretability | https://www.alignmentforum.org/posts/Av3frxNy3y3i2kpaa/does-circuit-analysis-interpretability-scale-evidence-from | |||||||||||||
12 | Are aligned neural networks adversarially aligned? | 26-Jun-23 | https://arxiv.org/abs/2306.15447 | Robustness | ||||||||||||||
13 | Model evaluation for extreme risks | 24-May-23 | https://arxiv.org/abs/2305.15324 | Evals and red-teaming | ||||||||||||||
14 | Dissecting Recall of Factual Associations in Auto-Regressive Language Models | 28-Apr-23 | https://arxiv.org/abs/2304.14767 | Mechanistic interpretability | ||||||||||||||
15 | Power-seeking can be probable and predictive for trained agents | 13-Apr-23 | https://arxiv.org/abs/2304.06528 | Theory | ||||||||||||||
16 | The 2023 component of DeepMind Alignment Team on Threat Models and Plans | https://www.alignmentforum.org/s/4iEpGXbD3tQW5atab | Planning and threat modeling | |||||||||||||||
17 | Collaborations (nonexhaustive): MATS, especially Neel Nanda. | |||||||||||||||||
18 | ||||||||||||||||||
19 | Anthropic | Anthropic Fall 2023 Debate Progress Update | 27-Nov-23 | https://www.alignmentforum.org/posts/QtqysYdJRenWFeWc4/anthropic-fall-2023-debate-progress-update | Scalable oversight | |||||||||||||
20 | Anthropic | Specific versus General Principles for Constitutional AI | 24-Oct-23 | https://www.anthropic.com/research/specific-versus-general-principles-for-constitutional-ai | RLHF/etc. | |||||||||||||
21 | Anthropic | Towards Understanding Sycophancy in Language Models | 24-Oct-23 | https://www.anthropic.com/research/towards-understanding-sycophancy-in-language-models | RLHF/etc. | https://www.alignmentforum.org/posts/g5rABd5qbp8B4g3DE/towards-understanding-sycophancy-in-language-models | ||||||||||||
22 | Anthropic | Towards Monosemanticity: Decomposing Language Models With Dictionary Learning | 5-Oct-23 | https://www.anthropic.com/research/towards-monosemanticity-decomposing-language-models-with-dictionary-learning | Mechanistic interpretability | https://www.anthropic.com/research/decomposing-language-models-into-understandable-components | ||||||||||||
23 | Anthropic | Studying Large Language Model Generalization with Influence Functions | 8-Aug-23 | https://www.anthropic.com/research/studying-large-language-model-generalization-with-influence-functions | Mechanistic interpretability | Blogpost-y version: https://www.anthropic.com/research/influence-functions | ||||||||||||
24 | Anthropic | Towards understanding-based safety evaluations + When can we trust model evaluations? | 28-Jul-23 | Evals and red-teaming | ||||||||||||||
25 | Anthropic | Question Decomposition Improves the Faithfulness of Model-Generated Reasoning | 18-Jul-23 | https://www.anthropic.com/research/question-decomposition-improves-the-faithfulness-of-model-generated-reasoning | Legible reasoning | https://github.com/anthropics/DecompositionFaithfulnessPaper | ||||||||||||
26 | Anthropic | Measuring Faithfulness in Chain-of-Thought Reasoning | 18-Jul-23 | https://www.anthropic.com/research/measuring-faithfulness-in-chain-of-thought-reasoning | Legible reasoning | |||||||||||||
27 | Anthropic | Privileged Bases in the Transformer Residual Stream | 16-Mar-23 | https://www.anthropic.com/research/privileged-bases-in-the-transformer-residual-stream | Mechanistic interpretability | |||||||||||||
28 | Anthropic | The Capacity for Moral Self-Correction in Large Language Models | 15-Feb-23 | https://www.anthropic.com/research/the-capacity-for-moral-self-correction-in-large-language-models | RLHF/etc. | |||||||||||||
29 | Anthropic | Superposition, Memorization, and Double Descent | 5-Jan-23 | https://www.anthropic.com/research/superposition-memorization-and-double-descent | Mechanistic interpretability | |||||||||||||
30 | Anthropic | Transformer Circuits blog: four informal research updates and notes: Distributed Representations: Composition & Superposition, Interpretability Dreams, Circuits Updates — May 2023, Circuits Updates - July 2023 | Mechanistic interpretability | |||||||||||||||
31 | Collaborations (nonexhaustive): MATS, especially Ethan + Evan + Sam. Also Anthropic gets some credit for Attribution Patching: Activation Patching At Industrial Scale. | |||||||||||||||||
32 | Out of scope: Core Views on AI Safety, Model Organisms of Misalignment | |||||||||||||||||
33 | ||||||||||||||||||
34 | OpenAI | Weak-to-strong generalization | 14-Dec-23 | https://openai.com/index/weak-to-strong-generalization/ | Scalable oversight | |||||||||||||
35 | OpenAI | Language models can explain neurons in language models | 9-May-23 | https://openai.com/research/language-models-can-explain-neurons-in-language-models | Mechanistic interpretability | |||||||||||||
36 | ||||||||||||||||||
37 | Meta: | No x-safety research, but ROAST: Robustifying Language Models via Adversarial Perturbation with Selective Training and Llama Guard. | ||||||||||||||||
38 | ||||||||||||||||||
39 | Microsoft: | No x-safety research. Hallucination stuff: Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning, Teaching Language Models to Hallucinate Less with Synthetic Tasks. | ||||||||||||||||
40 | ||||||||||||||||||
41 | (Others: nothing) | |||||||||||||||||
42 |