ABCDEFGHIJKLMNOPQR
1
Published safety research from AI companies, 2023
2
CompanyTitleDateURLCategoryNotesMore notes
3
GoogleFact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level22-Dec-23
https://www.alignmentforum.org/posts/iGuwZTHWb6DFY3sKB/fact-finding-attempting-to-reverse-engineer-factual-recall
Mechanistic interpretability
4
GoogleChallenges with unsupervised LLM knowledge discovery15-Dec-23
https://deepmind.google/research/publications/66937/
Mechanistic interpretability
https://www.alignmentforum.org/posts/wtfvbsYjNHYYBmT3k/discussion-challenges-with-unsupervised-llm-knowledge-1
5
GoogleScalable AI Safety via Doubly-Efficient Debate23-Nov-23
https://deepmind.google/research/publications/34920/
Scalable oversight
6
Google
Frontier Language Models are not Robust to Adversarial Arithmetic, or "What do I need to say so you agree 2+2=5?
8-Nov-23
https://arxiv.org/abs/2311.07587
Robustness
7
GoogleTracr: Compiled Transformers as a Laboratory for Interpretability21-Sep-23
https://deepmind.google/research/publications/22295/
Mechanistic interpretability
https://github.com/google-deepmind/tracr
8
GoogleExplaining grokking through circuit efficiency5-Sep-23
https://arxiv.org/abs/2309.02390
Mechanistic interpretability
9
Google
RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback
1-Sep-23
https://arxiv.org/abs/2309.00267
RLHF/etc.
10
GoogleThe Hydra Effect: Emergent Self-repair in Language Model Computations28-Jul-23
https://arxiv.org/abs/2307.15771
Mechanistic interpretability
11
Google
Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla
18-Jul-23
https://arxiv.org/abs/2307.09458
Mechanistic interpretability
https://www.alignmentforum.org/posts/Av3frxNy3y3i2kpaa/does-circuit-analysis-interpretability-scale-evidence-from
12
GoogleAre aligned neural networks adversarially aligned?26-Jun-23https://arxiv.org/abs/2306.15447Robustness
13
GoogleModel evaluation for extreme risks24-May-23
https://arxiv.org/abs/2305.15324
Evals and red-teaming
14
GoogleDissecting Recall of Factual Associations in Auto-Regressive Language Models28-Apr-23
https://arxiv.org/abs/2304.14767
Mechanistic interpretability
15
GooglePower-seeking can be probable and predictive for trained agents13-Apr-23https://arxiv.org/abs/2304.06528Theory
16
GoogleThe 2023 component of DeepMind Alignment Team on Threat Models and Plans
https://www.alignmentforum.org/s/4iEpGXbD3tQW5atab
Planning and threat modeling
17
Collaborations (nonexhaustive): MATS, especially Neel Nanda.
18
19
AnthropicAnthropic Fall 2023 Debate Progress Update27-Nov-23
https://www.alignmentforum.org/posts/QtqysYdJRenWFeWc4/anthropic-fall-2023-debate-progress-update
Scalable oversight
20
AnthropicSpecific versus General Principles for Constitutional AI24-Oct-23
https://www.anthropic.com/research/specific-versus-general-principles-for-constitutional-ai
RLHF/etc.
21
AnthropicTowards Understanding Sycophancy in Language Models24-Oct-23
https://www.anthropic.com/research/towards-understanding-sycophancy-in-language-models
RLHF/etc.
https://www.alignmentforum.org/posts/g5rABd5qbp8B4g3DE/towards-understanding-sycophancy-in-language-models
22
Anthropic
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
5-Oct-23
https://www.anthropic.com/research/towards-monosemanticity-decomposing-language-models-with-dictionary-learning
Mechanistic interpretability
https://www.anthropic.com/research/decomposing-language-models-into-understandable-components
23
AnthropicStudying Large Language Model Generalization with Influence Functions8-Aug-23
https://www.anthropic.com/research/studying-large-language-model-generalization-with-influence-functions
Mechanistic interpretability
Blogpost-y version: https://www.anthropic.com/research/influence-functions
24
Anthropic
Towards understanding-based safety evaluations + When can we trust model evaluations?
28-Jul-23Evals and red-teaming
25
Anthropic
Question Decomposition Improves the Faithfulness of Model-Generated Reasoning
18-Jul-23
https://www.anthropic.com/research/question-decomposition-improves-the-faithfulness-of-model-generated-reasoning
Legible reasoning
https://github.com/anthropics/DecompositionFaithfulnessPaper
26
AnthropicMeasuring Faithfulness in Chain-of-Thought Reasoning18-Jul-23
https://www.anthropic.com/research/measuring-faithfulness-in-chain-of-thought-reasoning
Legible reasoning
27
AnthropicPrivileged Bases in the Transformer Residual Stream16-Mar-23
https://www.anthropic.com/research/privileged-bases-in-the-transformer-residual-stream
Mechanistic interpretability
28
AnthropicThe Capacity for Moral Self-Correction in Large Language Models15-Feb-23
https://www.anthropic.com/research/the-capacity-for-moral-self-correction-in-large-language-models
RLHF/etc.
29
AnthropicSuperposition, Memorization, and Double Descent5-Jan-23
https://www.anthropic.com/research/superposition-memorization-and-double-descent
Mechanistic interpretability
30
Anthropic
Transformer Circuits blog: four informal research updates and notes: Distributed Representations: Composition & Superposition, Interpretability Dreams, Circuits Updates — May 2023, Circuits Updates - July 2023
Mechanistic interpretability
31
Collaborations (nonexhaustive): MATS, especially Ethan + Evan + Sam. Also Anthropic gets some credit for Attribution Patching: Activation Patching At Industrial Scale.
32
Out of scope: Core Views on AI Safety, Model Organisms of Misalignment
33
34
OpenAIWeak-to-strong generalization14-Dec-23
https://openai.com/index/weak-to-strong-generalization/
Scalable oversight
35
OpenAILanguage models can explain neurons in language models9-May-23
https://openai.com/research/language-models-can-explain-neurons-in-language-models
Mechanistic interpretability
36
37
Meta:
No x-safety research, but ROAST: Robustifying Language Models via Adversarial Perturbation with Selective Training and Llama Guard.
38
39
Microsoft:
No x-safety research. Hallucination stuff: Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning, Teaching Language Models to Hallucinate Less with Synthetic Tasks.
40
41
(Others: nothing)
42