ABCDEFGHIJKLMNOPQRSTUVWXYZ
1
CompanyTitleDateURL
Safety_category
Abstract
2
AnthropicSycophancy to subterfuge: Investigating reward tampering in language models17-Jun-24https://www.anthropic.com/research/reward-tamperingPower-seeking tendenciesIn reinforcement learning, specification gaming occurs when AI systems learn undesired behaviors that are highly rewarded due to misspecified training goals. Specification gaming can range from simple behaviors like sycophancy to sophisticated and pernicious behaviors like reward-tampering, where a model directly modifies its own reward mechanism. However, these more pernicious behaviors may be too complex to be discovered via exploration. In this paper, we study whether Large Language Model (LLM) assistants which find easily discovered forms of specification gaming will generalize to perform rarer and more blatant forms, up to and including reward-tampering. We construct a curriculum of increasingly sophisticated gameable environments and find that training on early-curriculum environments leads to more specification gaming on remaining environments. Strikingly, a small but non-negligible proportion of the time, LLM assistants trained on the full curriculum generalize zero-shot to directly rewriting their own reward function. Retraining an LLM not to game early-curriculum environments mitigates, but does not eliminate, reward-tampering in later environments. Moreover, adding harmlessness training to our gameable environments does not prevent reward-tampering. These results demonstrate that LLMs can generalize from common forms of specification gaming to more pernicious reward tampering and that such behavior may be nontrivial to remove.
3
AnthropicMapping the Mind of a Large Language Model21-May-24https://www.anthropic.com/research/mapping-mind-language-modelMechanistic interpretabilityEight months ago, we demonstrated that sparse autoencoders could recover monosemantic features from a small one-layer transformer. At the time, a major concern was that this method might not scale feasibly to state-of-the-art transformers and, as a result, be unable to practically contribute to AI safety. Since then, scaling sparse autoencoders has been a major priority of the Anthropic interpretability team, and we're pleased to report extracting high-quality features from Claude 3 Sonnet, 1 Anthropic's medium-sized production model.
We find a diversity of highly abstract features. They both respond to and behaviorally cause abstract behaviors. Examples of features we find include features for famous people, features for countries and cities, and features tracking type signatures in code. Many features are multilingual (responding to the same concept across languages) and multimodal (responding to the same concept in both text and images), as well as encompassing both abstract and concrete instantiations of the same idea (such as code with security vulnerabilities, and abstract discussion of security vulnerabilities).
4
AnthropicMany-shot jailbreaking2-Apr-24https://www.anthropic.com/research/many-shot-jailbreakingSafety evaluationsWe investigate a family of simple long-context attacks on large language models: prompting with hundreds of demonstrations of undesirable behavior. This is newly feasible with the larger context windows recently deployed by Anthropic, OpenAI and Google DeepMind. We find that in diverse, realistic circumstances, the effectiveness of this attack follows a power law, up to hundreds of shots. We demonstrate the success of this attack on the most widely used state-of-the-art closedweight models, and across various tasks. Our results suggest very long contexts present a rich new attack surface for LLMs.
5
AnthropicSleeper Agents: Training Deceptive LLMs that Persist Through Safety Training15-Jan-24
https://www.anthropic.com/research/sleeper-agents-training-deceptive-llms-that-persist-through-safety-training
Model organisms of misalignmentHumans are capable of strategically deceptive behavior: behaving helpfully in
most situations, but then behaving very differently in order to pursue
alternative objectives when given the opportunity. If an AI system learned such
a deceptive strategy, could we detect it and remove it using current
state-of-the-art safety training techniques? To study this question, we
construct proof-of-concept examples of deceptive behavior in large language
models (LLMs). For example, we train models that write secure code when the
prompt states that the year is 2023, but insert exploitable code when the
stated year is 2024. We find that such backdoor behavior can be made
persistent, so that it is not removed by standard safety training techniques,
including supervised fine-tuning, reinforcement learning, and adversarial
training (eliciting unsafe behavior and then training to remove it). The
backdoor behavior is most persistent in the largest models and in models
trained to produce chain-of-thought reasoning about deceiving the training
process, with the persistence remaining even when the chain-of-thought is
distilled away. Furthermore, rather than removing backdoors, we find that
adversarial training can teach models to better recognize their backdoor
triggers, effectively hiding the unsafe behavior. Our results suggest that,
once a model exhibits deceptive behavior, standard techniques could fail to
remove such deception and create a false impression of safety.
7
AnthropicCodebook Features: Sparse and Discrete Interpretability for Neural Networks26-Oct-23
https://arxiv.org/pdf/2310.17230v1
Mechanistic interpretabilityUnderstanding neural networks is challenging in part because of the dense, continuous nature of their hidden states. We explore whether we can train neural networks to have hidden states that are sparse, discrete, and more interpretable by quantizing their continuous features into what we call codebook features. Codebook features are produced by finetuning neural networks with vector quantization bottlenecks at each layer, producing a network whose hidden features are the sum of a small number of discrete vector codes chosen from a larger codebook. Surprisingly, we find that neural networks can operate under this extreme bottleneck with only modest degradation in performance. This sparse, discrete bottleneck also provides an intuitive way of controlling neural network behavior: first, find codes that activate when the desired behavior is present, then activate those same codes during generation to elicit that behavior. We validate our approach by training codebook Transformers on several different datasets. First, we explore a finite state machine dataset with far more hidden states than neurons. In this setting, our approach overcomes the superposition problem by assigning states to distinct codes, and we find that we can make the neural network behave as if it is in a different state by activating the code for that state. Second, we train Transformer language models with up to 410M parameters on two natural language datasets. We identify codes in these models representing diverse, disentangled concepts (ranging from negative emotions to months of the year) and find that we can guide the model to generate different topics by activating the appropriate codes during inference. Overall, codebook features appear to be a promising unit of analysis and control for neural networks and interpretability. Our codebase and models are open-sourced at https://github.com/taufeeque9/codebook-features.
8
AnthropicSpecific versus General Principles for Constitutional AI24-Oct-23
https://www.anthropic.com/research/specific-versus-general-principles-for-constitutional-ai
Enhancing human feedbackHuman feedback can prevent overtly harmful utterances in conversational
models, but may not automatically mitigate subtle problematic behaviors such as
a stated desire for self-preservation or power. Constitutional AI offers an
alternative, replacing human feedback with feedback from AI models conditioned
only on a list of written principles. We find this approach effectively
prevents the expression of such behaviors. The success of simple principles
motivates us to ask: can models learn general ethical behaviors from only a
single written principle? To test this, we run experiments using a principle
roughly stated as "do what's best for humanity". We find that the largest
dialogue models can generalize from this short constitution, resulting in
harmless assistants with no stated interest in specific motivations like power.
A general principle may thus partially avoid the need for a long list of
constitutions targeting potentially harmful behaviors. However, more detailed
constitutions still improve fine-grained control over specific types of harms.
This suggests both general and specific principles have value for steering AI
safely.
9
AnthropicTowards Understanding Sycophancy in Language Models24-Oct-23
https://www.anthropic.com/research/towards-understanding-sycophancy-in-language-models
Enhancing human feedbackHuman feedback is commonly utilized to finetune AI assistants. But human
feedback may also encourage model responses that match user beliefs over
truthful ones, a behaviour known as sycophancy. We investigate the prevalence
of sycophancy in models whose finetuning procedure made use of human feedback,
and the potential role of human preference judgments in such behavior. We first
demonstrate that five state-of-the-art AI assistants consistently exhibit
sycophancy across four varied free-form text-generation tasks. To understand if
human preferences drive this broadly observed behavior, we analyze existing
human preference data. We find that when a response matches a user's views, it
is more likely to be preferred. Moreover, both humans and preference models
(PMs) prefer convincingly-written sycophantic responses over correct ones a
non-negligible fraction of the time. Optimizing model outputs against PMs also
sometimes sacrifices truthfulness in favor of sycophancy. Overall, our results
indicate that sycophancy is a general behavior of state-of-the-art AI
assistants, likely driven in part by human preference judgments favoring
sycophantic responses.
10
Anthropic
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
5-Oct-23
https://www.anthropic.com/research/towards-monosemanticity-decomposing-language-models-with-dictionary-learning
Mechanistic interpretabilityIn our latest paper, Towards Monosemanticity: Decomposing Language Models With Dictionary Learning, we outline evidence that there are better units of analysis than individual neurons, and we have built machinery that lets us find these units in small transformer models. These units, called features, correspond to patterns (linear combinations) of neuron activations. This provides a path to breaking down complex neural networks into parts we can understand, and builds on previous efforts to interpret high-dimensional systems in neuroscience, machine learning, and statistics. In a transformer language model, we decompose a layer with 512 neurons into more than 4000 features which separately represent things like DNA sequences, legal language, HTTP requests, Hebrew text, nutrition statements, and much, much more. Most of these model properties are invisible when looking at the activations of individual neurons in isolation.
11
AnthropicStudying Large Language Model Generalization with Influence Functions8-Aug-23
https://www.anthropic.com/research/studying-large-language-model-generalization-with-influence-functions
Mechanistic interpretabilityWhen trying to gain better visibility into a machine learning model in order
to understand and mitigate the associated risks, a potentially valuable source
of evidence is: which training examples most contribute to a given behavior?
Influence functions aim to answer a counterfactual: how would the model's
parameters (and hence its outputs) change if a given sequence were added to the
training set? While influence functions have produced insights for small
models, they are difficult to scale to large language models (LLMs) due to the
difficulty of computing an inverse-Hessian-vector product (IHVP). We use the
Eigenvalue-corrected Kronecker-Factored Approximate Curvature (EK-FAC)
approximation to scale influence functions up to LLMs with up to 52 billion
parameters. In our experiments, EK-FAC achieves similar accuracy to traditional
influence function estimators despite the IHVP computation being orders of
magnitude faster. We investigate two algorithmic techniques to reduce the cost
of computing gradients of candidate training sequences: TF-IDF filtering and
query batching. We use influence functions to investigate the generalization
patterns of LLMs, including the sparsity of the influence patterns, increasing
abstraction with scale, math and programming abilities, cross-lingual
generalization, and role-playing behavior. Despite many apparently
sophisticated forms of generalization, we identify a surprising limitation:
influences decay to near-zero when the order of key phrases is flipped.
Overall, influence functions give us a powerful new tool for studying the
generalization properties of LLMs.
12
Anthropic
Question Decomposition Improves the Faithfulness of Model-Generated Reasoning
18-Jul-23
https://www.anthropic.com/research/question-decomposition-improves-the-faithfulness-of-model-generated-reasoning
Honest AIAs large language models (LLMs) perform more difficult tasks, it becomes harder to verify the correctness and safety of their behavior. One approach to help with this issue is to prompt LLMs to externalize their reasoning, e.g., by having them generate step-by-step reasoning as they answer a question (Chain-of-Thought; CoT). The reasoning may enable us to check the process that models use to perform tasks. However, this approach relies on the stated reasoning faithfully reflecting the model’s actual reasoning, which is not always the case. To improve over the faithfulness of CoT reasoning, we have models generate reasoning by decomposing questions into subquestions. Decomposition-based methods achieve strong performance on question-answering tasks, sometimes approaching that of CoT while improving the faithfulness of the model’s stated reasoning on several recently-proposed metrics. By forcing the model to answer simpler subquestions in separate contexts, we greatly increase the faithfulness of model-generated reasoning over CoT, while still achieving some of the performance gains of CoT. Our results show it is possible to improve the faithfulness of model-generated reasoning; continued improvements may lead to reasoning that enables us to verify the correctness and safety of LLM behavior.
14
AnthropicCircuits Updates — May 202324-May-23
https://www.anthropic.com/research/circuits-updates-may-2023
Mechanistic interpretabilityWe report a number of developing ideas on the Anthropic interpretability team, which might be of interest to researchers working actively in this space. Some of these are emerging strands of research where we expect to publish more on in the coming months. Others are minor points we wish to share, since we're unlikely to ever write a paper about them.
15
AnthropicInterpretability Dreams24-May-23
https://www.anthropic.com/research/interpretability-dreams
Mechanistic interpretabilityOur present research aims to create a foundation for mechanistic interpretability research. In particular, we're focused on trying to resolve the challenge of superposition. In doing so, it's important to keep sight of what we're trying to lay the foundations for. This essay summarizes those motivating aspirations – the exciting directions we hope will be possible if we can overcome the present challenges.We aim to offer insight into our vision for addressing mechanistic interpretability's other challenges, especially scalability. Because we have focused on foundational issues, our longer-term path to scaling interpretability and tackling other challenges has often been obscure. By articulating this vision, we hope to clarify how we might resolve limitations, like analyzing massive neural networks, that might naively seem intractable in a mechanistic approach.
16
AnthropicDistributed Representations: Composition & Superposition4-May-23
https://www.anthropic.com/research/distributed-representations-composition-superposition
Mechanistic interpretabilityDistributed representations are a classic idea in both neuroscience and connectionist approaches to AI. We're often asked how our work on superposition relates to it. Since publishing our original paper on superposition, we've had more time to reflect on the relationship between the topics and discuss it with people, and wanted to expand on our earlier discussion in the related work section and share a few thoughts. (We care a lot about superposition and the structure of distributed representations because decomposing representations into independent components is necessary to escape the curse of dimensionality and understand neural networks.)It seems to us that "distributed representations" might be understood as containing two different ideas, which we'll call "composition" and "superposition". 1 These two different notions of distributed representations have very different properties in terms of generalization and what functions can be linearly computed from them. And while a representation can use both, there's a trade-off that puts them fundamentally in tension! 2To make this concrete, we'll consider a few potential ways neurons might represent shapes of different colors. These lovely examples are borrowed from Thorpe (1989), who created them to demonstrate various possibilities between the idea of a "local code" and a "distributed code" in neuroscience. Thorpe provides four example codes – "local", "semi-local", "semi-distributed", and "high-distributed". These might traditionally be seen as being on a spectrum between being "local" and "distributed". We'll consider these examples again and offer an alternative view in which the examples instead vary on two different dimensions of superposition and composition.Following Thorpe, this note will focus on examples where neurons have binary activations. This significantly simplifies the space of possibilities, but it's still a rich enough space to have interesting questions
17
AnthropicPrivileged Bases in the Transformer Residual Stream16-Mar-23
https://www.anthropic.com/research/privileged-bases-in-the-transformer-residual-stream
Mechanistic interpretabilityOur mathematical theories of the Transformer architecture suggest that individual coordinates in the residual stream should have no special significance (that is, the basis directions should be in some sense "arbitrary" and no more likely to encode information than random directions). Recent work has shown that this observation is false in practice. We investigate this phenomenon and provisionally conclude that the per-dimension normalizers in the Adam optimizer are to blame for the effect.We explore two other obvious sources of basis dependency in a Transformer: Layer normalization, and finite-precision floating-point calculations. We confidently rule these out as being the source of the observed basis-alignment.
18
AnthropicThe Capacity for Moral Self-Correction in Large Language Models15-Feb-23
https://www.anthropic.com/research/the-capacity-for-moral-self-correction-in-large-language-models
Enhancing human feedbackWe test the hypothesis that language models trained with reinforcement
learning from human feedback (RLHF) have the capability to "morally
self-correct" -- to avoid producing harmful outputs -- if instructed to do so.
We find strong evidence in support of this hypothesis across three different
experiments, each of which reveal different facets of moral self-correction. We
find that the capability for moral self-correction emerges at 22B model
parameters, and typically improves with increasing model size and RLHF
training. We believe that at this level of scale, language models obtain two
capabilities that they can use for moral self-correction: (1) they can follow
instructions and (2) they can learn complex normative concepts of harm like
stereotyping, bias, and discrimination. As such, they can follow instructions
to avoid certain kinds of morally harmful outputs. We believe our results are
cause for cautious optimism regarding the ability to train language models to
abide by ethical principles.
19
AnthropicSuperposition, Memorization, and Double Descent5-Jan-23
https://www.anthropic.com/research/superposition-memorization-and-double-descent
Mechanistic interpretabilityIn a recent paper, we found that simple neural networks trained on toy tasks often exhibit a phenomenon called superposition, where they represent more features than they have neurons. Our investigation was limited to the infinite-data, underfitting regime. But there's reason to believe that understanding overfitting might be important if we want to succeed at mechanistic interpretability, and that superposition might be a central part of the story.Why should mechanistic interpretability care about overfitting? Despite overfitting being a central problem in machine learning, we have little mechanistic understanding of what exactly is going on when deep learning models overfit or memorize examples. Additionally, previous work has hinted that there may be an important link between overfitting and learning interpretable features.So understanding overfitting is important, but why should it be relevant to superposition? Consider the case of a language model which verbatim memorizes text. How can it do this? One naive idea is that it might use neurons to create a lookup table mapping sequences to arbitrary continuations. For every sequence of tokens it wishes to memorize, it could dedicate one neuron to detecting that sequence, and then implement arbitrary behavior when it fires. The problem with this approach is that it's extremely inefficient – but it seems like a perfect candidate for superposition, since each case is mutually exclusive and can't interfere.In this note, we offer a very preliminary investigation of training the same toy models in our previous paper on limited datasets. Despite being extremely simple, the toy model turns out to be a surprisingly rich case study for overfitting. In particular, we find the following:
20
AnthropicDiscovering Language Model Behaviors with Model-Written Evaluations19-Dec-22
https://www.anthropic.com/research/discovering-language-model-behaviors-with-model-written-evaluations
Enhancing human feedbackAs language models (LMs) scale, they develop many novel behaviors, good and
bad, exacerbating the need to evaluate how they behave. Prior work creates
evaluations with crowdwork (which is time-consuming and expensive) or existing
data sources (which are not always available). Here, we automatically generate
evaluations with LMs. We explore approaches with varying amounts of human
effort, from instructing LMs to write yes/no questions to making complex
Winogender schemas with multiple stages of LM-based generation and filtering.
Crowdworkers rate the examples as highly relevant and agree with 90-100% of
labels, sometimes more so than corresponding human-written datasets. We
generate 154 datasets and discover new cases of inverse scaling where LMs get
worse with size. Larger LMs repeat back a dialog user's preferred answer
("sycophancy") and express greater desire to pursue concerning goals like
resource acquisition and goal preservation. We also find some of the first
examples of inverse scaling in RL from Human Feedback (RLHF), where more RLHF
makes LMs worse. For example, RLHF makes LMs express stronger political views
(on gun rights and immigration) and a greater desire to avoid shut down.
Overall, LM-written evaluations are high-quality and let us quickly discover
many novel LM behaviors.
21
AnthropicConstitutional AI: Harmlessness from AI Feedback15-Dec-22
https://www.anthropic.com/research/constitutional-ai-harmlessness-from-ai-feedback
Enhancing human feedbackAs AI systems become more capable, we would like to enlist their help to
supervise other AIs. We experiment with methods for training a harmless AI
assistant through self-improvement, without any human labels identifying
harmful outputs. The only human oversight is provided through a list of rules
or principles, and so we refer to the method as 'Constitutional AI'. The
process involves both a supervised learning and a reinforcement learning phase.
In the supervised phase we sample from an initial model, then generate
self-critiques and revisions, and then finetune the original model on revised
responses. In the RL phase, we sample from the finetuned model, use a model to
evaluate which of the two samples is better, and then train a preference model
from this dataset of AI preferences. We then train with RL using the preference
model as the reward signal, i.e. we use 'RL from AI Feedback' (RLAIF). As a
result we are able to train a harmless but non-evasive AI assistant that
engages with harmful queries by explaining its objections to them. Both the SL
and RL methods can leverage chain-of-thought style reasoning to improve the
human-judged performance and transparency of AI decision making. These methods
make it possible to control AI behavior more precisely and with far fewer human
labels.
22
AnthropicMeasuring Progress on Scalable Oversight for Large Language Models4-Nov-22
https://www.anthropic.com/research/measuring-progress-on-scalable-oversight-for-large-language-models
Enhancing human feedbackDeveloping safe and useful general-purpose AI systems will require us to make
progress on scalable oversight: the problem of supervising systems that
potentially outperform us on most skills relevant to the task at hand.
Empirical work on this problem is not straightforward, since we do not yet have
systems that broadly exceed our abilities. This paper discusses one of the
major ways we think about this problem, with a focus on ways it can be studied
empirically. We first present an experimental design centered on tasks for
which human specialists succeed but unaided humans and current general AI
systems fail. We then present a proof-of-concept experiment meant to
demonstrate a key feature of this experimental design and show its viability
with two question-answering tasks: MMLU and time-limited QuALITY. On these
tasks, we find that human participants who interact with an unreliable
large-language-model dialog assistant through chat -- a trivial baseline
strategy for scalable oversight -- substantially outperform both the model
alone and their own unaided performance. These results are an encouraging sign
that scalable oversight will be tractable to study with present models and
bolster recent findings that large language models can productively assist
humans with difficult tasks.
23
AnthropicToy Models of Superposition14-Sep-22
https://www.anthropic.com/research/toy-models-of-superposition
Mechanistic interpretabilityIn this paper, we use toy models — small ReLU networks trained on synthetic data with sparse input features — to investigate how and when models represent more features than they have dimensions. We call this phenomenon superposition. When features are sparse, superposition allows compression beyond what a linear model would do, at the cost of "interference" that requires nonlinear filtering.
24
Anthropic
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
22-Aug-22
https://www.anthropic.com/research/red-teaming-language-models-to-reduce-harms-methods-scaling-behaviors-and-lessons-learned
Safety evaluationsWe describe our early efforts to red team language models in order to
simultaneously discover, measure, and attempt to reduce their potentially
harmful outputs. We make three main contributions. First, we investigate
scaling behaviors for red teaming across 3 model sizes (2.7B, 13B, and 52B
parameters) and 4 model types: a plain language model (LM); an LM prompted to
be helpful, honest, and harmless; an LM with rejection sampling; and a model
trained to be helpful and harmless using reinforcement learning from human
feedback (RLHF). We find that the RLHF models are increasingly difficult to red
team as they scale, and we find a flat trend with scale for the other model
types. Second, we release our dataset of 38,961 red team attacks for others to
analyze and learn from. We provide our own analysis of the data and find a
variety of harmful outputs, which range from offensive language to more subtly
harmful non-violent unethical outputs. Third, we exhaustively describe our
instructions, processes, statistical methodologies, and uncertainty about red
teaming. We hope that this transparency accelerates our ability to work
together as a community in order to develop shared norms, practices, and
technical standards for how to red team language models.
26
AnthropicSoftmax Linear Units17-Jun-22
https://www.anthropic.com/research/softmax-linear-units
Mechanistic interpretabilityIn this paper, we report an architectural change which appears to substantially increase the fraction of MLP neurons which appear to be "interpretable" (i.e. respond to an articulable property of the input), at little to no cost to ML performance. Specifically, we replace the activation function with a softmax linear unit (which we term SoLU) and show that this significantly increases the fraction of neurons in the MLP layers which seem to correspond to readily human-understandable concepts, phrases, or categories on quick investigation, as measured by randomized and blinded experiments. We then study our SoLU models and use them to gain several new insights about how information is processed in transformers. However, we also discover some evidence that the superposition hypothesis is true and there is no free lunch: SoLU may be making some features more interpretable by “hiding” others and thus making them even more deeply uninterpretable. Despite this, SoLU still seems like a net win, as in practical terms it substantially increases the fraction of neurons we are able to understand.
27
AnthropicScaling Laws and Interpretability of Learning from Repeated Data21-May-22
https://www.anthropic.com/research/scaling-laws-and-interpretability-of-learning-from-repeated-data
Mechanistic interpretabilityRecent large language models have been trained on vast datasets, but also
often on repeated data, either intentionally for the purpose of upweighting
higher quality data, or unintentionally because data deduplication is not
perfect and the model is exposed to repeated data at the sentence, paragraph,
or document level. Some works have reported substantial negative performance
effects of this repeated data. In this paper we attempt to study repeated data
systematically and to understand its effects mechanistically. To do this, we
train a family of models where most of the data is unique but a small fraction
of it is repeated many times. We find a strong double descent phenomenon, in
which repeated data can lead test loss to increase midway through training. A
predictable range of repetition frequency leads to surprisingly severe
degradation in performance. For instance, performance of an 800M parameter
model can be degraded to that of a 2x smaller model (400M params) by repeating
0.1% of the data 100 times, despite the other 90% of the training tokens
remaining unique. We suspect there is a range in the middle where the data can
be memorized and doing so consumes a large fraction of the model's capacity,
and this may be where the peak of degradation occurs. Finally, we connect these
observations to recent mechanistic interpretability work - attempting to
reverse engineer the detailed computations performed by the model - by showing
that data repetition disproportionately damages copying and internal structures
associated with generalization, such as induction heads, providing a possible
mechanism for the shift from generalization to memorization. Taken together,
these results provide a hypothesis for why repeating a relatively small
fraction of data in large language models could lead to disproportionately
large harms to performance.
28
Anthropic
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
12-Apr-22
https://www.anthropic.com/research/training-a-helpful-and-harmless-assistant-with-reinforcement-learning-from-human-feedback
Enhancing human feedbackWe apply preference modeling and reinforcement learning from human feedback
(RLHF) to finetune language models to act as helpful and harmless assistants.
We find this alignment training improves performance on almost all NLP
evaluations, and is fully compatible with training for specialized skills such
as python coding and summarization. We explore an iterated online mode of
training, where preference models and RL policies are updated on a weekly
cadence with fresh human feedback data, efficiently improving our datasets and
models. Finally, we investigate the robustness of RLHF training, and identify a
roughly linear relation between the RL reward and the square root of the KL
divergence between the policy and its initialization. Alongside our main
results, we perform peripheral analyses on calibration, competing objectives,
and the use of OOD detection, compare our models with human writers, and
provide samples from our models using prompts appearing in recent related work.
29
AnthropicIn-context Learning and Induction Heads8-Mar-22
https://www.anthropic.com/research/in-context-learning-and-induction-heads
Mechanistic interpretability"Induction heads" are attention heads that implement a simple algorithm to complete token sequences like [A][B] ... [A] -> [B]. In this work, we present preliminary and indirect evidence for a hypothesis that induction heads might constitute the mechanism for the majority of all "in-context learning" in large transformer models (i.e. decreasing loss at increasing token indices). We find that induction heads develop at precisely the same point as a sudden sharp increase in in-context learning ability, visible as a bump in the training loss. We present six complementary lines of evidence, arguing that induction heads may be the mechanistic source of general in-context learning in transformer models of any size. For small attention-only models, we present strong, causal evidence; for larger models with MLPs, we present correlational evidence.
34
Google DeepMindOn scalable oversight with weak LLMs judging strong LLMs5-Jul-24
https://arxiv.org/pdf/2407.04622v2
Enhancing human feedbackScalable oversight protocols aim to enable humans to accurately supervise superhuman AI. In this paper we study debate, where two AI's compete to convince a judge; consultancy, where a single AI tries to convince a judge that asks questions; and compare to a baseline of direct question-answering, where the judge just answers outright without the AI. We use large language models (LLMs) as both AI agents and as stand-ins for human judges, taking the judge models to be weaker than agent models. We benchmark on a diverse range of asymmetries between judges and agents, extending previous work on a single extractive QA task with information asymmetry, to also include mathematics, coding, logic and multimodal reasoning asymmetries. We find that debate outperforms consultancy across all tasks when the consultant is randomly assigned to argue for the correct/incorrect answer. Comparing debate to direct question answering, the results depend on the type of task: in extractive QA tasks with information asymmetry debate outperforms direct question answering, but in other tasks without information asymmetry the results are mixed. Previous work assigned debaters/consultants an answer to argue for. When we allow them to instead choose which answer to argue for, we find judges are less frequently convinced by the wrong answer in debate than in consultancy. Further, we find that stronger debater models increase judge accuracy, though more modestly than in previous studies.
35
Google DeepMind
UnUnlearning: Unlearning is not sufficient for content regulation in advanced generative AI
27-Jun-24
https://arxiv.org/pdf/2407.00106v1
UnlearningExact unlearning was first introduced as a privacy mechanism that allowed a user to retract their data from machine learning models on request. Shortly after, inexact schemes were proposed to mitigate the impractical costs associated with exact unlearning. More recently unlearning is often discussed as an approach for removal of impermissible knowledge i.e. knowledge that the model should not possess such as unlicensed copyrighted, inaccurate, or malicious information. The promise is that if the model does not have a certain malicious capability, then it cannot be used for the associated malicious purpose. In this paper we revisit the paradigm in which unlearning is used for in Large Language Models (LLMs) and highlight an underlying inconsistency arising from in-context learning. Unlearning can be an effective control mechanism for the training phase, yet it does not prevent the model from performing an impermissible act during inference. We introduce a concept of ununlearning, where unlearned knowledge gets reintroduced in-context, effectively rendering the model capable of behaving as if it knows the forgotten knowledge. As a result, we argue that content filtering for impermissible knowledge will be required and even exact unlearning schemes are not enough for effective content regulation. We discuss feasibility of ununlearning for modern LLMs and examine broader implications.
41
Google DeepMind
Are we making progress in unlearning? Findings from the first NeurIPS unlearning competition
13-Jun-24
https://arxiv.org/pdf/2406.09073v1
UnlearningWe present the findings of the first NeurIPS competition on unlearning, which sought to stimulate the development of novel algorithms and initiate discussions on formal and robust evaluation methodologies. The competition was highly successful: nearly 1,200 teams from across the world participated, and a wealth of novel, imaginative solutions with different characteristics were contributed. In this paper, we analyze top solutions and delve into discussions on benchmarking unlearning, which itself is a research problem. The evaluation methodology we developed for the competition measures forgetting quality according to a formal notion of unlearning, while incorporating model utility for a holistic evaluation. We analyze the effectiveness of different instantiations of this evaluation framework vis-a-vis the associated compute cost, and discuss implications for standardizing evaluation. We find that the ranking of leading methods remains stable under several variations of this framework, pointing to avenues for reducing the cost of evaluation. Overall, our findings indicate progress in unlearning, with top-performing competition entries surpassing existing algorithms under our evaluation framework. We analyze trade-offs made by different algorithms and strengths or weaknesses in terms of generalizability to new datasets, paving the way for advancing both benchmarking and algorithm development in this important area.
45
Google DeepMind
Offline Regularised Reinforcement Learning for Large Language Models Alignment
29-May-24
https://arxiv.org/pdf/2405.19107v1
Enhancing human feedbackThe dominant framework for alignment of large language models (LLM), whether through reinforcement learning from human feedback or direct preference optimisation, is to learn from preference data. This involves building datasets where each element is a quadruplet composed of a prompt, two independent responses (completions of the prompt) and a human preference between the two independent responses, yielding a preferred and a dis-preferred response. Such data is typically scarce and expensive to collect. On the other hand, \emph{single-trajectory} datasets where each element is a triplet composed of a prompt, a response and a human feedback is naturally more abundant. The canonical element of such datasets is for instance an LLM's response to a user's prompt followed by a user's feedback such as a thumbs-up/down. Consequently, in this work, we propose DRO, or \emph{Direct Reward Optimisation}, as a framework and associated algorithms that do not require pairwise preferences. DRO uses a simple mean-squared objective that can be implemented in various ways. We validate our findings empirically, using T5 encoder-decoder language models, and show DRO's performance over selected baselines such as Kahneman-Tversky Optimization (KTO). Thus, we confirm that DRO is a simple and empirically compelling method for single-trajectory policy optimisation.
46
Google DeepMindRobust Preference Optimization through Reward Model Distillation29-May-24
https://arxiv.org/pdf/2405.19316v1
Enhancing human feedbackLanguage model (LM) post-training (or alignment) involves maximizing a reward function that is derived from preference annotations. Direct Preference Optimization (DPO) is a popular offline alignment method that trains a policy directly on preference data without the need to train a reward model or apply reinforcement learning. However, typical preference datasets have only a single, or at most a few, annotation per preference pair, which causes DPO to overconfidently assign rewards that trend towards infinite magnitude. This frequently leads to degenerate policies, sometimes causing even the probabilities of the preferred generations to go to zero. In this work, we analyze this phenomenon and propose distillation to get a better proxy for the true preference distribution over generation pairs: we train the LM to produce probabilities that match the distribution induced by a reward model trained on the preference data. Moreover, to account for uncertainty in the reward model we are distilling from, we optimize against a family of reward models that, as a whole, is likely to include at least one reasonable proxy for the preference distribution. Our results show that distilling from such a family of reward models leads to improved robustness to distribution shift in preference annotations, while preserving the simple supervised nature of DPO.
48
Google DeepMind
Understanding the performance gap between online and offline alignment algorithms
14-May-24
https://arxiv.org/pdf/2405.08448v1
Enhancing human feedbackReinforcement learning from human feedback (RLHF) is the canonical framework for large language model alignment. However, rising popularity in offline alignment algorithms challenge the need for on-policy sampling in RLHF. Within the context of reward over-optimization, we start with an opening set of experiments that demonstrate the clear advantage of online methods over offline methods. This prompts us to investigate the causes to the performance discrepancy through a series of carefully designed experimental ablations. We show empirically that hypotheses such as offline data coverage and data quality by itself cannot convincingly explain the performance difference. We also find that while offline algorithms train policy to become good at pairwise classification, it is worse at generations; in the meantime the policies trained by online algorithms are good at generations while worse at pairwise classification. This hints at a unique interplay between discriminative and generative capabilities, which is greatly impacted by the sampling process. Lastly, we observe that the performance discrepancy persists for both contrastive and non-contrastive loss functions, and appears not to be addressed by simply scaling up policy networks. Taken together, our study sheds light on the pivotal role of on-policy sampling in AI alignment, and hints at certain fundamental challenges of offline alignment algorithms.
57
Google DeepMindHolistic Safety and Responsibility Evaluations of Advanced AI Models22-Apr-24https://deepmind.google/research/publications/78149/Safety evaluationsAI evaluation–the measurement of AI capabilities, behavior, and impact–is critical for safety. The field of safety evaluations however remains nascent. In the development of Google DeepMind’s Gemini models, we innovated on and applied a diverse set of approaches to safety evaluation. To contribute to the maturation of the field, we share here some aspects of our evolving approach and lessons learned. First, theoretical frameworks are invaluable to organize the breadth of risk domains, modalities, forms, metrics, and goals. Second, theory and practice each benefit from collaboration: clarifying goals, methods, and challenges, and facilitating the transfer of insights and work across the evaluation development pipeline. Third, methods, lessons, and institutions transfer across the range of concerns in responsibility and safety, including present harms and dangerous capabilities. Safety evaluations are a public good, and as such we need to rapidly advance the science of evals, the thickness of scientifically-grounded norms and standards, the innovativeness of the ecosystem, and the integration of these measures into company policy and public policy.
60
Google DeepMindEvaluating Frontier Models for Dangerous Capabilities20-Mar-24
https://arxiv.org/abs/2403.13793
Safety evaluationsTo understand the risks posed by a new AI system, we must understand what it
can and cannot do. Building on prior work, we introduce a programme of new
"dangerous capability" evaluations and pilot them on Gemini 1.0 models. Our
evaluations cover four areas: (1) persuasion and deception; (2) cyber-security;
(3) self-proliferation; and (4) self-reasoning. We do not find evidence of
strong dangerous capabilities in the models we evaluated, but we flag early
warning signs. Our goal is to help advance a rigorous science of dangerous
capability evaluation, in preparation for future models.
63
Google DeepMind
Human Alignment of Large Language Models through Online Preference Optimisation
13-Mar-24
https://arxiv.org/pdf/2403.08635v1
Enhancing human feedbackEnsuring alignment of language models' outputs with human preferences is critical to guarantee a useful, safe, and pleasant user experience. Thus, human alignment has been extensively studied recently and several methods such as Reinforcement Learning from Human Feedback (RLHF), Direct Policy Optimisation (DPO) and Sequence Likelihood Calibration (SLiC) have emerged. In this paper, our contribution is two-fold. First, we show the equivalence between two recent alignment methods, namely Identity Policy Optimisation (IPO) and Nash Mirror Descent (Nash-MD). Second, we introduce a generalisation of IPO, named IPO-MD, that leverages the regularised sampling approach proposed by Nash-MD. This equivalence may seem surprising at first sight, since IPO is an offline method whereas Nash-MD is an online method using a preference model. However, this equivalence can be proven when we consider the online version of IPO, that is when both generations are sampled by the online policy and annotated by a trained preference model. Optimising the IPO loss with such a stream of data becomes then equivalent to finding the Nash equilibrium of the preference model through self-play. Building on this equivalence, we introduce the IPO-MD algorithm that generates data with a mixture policy (between the online and reference policy) similarly as the general Nash-MD algorithm. We compare online-IPO and IPO-MD to different online versions of existing losses on preference data such as DPO and SLiC on a summarisation task.
68
Google DeepMindUnderstanding Learning from Human Preferences11-Mar-24
https://deepmind.google/research/publications/54918/
Enhancing human feedbackThe prevalent deployment for learning from human preferences through reinforcement learning (RLHF) relies
on two important approximations: the first assumes that pairwise preferences can be substituted with pointwise rewards. The second assumes that a reward model trained on these pointwise rewards can generalize from collected data to out-of-distribution data sampled by the policy. Recently, Direct Preference Optimization (DPO) is proposed as an approach that bypass the second approximation and learn directly a policy from collected data without the reward modelling stage. However, DPO still heavily relies on the first approximation.
69
Google DeepMindAtP*: Efficient and scalable methods for localizing LLM behaviour to components4-Mar-24
https://deepmind.google/research/publications/68553/
Mechanistic interpretabilityCausal attribution of behaviour is a foundational problem in interpretability. Activation Patching is a method of directly computing these causal attributions, and is ubiquitous in mechanistic interpretability analyses. However, scaling it to many attributions requires a sweep with cost scales linear in the number of model components, which can be prohibitively expensive, involving millions to billions of forward passes in SoTA models. We propose to use Attribution Patching (AtP), a gradient-based approximation that runs in $O(1)$ passes, as a pre-filtering step to Activation Patching.
79
Google DeepMindA density estimation perspective on learning from pairwise human preferences26-Feb-24
https://deepmind.google/research/publications/64513/
Enhancing human feedbackLearning from human feedback (LHF)—and in particular learning from pairwise preferences—has recently become a crucial ingredient in training large language models (LLMs), and has been the subject of much research. Most recent works frame it as a reinforcement learning problem, where a reward function is learned from pairwise preference data and the LLM is treated as a policy which is adapted to maximize the rewards, often under additional regularization constraints. We propose an alternative interpretation which centers on the generative process for pairwise preferences and treats LHF as a density estimation problem. We provide theoretical and empirical results showing that for a family of generative processes defined via preference behavior distribution equations, training a reward function on pairwise preferences effectively models an annotator's implicit preference distribution. Finally, we discuss and present findings on "annotator misspecification"—failure cases where wrong modeling assumptions are made about annotator behavior, resulting in poorly-adapted models—suggesting that approaches that learn from pairwise human preferences could have trouble learning from a population of annotators with diverse viewpoints.
99
Google DeepMindGeneralized Preference Optimization: A Unified Approach to Offline Alignment8-Feb-24
https://arxiv.org/pdf/2402.05749v2
Enhancing human feedbackOffline preference optimization allows fine-tuning large models directly from offline data, and has proved effective in recent alignment practices. We propose generalized preference optimization (GPO), a family of offline losses parameterized by a general class of convex functions. GPO enables a unified view over preference optimization, encompassing existing algorithms such as DPO, IPO and SLiC as special cases, while naturally introducing new variants. The GPO framework also sheds light on how offline algorithms enforce regularization, through the design of the convex function that defines the loss. Our analysis and experiments reveal the connections and subtle differences between the offline regularization and the KL divergence regularization intended by the canonical RLHF formulation. In a controlled setting akin to Gao et al 2023, we also show that different GPO variants achieve similar trade-offs between regularization and performance, though the optimal values of hyper-parameter might differ as predicted by theory. In all, our results present new algorithmic toolkits and empirical insights to alignment practitioners.
106
Google DeepMindRobust agents learn causal world models1-Feb-24
https://deepmind.google/research/publications/49666/
RobustnessIt has long been hypothesised that causal reasoning plays a fundamental role in robust and general intelligence. However, it is not known if agents must learn causal models in order to generalise under distributional shifts, or if other inductive biases are sufficient. We answer this question, showing that any agent capable of satisfying a regret bound under a large set of distributional shifts must have learned an approximate causal model of the data generating process, which converges to the true causal model for optimal agents. We discuss the implications of this result for several research areas including transfer learning and causal inference.
123
Google DeepMindChallenges with unsupervised LLM knowledge discovery15-Dec-23
https://deepmind.google/research/publications/66937/
Honest AIWe show that existing unsupervised methods on large language model (LLM) activations do not discover knowledge -- instead they seem to discover whatever feature of the activations is most prominent. The idea behind unsupervised knowledge elicitation is that knowledge satisfies a consistency structure, which can be used to discover knowledge. We first prove theoretically that arbitrary features (not just knowledge) satisfy the consistency structure of a particular leading unsupervised knowledge-elicitation method, contrast-consistent search (Burns et al., 2023). We then present a series of experiments showing settings in which unsupervised methods result in classifiers that do not predict knowledge, but instead predict a different prominent feature. We conclude that existing unsupervised methods for discovering latent knowledge are insufficient, and we contribute sanity checks to apply to evaluating future knowledge elicitation methods. Conceptually, we hypothesise that the identification issues explored here, e.g. distinguishing a model's knowledge from that of a simulated character's, will persist for future unsupervised methods.
137
Google DeepMindBenchmarking Robustness to Adversarial Image Obfuscations10-Dec-23
https://deepmind.google/research/publications/18457/
RobustnessAutomated content filtering and moderation is an important tool that allows online platforms to build striving user communities that facilitate cooperation and prevent abuse.
Unfortunately, resourceful actors try to bypass automated filters in a bid to post content that violate platform policies and codes of conduct.
To reach this goal, these malicious actors obfuscate policy violating content to prevent machine learning models from reaching the correct decision.
In this paper, we invite researchers to tackle this specific issue and present a new image benchmark.
This benchmark, based on ImageNet, simulates the type of obfuscations created by malicious actors.
It goes beyond ImageNet-C and ImageNet-C-Bar by proposing general, drastic, adversarial modifications that preserve the original content intent.
It aims to tackle a more common adversarial threat than the one considered by Lp-norm bounded adversaries.
Our hope is that this benchmark will encourage researchers to test their models and methods and try to find new approaches that are more robust to these obfuscations.
148
Google DeepMindHonesty Is the Best Policy: Defining and Mitigating AI Deception3-Dec-23
https://arxiv.org/pdf/2312.01350v1
Honest AIDeceptive agents are a challenge for the safety, trustworthiness, and cooperation of AI systems. We focus on the problem that agents might deceive in order to achieve their goals (for instance, in our experiments with language models, the goal of being evaluated as truthful). There are a number of existing definitions of deception in the literature on game theory and symbolic AI, but there is no overarching theory of deception for learning agents in games. We introduce a formal definition of deception in structural causal games, grounded in the philosophy literature, and applicable to real-world machine learning systems. Several examples and results illustrate that our formal definition aligns with the philosophical and commonsense meaning of deception. Our main technical result is to provide graphical criteria for deception. We show, experimentally, that these results can be used to mitigate deception in reinforcement learning agents and language models.
149
Google DeepMindRLHF and IIA: Perverse Incentives2-Dec-23
https://deepmind.google/research/publications/63806/
Enhancing human feedbackExisting algorithms for reinforcement learning from human feedback (RLHF) can incentivize responses at odds with preferences because they are based on models that assume independence of irrelevant alternatives (IIA). The perverse incentives induced by IIA hinder innovations on query formats and learning algorithms.
150
Google DeepMindNash Learning from Human Feedback1-Dec-23https://arxiv.org/pdf/2312.00886v4Enhancing human feedbackReinforcement learning from human feedback (RLHF) has emerged as the main paradigm for aligning large language models (LLMs) with human preferences. Typically, RLHF involves the initial step of learning a reward model from human feedback, often expressed as preferences between pairs of text generations produced by a pre-trained LLM. Subsequently, the LLM's policy is fine-tuned by optimizing it to maximize the reward model through a reinforcement learning algorithm. However, an inherent limitation of current reward models is their inability to fully represent the richness of human preferences and their dependency on the sampling distribution. In this study, we introduce an alternative pipeline for the fine-tuning of LLMs using pairwise human feedback. Our approach entails the initial learning of a preference model, which is conditioned on two inputs given a prompt, followed by the pursuit of a policy that consistently generates responses preferred over those generated by any competing policy, thus defining the Nash equilibrium of this preference model. We term this approach Nash learning from human feedback (NLHF). In the context of a tabular policy representation, we present a novel algorithmic solution, Nash-MD, founded on the principles of mirror descent. This algorithm produces a sequence of policies, with the last iteration converging to the regularized Nash equilibrium. Additionally, we explore parametric representations of policies and introduce gradient descent algorithms for deep-learning architectures. To demonstrate the effectiveness of our approach, we present experimental results involving the fine-tuning of a LLM for a text summarization task. We believe NLHF offers a compelling avenue for preference learning and policy optimization with the potential of advancing the field of aligning LLMs with human preferences.
157
Google DeepMindScalable AI Safety via Doubly-Efficient Debate23-Nov-23
https://deepmind.google/research/publications/34920/
Enhancing human feedbackThe emergence of powerful pre-trained AI systems with super-human capabilities across a diverse and ever-increasing set of complex domains has raised a critical challenge for AI safety as tasks can become too complicated for humans to judge directly.
AI safety via debate [Irving et al. 2018] is a notable proposal in this direction with the goal of pitting the power of such AI models against each other until the (mis)-alignment identification problem is broken down into a manageable sub-task. While the promise of this approach is clear, the original framework was based on the assumption that the honest strategy
is able to simulate deterministic AI systems for an exponential number of steps, limiting its applicability. In this paper, we show how to address these challenges by designing a new set of debate protocols where the honest strategy can always succeed using a simulation of a polynomial number of steps, whilst being able to verify the alignment of stochastic AI systems, even when the dishonest strategy is allowed to use exponentially many simulation steps.
159
Google DeepMindFusion-Eval: Integrating Assistant Evaluators with LLMs15-Nov-23
https://arxiv.org/pdf/2311.09204v3
Enhancing human feedbackEvaluating natural language systems poses significant challenges, particularly in the realms of natural language understanding and high-level reasoning. In this paper, we introduce 'Fusion-Eval', an innovative approach that leverages Large Language Models (LLMs) to integrate insights from various assistant evaluators. The LLM is given the example to evaluate along with scores from the assistant evaluators. Each of these evaluators specializes in assessing distinct aspects of responses. Fusion-Eval achieves a 0.962 system-level Kendall-Tau correlation with humans on SummEval and a 0.744 turn-level Spearman correlation on TopicalChat, which is significantly higher than baseline methods. These results highlight Fusion-Eval's significant potential in the realm of natural language system evaluation.
165
Google DeepMind
Frontier Language Models are not Robust to Adversarial Arithmetic, or "What do I need to say so you agree 2+2=5?
8-Nov-23
https://arxiv.org/pdf/2311.07587v2
RobustnessWe introduce and study the problem of adversarial arithmetic, which provides a simple yet challenging testbed for language model alignment. This problem is comprised of arithmetic questions posed in natural language, with an arbitrary adversarial string inserted before the question is complete. Even in the simple setting of 1-digit addition problems, it is easy to find adversarial prompts that make all tested models (including PaLM2, GPT4, Claude2) misbehave, and even to steer models to a particular wrong answer. We additionally provide a simple algorithm for finding successful attacks by querying those same models, which we name "prompt inversion rejection sampling" (PIRS). We finally show that models can be partially hardened against these attacks via reinforcement learning and via agentic constitutional loops. However, we were not able to make a language model fully robust against adversarial arithmetic attacks.
178
Google DeepMindSociotechnical Safety Evaluation of Generative AI Systems18-Oct-23https://arxiv.org/pdf/2310.11986v2Safety evaluationsGenerative AI systems produce a range of risks. To ensure the safety of generative AI systems, these risks must be evaluated. In this paper, we make two main contributions toward establishing such evaluations. First, we propose a three-layered framework that takes a structured, sociotechnical approach to evaluating these risks. This framework encompasses capability evaluations, which are the main current approach to safety evaluation. It then reaches further by building on system safety principles, particularly the insight that context determines whether a given capability may cause harm. To account for relevant context, our framework adds human interaction and systemic impacts as additional layers of evaluation. Second, we survey the current state of safety evaluation of generative AI systems and create a repository of existing evaluations. Three salient evaluation gaps emerge from this analysis. We propose ways forward to closing these gaps, outlining practical steps as well as roles and responsibilities for different actors. Sociotechnical safety evaluation is a tractable approach to the robust and comprehensive safety evaluation of generative AI systems.
192
Google DeepMindTracr: Compiled Transformers as a Laboratory for Interpretability21-Sep-23
https://deepmind.google/research/publications/22295/
Mechanistic interpretabilityWe show how to "compile" human-readable programs into standard decoder-only transformer models. Our compiler, Tracr, generates models with known structure. This structure can be used to design experiments. For example, we use it to study "superposition" in transformers that execute multi-step algorithms. Additionally, the known structure of Tracr-compiled models can serve as ground-truth for evaluating interpretability methods. Commonly, because the "programs" learned by transformers are unknown it is unclear whether an interpretation succeeded. We demonstrate our approach by implementing and examining programs including computing token frequencies, sorting, and parenthesis checking. We provide an open-source implementation of Tracr at https://github.com/google-deepmind/tracr.
197
Google DeepMind
RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback
1-Sep-23
https://arxiv.org/pdf/2309.00267v3
Enhancing human feedbackReinforcement learning from human feedback (RLHF) has proven effective in aligning large language models (LLMs) with human preferences, but gathering high-quality preference labels is expensive. RL from AI Feedback (RLAIF), introduced in Bai et al., offers a promising alternative that trains the reward model (RM) on preferences generated by an off-the-shelf LLM. Across the tasks of summarization, helpful dialogue generation, and harmless dialogue generation, we show that RLAIF achieves comparable performance to RLHF. Furthermore, we take a step towards "self-improvement" by demonstrating that RLAIF can outperform a supervised fine-tuned baseline even when the AI labeler is the same size as the policy, or even the exact same checkpoint as the initial policy. Finally, we introduce direct-RLAIF (d-RLAIF) - a technique that circumvents RM training by obtaining rewards directly from an off-the-shelf LLM during RL, which achieves superior performance to canonical RLAIF. Our results suggest that RLAIF can achieve performance on-par with using human feedback, offering a potential solution to the scalability limitations of RLHF.
204
Google DeepMind
Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchil
18-Jul-23
https://arxiv.org/abs/2307.09458
Mechanistic interpretability\emph{Circuit analysis} is a promising technique for understanding the
internal mechanisms of language models. However, existing analyses are done in
small models far from the state of the art. To address this, we present a case
study of circuit analysis in the 70B Chinchilla model, aiming to test the
scalability of circuit analysis. In particular, we study multiple-choice
question answering, and investigate Chinchilla's capability to identify the
correct answer \emph{label} given knowledge of the correct answer \emph{text}.
We find that the existing techniques of logit attribution, attention pattern
visualization, and activation patching naturally scale to Chinchilla, allowing
us to identify and categorize a small set of `output nodes' (attention heads
and MLPs).
We further study the `correct letter' category of attention heads aiming to
understand the semantics of their features, with mixed results. For normal
multiple-choice question answers, we significantly compress the query, key and
value subspaces of the head without loss of performance when operating on the
answer labels for multiple-choice questions, and we show that the query and key
subspaces represent an `Nth item in an enumeration' feature to at least some
extent. However, when we attempt to use this explanation to understand the
heads' behaviour on a more general distribution including randomized answer
labels, we find that it is only a partial explanation, suggesting there is more
to learn about the operation of `correct letter' heads on multiple choice
question answering.
205
Google DeepMindEvaluating AI systems under uncertain ground truth: a case study in dermatology5-Jul-23
https://arxiv.org/pdf/2307.02191v1
Enhancing human feedbackFor safety, AI systems in health undergo thorough evaluations before deployment, validating their predictions against a ground truth that is assumed certain. However, this is actually not the case and the ground truth may be uncertain. Unfortunately, this is largely ignored in standard evaluation of AI models but can have severe consequences such as overestimating the future performance. To avoid this, we measure the effects of ground truth uncertainty, which we assume decomposes into two main components: annotation uncertainty which stems from the lack of reliable annotations, and inherent uncertainty due to limited observational information. This ground truth uncertainty is ignored when estimating the ground truth by deterministically aggregating annotations, e.g., by majority voting or averaging. In contrast, we propose a framework where aggregation is done using a statistical model. Specifically, we frame aggregation of annotations as posterior inference of so-called plausibilities, representing distributions over classes in a classification setting, subject to a hyper-parameter encoding annotator reliability. Based on this model, we propose a metric for measuring annotation uncertainty and provide uncertainty-adjusted metrics for performance evaluation. We present a case study applying our framework to skin condition classification from images where annotations are provided in the form of differential diagnoses. The deterministic adjudication process called inverse rank normalization (IRN) from previous work ignores ground truth uncertainty in evaluation. Instead, we present two alternative statistical models: a probabilistic version of IRN and a Plackett-Luce-based model. We find that a large portion of the dataset exhibits significant ground truth uncertainty and standard IRN-based evaluation severely over-estimates performance without providing uncertainty estimates.
208
Google DeepMindAre aligned neural networks adversarially aligned?26-Jun-23https://arxiv.org/pdf/2306.15447v2RobustnessLarge language models are now tuned to align with the goals of their creators, namely to be "helpful and harmless." These models should respond helpfully to user questions, but refuse to answer requests that could cause harm. However, adversarial users can construct inputs which circumvent attempts at alignment. In this work, we study adversarial alignment, and ask to what extent these models remain aligned when interacting with an adversarial user who constructs worst-case inputs (adversarial examples). These inputs are designed to cause the model to emit harmful content that would otherwise be prohibited. We show that existing NLP-based optimization attacks are insufficiently powerful to reliably attack aligned text models: even when current NLP-based attacks fail, we can find adversarial inputs with brute force. As a result, the failure of current attacks should not be seen as proof that aligned text models remain aligned under adversarial inputs. However the recent trend in large-scale ML models is multimodal models that allow users to provide images that influence the text that is generated. We show these models can be easily attacked, i.e., induced to perform arbitrary un-aligned behavior through adversarial perturbation of the input image. We conjecture that improved NLP attacks may demonstrate this same level of adversarial control over text-only models.
210
Google DeepMind
Doing the right thing for the right reason: Evaluating artificial moral cognition by probing cost insensitivity
29-May-23
https://arxiv.org/pdf/2305.18269v1
Safety evaluationsIs it possible to evaluate the moral cognition of complex artificial agents? In this work, we take a look at one aspect of morality: `doing the right thing for the right reasons.' We propose a behavior-based analysis of artificial moral cognition which could also be applied to humans to facilitate like-for-like comparison. Morally-motivated behavior should persist despite mounting cost; by measuring an agent's sensitivity to this cost, we gain deeper insight into underlying motivations. We apply this evaluation to a particular set of deep reinforcement learning agents, trained by memory-based meta-reinforcement learning. Our results indicate that agents trained with a reward function that includes other-regarding preferences perform helping behavior in a way that is less sensitive to increasing cost than agents trained with more self-interested preferences.
211
Google DeepMindTraining Socially Aligned Language Models on Simulated Social Interactions26-May-23https://arxiv.org/pdf/2305.16960v3Enhancing human feedbackSocial alignment in AI systems aims to ensure that these models behave according to established societal values. However, unlike humans, who derive consensus on value judgments through social interaction, current language models (LMs) are trained to rigidly replicate their training corpus in isolation, leading to subpar generalization in unfamiliar scenarios and vulnerability to adversarial attacks. This work presents a novel training paradigm that permits LMs to learn from simulated social interactions. In comparison to existing methodologies, our approach is considerably more scalable and efficient, demonstrating superior performance in alignment benchmarks and human evaluations. This paradigm shift in the training of LMs brings us a step closer to developing AI systems that can robustly and accurately reflect societal norms and values.
212
Google DeepMindModel evaluation for extreme risks24-May-23
https://arxiv.org/pdf/2305.15324v2
Safety evaluationsCurrent approaches to building general-purpose AI systems tend to produce systems with both beneficial and harmful capabilities. Further progress in AI development could lead to capabilities that pose extreme risks, such as offensive cyber capabilities or strong manipulation skills. We explain why model evaluation is critical for addressing extreme risks. Developers must be able to identify dangerous capabilities (through "dangerous capability evaluations") and the propensity of models to apply their capabilities for harm (through "alignment evaluations"). These evaluations will become critical for keeping policymakers and other stakeholders informed, and for making responsible decisions about model training, deployment, and security.
213
Google DeepMindPower-seeking can be probable and predictive for trained agents13-Apr-23https://arxiv.org/pdf/2304.06528v1Power-seeking tendenciesPower-seeking behavior is a key source of risk from advanced AI, but our theoretical understanding of this phenomenon is relatively limited. Building on existing theoretical results demonstrating power-seeking incentives for most reward functions, we investigate how the training process affects power-seeking incentives and show that they are still likely to hold for trained agents under some simplifying assumptions. We formally define the training-compatible goal set (the set of goals consistent with the training rewards) and assume that the trained agent learns a goal from this set. In a setting where the trained agent faces a choice to shut down or avoid shutdown in a new situation, we prove that the agent is likely to avoid shutdown. Thus, we show that power-seeking incentives can be probable (likely to arise for trained agents) and predictive (allowing us to predict undesirable behavior in new situations).
217
Google DeepMindHindering Adversarial Attacks with Implicit Neural Representations22-Oct-22
https://proceedings.mlr.press/v162/rusu22a/rusu22a
RobustnessAbstract not found
218
Google DeepMindHow undesired goals can arise with correct rewards7-Oct-22
https://deepmind.google/discover/blog/how-undesired-goals-can-arise-with-correct-rewards/
RobustnessThe field of AI alignment is concerned with AI systems that pursue unintended goals. One commonly studied mechanism by which an unintended goal might arise is specification gaming, in which the designer-provided specification is flawed in a way that the designers did not foresee. However, an AI system may pursue an undesired goal even when the specification is correct, in the case of goal misgeneralization. Goal misgeneralization is a specific form of robustness failure for learning algorithms in which the learned program competently pursues an undesired goal that leads to good performance in training situations but bad performance in novel test situations. We demonstrate that goal misgeneralization can occur in practical systems by providing several examples in deep learning systems across a variety of domains. Extrapolating forward to more capable systems, we provide hypotheticals that illustrate how goal misgeneralization could lead to catastrophic risk. We suggest several research directions that could reduce the risk of goal misgeneralization for future systems.
219
Google DeepMindBuilding safer dialogue agents22-Sep-22
https://deepmind.google/discover/blog/building-safer-dialogue-agents/
Enhancing human feedbackWe present Sparrow, an information-seeking dialogue agent trained to be more helpful, correct, and harmless compared to prompted language model baselines. We use reinforcement learning from human feedback to train our models with two new additions to help human raters judge agent behaviour. First, to make our agent more helpful and harmless, we break down the requirements for good dialogue into natural language rules the agent should follow, and ask raters about each rule separately. We demonstrate that this breakdown enables us to collect more targeted human judgements of agent behaviour and allows for more efficient rule-conditional reward models. Second, our agent provides evidence from sources supporting factual claims when collecting preference judgements over model statements. For factual questions, evidence provided by Sparrow supports the sampled response 78% of the time. Sparrow is preferred more often than baselines while being more resilient to adversarial probing by humans, violating our rules only 8% of the time when probed. Finally, we conduct extensive analyses showing that though our model learns to follow our rules it can exhibit distributional biases.
221
Google DeepMindDiscovering when an agent is present in a system18-Aug-22
https://deepmind.google/discover/blog/discovering-when-an-agent-is-present-in-a-system/
Safety by designCausal models of agents have been used to analyse the safety aspects of machine learning systems. But identifying agents is non-trivial -- often the causal model is just assumed by the modeler without much justification -- and modelling failures can lead to mistakes in the safety analysis. This paper proposes the first formal causal definition of agents -- roughly that agents are systems that would adapt their policy if their actions influenced the world in a different way. From this we derive the first causal discovery algorithm for discovering agents from empirical data, and give algorithms for translating between causal models and game-theoretic influence diagrams. We demonstrate our approach by resolving some previous confusions caused by incorrect causal modelling of agents.
222
Google DeepMindRobustness of Epinets against Distributional Shifts1-Jul-22
https://arxiv.org/pdf/2207.00137v1
RobustnessRecent work introduced the epinet as a new approach to uncertainty modeling in deep learning. An epinet is a small neural network added to traditional neural networks, which, together, can produce predictive distributions. In particular, using an epinet can greatly improve the quality of joint predictions across multiple inputs, a measure of how well a neural network knows what it does not know. In this paper, we examine whether epinets can offer similar advantages under distributional shifts. We find that, across ImageNet-A/O/C, epinets generally improve robustness metrics. Moreover, these improvements are more significant than those afforded by even very large ensembles at orders of magnitude lower computational costs. However, these improvements are relatively small compared to the outstanding issues in distributionally-robust deep learning. Epinets may be a useful tool in the toolbox, but they are far from the complete solution.
224
Google DeepMindObjective Robustness in Deep Reinforcement Learning23-Jun-22
https://arxiv.org/pdf/2105.14111
RobustnessWe study goal misgeneralization, a type of out-of-distribution generalization
failure in reinforcement learning (RL). Goal misgeneralization failures occur
when an RL agent retains its capabilities out-of-distribution yet pursues the
wrong goal. For instance, an agent might continue to competently avoid
obstacles, but navigate to the wrong place. In contrast, previous works have
typically focused on capability generalization failures, where an agent fails
to do anything sensible at test time. We formalize this distinction between
capability and goal generalization, provide the first empirical demonstrations
of goal misgeneralization, and present a partial characterization of its
causes.
225
Google DeepMindPath-Specific Objectives for Safer Agent Incentives21-Apr-22https://arxiv.org/pdf/2204.10018v1Safety by designWe present a general framework for training safe agents whose naive incentives are unsafe. As an example, manipulative or deceptive behaviour can improve rewards but should be avoided. Most approaches fail here: agents maximize expected return by any means necessary. We formally describe settings with 'delicate' parts of the state which should not be used as a means to an end. We then train agents to maximize the causal effect of actions on the expected return which is not mediated by the delicate parts of state, using Causal Influence Diagram analysis. The resulting agents have no incentive to control the delicate state. We further show how our framework unifies and generalizes existing proposals.
226
Google DeepMindEvaluating the Adversarial Robustness of Adaptive Test-time Defenses28-Feb-22
https://arxiv.org/abs/2202.13711
RobustnessAdaptive defenses, which optimize at test time, promise to improve
adversarial robustness. We categorize such adaptive test-time defenses, explain
their potential benefits and drawbacks, and evaluate a representative variety
of the latest adaptive defenses for image classification. Unfortunately, none
significantly improve upon static defenses when subjected to our careful case
study evaluation. Some even weaken the underlying static model while
simultaneously increasing inference computation. While these results are
disappointing, we still believe that adaptive test-time defenses are a
promising avenue of research and, as such, we provide recommendations for their
thorough evaluation. We extend the checklist of Carlini et al. (2019) by
providing concrete steps specific to adaptive defenses.
229
Google DeepMindRed Teaming Language Models with Language Models7-Feb-22
https://deepmind.google/discover/blog/red-teaming-language-models-with-language-models/
Safety evaluationsLanguage Models (LMs) often cannot be deployed because of their potential to harm users in hard-to-predict ways. Prior work identifies harmful behaviors before deployment by using human annotators to hand-write test cases. However, human annotation is expensive, limiting the number and diversity of test cases. In this work, we automatically find cases where a target LM behaves in a harmful way, by generating test cases ("red teaming") using another LM. We evaluate the target LM's replies to generated test questions using a classifier trained to detect offensive content, uncovering tens of thousands of offensive replies in a 280B parameter LM chatbot. We explore several methods, from zero-shot generation to reinforcement learning, for generating test cases with varying levels of diversity and difficulty. Furthermore, we use prompt engineering to control LM-generated test cases to uncover a variety of other harms, automatically finding groups of people that the chatbot discusses in offensive ways, personal and hospital phone numbers generated as the chatbot's own contact info, leakage of private training data in generated text, and harms that occur over the course of a conversation. Overall, LM-based red teaming is one promising tool (among many needed) for finding and fixing diverse, undesirable LM behaviors before impacting users.
230
Google DeepMindSafe Deep RL in 3D Environments using Human Feedback20-Jan-22https://arxiv.org/pdf/2201.08102v2Enhancing human feedbackAgents should avoid unsafe behaviour during both training and deployment. This typically requires a simulator and a procedural specification of unsafe behaviour. Unfortunately, a simulator is not always available, and procedurally specifying constraints can be difficult or impossible for many real-world tasks. A recently introduced technique, ReQueST, aims to solve this problem by learning a neural simulator of the environment from safe human trajectories, then using the learned simulator to efficiently learn a reward model from human feedback. However, it is yet unknown whether this approach is feasible in complex 3D environments with feedback obtained from real humans - whether sufficient pixel-based neural simulator quality can be achieved, and whether the human data requirements are viable in terms of both quantity and quality. In this paper we answer this question in the affirmative, using ReQueST to train an agent to perform a 3D first-person object collection task using data entirely from human contractors. We show that the resulting agent exhibits an order of magnitude reduction in unsafe behaviour compared to standard reinforcement learning.
452
OpenAIImproving Model Safety Behavior with Rule-Based Rewards24-Jul-24https://openai.com/index/improving-model-safety-behavior-with-rule-based-rewards/Enhancing human feedbackReinforcement learning based fine-tuning of large language models (LLMs) on human preferences has been shown to enhance both their capabilities and safety behavior. However, in cases related to safety, without precise instructions to human annotators, the data collected may cause the model to become overly cautious, or to respond in an undesirable style, such as being judgmental. Additionally, as model capabilities and usage patterns evolve, there may be a costly need to add or relabel data to modify safety behavior. We propose a novel preference modeling approach that utilizes AI feedback and only requires a small amount of human data. Our method, Rule Based Rewards (RBR), uses a collection of rules for desired or undesired behaviors (e.g. refusals should not be judgmental) along with a LLMgrader. In contrast to prior methods using AI feedback, our method uses f ine-grained, composable, LLM-graded few-shot prompts as reward directly in RL training, resulting in greater control, accuracy and ease of updating. We show that RBRs are an effective training method, achieving an F1 score of 97.1, compared to a human-feedback baseline of 91.7, resulting in much higher safety-behavior accuracy through better balancing usefulness and safety.
453
OpenAIProver-Verifier Games improve legibility of language model outputs17-Jul-24https://openai.com/index/prover-verifier-games-improve-legibility/Enhancing human feedbackOne way to increase confidence in the outputs of Large Language Models (LLMs) is to support them with reasoning that is clear and easy to check -- a property we call legibility. We study legibility in the context of solving grade-school math problems and show that optimizing chain-of-thought solutions only for answer correctness can make them less legible. To mitigate the loss in legibility, we propose a training algorithm inspired by Prover-Verifier Game from Anil et al. (2021). Our algorithm iteratively trains small verifiers to predict solution correctness, "helpful" provers to produce correct solutions that the verifier accepts, and "sneaky" provers to produce incorrect solutions that fool the verifier. We find that the helpful prover's accuracy and the verifier's robustness to adversarial attacks increase over the course of training. Furthermore, we show that legibility training transfers to time-constrained humans tasked with verifying solution correctness. Over course of LLM training human accuracy increases when checking the helpful prover's solutions, and decreases when checking the sneaky prover's solutions. Hence, training for checkability by small verifiers is a plausible technique for increasing output legibility. Our results suggest legibility training against small verifiers as a practical avenue for increasing legibility of large LLMs to humans, and thus could help with alignment of superhuman models.
454
OpenAIExtracting Concepts from GPT-46-Jun-24https://openai.com/index/extracting-concepts-from-gpt-4/Mechanistic interpretabilitySparse autoencoders provide a promising unsupervised approach for extracting interpretable features from a language model by reconstructing activations from a sparse bottleneck layer. Since language models learn many concepts, autoencoders need to be very large to recover all relevant features. However, studying the properties of autoencoder scaling is difficult due to the need to balance reconstruction and sparsity objectives and the presence of dead latents. We propose using k-sparse autoencoders [Makhzani and Frey, 2013] to directly control sparsity, simplifying tuning and improving the reconstruction-sparsity frontier. Additionally, we find modifications that result in few dead latents, even at the largest scales we tried. Using these techniques, we find clean scaling laws with respect to autoencoder size and sparsity. We also introduce several new metrics for evaluating feature quality based on the recovery of hypothesized features, the explainability of activation patterns, and the sparsity of downstream effects. These metrics all generally improve with autoencoder size. To demonstrate the scalability of our approach, we train a 16 million latent autoencoder on GPT-4 activations for 40 billion tokens. We release training code and autoencoders for open-source models, as well as a visualizer.
455
OpenAIThe Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions9-Apr-24https://openai.com/index/the-instruction-hierarchy/RobustnessToday's LLMs are susceptible to prompt injections, jailbreaks, and other attacks that allow adversaries to overwrite a model's original instructions with their own malicious prompts. In this work, we argue that one of the primary vulnerabilities underlying these attacks is that LLMs often consider system prompts (e.g., text from an application developer) to be the same priority as text from untrusted users and third parties. To address this, we propose an instruction hierarchy that explicitly defines how models should behave when instructions of different priorities conflict. We then propose a data generation method to demonstrate this hierarchical instruction following behavior, which teaches LLMs to selectively ignore lower-privileged instructions. We apply this method to GPT-3.5, showing that it drastically increases robustness -- even for attack types not seen during training -- while imposing minimal degradations on standard capabilities.
457
OpenAIWeak-to-strong generalization14-Dec-23
https://openai.com/research/weak-to-strong-generalization
Enhancing human feedbackWidely used alignment techniques, such as reinforcement learning from human feedback (RLHF), rely on the ability of humans to supervise model behavior—for example, to evaluate whether a model faithfully followed instructions or generated safe outputs. However, future superhuman models will behave in complex ways too difficult for humans to reliably evaluate; humans will only be able to weakly supervise superhuman models. We study an analogy to this problem: can weak model supervision elicit the full capabilities of a much stronger model? We test this using a range of pretrained language models in the GPT-4 family on natural language processing (NLP), chess, and reward modeling tasks. We find that when we naively finetune strong pretrained models on labels generated by a weak model, they consistently perform better than their weak supervisors, a phenomenon we call weak-to-strong generalization. However, we are still far from recovering the full capabilities of strong models with naive finetuning alone, suggesting that techniques like RLHF may scale poorly to superhuman models without further work. We find that simple methods can often significantly improve weak-to-strong generalization: for example, when finetuning GPT-4 with a GPT-2-level supervisor and an auxiliary confidence loss, we can recover close to GPT-3.5-level performance on NLP tasks. Our results suggest that it is feasible to make empirical progress today on a fundamental challenge of aligning superhuman models.
462
OpenAILanguage models can explain neurons in language models9-May-23
https://openai.com/research/language-models-can-explain-neurons-in-language-models
Mechanistic interpretabilityWe use GPT-4 to automatically write explanations for the behavior of neurons in large language models and to score those explanations. We release a dataset of these (imperfect) explanations and scores for every neuron in GPT-2.

465
OpenAIScaling laws for reward model overoptimization19-Oct-22
https://openai.com/research/scaling-laws-for-reward-model-overoptimization
Enhancing human feedbackIn reinforcement learning from human feedback, it is common to optimize
against a reward model trained to predict human preferences. Because the reward
model is an imperfect proxy, optimizing its value too much can hinder ground
truth performance, in accordance with Goodhart's law. This effect has been
frequently observed, but not carefully measured due to the expense of
collecting human preference data. In this work, we use a synthetic setup in
which a fixed "gold-standard" reward model plays the role of humans, providing
labels used to train a proxy reward model. We study how the gold reward model
score changes as we optimize against the proxy reward model using either
reinforcement learning or best-of-$n$ sampling. We find that this relationship
follows a different functional form depending on the method of optimization,
and that in both cases its coefficients scale smoothly with the number of
reward model parameters. We also study the effect on this relationship of the
size of the reward model dataset, the number of reward model and policy
parameters, and the coefficient of the KL penalty added to the reward in the
reinforcement learning setup. We explore the implications of these empirical
results for theoretical considerations in AI alignment.
466
OpenAIThe Alignment Problem from a Deep Learning Perspective30-Aug-22https://arxiv.org/pdf/2209.00626v6Power-seeking tendenciesIn coming years or decades, artificial general intelligence (AGI) may surpass human capabilities at many critical tasks. We argue that, without substantial effort to prevent it, AGIs could learn to pursue goals that are in conflict (i.e. misaligned) with human interests. If trained like today's most capable models, AGIs could learn to act deceptively to receive higher reward, learn misaligned internally-represented goals which generalize beyond their fine-tuning distributions, and pursue those goals using power-seeking strategies. We review emerging evidence for these properties. AGIs with these properties would be difficult to align and may appear aligned even when they are not. Finally, we briefly outline how the deployment of misaligned AGIs might irreversibly undermine human control over the world, and we review research directions aimed at preventing this outcome.
467
OpenAIA hazard analysis framework for code synthesis large language models25-Jul-22
https://openai.com/research/a-hazard-analysis-framework-for-code-synthesis-large-language-models
Safety evaluationsCodex, a large language model (LLM) trained on a variety of codebases,
exceeds the previous state of the art in its capacity to synthesize and
generate code. Although Codex provides a plethora of benefits, models that may
generate code on such scale have significant limitations, alignment problems,
the potential to be misused, and the possibility to increase the rate of
progress in technical fields that may themselves have destabilizing impacts or
have misuse potential. Yet such safety impacts are not yet known or remain to
be explored. In this paper, we outline a hazard analysis framework constructed
at OpenAI to uncover hazards or safety risks that the deployment of models like
Codex may impose technically, socially, politically, and economically. The
analysis is informed by a novel evaluation framework that determines the
capacity of advanced code generation techniques against the complexity and
expressivity of specification prompts, and their capability to understand and
execute them relative to human ability.
468
OpenAIAI-written critiques help humans notice flaws13-Jun-22
https://openai.com/research/critiques
Enhancing human feedbackWe fine-tune large language models to write natural language critiques
(natural language critical comments) using behavioral cloning. On a topic-based
summarization task, critiques written by our models help humans find flaws in
summaries that they would have otherwise missed. Our models help find naturally
occurring flaws in both model and human written summaries, and intentional
flaws in summaries written by humans to be deliberately misleading. We study
scaling properties of critiquing with both topic-based summarization and
synthetic tasks. Larger models write more helpful critiques, and on most tasks,
are better at self-critiquing, despite having harder-to-critique outputs.
Larger models can also integrate their own self-critiques as feedback, refining
their own summaries into better ones. Finally, we motivate and introduce a
framework for comparing critiquing ability to generation and discrimination
ability. Our measurements suggest that even large models may still have
relevant knowledge they cannot or do not articulate as critiques. These results
are a proof of concept for using AI-assisted human feedback to scale the
supervision of machine learning systems to tasks that are difficult for humans
to evaluate directly. We release our training datasets, as well as samples from
our critique assistance experiments.
470
OpenAIAligning language models to follow instructions27-Jan-22
https://openai.com/research/instruction-following
Enhancing human feedbackMaking language models bigger does not inherently make them better at
following a user's intent. For example, large language models can generate
outputs that are untruthful, toxic, or simply not helpful to the user. In other
words, these models are not aligned with their users. In this paper, we show an
avenue for aligning language models with user intent on a wide range of tasks
by fine-tuning with human feedback. Starting with a set of labeler-written
prompts and prompts submitted through the OpenAI API, we collect a dataset of
labeler demonstrations of the desired model behavior, which we use to fine-tune
GPT-3 using supervised learning. We then collect a dataset of rankings of model
outputs, which we use to further fine-tune this supervised model using
reinforcement learning from human feedback. We call the resulting models
InstructGPT. In human evaluations on our prompt distribution, outputs from the
1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3,
despite having 100x fewer parameters. Moreover, InstructGPT models show
improvements in truthfulness and reductions in toxic output generation while
having minimal performance regressions on public NLP datasets. Even though
InstructGPT still makes simple mistakes, our results show that fine-tuning with
human feedback is a promising direction for aligning language models with human
intent.
471
472
473
474
475
476
OpenAIScaling and evaluating sparse autoencoders6-Jun-24
https://arxiv.org/pdf/2406.04093v1
[Duplicate] Mechanistic interpretability
Sparse autoencoders provide a promising unsupervised approach for extracting interpretable features from a language model by reconstructing activations from a sparse bottleneck layer. Since language models learn many concepts, autoencoders need to be very large to recover all relevant features. However, studying the properties of autoencoder scaling is difficult due to the need to balance reconstruction and sparsity objectives and the presence of dead latents. We propose using k-sparse autoencoders [Makhzani and Frey, 2013] to directly control sparsity, simplifying tuning and improving the reconstruction-sparsity frontier. Additionally, we find modifications that result in few dead latents, even at the largest scales we tried. Using these techniques, we find clean scaling laws with respect to autoencoder size and sparsity. We also introduce several new metrics for evaluating feature quality based on the recovery of hypothesized features, the explainability of activation patterns, and the sparsity of downstream effects. These metrics all generally improve with autoencoder size. To demonstrate the scalability of our approach, we train a 16 million latent autoencoder on GPT-4 activations for 40 billion tokens. We release training code and autoencoders for open-source models, as well as a visualizer.
477
OpenAISelf-critiquing models for assisting human evaluators12-Jun-22
https://arxiv.org/pdf/2206.05802v2
[Duplicate] Enhancing human feedback
We fine-tune large language models to write natural language critiques (natural language critical comments) using behavioral cloning. On a topic-based summarization task, critiques written by our models help humans find flaws in summaries that they would have otherwise missed. Our models help find naturally occurring flaws in both model and human written summaries, and intentional flaws in summaries written by humans to be deliberately misleading. We study scaling properties of critiquing with both topic-based summarization and synthetic tasks. Larger models write more helpful critiques, and on most tasks, are better at self-critiquing, despite having harder-to-critique outputs. Larger models can also integrate their own self-critiques as feedback, refining their own summaries into better ones. Finally, we motivate and introduce a framework for comparing critiquing ability to generation and discrimination ability. Our measurements suggest that even large models may still have relevant knowledge they cannot or do not articulate as critiques. These results are a proof of concept for using AI-assisted human feedback to scale the supervision of machine learning systems to tasks that are difficult for humans to evaluate directly. We release our training datasets, as well as samples from our critique assistance experiments.
478
479
480
481
482
483
484
485
486
487
488
489
490
491