Alignment Newsletter Database (public)
The version of the browser you are using is no longer supported. Please upgrade to a supported browser.Dismiss

View only
CategoryTitleAuthorsH/TEmailSummarizerSummaryMy opinionPrerequisitesRead more
Adversarial examplesHighlightMotivating the Rules of the Game for Adversarial Example ResearchJustin Gilmer, George E. Dahl et alAN #28Dan HIn this position paper, the authors argue that many of the threat models which motivate adversarial examples are unrealistic. They enumerate various previously proposed threat models, and then they show their limitations or detachment from reality. For example, it is common to assume that an adversary must create an imperceptible perturbation to an example, but often attackers can input whatever they please. In fact, in some settings an attacker can provide an input from the clean test set that is misclassified. Also, they argue that adversarial robustness defenses which degrade clean test set error are likely to make systems less secure since benign or nonadversarial inputs are vastly more common. They recommend that future papers motivated by adversarial examples take care to define the threat model realistically. In addition, they encourage researchers to establish “content-preserving” adversarial attacks (as opposed to “imperceptible” l_p attacks) and improve robustness to unseen input transformations.This is my favorite paper of the year as it handily counteracts much of the media coverage and research lab PR purporting ``doom'' from adversarial examples. While there are some scenarios in which imperceptible perturbations may be a motivation---consider user-generated privacy-creating perturbations to Facebook photos which stupefy face detection algorithms---much of the current adversarial robustness research optimizing small l_p ball robustness can be thought of as tackling a simplified subproblem before moving to a more realistic setting. Because of this paper, new tasks such as [Unrestricted Adversarial Examples]( ([AN #24]( take an appropriate step toward increasing realism without appearing to make the problem too hard.
Adversarial examplesHighlightIntroducing the Unrestricted Adversarial Examples ChallengeTom B. Brown et alAN #24There's a new adversarial examples contest, after the one from NIPS 2017. The goal of this contest is to figure out how to create a model that never confidently makes a mistake on a very simple task, even in the presence of a powerful adversary. This leads to many differences from the previous contest. The task is a lot simpler -- classifiers only need to distinguish between bicycles and birds, with an option of saying "ambiguous". Instead of using the L-infinity norm ball to define what an adversarial example is, attackers are allowed to supply any image whatsoever, as long as a team of human evaluators agrees unanimously on the classification of the image. The contest has no time bound, and will run until some defense survives for 90 days without being broken even once. A defense is not broken if it says "ambiguous" on an adversarial example. Any submitted defense will be published, which means that attackers can specialize their attacks to that specific model (i.e. it is white box).I really like this contest format, it seems like it's actually answering the question we care about, for a simple task. If I were designing a defense, the first thing I'd aim for would be to get a lot of training data, ideally from different distributions in the real world, but data augmentation techniques may also be necessary, especially for eg. images of a bicycle against an unrealistic textured background. The second thing would be to shrink the size of the model, to make it more likely that it generalizes better (in accordance with Occam's razor or the minimum description length principle). After that I'd think about the defenses proposed in the literature. I'm not sure how the verification-based approaches will work, since they are intrinsically tied to the L-infinity norm ball definition of adversarial examples, or something similar -- you can't include the human evaluators in your specification of what you want to verify.
Adversarial examplesHighlightAdversarial Attacks and Defences CompetitionAlexey Kurakin et alAN #1This is a report on a competition held at NIPS 2017 for the best adversarial attacks and defences. It includes a summary of the field and then shows the results from the competition.I'm not very familiar with the literature on adversarial examples and so I found this very useful as an overview of the field, especially since it talks about the advantages and disadvantages of different methods, which are hard to find by reading individual papers. The actual competition results are also quite interesting -- they find that the best attacks and defences are both quite successful on average, but have very bad worst-case performance (that is, the best defence is still very weak against at least one attack, and the best attack fails to attack at least one defence). Overall, this paints a bleak picture for defence, at least if the attacker has access to enough compute to actually try out different attack methods, and has a way of verifying whether the attacks succeed.
Adversarial examplesOn the Geometry of Adversarial ExamplesMarc Khoury et alAN #36This paper analyzes adversarial examples based off a key idea: even if the data of interest forms a low-dimensional manifold, as we often assume, the ϵ-tube _around_ the manifold is still high-dimensional, and so accuracy in an ϵ-ball around true data points will be hard to learn.

For a given L_p norm, we can define the optimal decision boundary to be the one that maximizes the margin from the true data manifold. If there exists some classifier that is adversarially robust, then the optimal decision boundary is as well. Their first result is that the optimal decision boundary can change dramatically if you change p. In particular, for concentric spheres, the optimal L_inf decision boundary provides an L_2 robustness guarantee √d times smaller than the optimal L_2 decision boundary, where d is the dimensionality of the input. This explains why a classifier that is adversarially trained on L_inf adversarial examples does so poorly on L_2 adversarial examples.

I'm not sure I understand the point of the next section, but I'll give it a try. They show that a nearest neighbors classifier can achieve perfect robustness if the underlying manifold is sampled sufficiently densely (requiring samples exponential in k, the dimensionality of the manifold). However, a learning algorithm with a particular property that they formalize would require exponentially more samples in at least some cases in order to have the same guarantee. I don't know why they chose the particular property they did -- my best guess is that the property is meant to represent what we get when we train a neural net on L_p adversarial examples. If so, then their theorem suggests that we would need exponentially more training points to achieve perfect robustness with adversarial training compared to a nearest neighbor classifier.

They next turn to the fact that the ϵ-tube around the manifold is d-dimensional instead of k-dimensional. If we consider ϵ-balls around the training set X, this covers a very small fraction of the ϵ-tube, approaching 0 as d becomes much larger than k, even if the training set X covers the k-dimensional manifold sufficiently well.

Another issue is that if we require adversarial robustness, then we severely restrict the number of possible decision boundaries, and so we may need significantly more expressive models to get one of these decision boundaries. In particular, since feedforward neural nets with Relu activations have "piecewise linear" decision boundaries (in quotes because I might be using the term incorrectly), it is hard for them to separate concentric spheres. Suppose that the spheres are separated by a distance d. Then for accuracy on the manifold, we only need the decision boundary to lie entirely in the shell of width d. However, for ϵ-tube adversarial robustness, the decision boundary must lie in a shell of width d - 2ϵ. They prove a lower bound on the number of linear regions for the decision boundary that grows as τ^(-d), where τ is the width of the shell, suggesting that adversarial robustness would require more parameters in the model.

Their experiments show that for simple learning problems (spheres and planes), adversarial examples tend to be in directions orthogonal to the manifold. In addition, if the true manifold has high codimension, then the learned model has poor robustness.
I think this paper has given me a significantly better understanding of how L_p norm balls work in high dimensions. I'm more fuzzy on how this applies to adversarial examples, in the sense of any confident misclassification by the model on an example that humans agree is obvious. Should we be giving up on L_p robustness since it forms a d-dimensional manifold, whereas we can only hope to learn the smaller k-dimensional manifold? Surely though a small enough perturbation shouldn't change anything? On the other hand, even humans have _some_ decision boundary, and the points near the decision boundary have some small perturbation which would change their classification (though possibly to "I don't know" rather than some other class).

There is a phenomenon where if you train on L_inf adversarial examples, the resulting classifier fails on L_2 adversarial examples, which has previously been described as "overfitting to L_inf". The authors interpret their first theorem as contradicting this statement, since the optimal decision boundaries are very different for L_inf and L_2. I don't see this as a contradiction. The L_p norms are simply a method of label propagation, which augments the set of data points for which we know labels. Ultimately, we want the classifier to reproduce the labels that we would assign to data points, and L_p propagation captures some of that. So, we can think of there as being many different ways that we can augment the set of training points until it matches human classification, and the L_p norm balls are such methods. Then an algorithm is more robust as it works with more of these augmentation methods. Simply doing L_inf training means that by default the learned model only works on one of the methods (L_inf norm balls) and not all of them as we wanted, and we can think of this as "overfitting" to the imperfect L_inf notion of adversarial robustness. The meaning of "overfitting" here is that the learned model is too optimized for L_inf, at the cost of other notions of robustness like L_2 -- and their theorem says basically the same thing, that optimizing for L_inf comes at the cost of L_2 robustness.
Adversarial examplesA Geometric Perspective on the Transferability of Adversarial DirectionsZachary Charles et alAN #34
Adversarial examplesTowards the first adversarially robust neural network model on MNISTLukas Schott, Jonas Rauber et alAN #27Dan HThis recent pre-print claims to make MNIST classifiers more adversarially robust to different L-p perturbations, while the previous paper only worked for L-infinity perturbations. The basic building block in their approach is a variational autoencoder, one for each MNIST class. Each variational autoencoder computes the likelihood of the input sample, and this information is used for classification. The authors also demonstrate that binarizing MNIST images can serve as strong defense against some perturbations. They evaluate against strong attacks and not just the fast gradient sign method.This paper has generated considerable excitement among my peers. Yet inference time with this approach is approximately 100,000 times that of normal inference (10^4 samples per VAE * 10 VAEs). Also unusual is that the L-infinity "latent descent attack" result is missing. It is not clear why training a single VAE does not work. Also, could results improve by adversarially training the VAEs? As with all defense papers, it is prudent to wait for third-party reimplementations and analysis, but the range of attacks they consider is certainly thorough.
Adversarial examplesTowards Deep Learning Models Resistant to Adversarial AttacksAleksander Madry et alAN #27Dan HMadry et al.'s paper is a seminal work which shows that some neural networks can attain more adversarial robustness with a well-designed adversarial training procedure. The key idea is to phrase the adversarial defense problem as minimizing the expected result of the adversarial attack problem, which is maximizing the loss on an input training point when the adversary is allowed to perturb the point anywhere within an L-infinity norm ball. They also start the gradient descent from a random point in the norm ball. Then, given this attack, to optimize the adversarial defense problem, we simply do adversarial training. When trained long enough, some networks will attain more adversarial robustness.It is notable that this paper has survived third-party security analysis, so this is a solid contribution. This contribution is limited by the fact that its improvements are limited to L-infinity adversarial perturbations on small images, as [follow-up work]( has shown.
Adversarial examplesMotivating the Rules of the Game for Adversarial Example ResearchJustin Gilmer, George E. Dahl et alDaniel FilanAN #19
Adversarial examplesA learning and masking approach to secure learningLinh Nguyen et alN/AOne way to view the problem of adversarial examples is that adversarial attacks map "good" clean data points that are classified correctly into a nearby "bad" space that is low probability and so is misclassified. This suggests that in order to attack a model, we can use a neural net to _learn_ a transformation from good data points to bad ones. The loss function is easy -- one term encourages similarity to the original data point, and the other term encourages the new data point to have a different class label. Then, for any new input data point, we can simply feed it through the neural net to get an adversarial example.

Similarly, in order to defend a model, we can learn a neural net transformation that maps bad data points to good ones. The loss function continues to encourage similarity between the data points, but now encourages that the new data point have the correct label. Note that we need to use some attack algorithm in order to generate the bad data points that are used to train the defending neural net.
Ultimately, in the defense proposed here, the information on how to be more robust comes from the "bad" data points that are used to train the neural net. It's not clear why this would outperform adversarial training, where we train the original classifier on the "bad" data points. In fact, if the best way to deal with adversarial examples is to transform them to regular examples, then we could simply use adversarial training with a more expressive neural net, and it could learn this transformation.
Adversarial examplesAdversarial Vulnerability of Neural Networks Increases With Input DimensionCarl-Johann Simon-Gabriel et alAN #36The key idea of this paper is that imperceptible adversarial vulnerability happens when small changes in the input lead to large changes in the output, suggesting that the gradient is large. They first recommend choosing ϵ_p to be proportional to d^(1/p). Intuitively, this is because larger values of p behave more like maxing instead of summing, and so using the same value of ϵ across values of p would lead to more points being considered for larger p. They show a link between adversarial robustness and regularization, which makes sense since both of these techniques aim for better generalization.

Their main point is that the norm of the gradient increases with the input dimension d. In particular, a typical initialization scheme will set the variance of the weights to be inversely proportional to d, which means the absolute value of each weight is inversely proportional to √d. For a single-layer neural net (that is, a perceptron), the gradient is exactly the weights. For L_inf adversarial robustness, the relevant norm for the gradient is the L_1 norm. This gives the sum of the d weights, which will be proportional to √d. For L_p adversarial robustness, the corresponding gradient is L_q with q larger than 1, which decreases the size of the gradient. However, this is exactly offset by the increase in the size of ϵ_p that they proposed. Thus, in this simple case the adversarial vulnerability increases with input dimension. They then prove theorems that show that this generalizes to other neural nets, including CNNS (albeit still only at initialization, not after training). They also perform experiments showing that their result also holds after training.
I suspect that there is some sort of connection between the explanation given in this paper and the explanation that there are many different perturbation directions in high-dimensional space which means that there are lots of potential adversarial examples, which increases the chance that you can find one. Their theoretical result comes primarily from the fact that weights are initialized with variance inversely proportional to d. We could eliminate this by having the variance be inversely proportional to d^2, in which case their result would say that adversarial vulnerability is constant with input dimension. However, in this case the variance of the activations would be inversely proportional to d, making it hard to learn. It seems like adversarial vulnerability should be the product of "number of directions", and "amount you can search in a direction", where the latter is related to the variance of the activations, making the connection to this paper.
Adversarial examplesRobustness via curvature regularization, and vice versaMoosavi-Dezfooli et alAN #35Dan HThis paper proposes a distinct way to increase adversarial perturbation robustness. They take an adversarial example generated with the FGSM, compute the gradient of the loss for the clean example and the gradient of the loss for the adversarial example, and they penalize this difference. Decreasing this penalty relates to decreasing the loss surface curvature. The technique works slightly worse than adversarial training.
Adversarial examplesIs Robustness [at] the Cost of Accuracy?Dong Su, Huan Zhang et alAN #32Dan HThis work shows that older architectures such as VGG exhibit more adversarial robustness than newer models such as ResNets. Here they take adversarial robustness to be the average adversarial perturbation size required to fool a network. They use this to show that architecture choice matters for adversarial robustness and that accuracy on the clean dataset is not necessarily predictive of adversarial robustness. A separate observation they make is that adversarial examples created with VGG transfers far better than those created with other architectures. All of these findings are for models without adversarial training.
Adversarial examplesRobustness May Be at Odds with AccuracyDimitris Tsipras, Shibani Santurkar, Logan Engstrom et alAN #32Dan HSince adversarial training can markedly reduce accuracy on clean images, one may ask whether there exists an inherent trade-off between adversarial robustness and accuracy on clean images. They use a simple model amenable to theoretical analysis, and for this model they demonstrate a trade-off. In the second half of the paper, they show adversarial training can improve feature visualization, which has been shown in several concurrent works.
Adversarial examplesAdversarial Examples Are a Natural Consequence of Test Error in NoiseAnonymousAN #32Dan HThis paper argues that there is a link between model accuracy on noisy images and model accuracy on adversarial images. They establish this empirically by showing that augmenting the dataset with random additive noise can improve adversarial robustness reliably. To establish this theoretically, they use the Gaussian Isoperimetric Inequality, which directly gives a relation between error rates on noisy images and the median adversarial perturbation size. Given that measuring test error on noisy images is easy, given that claims about adversarial robustness are almost always wrong, and given the relation between adversarial noise and random noise, they suggest that future defense research include experiments demonstrating enhanced robustness on nonadversarial, noisy images.
Adversarial examplesAdversarial Reprogramming of Sequence Classification Neural NetworksPaarth Neekhara et alAN #23
Adversarial examplesFortified Networks: Improving the Robustness of Deep Networks by Modeling the Manifold of Hidden RepresentationsAlex Lamb et alAN #2
Adversarial examplesAdversarial Vision ChallengeWieland Brendel et alAN #19There will be a competition on adversarial examples for vision at NIPS 2018.
Adversarial examplesEvaluating and Understanding the Robustness of Adversarial Logit PairingLogan Engstrom, Andrew Ilyas and Anish AthalyeAN #18
Adversarial examplesBenchmarking Neural Network Robustness to Common Corruptions and Surface VariationsDan Hendrycks et alAN #15See [Import AI](
Adversarial examplesAdversarial Reprogramming of Neural NetworksGamaleldin F. Elsayed et alAN #14
Adversarial examplesDefense Against the Dark Arts: An overview of adversarial example security research and future research directionsIan GoodfellowAN #11
Adversarial examplesSpatially Transformed Adversarial ExamplesChaowei Xiao et alAN #29Dan HMany adversarial attacks perturb pixel values, but the attack in this paper perturbs the pixel locations instead. This is accomplished with a smooth image deformation which has subtle effects for large images. For MNIST images, however, the attack is more obvious and not necessarily content-preserving (see Figure 2 of the paper).
Adversarial examplesCharacterizing Adversarial Examples Based on Spatial Consistency Information for Semantic SegmentationChaowei Xiao et alAN #29Dan HThis paper considers adversarial attacks on segmentation systems. They find that segmentation systems behave inconsistently on adversarial images, and they use this inconsistency to detect adversarial inputs. Specifically, they take overlapping crops of the image and segment each crop. For overlapping crops of an adversarial image, they find that the segmentation are more inconsistent. They defend against one adaptive attack.
Adversarial examplesAdversarial Logit PairingRecon #5
Adversarial examplesLearning to write programs that generate imagesRecon #5
Adversarial examplesIntrinsic Geometric Vulnerability of High-Dimensional Artificial IntelligenceLuca Bortolussi et alAN #36
Adversarial examplesOn Adversarial Examples for Character-Level Neural Machine TranslationJavid Ebrahimi et alAN #13
Adversarial examplesIdealised Bayesian Neural Networks Cannot Have Adversarial Examples: Theoretical and Empirical StudyYarin Gal et alAN #10
Agent foundationsRobust program equilibriumCaspar OesterheldAN #35
Agent foundationsBounded Oracle InductionDiffractorAN #35
Agent foundationsA Rationality Condition for CDT Is That It Equal EDT (Part 2)Abram DemskiAN #28
Agent foundationsEDT solves 5 and 10 with conditional oraclesjessicataAN #27
Agent foundationsA Rationality Condition for CDT Is That It Equal EDT (Part 1)Abram DemskiAN #27
Agent foundationsAsymptotic Decision Theory (Improved Writeup)DiffractorAN #26
Agent foundationsIn Logical Time, All Games are Iterated GamesAbram DemskiAN #25RichardThe key difference between causal and functional decision theory is that the latter supplements the normal notion of causation with "logical causation". The decision of agent A can logically cause the decision of agent B even if B made their decision before A did - for example, if B made their decision by simulating A. Logical time is an informal concept developed to help reason about which computations cause which other computations: logical causation only flows forward through logical time in the same way that normal causation only flows forward through normal time (although maybe logical time turns out to be loopy). For example, when B simulates A, B is placing themselves later in logical time than A. When I choose not to move my bishop in a game of chess because I've noticed it allows a sequence of moves which ends in me being checkmated, then I am logically later than that sequence of moves. One toy model of logical time is based on proof length - we can consider shorter proofs to be earlier in logical time than longer proofs. It's apparently surprisingly difficult to find a case where this fails badly.

In logical time, all games are iterated games. We can construct a series of simplified versions of each game where each player's thinking time is bounded. As thinking time increases, the games move later in logical time, and so we can treat them as a series of iterated games whose outcomes causally affect all longer versions. Iterated games are fundamentally different from single-shot games: the [folk theorem]( states that virtually any outcome is possible in iterated games.
I like logical time as an intuitive way of thinking about logical causation. However, the analogy between normal time and logical time seems to break down in some cases. For example, suppose we have two boolean functions F and G, such that F = not G. It seems like G is logically later than F - yet we could equally well have defined them such that G = not F, which leads to the opposite conclusion. As Abram notes, logical time is intended as an intuition pump not a well-defined theory - yet the possibility of loopiness makes me less confident in its usefulness. In general I am pessimistic about the prospects for finding a formal definition of logical causation, for reasons I described in [Realism about Rationality](, which Rohin summarised above.
Agent foundationsCounterfactuals and reflective oraclesNisanAN #23
Agent foundationsWhen wishful thinking worksAlex MennenAN #23Sometimes beliefs can be loopy, in that the probability of a belief being true depends on whether you believe it. For example, the probability that a placebo helps you may depend on whether you believe that a placebo helps you. In the situation where you know this, you can "wish" your beliefs to be the most useful possible beliefs. In the case where the "true probability" depends continuously on your beliefs, you can use a fixed point theorem to find a consistent set of probabilities. There may be many such fixed points, in which case you can choose the one that would lead to highest expected utility (such as choosing to believe in the placebo). One particular application of this would be to think of the propositions as "you will take action a_i". In this case, you act the way you believe you act, and then every probability distribution over the propositions is a fixed point, and so we just choose the probability distribution (i.e. stochastic policy) that maximized expected utility, as usual. This analysis can also be carried to Nash equilibria, where beliefs in what actions you take will affect the actions that the other player takes.
Agent foundationsComputational complexity of RL with trapsVadim KosoyAN #22A post asking about complexity theoretic results around RL, both with (unknown) deterministic and stochastic dynamics.
Agent foundationsUsing expected utility for Good(hart)Stuart ArmstrongAN #22If we include all of the uncertainty we have about human values into the utility function, then it seems possible to design an expected utility maximizer that doesn't fall prey to Goodhart's law. The post shows a simple example where there are many variables that may be of interest to humans, but we're not sure which ones. In this case, by incorporating this uncertainty into our proxy utility function, we can design an expected utility maximizer that has conservative behavior that makes sense.On the one hand, I'm sympathetic to this view -- for example, I see risk aversion as a heuristic leading to good expected utility maximization for bounded reasoners on large timescales. On the other hand, an EU maximizer still seems hard to align, because whatever utility function it gets, or distribution over utility functions, it will act as though that input is definitely true, which means that anything we fail to model will never make it into the utility function. If you could have some sort of "unresolvable" uncertainty, some reasoning (similar to the [problem of induction]( suggesting that you can never fully trust your own thoughts to be perfectly correct, that would make me more optimistic about an EU maximization based approach, but I don't think it can be done by just changing the utility function, or by adding a distribution over them.
Agent foundationsCorrigibility doesn't always have a good action to takeStuart ArmstrongAN #22Stuart has [previously argued]( that an AI could be put in situations where no matter what it does, it would affect the human's values. In this short post, he notes that if you then say that it is possible to have situations where the AI cannot act corrigibly, then other problems arise, such as how you can create a superintelligent corrigible AI that does anything at all (since any action that it takes would likely affect our values somehow).
Agent foundationsAgents and Devices: A Relative Definition of AgencyLaurent Orseau et alAN #22This paper considers the problem of modeling other behavior, either as an agent (trying to achieve some goal) or as a device (that reacts to its environment without any clear goal). They use Bayesian IRL to model behavior as coming from an agent optimizing a reward function, and design their own probability model to model the behavior as coming from a device. They then use Bayes rule to decide whether the behavior is better modeled as an agent or as a device. Since they have a uniform prior over agents and devices, this ends up choosing the one that better fits the data, as measured by log likelihood.

In their toy gridworld, agents are navigating towards particular locations in the gridworld, whereas devices are reacting to their local observation (the type of cell in the gridworld that they are currently facing, as well as the previous action they took). They create a few environments by hand which demonstrate that their method infers the intuitive answer given the behavior.
In their experiments, they have two different model classes with very different inductive biases, and their method correctly switches between the two classes depending on which inductive bias works better. One of these classes is the maximization of some reward function, and so we call that the agent class. However, they also talk about using the Solomonoff prior for devices -- in that case, even if we have something we would normally call an agent, if it is even slightly suboptimal, then with enough data the device explanation will win out.

I'm not entirely sure why they are studying this problem in particular -- one reason is explained in the next post, I'll write more about it in that section.
Agent foundationsCooperative OraclesDiffractorAN #22
Agent foundationsBottle Caps Aren't OptimisersDaniel FilanAN #22The previous paper detects optimizers by studying their behavior. However, if the goal is to detect an optimizer before deployment, we need to determine whether an algorithm is performing optimization by studying its source code, _without_ running it. One definition that people have come up with is that an optimizer is something such that the objective function attains higher values than it otherwise would have. However, the author thinks that this definition is insufficient. For example, this would allow us to say that a bottle cap is an optimizer for keeping water inside the bottle. Perhaps in this case we can say that there are simpler descriptions of bottle caps, so those should take precedence. But what about a liver? We could say that a liver is optimizing for its owner's bank balance, since in its absence the bank balance is not going to increase.Here, we want a definition of optimization because we're worried about an AI being deployed, optimizing for some metric in the environment, and then doing something unexpected that we don't like but nonetheless does increase the metric (falling prey to Goodhart's law). It seems better to me to talk about "optimizer" and "agent" as models of predicting behavior, not something that is an inherent property of the thing producing the behavior. Under that interpretation, we want to figure out whether the agent model with a particular utility function is a good model for an AI system, by looking at its internals (without running it). It seems particularly important to be able to use this model to predict the behavior in novel situations -- perhaps that's what is needed to make the definition of optimizer avoid the counterexamples in this post. (A bottle cap definitely isn't going to keep water in containers if it is simply lying on a table somewhere.)
Agent foundationsReducing collective rationality to individual optimization in common-payoff games using MCMCjessicataAN #21Given how hard multiagent cooperation is, it would be great if we could devise an algorithm such that each agent is only locally optimizing their own utility (without requiring that anyone else change their policy), that still achieves the globally optimal policy. This post considers the case where all players have the same utility function in an iterated game. In this case, we can define a process where at every timestep, one agent is randomly selected, and that agent changes their action in the game uniformly at random with probability that depends on how much utility was just achieved. This depends on a rationality parameter α -- the higher α is, the more likely it is for the player to stick with a high utility action.

This process allows you to reach every possible joint action from every other possible joint action with some non-zero probability, so in the limit of running this process forever, you will end up visiting every state infinitely often. However, by cranking up the value of α, we can ensure that in the limit we spend most of the time in the high-value states and rarely switch to anything lower, which lets us get arbitrarily close to the optimal deterministic policy (and so arbitrarily close to the optimal expected value).
I like this, it's an explicit construction that demonstrates how you can play with the explore-exploit tradeoff in multiagent settings. Note that when α is set to be very high (the condition in which we get near-optimal outcomes in the limit), there is very little exploration, and so it will take a long time before we actually find the optimal outcome in the first place. It seems like this would make it hard to use in practice, but perhaps we could replace the exploration with reasoning about the game and other agents in it? The author was planning to use reflective oracles to do something like this if I understand correctly.
Agent foundationsComputational efficiency reasons not to model VNM-rational preference relations with utility functionsAlex MennenAN #17Realistic agents don't use utility functions over world histories to make decisions, because it is computationally infeasible, and it's quite possible to make a good decision by only considering the local effects of the decision. For example, when deciding whether or not to eat a sandwich, we don't typically worry about the outcome of a local election in Siberia. For the same computational reasons, we wouldn't want to use a utility function to model other agents. Perhaps a utility function is useful for measuring the strength of an agent's preference, but even then it is really measuring the ratio of the strength of the agent's preference to the strength of the agent's preference over the two reference points used to determine the utility function.I agree that we certainly don't want to model other agents using full explicit expected utility calculations because it's computationally infeasible. However, as a first approximation it seems okay to model other agents as computationally bounded optimizers of some utility function. It seems like a bigger problem to me that any such model predicts that the agent will never change its preferences (since that would be bad according to the current utility function).
Agent foundationsStable Pointers to Value III: Recursive QuantilizationAbram DemskiAN #17We often try to solve alignment problems by going a level meta. For example, instead of providing feedback on what the utility function is, we might provide feedback on how to best learn what the utility function is. This seems to get more information about what safe behavior is. What if we iterate this process? For example, in the case of quantilizers with three levels of iteration, we would do a quantilized search over utility function generators, then do a quantilized search over the generated utility functions, and then do a quantilized search to actually take actions.The post mentions what seems like the most salient issue -- that it is really hard for humans to give feedback even a few meta levels up. How do you evaluate a thing that will create a distribution over utility functions? I might go further -- I'm not even sure there is good normative feedback on the meta level(s). There is feedback we can give on the meta level for any particular object-level instance, but it seems not at all obvious (to me) that this advice will generalize well to other object-level instances. On the other hand, it does seem to me that the higher up you are in meta-levels, the smaller the space of concepts and the easier it is to learn. So maybe my overall take is that it seems like we can't depend on humans to give meta-level feedback well, but if we can figure out how to either give better feedback or learn from noisy feedback, it would be easier to learn and likely generalize better.
Agent foundationsBuridan's ass in coordination gamesjessicataAN #16Suppose two agents have to coordinate to choose the same action, X or Y, where X gives utility 1 and Y gives utility u, for some u in [0, 2]. (If the agents fail to coordinate, they get zero utility.) If the agents communicate, decide on policies, then observe the value of u with some noise ϵ, and then execute their policies independently, there must be some u for which they lose out on significant utility. Intuitively, the proof is that at u = 0, you should say X, and at u = 2, you should say Y, and there is some intermediate value where you are indifferent between the two (equal probability of choosing X or Y), meaning that 50% of the time you will fail to coordinate. However, if you have a shared source of randomness (after observing the value of u), then you can correlate your decisions using the randomness in order to do much better.Cool result, and quite easy to understand. As usual I don't want to speculate on relevance to AI alignment because it's not my area.
Agent foundationsProbability is Real, and Value is ComplexAbram DemskiAN #16If you interpret events as vectors on a graph, with probability on the x-axis and probability * utility on the y-axis, then any rotation of the vectors preserves the preference relation, so that you will make the same decision. This means that from decisions, you cannot distinguish between rotations, which intuitively means that you can't tell if a decision was made because it had a low probability of high utility, or medium probability of medium utility, for example. As a result, beliefs and utilities are inextricably linked, and you can't just separate them. Key quote: "Viewing [probabilities and utilities] in this way makes it somewhat more natural to think that probabilities are more like "caring measure" expressing how much the agent cares about how things go in particular worlds, rather than subjective approximations of an objective "magical reality fluid" which determines what worlds are experienced."I am confused. If you want to read my probably-incoherent confused opinion on it, it's [here]( Utility: Representing Preference by Probability Measures
Agent foundationsBayesian Probability is for things that are Space-like Separated from YouScott GarrabrantAN #15When an agent has uncertainty about things that either influenced which algorithm the agent is running (the agent's "past") or about things that will be affected by the agent's actions (the agent's "future"), you may not want to use Bayesian probability. Key quote: "The problem is that the standard justifications of Bayesian probability are in a framework where the facts that you are uncertain about are not in any way affected by whether or not you believe them!" This is not the case for events in the agent's "past" or "future". So, you should only use Bayesian probability for everything else, which are "space-like separated" from you (in analogy with space-like separation in relativity).I don't know much about the justifications for Bayesianism. However, I would expect any justification to break down once you start to allow for sentences where the agent's degree of belief in the sentence affects its truth value, so the post makes sense given that intuition.
Agent foundationsComplete Class: Consequentialist FoundationsAbram DemskiAN #15An introduction to "complete class theorems", which can be used to motivate the use of probabilities and decision theory.This is cool, and I do want to learn more about complete class theorems. The post doesn't go into great detail on any of the theorems, but from what's there it seems like these theorems would be useful for figuring out what things we can argue from first principles (akin to the VNM theorem and dutch book arguments).
Agent foundationsWeak arguments against the universal prior being malignX4vierAN #11In an [earlier post](, Paul Christiano has argued that if you run Solomonoff induction and use its predictions for important decisions, most of your probability mass will be placed on universes with intelligent agents that make the right predictions so that their predictions will influence your decisions, and then use that influence to manipulate you into doing things that they value. This post makes a few arguments that this wouldn't actually happen, and Paul responds to the arguments in the comments.I still have only a fuzzy understanding of what's going on here, so I'm going to abstain from an opinion on this one.What does the universal prior actually look like?
Agent foundationsCounterfactual Mugging Poker GameScott GarrabrantAN #11This is a variant of counterfactual mugging, in which an agent doesn't take the action that is locally optimal, because that would provide information in the counterfactual world where one aspect of the environment was different that would lead to a large loss in that setting.This example is very understandable and very short -- I haven't summarized it because I don't think I can make it any shorter.
Agent foundationsPrisoners' Dilemma with Costs to ModelingScott GarrabrantAN #10Open source game theory looks at the behavior of agents that have access to each other's source code. A major result is that we can define an agent FairBot that will cooperate with itself in the prisoner's dilemma, yet can never be exploited. Later, we got PrudentBot, which still cooperates with FairBots, but will defect against CooperateBots (which always cooperate) since it can at no cost to itself. Given this, you would expect that if you evolved a population of such bots, you'd hopefully get an equilibrium in which everyone cooperates with each other, since they can do so robustly without falling prey to DefectBots (which always defect). However, being a FairBot or PrudentBot is costly -- you have to think hard about the opponent and prove things about them, it's a lot easier to rely on everyone else to punish the DefectBots and become a CooperateBot yourself. In this post, Scott analyzes the equilibria in the two person prisoner's dilemma with small costs to play bots that have to prove things. It turns out that in addition to the standard Defect-Defect equilibirum, there are two mixed strategy equilibria, including one that leads to generally cooperative behavior -- and if you evolve agents to play this game, they generally stay in the vicinity of this good equilibrium, for a range of initial conditions.This is an interesting result. I continue to be surprised at how robust this Lobian cooperative behavior seems to be -- while I used to think that humans could only cooperate with each other because of prosocial tendencies that meant that we were not fully selfish, I'm now leaning more towards the theory that we are simply very good at reading other people, which gives us insight into them, and leads to cooperative behavior in a manner similar to Lobian cooperation.[Robust Cooperation in the Prisoner's Dilemma]( and/or [Open-source game theory is weird](
Agent foundations2018 research plans and predictionsRob BensingerAN #1Scott and Nate from MIRI score their predictions for research output in 2017 and make predictions for research output in 2018.I don't know enough about MIRI to have any idea what the predictions mean, but I'd still recommend reading it if you're somewhat familiar with MIRI's technical agenda to get a bird's-eye view of what they have been focusing on for the last year.A basic understanding of MIRI's technical agenda (eg. what they mean by naturalized agents, decision theory, Vingean reflection, and so on).
Agent foundationsOracle Induction ProofsDiffractorAN #35
Agent foundationsDimensional regret without resetsVadim KosoyAN #33
Agent foundationsWhat are Universal Inductors, Again?DiffractorAN #32
Agent foundationsWhen EDT=CDT, ADT Does WellDiffractorAN #30
Agent foundationsRamsey and Joyce on deliberation and predictionYang Liu et alAN #27
Agent foundationsDoubts about UpdatelessnessAlex AppelAN #5
Agent foundationsComputing an exact quantilal policyVadim KosoyAN #3
Agent foundationsLogical Counterfactuals & the Cooperation GameChris LeongAN #20
Agent foundationsIdea: OpenAI Gym environments where the AI is a part of the environmentcrabmanAN #2
Agent foundationsResource-Limited Reflective OraclesAlex AppelAN #2
Agent foundationsNo Constant Distribution Can be a Logical InductorAlex AppelAN #2
Agent foundationsProbabilistic Tiling (Preliminary Attempt)DiffractorAN #19
Agent foundations[Logical Counterfactuals for Perfect Predictors]( and [A Short Note on UDT]( LeongAN #19
Agent foundationsCounterfactuals, thick and thinNisanAN #18There are many different ways to formalize counterfactuals (the post suggests three such ways). Often, for any given way of formalizing counterfactuals, there are many ways you could take a counterfactual, which give different answers. When considering the physical world, we have strong causal models that can tell us which one is the "correct" counterfactual. However, there is no such method for logical counterfactuals yet.I don't think I understood this post, so I'll abstain on an opinion.
Agent foundationsConceptual problems with utility functions, second attempt at explainingDacynAN #17Argues that there's a difference between object-level fairness (which sounds to me like fairness as a terminal value) and meta-level fairness (which sounds to me like instrumental fairness), and that this difference is not captured with single-player utility function maximization.I still think that the difference pointed out here is accounted for by traditional multiagent game theory, which has utility maximization for each player. For example, I would expect that in a repeated Ultimatum game, fairness would arise naturally, similarly to how tit-for-tat is a good strategy in an iterated prisoner's dilemma.
Conceptual problems with utility functions
Agent foundationsThe Evil Genie PuzzleChris LeongAN #17
Agent foundationsExorcizing the Speed Prior?Abram DemskiAN #17Intuitively, in order to find a solution to a hard problem, we could either do an uninformed brute force search, or encode some domain knowledge and then do an informed search. Roughly, we should expect each additional bit of information to cut the required search roughly in half. The speed prior trades off a bit of complexity against a doubling of running time, so we should expect the informed and uninformed searches to be equally likely in the speed prior. So, uninformed brute force searches that can find weird edge cases (aka daemons) are only equally likely, not more likely.As the post acknowledges, this is extremely handwavy and just gesturing at an intuition, so I'm not sure what to make of it yet. One counterconsideration is that a lot of intelligence that is not just search, that still is general across domains (see [this comment]( for examples).
Agent foundationsForecasting using incomplete modelsVadim KosoyAN #13
Agent foundationsLogical uncertainty and Mathematical uncertaintyAlex MennenAN #13
Agent foundationsLogical Inductor Tiling and Why it's HardAlex AppelAN #10
Agent foundationsA Possible Loophole for Self-Applicative Soundness?Alex AppelAN #10
Agent foundationsLogical Inductors Converge to Correlated Equilibria (Kinda)Alex AppelAN #10
Agent foundationsLogical Inductor LemmasAlex AppelAN #10
Agent foundationsTwo Notions of Best ResponseAlex AppelAN #10
Agent foundationsQuantilal control for finite MDPsVadim KosoyAN #1
Agent foundationsMusings on ExplorationAlex AppelAN #1Decision theories require some exploration in order to prevent the problem of spurious conterfactuals, where you condition on a zero-probability event. However, there are problems with exploration too, such as unsafe exploration (eg. launching a nuclear arsenal in an exploration step), and a sufficiently strong agent seems to have an incentive to self-modify to remove the exploration, because the exploration usually leads to suboptimal outcomes for the agent.I liked the linked [post]( that explains why conditioning on low-probability actions is not the same thing as a counterfactual, but I'm not knowledgeable enough to understand what's going on in this post, so I can't really say whether or not you should read it.
Agent foundationsDistributed CooperationRecon #5
Agent foundationsDecisions are not about changing the world, they are about learning what world you live inshminuxAN #18The post tries to reconcile decision theory (in which agents can "choose" actions) with the deterministic physical world (in which nothing can be "chosen"), using many examples from decision theory.
Agent foundationsAn Agent is a Worldline in Tegmark VkomponistoAN #15Tegmark IV consists of all possible consistent mathematical structures. Tegmark V is an extension that also considers "impossible possible worlds", such as the world where 1+1=3. Agents are reasoning at the level of Tegmark V, because counterfactuals are considering these impossible possible worlds.I'm not really sure what you gain by thinking of an agent this way.
AGI theoryAbstraction LearningFei Deng et alAN #25
AGI theorySteps toward super intelligence ([1](, [2](, [3](, [4]( BrooksAN #16Part 1 goes into four historical approaches to AI and their strengths and weaknesses. Part 2 talks about what sorts of things an AGI should be capable of doing, proposing two tasks to evaluate on to replace the Turing test (which simple not-generally-intelligent chatbots can pass). The tasks are an elder care worker (ECW) robot, that could assist the elderly and let them live their lives in their homes, and a services logistics planner (SLP), which should be able to design systems for logistics, such as the first-ever dialysis ward in a hospital. Part 3 talks about what sorts of things are hard now, but talks about rather high-level things such as reading a book and writing code. Part 4 has suggestions on what to work on right now, such as getting object recognition and manipulation capabilities of young children.Firstly, you may just want to skip it all because many parts drastically and insultingly misrepresent AI alignment concerns. But if you're okay with that, then part 2 is worth reading -- I really like the proposed tasks for AGI, they seem like good cases to think about. Part 1 doesn't actually talk about superintelligence so I would skip it. Part 3 was not news to me, and I suspect will not be news to readers of this newsletter (even if you aren't an AI researcher). I disagree with the intuition behind Part 4 as a method for getting superintelligent AI systems, but it does seem like the way we will make progress in the short term.
AGI theoryA Model for General IntelligencePaul YaworskyAN #32
AGI theoryFractal AI: A fragile theory of intelligenceRecon #5
AGI theoryBelievable PromisesDouglas_ReayAN #3
AGI theoryThe Foundations of Deep Learning with a Path Towards General IntelligenceEray ÖzkuralAN #13
AI strategy and policyHighlight80K podcast with Allan DafoeAllan Dafoe and Rob WiblinAN #7A long interview with Allan Dafoe about the field of AI policy, strategy and governance. It discusses challenges for AI policy that haven't arisen before (primarily because AI is a dual use technology), the rhetoric around arms races, and autonomous weapons as a means to enable authoritarian regimes, to give a small sampling. One particularly interesting tidbit (to me) was that Putin has said that Russia will give away its AI capabilities to the world, because an arms race would be dangerous.Overall this is a great introduction to the field, I'd probably recommend people interested in the area to read this before any of the more typical published papers. I do have one disagreement -- Allan claims that even if we stopped Moore's law, and stopped algorithmic scientific improvement in AI, there could be some extreme systematic risks that emerge from AI -- mass labor displacement, creating monopolies, mass surveillance and control (through robot repression), and strategic stability. I would be very surprised if current AI systems would be able to lead to mass labor displacement and/or control through robot repression. We are barely able to get machines to do anything in the real world right now -- _something_ has to improve quite drastically, and if it's neither compute nor algorithms, then I don't know what it would be. The other worries seem plausible from the technical viewpoint.
AI strategy and policyHighlightAI Governance: A Research AgendaAllan DafoeAN #22A comprehensive document about the research agenda at the Governance of AI Program. This is really long and covers a lot of ground so I'm not going to summarize it, but I highly recommend it, even if you intend to work primarily on technical work.
AI strategy and policyHighlightOpenAI CharterAN #2In their words, this is "a charter that describes the principles we use to execute on OpenAI’s mission".I'm very excited by this charter, it's a good sign suggesting that we can get the important actors to cooperate in building aligned AI, and in particular to avoid a competitive race. Key quote: "if a value-aligned, safety-conscious project comes close to building AGI before we do, we commit to stop competing with and start assisting this project".
AI strategy and policyHighlightAGI Strategy - List of ResourcesAN #11Exactly what it sounds like.
AI strategy and policyHighlightTechnology Roulette: Managing Loss of Control as Many Militaries Pursue Technological SuperiorityRichard DanzigAN #10US policy so far has been to pursue technological superiority in order to stay ahead of its adversaries and to prevent conflict through deterrence. This paper argues that policymakers should shift some attention to preparing for other risks, such as accidents, emergent effects, sabotage and proliferation (where other actors get and use the technology, without the same safety standards as the US). There were several interesting sections, but the one I was particularly interested in was the section arguing that keeping a human in the loop would not be sufficient. In military situations, decisions must often be made in time-sensitive, high-stress situations, and in such scenarios humans are not very good at making decisions. For example, if an AI system detects an incoming missile, it must autonomously aim and fire to prevent the missile from hitting its target -- there is not enough time for a human to be in the loop. The biggest issue though is that while a human may be part of the decision-making process, they are reliant on various machine readings and calculations in order to reach their decision, and so a human in the loop doesn't provide an independent check on the answer, and so is of limited utility. And as AI systems get better, humans will become less useful for checking the AI's decisions, making this a temporary solution at best.I found the paper to be quite compelling, especially the comments on the human-in-the-loop solution. This feels relevant to problems in technical AI alignment, though I'm not exactly sure how. One question that it suggests -- how can we learn human preferences, when the human answers may themselves depend on the AI's actions? Stuart Armstrong has [pointed out]( this problem as well.
AI strategy and policyMIRI 2018 Update: Our New Research DirectionsNate SoaresAN #34This post gives a high-level overview of the new research directions that MIRI is pursuing with the goal of deconfusion, a discussion of why deconfusion is so important to them, an explanation of why MIRI is now planning to leave research unpublished by default, and a case for software engineers to join their team.There aren't enough details on the technical research for me to say anything useful about it. I'm broadly in support of deconfusion but am either less optimistic on the tractability of deconfusion, or more optimistic on the possibility of success with our current notions (probably both). Keeping research unpublished-by-default seems reasonable to me given the MIRI viewpoint for the reasons they talk about, though I haven't thought about it much. See also [Import AI](
AI strategy and policyAI development incentive gradients are not uniformly terriblerkAN #33This post considers a model of AI development somewhat similar to the one in [Racing to the precipice]( paper. It notes that under this model, assuming perfect information, the utility curves for each player are _discontinuous_. Specifically, the models predict deterministically that the player that spent the most on something (typically AI capabilities) is the one that "wins" the race (i.e. builds AGI), and so there is a discontinuity at the point where the players are spending equal amounts of money. This results in players fighting as hard as possible to be on the right side of the discontinuity, which suggests that they will skimp on safety. However, in practice, there will be some uncertainty about which player wins, even if you know exactly how much each is spending, and this removes the discontinuity. The resulting model predicts more investment in safety, since buying expected utility through safety now looks better than increasing the probability of winning the race (whereas before, it was compared against changing from definitely losing the race to definitely winning the race).The model in [Racing to the precipice]( had the unintuitive conclusion that if teams have _more_ information (i.e. they know their own or other’s capabilities), then we become _less_ safe, which puzzled me for a while. Their explanation is that with maximal information, the top team takes as much risk as necessary in order to guarantee that they beat the second team, which can be quite a lot of risk if the two teams are close. While this is true, the explanation from this post is more satisfying -- since the model has a discontinuity that rewards taking on risk, anything that removes the discontinuity and makes it more continuous will likely improve the prospects for safety, such as not having full information. I claim that in reality these discontinuities mostly don't exist, since (1) we're uncertain about who will win and (2) we will probably have a multipolar scenario where even if you aren't first-to-market you can still capture a lot of value. This suggests that it likely isn't a problem for teams to have more information about each other on the margin.

That said, these models are still very simplistic, and I mainly try to derive qualitative conclusions from them that my intuition agrees with in hindsight.
Racing to the precipice: a model of artificial intelligence development
AI strategy and policyThe Vulnerable World HypothesisNick BostromAN #32RichardBostrom considers the possibility "that there is some level of technology at which civilization almost certainly gets destroyed unless quite extraordinary and historically unprecedented degrees of preventive policing and/or global governance are implemented." We were lucky, for example, that starting a nuclear chain reaction required difficult-to-obtain plutonium or uranium, instead of easily-available materials. In the latter case, our civilisation would probably have fallen apart, because it was (and still is) in the "semi-anarchic default condition": we have limited capacity for preventative policing or global governence, and people have a diverse range of motivations, many selfish and some destructive. Bostrom identifies four types of vulnerability which vary by how easily and widely the dangerous technology can be produced, how predictable its effects are, and how strong the incentives to use it are. He also idenitifies four possible ways of stabilising the situation: restrict technological development, influence people's motivations, establish effective preventative policing, and establish effective global governance. He argues that the latter two are more promising in this context, although they increase the risks of totalitarianism. Note that Bostrom doesn't take a strong stance on whether the vulnerable world hypothesis is true, although he claims that it's unjustifiable to have high credence in its falsity.This is an important paper which I hope will lead to much more analysis of these questions.
AI strategy and policyThe Future of SurveillanceBen GarfinkelAN #28While we often think of there being a privacy-security tradeoff and an accountability-security tradeoff with surveillance, advances in AI and cryptography can make advances on the Pareto frontier. For example, automated systems could surveil many people but only report a few suspicious cases to humans, or they could be used to redact sensitive information (eg. by blurring faces), both of which improve privacy and security significantly compared to the status quo. Similarly, automated ML systems can be applied consistently to every person, can enable collection of good statistics (eg. false positive rates), and are more interpretable than a human making a judgment call, all of which improve accountability.
AI strategy and policyMachine Learning and Artificial Intelligence: how do we make sure technology serves the open society?Recon #5A report from a workshop held in the UK in December 2017, where a diverse group of people ("omputer scientists, entrepreneurs, business leaders, politicians, journalists, researchers and a pastor") talked about the "impact of AI on societies, governments and the relations between states, and between states and companies". I'm glad to see more thought being put into AI policy, and especially what the effects of AI may be on world stability.As we might expect, the report raises more questions than answers. If you think a lot about this space, it is worth reading -- it proposes a classification of problems that's different from ones I've seen before -- but I really only expect it to be useful to people actively doing research in this space.
Main menu