CategoryHighlight?TitleAuthorsVenueH/TEmailSummarizerSummaryMy opinionPrerequisitesRead more
Adversarial examplesHighlightAdversarial Examples Are Not Bugs, They Are FeaturesAndrew Ilyas*, Shibani Santurkar*, Dimitris Tsipras*, Logan Engstrom*, Brandon Tran, Aleksander MadryarXivAN #62Rohin_Distill published a discussion of this paper. This highlights section will cover the full discussion; all of these summaries and opinions are meant to be read together._

Consider two possible explanations of adversarial examples. First, they could be caused because the model "hallucinates" a signal that is not useful for classification, and it becomes very sensitive to this feature. We could call these "bugs", since they don't generalize well. Second, they could be caused by features that _do_ generalize to the test set, but _can_ be modified by an adversarial perturbation. We could call these "non-robust features" (as opposed to "robust features", which can't be changed by an adversarial perturbation). The authors argue that at least some adversarial perturbations fall into the second category of being informative but sensitive features, based on two experiments.

If the "hallucination" explanation were true, the hallucinations would presumably be caused by the training process, the choice of architecture, the size of the dataset, **but not by the type of data**. So one thing to do would be to see if we can construct a dataset such that a model trained on that dataset is _already_ robust, without adversarial training. The authors do this in the first experiment. They take an adversarially trained robust classifier, and create images whose features (final-layer activations of the robust classifier) match the features of some unmodified input. The generated images only have robust features because the original classifier was robust, and in fact models trained on this dataset are automatically robust.

If the "non-robust features" explanation were true, then it should be possible for a model to learn on a dataset containing only non-robust features (which will look nonsensical to humans) and **still generalize to a normal-looking test set**. In the second experiment (henceforth WrongLabels), the authors construct such a dataset. Their hypothesis is that adversarial perturbations work by introducing non-robust features of the target class. So, to construct their dataset, they take an image x with original label y, adversarially perturb it towards some class y' to get image x', and then add (x', y') to their dataset (even though to a human x' looks like class y). They have two versions of this: in RandLabels, the target class y' is chosen randomly, whereas in DetLabels, y' is chosen to be y + 1. For both datasets, if you train a new model on the dataset, you get good performance **on the original test set**, showing that the "non-robust features" do generalize.
I buy this hypothesis. It explains why adversarial examples occur ("because they are useful to reduce loss"), and why they transfer across models ("because different models can learn the same non-robust features"). In fact, the paper shows that architectures that did worse in ExpWrongLabels (and so presumably are bad at learning non-robust features) are also the ones to which adversarial examples transfer the least. I'll leave the rest of my opinion to the opinions on the responses.
[Paper]( and [Author response](
Adversarial examplesHighlightResponse: Learning from Incorrectly Labeled DataEric WallaceDistillAN #62RohinThis response notes that all of the experiments are of the form: create a dataset D that is consistent with a model M; then, when you train a new model M' on D you get the same properties as M. Thus, we can interpret these experiments as showing that [model distillation]( can work even with data points that we would naively think of "incorrectly labeled". This is a more general phenomenon: we can take an MNIST model, select _only_ the examples for which the top prediction is incorrect (labeled with these incorrect top predictions), and train a new model on that -- and get nontrivial performance on the original test set, even though the new model has never seen a "correctly labeled" example.I definitely agree that these results can be thought of as a form of model distillation. I don't think this detracts from the main point of the paper: the _reason_ model distillation works even with incorrectly labeled data is probably because the data is labeled in such a way that it incentivizes the new model to pick out the same features that the old model was paying attention to.
Adversarial examplesHighlightResponse: Robust Feature LeakageGabriel GohDistillAN #62RohinThis response investigates whether the datasets in WrongLabels could have had robust features. Specifically, it checks whether a linear classifier over provably robust features trained on the WrongLabels dataset can get good accuracy on the _original_ test set. This shouldn't be possible since WrongLabels is meant to correlate only non-robust features with labels. It finds that you _can_ get some accuracy with RandLabels, but you don't get much accuracy with DetLabels.

The original authors can actually explain this: intuitively, you get accuracy with RandLabels because it's less harmful to choose labels randomly than to choose them explicitly incorrectly. With random labels on unmodified inputs, robust features should be completely uncorrelated with accuracy. However, with random labels _followed by an adversarial perturbation towards the label_, there can be some correlation, because the adversarial perturbation can add "a small amount" of the robust feature. However, in DetLabels, the labels are _wrong_, and so the robust features are _negatively correlated_ with the true label, and while this can be reduced by an adversarial perturbation, it can't be reversed (otherwise it wouldn't be robust).
The original authors' explanation of these results is quite compelling; it seems correct to me.
Adversarial examplesHighlightResponse: Adversarial Examples are Just Bugs, TooPreetum NakkiranDistillAN #62RohinThe main point of this response is that adversarial examples can be bugs too. In particular, if you construct adversarial examples that explicitly _don't_ transfer between models, and then run ExpWrongLabels with such adversarial perturbations, then the resulting model doesn't perform well on the original test set (and so it must not have learned non-robust features).

It also constructs a data distribution where **every useful feature _of the optimal classifer_ is guaranteed to be robust**, and shows that we can still get adversarial examples with a typical model, showing that it is not just non-robust features that cause adversarial examples.

In their response, the authors clarify that they didn't intend to claim that adversarial examples could not arise due to "bugs", just that "bugs" were not the only explanation. In particular, they say that their main thesis is “adversarial examples will not just go away as we fix bugs in our models”, which is consistent with the point in this response.
Amusingly, I think I'm more bullish on the original paper's claims than the authors themselves. It's certainly true that adversarial examples can arise from "bugs": if your model overfits to your data, then you should expect adversarial examples along the overfitted decision boundary. The dataset constructed in this response is a particularly clean example: the optimal classifier would have an accuracy of 90%, but the model is trained to accuracy 99.9%, which means it must be overfitting.

However, I claim that with large and varied datasets with neural nets, we are typically not in the regime where models overfit to the data, and the presence of "bugs" in the model will decrease. (You certainly _can_ get a neural net to be "buggy", e.g. by randomly labeling the data, but if you're using real data with a natural task then I don't expect it to happen to a significant degree.) Nonetheless, adversarial examples persist, because the features that models use are not the ones that humans use.

It's also worth noting that this experiment strongly supports the hypothesis that adversarial examples transfer because they are real features that generalize to the test set.
Adversarial examplesHighlightResponse: Adversarial Example Researchers Need to Expand What is Meant by ‘Robustness’Justin Gilmer, Dan HendrycksDistillAN #62RohinThis response argues that the results in the original paper are simply a consequence of a generally accepted principle: "models lack robustness to distribution shift because they latch onto superficial correlations in the data". This isn't just about L_p norm ball adversarial perturbations: for example, one [recent paper]( shows that if the model is only given access to high frequency features of images (which look uniformly grey to humans), it can still get above 50% accuracy. In fact, when we do adversarial training to become robust to L_p perturbations, then the model pays attention to different non-robust features and becomes more vulnerable to e.g. [low-frequency fog corruption]( The authors call for adversarial examples researchers to move beyond L_p perturbations and think about the many different ways models can be fragile, and to make them more robust to distributional shift.I strongly agree with the worldview behind this response, and especially the principle they identified. I didn't know this was a generally accepted principle, though of course I am not an expert on distributional robustness.

One thing to note is what is meant by "superficial correlation" here. I interpret it to mean a correlation that really does exist in the dataset, that really does generalize to the test set, but that _doesn't_ generalize out of distribution. A better term might be "fragile correlation". All of the experiments so far have been looking at within-distribution generalization (aka generalization to the test set), and are showing that non-robust features _do_ generalize within-distribution. By my understanding, this response is arguing that there are many such non-robust features that will generalize within-distribution but will not generalize under distributional shift, and we need to make our models robust to all of them, not just L_p adversarial perturbations.
Adversarial examplesHighlightResponse: Two Examples of Useful, Non-Robust FeaturesGabriel GohDistillAN #62RohinThis response studies linear features, since we can analytically compute their usefulness and robustness. It plots the singular vectors of the data as features, and finds that such features are either robust and useful, or non-robust and not useful. However, you can get useful, non-robust features by ensembling or contamination (see response for details).
Adversarial examplesHighlightResponse: Adversarially Robust Neural Style TransferReiichiro NakanoDistillAN #62RohinThe original paper showed that adversarial examples don't transfer well to VGG, and that VGG doesn't tend to learn similar non-robust features as a ResNet. Separately, VGG works particularly well for style transfer. Perhaps since VGG doesn't capture non-robust features as well, the results of style transfer look better to humans? This response and the author's response investigate this hypothesis in more detail and find that it seems broadly supported, but there are still finnicky details to be worked out.This is an intriguing empirical fact. However, I don't really buy the theoretical argument that style transfer works because it doesn't use non-robust features, since I would typically expect that a model that doesn't use L_p-fragile features would instead use features that are fragile or non-robust in some other way.
Adversarial examplesHighlightFeature Denoising for Improving Adversarial RobustnessCihang Xie, Yuxin Wu, Laurens van der Maaten, Alan Yuille, Kaiming HeICML 2018AN #49Dan HThis paper claims to obtain nontrivial adversarial robustness on ImageNet. Assuming an adversary can add perturbations of size 16/255 (l_infinity), previous adversarially trained classifiers could not obtain above 1% adversarial accuracy. Some groups have tried to break the model proposed in this paper, but so far it appears its robustness is close to what it claims, [around]( 40% adversarial accuracy. Vanilla adversarial training is how they obtain said adversarial robustness. There has only been one previous public attempt at applying (multistep) adversarial training to ImageNet, as those at universities simply do not have the GPUs necessary to perform adversarial training on 224x224 images. Unlike the previous attempt, this paper ostensibly uses better hyperparameters, possibly accounting for the discrepancy. If true, this result reminds us that hyperparameter tuning can be critical even in vision, and that improving adversarial robustness on large-scale images may not be possible outside industry for many years.
Adversarial examplesHighlightConstructing Unrestricted Adversarial Examples with Generative ModelsYang Song, Rui Shu, Nate Kushman, Stefano ErmonNeurIPS 2018AN #39RohinThis paper predates the [unrestricted adversarial examples challenge]( ([AN #24]( and shows how to generate such unrestricted adversarial examples using generative models. As a reminder, most adversarial examples research is focused on finding imperceptible perturbations to existing images that cause the model to make a mistake. In contrast, unrestricted adversarial examples allow you to find _any_ image that humans will reliably classify a particular way, where the model produces some other classification.

The key idea is simple -- train a GAN to generate images in the domain of interest, and then create adversarial examples by optimizing an image to simultaneously be "realistic" (as evaluated by the generator), while still being misclassified by the model under attack. The authors also introduce another term into the loss function that minimizes deviation from a randomly chosen noise vector -- this allows them to get diverse adversarial examples, rather than always converging to the same one.

They also consider a "noise-augmented" attack, where in effect they are running the normal attack they have, and then running a standard attack like FGSM or PGD afterwards. (They do these two things simultaneously, but I believe it's nearly equivalent.)

For evaluation, they generate adversarial examples with their method and check that humans on Mechanical Turk reliably classify the examples as a particular class. Unsurprisingly, their adversarial examples "break" all existing defenses, including the certified defenses, though to be clear existing defenses assume a different threat model where an adversarial example must be an imperceptible perturbation to one of a known set of images. You could imagine doing something similar by taking the imperceptible-perturbation attacks and raise the value of ϵ until it is perceptible -- but in this case the generated images are much less realistic.
This is the clear first thing to try with unrestricted adversarial examples, and it seems to work reasonably well. I'd love to see whether adversarial training with these sorts of adversarial examples works as a defense against both this attack and standard imperceptible-perturbation attacks. In addition, it would be interesting to see if humans could direct or control the search for unrestricted adversarial examples.
Adversarial examplesHighlightMotivating the Rules of the Game for Adversarial Example ResearchJustin Gilmer, Ryan P. Adams, Ian Goodfellow, David Andersen, George E. Dahl2018 IEEE/RSJ International Conference on Intelligent Robots and SystemsAN #28Dan HIn this position paper, the authors argue that many of the threat models which motivate adversarial examples are unrealistic. They enumerate various previously proposed threat models, and then they show their limitations or detachment from reality. For example, it is common to assume that an adversary must create an imperceptible perturbation to an example, but often attackers can input whatever they please. In fact, in some settings an attacker can provide an input from the clean test set that is misclassified. Also, they argue that adversarial robustness defenses which degrade clean test set error are likely to make systems less secure since benign or nonadversarial inputs are vastly more common. They recommend that future papers motivated by adversarial examples take care to define the threat model realistically. In addition, they encourage researchers to establish “content-preserving” adversarial attacks (as opposed to “imperceptible” l_p attacks) and improve robustness to unseen input transformations.This is my favorite paper of the year as it handily counteracts much of the media coverage and research lab PR purporting ``doom'' from adversarial examples. While there are some scenarios in which imperceptible perturbations may be a motivation---consider user-generated privacy-creating perturbations to Facebook photos which stupefy face detection algorithms---much of the current adversarial robustness research optimizing small l_p ball robustness can be thought of as tackling a simplified subproblem before moving to a more realistic setting. Because of this paper, new tasks such as [Unrestricted Adversarial Examples]( ([AN #24]( take an appropriate step toward increasing realism without appearing to make the problem too hard.
Adversarial examplesHighlightIntroducing the Unrestricted Adversarial Examples ChallengeTom B. Brown and Catherine Olsson, Reserach Engineers, Google Brain TeamGoogle AI BlogAN #24RohinThere's a new adversarial examples contest, after the one from NIPS 2017. The goal of this contest is to figure out how to create a model that never confidently makes a mistake on a very simple task, even in the presence of a powerful adversary. This leads to many differences from the previous contest. The task is a lot simpler -- classifiers only need to distinguish between bicycles and birds, with an option of saying "ambiguous". Instead of using the L-infinity norm ball to define what an adversarial example is, attackers are allowed to supply any image whatsoever, as long as a team of human evaluators agrees unanimously on the classification of the image. The contest has no time bound, and will run until some defense survives for 90 days without being broken even once. A defense is not broken if it says "ambiguous" on an adversarial example. Any submitted defense will be published, which means that attackers can specialize their attacks to that specific model (i.e. it is white box).I really like this contest format, it seems like it's actually answering the question we care about, for a simple task. If I were designing a defense, the first thing I'd aim for would be to get a lot of training data, ideally from different distributions in the real world, but data augmentation techniques may also be necessary, especially for eg. images of a bicycle against an unrealistic textured background. The second thing would be to shrink the size of the model, to make it more likely that it generalizes better (in accordance with Occam's razor or the minimum description length principle). After that I'd think about the defenses proposed in the literature. I'm not sure how the verification-based approaches will work, since they are intrinsically tied to the L-infinity norm ball definition of adversarial examples, or something similar -- you can't include the human evaluators in your specification of what you want to verify.
Adversarial examplesHighlightAdversarial Attacks and Defences CompetitionAlexey Kurakin, Ian Goodfellow, Samy Bengio, Yinpeng Dong, Fangzhou Liao, Ming Liang, Tianyu Pang, Jun Zhu, Xiaolin Hu, Cihang Xie, Jianyu Wang, Zhishuai Zhang, Zhou Ren, Alan Yuille, Sangxia Huang, Yao Zhao, Yuzhe Zhao, Zhonglin Han, Junjiajia Long, Yerkebulan Berdibekov, Takuya Akiba, Seiya Tokui, Motoki AbeThe NIPS '17 Competition: Building Intelligent SystemsAN #1RohinThis is a report on a competition held at NIPS 2017 for the best adversarial attacks and defences. It includes a summary of the field and then shows the results from the competition.I'm not very familiar with the literature on adversarial examples and so I found this very useful as an overview of the field, especially since it talks about the advantages and disadvantages of different methods, which are hard to find by reading individual papers. The actual competition results are also quite interesting -- they find that the best attacks and defences are both quite successful on average, but have very bad worst-case performance (that is, the best defence is still very weak against at least one attack, and the best attack fails to attack at least one defence). Overall, this paints a bleak picture for defence, at least if the attacker has access to enough compute to actually try out different attack methods, and has a way of verifying whether the attacks succeed.
Adversarial examplesRobustness beyond Security: Representation LearningLogan Engstrom, Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Brandon Tran, Aleksander MadryarXivAN #68CodyEarlier this year, a <@provocative paper@>(@Adversarial Examples Are Not Bugs, They Are Features@) out of MIT claimed that adversarial perturbations weren’t just spurious correlations, but were, at least in some cases, features that generalize to the test set. A subtler implied point of the paper was that robustness to adversarial examples wasn’t a matter of resolving the model’s misapprehensions, but rather one of removing the model’s sensitivity to features that would be too small for a human to perceive. If we do this via adversarial training, we get so-called “robust representations”. The same group has now put out another paper, asking the question: are robust representations also human-like representations?

To evaluate how human-like the representations are, they propose the following experiment: take a source image, and optimize it until its representations (penultimate layer activations) match those of some target image. If the representations are human-like, the result of this optimization should look (to humans) very similar to the target image. (They call this property “invertibility”.) Normal image classifiers fail miserably at this test: the image looks basically like the source image, making it a classic adversarial example. Robust models on the other hand pass the test, suggesting that robust representations usually are human-like. They provide further evidence by showing that you can run feature visualization without regularization and get meaningful results (existing methods result in noise if you don’t regularize).
I found this paper clear, well-written, and straightforward in its empirical examination of how the representations learned by standard and robust models differ. I also have a particular interest in this line of research, since I have thought for a while that we should be more clear about the fact that adversarially-susceptible models aren’t wrong in some absolute sense, but relative to human perception in particular.

**Rohin’s opinion:** I agree with Cody above, and have a few more thoughts.

Most of the evidence in this paper suggests that the learned representations are “human-like” in the sense that two images that have similar representations must also be perceptually similar (to humans). That is, by enforcing that “small change in pixels” implies “small change in representations”, you seem to get for free the converse: “small change in representations” implies “small change in pixels”. This wasn’t obvious to me: a priori, each feature could have corresponded to 2+ “clusters” of inputs.

The authors also seem to be making a claim that the representations are semantically similar to the ones humans use. I don’t find the evidence for this as compelling. For example, they claim that when putting the “stripes” feature on a picture of an animal, only the animal gets the stripes and not the background. However, when I tried it myself in the interactive visualization, it looked like a lot of the background was also getting stripes.

One typical regularization for [feature visualization]( is to jitter the image while optimizing it, which seems similar to selecting for robustness to imperceptible changes, so it makes sense that using robust features helps with feature visualization. That said, there are several other techniques for regularization, and the authors didn’t need any of them, which is very interesting. On the other hand, their visualizations don't look as good to me as those from other papers.
Paper: Adversarial Robustness as a Prior for Learned Representations
Adversarial examplesRobustness beyond Security: Computer Vision ApplicationsShibani Santurkar*, Dimitris Tsipras*, Brandon Tran*, Andrew Ilyas*, Logan Engstrom*, Aleksander MadryarXivAN #68RohinSince a robust model seems to have significantly more "human-like" features (see post above), it should be able to help with many of the tasks in computer vision. The authors demonstrate results on image generation, image-to-image translation, inpainting, superresolution and interactive image manipulation: all of which are done simply by optimizing the image to maximize the probability of a particular class label or the value of a particular learned feature.This provides more evidence of the utility of robust features, though all of the comments from the previous paper apply here as well. In particular, looking at the results, my non-expert guess is that they are probably not state-of-the-art (but it's still interesting that one simple method is able to do well on all of these tasks).
Paper: Image Synthesis with a Single (Robust) Classifier
Adversarial examplesNatural Adversarial ExamplesDan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, Dawn SongarXivAN #64Flo DornerThis paper introduces a new dataset to evaluate the worst-case performance of image classifiers. ImageNet-A consists of unmodified natural images that are consistently misclassified by popular neural-network architectures trained on ImageNet. Based on some concrete misclassifications, like a dragonfly on a yellow plastic shovel being classified as a banana, the authors hypothesize that current classifiers rely too much on color, texture and background cues. Neither classical adversarial training nor training on a version of ImageNet designed to reduce the reliance on texture helps a lot, but modifying the network architecture can increase the accuracy on ImageNet-A from around 5% to 15%.This seems to show that current methods and/or training sets for image classification are still far away from allowing for robust generalization, even in naturally occuring scenarios. While not too surprising, the results might convince those who have heavily discounted the evidence provided by classical adversarial examples due to the reliance on artificial perturbations.

**Rohin's opinion:** I'm particularly excited about this dataset because it seems like a significantly better way to evaluate new techniques for robustness: it's much closer to a "real world" test of the technique (as opposed to e.g. introducing an artificial perturbation that classifiers are expected to be robust to).
Adversarial examplesTesting Robustness Against Unforeseen AdversariesDaniel Kang, Yi Sun, Dan Hendrycks, Tom Brown, Jacob SteinhardtarXivAN #63CodyThis paper demonstrates that adversarially training on just one type or family of adversarial distortions fails to provide general robustness against different kinds of possible distortions. In particular, they show that adversarial training against L-p norm ball distortions transfer reasonably well to other L-p norm ball attacks, but provides little value, and can in fact reduce robustness, when evaluated on other families of attacks, such as adversarially-chosen Gabor noise, "snow" noise, or JPEG compression. In addition to proposing these new perturbation types beyond the typical L-p norm ball, the paper also provides a "calibration table" with epsilon sizes they judge to be comparable between attack types, by evaluating them according to how much they reduce accuracy on either a defended or undefended model. (Because attacks are so different in approach, a given numerical value of epsilon won't correspond to the same "strength" of attack across methods) I didn't personally find this paper hugely surprising, given the past pattern of whack-a-mole between attack and defense suggesting that defenses tend to be limited in their scope, and don't confer general robustness. That said, I appreciate how centrally the authors lay this lack of transfer as a problem, and the effort they put in to generating new attack types and calibrating them so they can be meaningfully compared to existing L-p norm ball ones.

**Rohin's opinion:** I see this paper as calling for adversarial examples researchers to stop focusing just on the L-p norm ball, in line with <@one of the responses@>(@Response: Adversarial Example Researchers Need to Expand What is Meant by ‘Robustness’@) to the last newsletter's highlight, <@Adversarial Examples Are Not Bugs, They Are Features@>.
Testing Robustness Against Unforeseen Adversaries
Adversarial examplesOn the Geometry of Adversarial ExamplesMarc Khoury, Dylan Hadfield-MenellProceedings of the Genetic and Evolutionary Computation Conference '18AN #36RohinThis paper analyzes adversarial examples based off a key idea: even if the data of interest forms a low-dimensional manifold, as we often assume, the ϵ-tube _around_ the manifold is still high-dimensional, and so accuracy in an ϵ-ball around true data points will be hard to learn.

For a given L_p norm, we can define the optimal decision boundary to be the one that maximizes the margin from the true data manifold. If there exists some classifier that is adversarially robust, then the optimal decision boundary is as well. Their first result is that the optimal decision boundary can change dramatically if you change p. In particular, for concentric spheres, the optimal L_inf decision boundary provides an L_2 robustness guarantee √d times smaller than the optimal L_2 decision boundary, where d is the dimensionality of the input. This explains why a classifier that is adversarially trained on L_inf adversarial examples does so poorly on L_2 adversarial examples.

I'm not sure I understand the point of the next section, but I'll give it a try. They show that a nearest neighbors classifier can achieve perfect robustness if the underlying manifold is sampled sufficiently densely (requiring samples exponential in k, the dimensionality of the manifold). However, a learning algorithm with a particular property that they formalize would require exponentially more samples in at least some cases in order to have the same guarantee. I don't know why they chose the particular property they did -- my best guess is that the property is meant to represent what we get when we train a neural net on L_p adversarial examples. If so, then their theorem suggests that we would need exponentially more training points to achieve perfect robustness with adversarial training compared to a nearest neighbor classifier.

They next turn to the fact that the ϵ-tube around the manifold is d-dimensional instead of k-dimensional. If we consider ϵ-balls around the training set X, this covers a very small fraction of the ϵ-tube, approaching 0 as d becomes much larger than k, even if the training set X covers the k-dimensional manifold sufficiently well.

Another issue is that if we require adversarial robustness, then we severely restrict the number of possible decision boundaries, and so we may need significantly more expressive models to get one of these decision boundaries. In particular, since feedforward neural nets with Relu activations have "piecewise linear" decision boundaries (in quotes because I might be using the term incorrectly), it is hard for them to separate concentric spheres. Suppose that the spheres are separated by a distance d. Then for accuracy on the manifold, we only need the decision boundary to lie entirely in the shell of width d. However, for ϵ-tube adversarial robustness, the decision boundary must lie in a shell of width d - 2ϵ. They prove a lower bound on the number of linear regions for the decision boundary that grows as τ^(-d), where τ is the width of the shell, suggesting that adversarial robustness would require more parameters in the model.

Their experiments show that for simple learning problems (spheres and planes), adversarial examples tend to be in directions orthogonal to the manifold. In addition, if the true manifold has high codimension, then the learned model has poor robustness.
I think this paper has given me a significantly better understanding of how L_p norm balls work in high dimensions. I'm more fuzzy on how this applies to adversarial examples, in the sense of any confident misclassification by the model on an example that humans agree is obvious. Should we be giving up on L_p robustness since it forms a d-dimensional manifold, whereas we can only hope to learn the smaller k-dimensional manifold? Surely though a small enough perturbation shouldn't change anything? On the other hand, even humans have _some_ decision boundary, and the points near the decision boundary have some small perturbation which would change their classification (though possibly to "I don't know" rather than some other class).

There is a phenomenon where if you train on L_inf adversarial examples, the resulting classifier fails on L_2 adversarial examples, which has previously been described as "overfitting to L_inf". The authors interpret their first theorem as contradicting this statement, since the optimal decision boundaries are very different for L_inf and L_2. I don't see this as a contradiction. The L_p norms are simply a method of label propagation, which augments the set of data points for which we know labels. Ultimately, we want the classifier to reproduce the labels that we would assign to data points, and L_p propagation captures some of that. So, we can think of there as being many different ways that we can augment the set of training points until it matches human classification, and the L_p norm balls are such methods. Then an algorithm is more robust as it works with more of these augmentation methods. Simply doing L_inf training means that by default the learned model only works on one of the methods (L_inf norm balls) and not all of them as we wanted, and we can think of this as "overfitting" to the imperfect L_inf notion of adversarial robustness. The meaning of "overfitting" here is that the learned model is too optimized for L_inf, at the cost of other notions of robustness like L_2 -- and their theorem says basically the same thing, that optimizing for L_inf comes at the cost of L_2 robustness.
Adversarial examplesA Geometric Perspective on the Transferability of Adversarial DirectionsZachary Charles, Harrison Rosenberg, Dimitris PapailiopoulosarXivAN #34
Adversarial examplesTowards the first adversarially robust neural network model on MNISTLukas Schott, Jonas Rauber, Matthias Bethge, Wieland BrendelarXivAN #27Dan HThis recent pre-print claims to make MNIST classifiers more adversarially robust to different L-p perturbations, while the previous paper only worked for L-infinity perturbations. The basic building block in their approach is a variational autoencoder, one for each MNIST class. Each variational autoencoder computes the likelihood of the input sample, and this information is used for classification. The authors also demonstrate that binarizing MNIST images can serve as strong defense against some perturbations. They evaluate against strong attacks and not just the fast gradient sign method.This paper has generated considerable excitement among my peers. Yet inference time with this approach is approximately 100,000 times that of normal inference (10^4 samples per VAE * 10 VAEs). Also unusual is that the L-infinity "latent descent attack" result is missing. It is not clear why training a single VAE does not work. Also, could results improve by adversarially training the VAEs? As with all defense papers, it is prudent to wait for third-party reimplementations and analysis, but the range of attacks they consider is certainly thorough.
Adversarial examplesTowards Deep Learning Models Resistant to Adversarial AttacksAleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, Adrian VladuICLRAN #27Dan HMadry et al.'s paper is a seminal work which shows that some neural networks can attain more adversarial robustness with a well-designed adversarial training procedure. The key idea is to phrase the adversarial defense problem as minimizing the expected result of the adversarial attack problem, which is maximizing the loss on an input training point when the adversary is allowed to perturb the point anywhere within an L-infinity norm ball. They also start the gradient descent from a random point in the norm ball. Then, given this attack, to optimize the adversarial defense problem, we simply do adversarial training. When trained long enough, some networks will attain more adversarial robustness.It is notable that this paper has survived third-party security analysis, so this is a solid contribution. This contribution is limited by the fact that its improvements are limited to L-infinity adversarial perturbations on small images, as [follow-up work]( has shown.
Adversarial examplesMotivating the Rules of the Game for Adversarial Example ResearchJustin Gilmer, Ryan P. Adams, Ian Goodfellow, David Andersen, George E. DahlarXivDaniel FilanAN #19
Adversarial examplesA learning and masking approach to secure learningLinh Nguyen, Sky Wang, Arunesh SinhaInternational Conference on Decision and Game Theory for Security 2018N/ARohinOne way to view the problem of adversarial examples is that adversarial attacks map "good" clean data points that are classified correctly into a nearby "bad" space that is low probability and so is misclassified. This suggests that in order to attack a model, we can use a neural net to _learn_ a transformation from good data points to bad ones. The loss function is easy -- one term encourages similarity to the original data point, and the other term encourages the new data point to have a different class label. Then, for any new input data point, we can simply feed it through the neural net to get an adversarial example.

Similarly, in order to defend a model, we can learn a neural net transformation that maps bad data points to good ones. The loss function continues to encourage similarity between the data points, but now encourages that the new data point have the correct label. Note that we need to use some attack algorithm in order to generate the bad data points that are used to train the defending neural net.
Ultimately, in the defense proposed here, the information on how to be more robust comes from the "bad" data points that are used to train the neural net. It's not clear why this would outperform adversarial training, where we train the original classifier on the "bad" data points. In fact, if the best way to deal with adversarial examples is to transform them to regular examples, then we could simply use adversarial training with a more expressive neural net, and it could learn this transformation.
Adversarial examplesAdversarial Policies: Attacking Deep Reinforcement LearningAdam Gleave, Michael Dennis, Neel Kant, Cody Wild, Sergey Levine, Stuart RussellarXivAN #70SudhanshuThis work demonstrates the existence of _adversarial policies_ of behaviour in high-dimensional, two-player zero-sum games. Specifically, they show that adversarially-trained agents ("Adv"), who can only affect a victim's observations of their (Adv's) states, can act in ways that confuse the victim into behaving suboptimally.

An adversarial policy is trained by reinforcement learning in a single-player paradigm where the victim is a black-box fixed policy that was previously trained via self-play to be robust to adversarial attacks. As a result, the adversarial policies learn to push the observations of the victim outside the training distribution, causing the victim to behave poorly. The adversarial policies do not actually behave intelligently, such as blocking or tackling the victim, but instead do unusual things like spasming in a manner that appears random to humans, curling into a ball or kneeling.

Further experiments showed that if the victim's observations of the adversary were removed, then the adversary was unable to learn such an adversarial policy. In addition, the victim's network activations were very different when playing against an adversarial policy relative to playing against a random or lifeless opponent. By comparing two similar games where the key difference was the number of adversary dimensions being observed, they showed that such policies were easier to learn in higher-dimensional games.
This work points to an important question about optimisation in high dimension continuous spaces: without guarantees on achieving solution optimality, how do we design performant systems that are robust to (irrelevant) off-distribution observations? By generating demonstrations that current methods are insufficient, it can inspire future work across areas like active learning, continual learning, fall-back policies, and exploration.

I had a tiny nit-pick: while the discussion is excellent, the paper doesn't cover whether this phenomenon has been observed before with discrete observation/action spaces, and why/why not, which I feel would be an important aspect to draw out. In a finite environment, the victim policy might have actually covered every possible situation, and thus be robust to such attacks; for continuous spaces, it is not clear to me whether we can _always_ find an adversarial attack.

In separate correspondence, author Adam Gleave notes that he considers these to be relatively low-dimensional -- even MNIST has way more dimensions -- so when comparing to regular adversarial examples work, it seems like multi-agent RL is harder to make robust than supervised learning.
Adversarial Policies website
Adversarial examplesE-LPIPS: Robust Perceptual Image Similarity via Random Transformation EnsemblesMarkus Kettunen et alarXivAN #66Dan HConvolutional neural networks are one of the best methods for assessing the perceptual similarity between images. This paper provides evidence that perceptual similarity metrics can be made adversarially robust. Out-of-the-box, network-based perceptual similarity metrics exhibit some adversarial robustness. While classifiers transform a long embedding vector to class scores, perceptual similarity measures compute distances between long and wide embedding tensors, possibly from multiple layers. Thus the attacker must alter far more neural network responses, which makes attacks on perceptual similarity measures harder for adversaries. This paper makes attacks even harder for the adversary by using a barrage of input image transformations and by using techniques such as dropout while computing the embeddings. This forces the adversarial perturbation to be substantially larger.
Adversarial examplesThe LogBarrier adversarial attack: making effective use of decision boundary informationChris Finlay, Aram-Alexandre Pooladian, Adam M. ObermanarXivAN #53Dan HRather than maximizing the loss of a model given a perturbation budget, this paper minimizes the perturbation size subject to the constraint that the model misclassify the example. This misclassification constraint is enforced by adding a logarithmic barrier to the objective, which they prevent from causing a loss explosion through through a few clever tricks. Their attack appears to be faster than the Carlini-Wagner attack.
[The code is here.](
Adversarial examplesQuantifying Perceptual Distortion of Adversarial ExamplesMatt Jordan, Naren Manoj, Surbhi Goel, Alexandros G. DimakisarXivAN #48Dan HThis paper takes a step toward more general adversarial threat models by combining adversarial additive perturbations small in an l_p sense with [spatially transformed adversarial examples](, among other other attacks. In this more general setting, they measure the size of perturbations by computing the [SSIM]( between clean and perturbed samples, which has limitations but is on the whole better than the l_2 distance. This work shows, along with other concurrent works, that perturbation robustness under some threat models does not yield robustness under other threat models. Therefore the view that l_p perturbation robustness must be achieved before considering other threat models is made more questionable. The paper also contributes a large code library for testing adversarial perturbation robustness.
Adversarial examplesOn the Sensitivity of Adversarial Robustness to Input Data DistributionsGavin Weiguang Ding, Kry Yik Chau Lui, Xiaomeng Jin, Luyu Wang, Ruitong HuangICLR 2019AN #48
Adversarial examplesTheoretically Principled Trade-off between Robustness and AccuracyHongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric P. Xing, Laurent El Ghaoui, Michael I. JordanarXivAN #44Dan HThis paper won the NeurIPS 2018 Adversarial Vision Challenge. For robustness on CIFAR-10 against l_infinity perturbations (epsilon = 8/255), it improves over the Madry et al. adversarial training baseline from 45.8% to 56.61%, making it [almost]( state-of-the-art. However, it does decrease clean set accuracy by a few percent, despite using a deeper network than Madry et al. Their technique has many similarities to Adversarial Logit Pairing, which is not cited, because they encourage the network to embed a clean example and an adversarial perturbation of a clean example similarly. I now describe Adversarial Logit Pairing. During training, ALP teaches the network to classify clean and adversarially perturbed points; added to that loss is an l_2 loss between the logit embeddings of clean examples and the logits of the corresponding adversarial examples. In contrast, in place of the l_2 loss from ALP, this paper uses the KL divergence from the softmax of the clean example to the softmax of an adversarial example. Yet the softmax distributions are given a high temperature, so this loss is not much different from an l_2 loss between logits. The other main change in this paper is that adversarial examples are generated by trying to maximize the aforementioned KL divergence between clean and adversarial pairs, not by trying to maximize the classification log loss as in ALP. This paper then shows that some further engineering to adversarial logit pairing can improve adversarial robustness on CIFAR-10.
Adversarial examplesAdversarial Vulnerability of Neural Networks Increases With Input DimensionCarl-Johann Simon-Gabriel, Yann Ollivier, Léon Bottou, Bernhard Schölkopf, David Lopez-PazarXivAN #36RohinThe key idea of this paper is that imperceptible adversarial vulnerability happens when small changes in the input lead to large changes in the output, suggesting that the gradient is large. They first recommend choosing ϵ_p to be proportional to d^(1/p). Intuitively, this is because larger values of p behave more like maxing instead of summing, and so using the same value of ϵ across values of p would lead to more points being considered for larger p. They show a link between adversarial robustness and regularization, which makes sense since both of these techniques aim for better generalization.

Their main point is that the norm of the gradient increases with the input dimension d. In particular, a typical initialization scheme will set the variance of the weights to be inversely proportional to d, which means the absolute value of each weight is inversely proportional to √d. For a single-layer neural net (that is, a perceptron), the gradient is exactly the weights. For L_inf adversarial robustness, the relevant norm for the gradient is the L_1 norm. This gives the sum of the d weights, which will be proportional to √d. For L_p adversarial robustness, the corresponding gradient is L_q with q larger than 1, which decreases the size of the gradient. However, this is exactly offset by the increase in the size of ϵ_p that they proposed. Thus, in this simple case the adversarial vulnerability increases with input dimension. They then prove theorems that show that this generalizes to other neural nets, including CNNS (albeit still only at initialization, not after training). They also perform experiments showing that their result also holds after training.
I suspect that there is some sort of connection between the explanation given in this paper and the explanation that there are many different perturbation directions in high-dimensional space which means that there are lots of potential adversarial examples, which increases the chance that you can find one. Their theoretical result comes primarily from the fact that weights are initialized with variance inversely proportional to d. We could eliminate this by having the variance be inversely proportional to d^2, in which case their result would say that adversarial vulnerability is constant with input dimension. However, in this case the variance of the activations would be inversely proportional to d, making it hard to learn. It seems like adversarial vulnerability should be the product of "number of directions", and "amount you can search in a direction", where the latter is related to the variance of the activations, making the connection to this paper.
Adversarial examplesRobustness via curvature regularization, and vice versaSeyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, Jonathan Uesato, Pascal FrossardarXivAN #35Dan HThis paper proposes a distinct way to increase adversarial perturbation robustness. They take an adversarial example generated with the FGSM, compute the gradient of the loss for the clean example and the gradient of the loss for the adversarial example, and they penalize this difference. Decreasing this penalty relates to decreasing the loss surface curvature. The technique works slightly worse than adversarial training.
Adversarial examplesIs Robustness [at] the Cost of Accuracy?Dong Su, Huan Zhang, Hongge Chen, Jinfeng Yi, Pin-Yu Chen, Yupeng GaoECCVAN #32Dan HThis work shows that older architectures such as VGG exhibit more adversarial robustness than newer models such as ResNets. Here they take adversarial robustness to be the average adversarial perturbation size required to fool a network. They use this to show that architecture choice matters for adversarial robustness and that accuracy on the clean dataset is not necessarily predictive of adversarial robustness. A separate observation they make is that adversarial examples created with VGG transfers far better than those created with other architectures. All of these findings are for models without adversarial training.
Adversarial examplesAdversarial Examples Are a Natural Consequence of Test Error in NoiseAnonymousOpenReviewAN #32Dan HThis paper argues that there is a link between model accuracy on noisy images and model accuracy on adversarial images. They establish this empirically by showing that augmenting the dataset with random additive noise can improve adversarial robustness reliably. To establish this theoretically, they use the Gaussian Isoperimetric Inequality, which directly gives a relation between error rates on noisy images and the median adversarial perturbation size. Given that measuring test error on noisy images is easy, given that claims about adversarial robustness are almost always wrong, and given the relation between adversarial noise and random noise, they suggest that future defense research include experiments demonstrating enhanced robustness on nonadversarial, noisy images.
Adversarial examplesRobustness May Be at Odds with AccuracyDimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, Aleksander MadryOpenReviewAN #32Dan HSince adversarial training can markedly reduce accuracy on clean images, one may ask whether there exists an inherent trade-off between adversarial robustness and accuracy on clean images. They use a simple model amenable to theoretical analysis, and for this model they demonstrate a trade-off. In the second half of the paper, they show adversarial training can improve feature visualization, which has been shown in several concurrent works.
Adversarial examplesAre adversarial examples inevitable?Ali Shafahi, W. Ronny Huang, Christoph Studer, Soheil Feizi, Tom GoldsteinInternational Conference on Learning Representations, 2019.AN #24
Adversarial examplesAdversarial Reprogramming of Sequence Classification Neural NetworksPaarth Neekhara, Shehzeen Hussain, Shlomo Dubnov, Farinaz KoushanfarAAAI-2019 Workshop on Engineering Dependable and Secure Machine Learning SystemsAN #23
Adversarial examplesFortified Networks: Improving the Robustness of Deep Networks by Modeling the Manifold of Hidden RepresentationsAlex Lamb, Jonathan Binas, Anirudh Goyal, Dmitriy Serdyuk, Sandeep Subramanian, Ioannis Mitliagkas, Yoshua BengioarXivAN #2
Adversarial examplesAdversarial Vision ChallengeWieland Brendel, Jonas Rauber, Alexey Kurakin, Nicolas Papernot, Veliqi, Marcel Salathé, Sharada P. Mohanty, Matthias BethgeNIPS 2018AN #19RohinThere will be a competition on adversarial examples for vision at NIPS 2018.
Adversarial examplesEvaluating and Understanding the Robustness of Adversarial Logit PairingLogan Engstrom, Andrew Ilyas, Anish AthalyearXivAN #18
Adversarial examplesBenchmarking Neural Network Robustness to Common Corruptions and Surface VariationsDan Hendrycks, Thomas G. DietterichICLR 2019AN #15RohinSee [Import AI](
Adversarial examplesAdversarial Reprogramming of Neural NetworksGamaleldin F. Elsayed, Ian Goodfellow, Jascha Sohl-DicksteinarXivAN #14
Adversarial examplesDefense Against the Dark Arts: An overview of adversarial example security research and future research directionsIan GoodfellowarXivAN #11
Adversarial examplesOn Evaluating Adversarial RobustnessNicholas Carlini, Anish Athalye, Nicolas Papernot, Wieland Brendel, Jonas Rauber, Dimitris Tsipras, Ian Goodfellow, Aleksander Madry, Alexey KurakinNeurIPS SECML 2018AN #46
Adversarial examplesCharacterizing Adversarial Examples Based on Spatial Consistency Information for Semantic SegmentationChaowei Xiao, Ruizhi Deng, Bo Li, Fisher Yu, Mingyan Liu, and Dawn SongECCVAN #29Dan HThis paper considers adversarial attacks on segmentation systems. They find that segmentation systems behave inconsistently on adversarial images, and they use this inconsistency to detect adversarial inputs. Specifically, they take overlapping crops of the image and segment each crop. For overlapping crops of an adversarial image, they find that the segmentation are more inconsistent. They defend against one adaptive attack.
Adversarial examplesSpatially Transformed Adversarial ExamplesChaowei Xiao, Jun-Yan Zhu, Bo Li, Warren He, Mingyan Liu, Dawn SongICLRAN #29Dan HMany adversarial attacks perturb pixel values, but the attack in this paper perturbs the pixel locations instead. This is accomplished with a smooth image deformation which has subtle effects for large images. For MNIST images, however, the attack is more obvious and not necessarily content-preserving (see Figure 2 of the paper).
Adversarial examplesAdversarial Logit PairingHarini Kannan, Alexey Kurakin, Ian GoodfellowarXivRecon #5
Adversarial examplesLearning to write programs that generate imagesS M Ali Eslami, Tejas Kulkarni, Oriol Vinyals, Yaroslav GaninDeepMind BlogRecon #5
Adversarial examplesIntrinsic Geometric Vulnerability of High-Dimensional Artificial IntelligenceLuca Bortolussi, Guido SanguinettiarXivAN #36
Adversarial examplesOn Adversarial Examples for Character-Level Neural Machine TranslationJavid Ebrahimi, Daniel Lowd, Dejing DouarXivAN #13
Adversarial examplesIdealised Bayesian Neural Networks Cannot Have Adversarial Examples: Theoretical and Empirical StudyYarin Gal, Lewis SmitharXivAN #10Rohin
Agent foundationsTheory of Ideal Agents, or of Existing Agents?John S WentworthAlignment ForumAN #66FloThere are at least two ways in which a theoretical understanding of agency can be useful: On one hand, such understanding can enable the **design** of an artificial agent with certain properties. On the other hand, it can be used to **describe** existing agents. While both perspectives are likely needed for successfully aligning AI, individual researchers face a tradeoff: either they focus their efforts on existence results concerning strong properties, which helps with design (e.g. most of <@MIRI's work on embedded agency@>(@Embedded Agents@)), or they work on proving weaker properties for a broad class of agents, which helps with description (e.g. [all logical inductors can be described as markets](, summarized next). The prioritization of design versus description is a likely crux in disagreements about the correct approach to developing a theory of agency.To facilitate productive discussions it seems important to disentangle disagreements about goals from disagreements about means whenever we can. I liked the clear presentation of this attempt to identify a common source of disagreements on the (sub)goal level.
Agent foundationsTroll BridgeAbram DemskiAlignment ForumAN #63RohinThis is a particularly clean exposition of the Troll Bridge problem in decision theory. In this problem, an agent is determining whether to cross a bridge guarded by a troll who will blow up the agent if its reasoning is inconsistent. It turns out that an agent with consistent reasoning can prove that if it crosses, it will be detected as inconsistent and blown up, and so it decides not to cross. This is rather strange reasoning about counterfactuals -- we'd expect perhaps that the agent is uncertain about whether its reasoning is consistent or not.
Agent foundationsSelection vs ControlAbram DemskiAlignment ForumAN #58RohinThe previous paper focuses on mesa optimizers that are explicitly searching across a space of possibilities for an option that performs well on some objective. This post argues that in addition to this "selection" model of optimization, there is a "control" model of optimization, where the model cannot evaluate all of the options separately (as in e.g. a heat-seeking missile, which can't try all of the possible paths to the target separately). However, these are not cleanly separated categories -- for example, a search process could have control-based optimization inside of it, in the form of heuristics that guide the search towards more likely regions of the search space.This is an important distinction, and I'm of the opinion that most of what we call "intelligence" is actually more like the "control" side of these two options.
Agent foundationsPavlov GeneralizesAbram DemskiAlignment ForumAN #52RohinIn the iterated prisoner's dilemma, the [Pavlov strategy]( is to start by cooperating, and then switch the action you take whenever the opponent defects. This can be generalized to arbitrary games. Roughly, an agent is "discontent" by default and chooses actions randomly. It can become "content" if it gets a high payoff, in which case it continues to choose whatever action it previously chose as long as the payoffs remain consistently high. This generalization achieves Pareto optimality in the limit, though with a very bad convergence rate. Basically, all of the agents start out discontent and do a lot of exploration, and as long as any one agent is discontent the payoffs will be inconsistent and all agents will tend to be discontent. Only when by chance all of the agents take actions that lead to all of them getting high payoffs do they all become content, at which point they keep choosing the same action and stay in the equilibrium.

Despite the bad convergence, the cool thing about the Pavlov generalization is that it only requires agents to notice when the results are good or bad for them. In contrast, typical strategies that aim to mimic Tit-for-Tat require the agent to reason about the beliefs and utility functions of other agents, which can be quite difficult to do. By just focusing on whether things are going well for themselves, Pavlov agents can get a lot of properties in environments with other agents that Tit-for-Tat strategies don't obviously get, such as exploiting agents that always cooperate. However, when thinking about <@logical time@>(@In Logical Time, All Games are Iterated Games@), it would seem that a Pavlov-esque strategy would have to make decisions based on a prediction about its own behavior, which is... not obviously doomed, but seems odd. Regardless, given the lack of work on Pavlov strategies, it's worth trying to generalize them further.
Agent foundationsCDT=EDT=UDTAbram DemskiAlignment ForumAN #42
Agent foundationsTwo senses of “optimizer”Joar SkalseAlignment ForumAN #63RohinThe first sense of "optimizer" is an optimization algorithm, that given some formally specified problem computes the solution to that problem, e.g. a SAT solver or linear program solver. The second sense is an algorithm that acts upon its environment to change it. Joar believes that people often conflate the two in AI safety.I agree that this is an important distinction to keep in mind. It seems to me that the distinction is whether the optimizer has knowledge about the environment: in canonical examples of the first kind of optimizer, it does not. If we somehow encoded the dynamics of the world as a SAT formula and asked a super-powerful SAT solver to solve for the actions that accomplish some goal, it would look like the second kind of optimizer.
Agent foundationsClarifying Logical CounterfactualsChris LeongLessWrongAN #43
Agent foundationsCDT Dutch BookAbram DemskiAlignment ForumAN #42
Agent foundationsAnthropic paradoxes transposed into Anthropic Decision TheoryStuart ArmstrongAlignment ForumAN #38
Agent foundationsAnthropic probabilities and cost functionsStuart ArmstrongAlignment ForumAN #38
Agent foundationsBounded Oracle InductionDiffractorAlignment ForumAN #35
Agent foundationsRobust program equilibriumCaspar OesterheldSpringerAN #35
Agent foundationsA Rationality Condition for CDT Is That It Equal EDT (Part 2)Abram DemskiAlignment ForumAN #28
Agent foundationsA Rationality Condition for CDT Is That It Equal EDT (Part 1)Abram DemskiAlignment ForumAN #27
Agent foundationsEDT solves 5 and 10 with conditional oraclesjessicataAlignment ForumAN #27
Agent foundationsAsymptotic Decision Theory (Improved Writeup)DiffractorAlignment ForumAN #26
Agent foundationsIn Logical Time, All Games are Iterated GamesAbram DemskiAlignment ForumAN #25RichardThe key difference between causal and functional decision theory is that the latter supplements the normal notion of causation with "logical causation". The decision of agent A can logically cause the decision of agent B even if B made their decision before A did - for example, if B made their decision by simulating A. Logical time is an informal concept developed to help reason about which computations cause which other computations: logical causation only flows forward through logical time in the same way that normal causation only flows forward through normal time (although maybe logical time turns out to be loopy). For example, when B simulates A, B is placing themselves later in logical time than A. When I choose not to move my bishop in a game of chess because I've noticed it allows a sequence of moves which ends in me being checkmated, then I am logically later than that sequence of moves. One toy model of logical time is based on proof length - we can consider shorter proofs to be earlier in logical time than longer proofs. It's apparently surprisingly difficult to find a case where this fails badly.

In logical time, all games are iterated games. We can construct a series of simplified versions of each game where each player's thinking time is bounded. As thinking time increases, the games move later in logical time, and so we can treat them as a series of iterated games whose outcomes causally affect all longer versions. Iterated games are fundamentally different from single-shot games: the [folk theorem]( states that virtually any outcome is possible in iterated games.
I like logical time as an intuitive way of thinking about logical causation. However, the analogy between normal time and logical time seems to break down in some cases. For example, suppose we have two boolean functions F and G, such that F = not G. It seems like G is logically later than F - yet we could equally well have defined them such that G = not F, which leads to the opposite conclusion. As Abram notes, logical time is intended as an intuition pump not a well-defined theory - yet the possibility of loopiness makes me less confident in its usefulness. In general I am pessimistic about the prospects for finding a formal definition of logical causation, for reasons I described in [Realism about Rationality](, which Rohin summarised above.
Agent foundationsCounterfactuals and reflective oraclesNisanAlignment ForumAN #23
Agent foundationsWhen wishful thinking worksAlex MennenAlignment ForumAN #23RohinSometimes beliefs can be loopy, in that the probability of a belief being true depends on whether you believe it. For example, the probability that a placebo helps you may depend on whether you believe that a placebo helps you. In the situation where you know this, you can "wish" your beliefs to be the most useful possible beliefs. In the case where the "true probability" depends continuously on your beliefs, you can use a fixed point theorem to find a consistent set of probabilities. There may be many such fixed points, in which case you can choose the one that would lead to highest expected utility (such as choosing to believe in the placebo). One particular application of this would be to think of the propositions as "you will take action a_i". In this case, you act the way you believe you act, and then every probability distribution over the propositions is a fixed point, and so we just choose the probability distribution (i.e. stochastic policy) that maximized expected utility, as usual. This analysis can also be carried to Nash equilibria, where beliefs in what actions you take will affect the actions that the other player takes.
Agent foundationsBottle Caps Aren't OptimisersDaniel FilanAlignment ForumAN #22RohinThe previous paper detects optimizers by studying their behavior. However, if the goal is to detect an optimizer before deployment, we need to determine whether an algorithm is performing optimization by studying its source code, _without_ running it. One definition that people have come up with is that an optimizer is something such that the objective function attains higher values than it otherwise would have. However, the author thinks that this definition is insufficient. For example, this would allow us to say that a bottle cap is an optimizer for keeping water inside the bottle. Perhaps in this case we can say that there are simpler descriptions of bottle caps, so those should take precedence. But what about a liver? We could say that a liver is optimizing for its owner's bank balance, since in its absence the bank balance is not going to increase.Here, we want a definition of optimization because we're worried about an AI being deployed, optimizing for some metric in the environment, and then doing something unexpected that we don't like but nonetheless does increase the metric (falling prey to Goodhart's law). It seems better to me to talk about "optimizer" and "agent" as models of predicting behavior, not something that is an inherent property of the thing producing the behavior. Under that interpretation, we want to figure out whether the agent model with a particular utility function is a good model for an AI system, by looking at its internals (without running it). It seems particularly important to be able to use this model to predict the behavior in novel situations -- perhaps that's what is needed to make the definition of optimizer avoid the counterexamples in this post. (A bottle cap definitely isn't going to keep water in containers if it is simply lying on a table somewhere.)
Agent foundationsComputational complexity of RL with trapsVadim KosoyAlignment ForumAN #22RohinA post asking about complexity theoretic results around RL, both with (unknown) deterministic and stochastic dynamics.
Agent foundationsCooperative OraclesDiffractorAlignment ForumAN #22
Agent foundationsCorrigibility doesn't always have a good action to takeStuart ArmstrongAlignment ForumAN #22RohinStuart has [previously argued]( that an AI could be put in situations where no matter what it does, it would affect the human's values. In this short post, he notes that if you then say that it is possible to have situations where the AI cannot act corrigibly, then other problems arise, such as how you can create a superintelligent corrigible AI that does anything at all (since any action that it takes would likely affect our values somehow).
Agent foundationsUsing expected utility for Good(hart)Stuart ArmstrongAlignment ForumAN #22RohinIf we include all of the uncertainty we have about human values into the utility function, then it seems possible to design an expected utility maximizer that doesn't fall prey to Goodhart's law. The post shows a simple example where there are many variables that may be of interest to humans, but we're not sure which ones. In this case, by incorporating this uncertainty into our proxy utility function, we can design an expected utility maximizer that has conservative behavior that makes sense.On the one hand, I'm sympathetic to this view -- for example, I see risk aversion as a heuristic leading to good expected utility maximization for bounded reasoners on large timescales. On the other hand, an EU maximizer still seems hard to align, because whatever utility function it gets, or distribution over utility functions, it will act as though that input is definitely true, which means that anything we fail to model will never make it into the utility function. If you could have some sort of "unresolvable" uncertainty, some reasoning (similar to the [problem of induction]( suggesting that you can never fully trust your own thoughts to be perfectly correct, that would make me more optimistic about an EU maximization based approach, but I don't think it can be done by just changing the utility function, or by adding a distribution over them.
Agent foundationsAgents and Devices: A Relative Definition of AgencyLaurent Orseau,
Simon McGregor McGill,
Shane Legg
arXivAN #22RohinThis paper considers the problem of modeling other behavior, either as an agent (trying to achieve some goal) or as a device (that reacts to its environment without any clear goal). They use Bayesian IRL to model behavior as coming from an agent optimizing a reward function, and design their own probability model to model the behavior as coming from a device. They then use Bayes rule to decide whether the behavior is better modeled as an agent or as a device. Since they have a uniform prior over agents and devices, this ends up choosing the one that better fits the data, as measured by log likelihood.

In their toy gridworld, agents are navigating towards particular locations in the gridworld, whereas devices are reacting to their local observation (the type of cell in the gridworld that they are currently facing, as well as the previous action they took). They create a few environments by hand which demonstrate that their method infers the intuitive answer given the behavior.
In their experiments, they have two different model classes with very different inductive biases, and their method correctly switches between the two classes depending on which inductive bias works better. One of these classes is the maximization of some reward function, and so we call that the agent class. However, they also talk about using the Solomonoff prior for devices -- in that case, even if we have something we would normally call an agent, if it is even slightly suboptimal, then with enough data the device explanation will win out.

I'm not entirely sure why they are studying this problem in particular -- one reason is explained in the next post, I'll write more about it in that section.
Agent foundationsReducing collective rationality to individual optimization in common-payoff games using MCMCjessicataAlignment ForumAN #21RohinGiven how hard multiagent cooperation is, it would be great if we could devise an algorithm such that each agent is only locally optimizing their own utility (without requiring that anyone else change their policy), that still achieves the globally optimal policy. This post considers the case where all players have the same utility function in an iterated game. In this case, we can define a process where at every timestep, one agent is randomly selected, and that agent changes their action in the game uniformly at random with probability that depends on how much utility was just achieved. This depends on a rationality parameter α -- the higher α is, the more likely it is for the player to stick with a high utility action.

This process allows you to reach every possible joint action from every other possible joint action with some non-zero probability, so in the limit of running this process forever, you will end up visiting every state infinitely often. However, by cranking up the value of α, we can ensure that in the limit we spend most of the time in the high-value states and rarely switch to anything lower, which lets us get arbitrarily close to the optimal deterministic policy (and so arbitrarily close to the optimal expected value).
I like this, it's an explicit construction that demonstrates how you can play with the explore-exploit tradeoff in multiagent settings. Note that when α is set to be very high (the condition in which we get near-optimal outcomes in the limit), there is very little exploration, and so it will take a long time before we actually find the optimal outcome in the first place. It seems like this would make it hard to use in practice, but perhaps we could replace the exploration with reasoning about the game and other agents in it? The author was planning to use reflective oracles to do something like this if I understand correctly.
Agent foundationsComputational efficiency reasons not to model VNM-rational preference relations with utility functionsAlex MennenLessWrongAN #17RohinRealistic agents don't use utility functions over world histories to make decisions, because it is computationally infeasible, and it's quite possible to make a good decision by only considering the local effects of the decision. For example, when deciding whether or not to eat a sandwich, we don't typically worry about the outcome of a local election in Siberia. For the same computational reasons, we wouldn't want to use a utility function to model other agents. Perhaps a utility function is useful for measuring the strength of an agent's preference, but even then it is really measuring the ratio of the strength of the agent's preference to the strength of the agent's preference over the two reference points used to determine the utility function.I agree that we certainly don't want to model other agents using full explicit expected utility calculations because it's computationally infeasible. However, as a first approximation it seems okay to model other agents as computationally bounded optimizers of some utility function. It seems like a bigger problem to me that any such model predicts that the agent will never change its preferences (since that would be bad according to the current utility function).
Agent foundationsStable Pointers to Value III: Recursive QuantilizationAbram DemskiLessWrongAN #17RohinWe often try to solve alignment problems by going a level meta. For example, instead of providing feedback on what the utility function is, we might provide feedback on how to best learn what the utility function is. This seems to get more information about what safe behavior is. What if we iterate this process? For example, in the case of quantilizers with three levels of iteration, we would do a quantilized search over utility function generators, then do a quantilized search over the generated utility functions, and then do a quantilized search to actually take actions.The post mentions what seems like the most salient issue -- that it is really hard for humans to give feedback even a few meta levels up. How do you evaluate a thing that will create a distribution over utility functions? I might go further -- I'm not even sure there is good normative feedback on the meta level(s). There is feedback we can give on the meta level for any particular object-level instance, but it seems not at all obvious (to me) that this advice will generalize well to other object-level instances. On the other hand, it does seem to me that the higher up you are in meta-levels, the smaller the space of concepts and the easier it is to learn. So maybe my overall take is that it seems like we can't depend on humans to give meta-level feedback well, but if we can figure out how to either give better feedback or learn from noisy feedback, it would be easier to learn and likely generalize better.
Agent foundationsBuridan's ass in coordination gamesjessicataLessWrongAN #16RohinSuppose two agents have to coordinate to choose the same action, X or Y, where X gives utility 1 and Y gives utility u, for some u in [0, 2]. (If the agents fail to coordinate, they get zero utility.) If the agents communicate, decide on policies, then observe the value of u with some noise ϵ, and then execute their policies independently, there must be some u for which they lose out on significant utility. Intuitively, the proof is that at u = 0, you should say X, and at u = 2, you should say Y, and there is some intermediate value where you are indifferent between the two (equal probability of choosing X or Y), meaning that 50% of the time you will fail to coordinate. However, if you have a shared source of randomness (after observing the value of u), then you can correlate your decisions using the randomness in order to do much better.Cool result, and quite easy to understand. As usual I don't want to speculate on relevance to AI alignment because it's not my area.
Agent foundationsProbability is Real, and Value is ComplexAbram DemskiLessWrongAN #16RohinIf you interpret events as vectors on a graph, with probability on the x-axis and probability * utility on the y-axis, then any rotation of the vectors preserves the preference relation, so that you will make the same decision. This means that from decisions, you cannot distinguish between rotations, which intuitively means that you can't tell if a decision was made because it had a low probability of high utility, or medium probability of medium utility, for example. As a result, beliefs and utilities are inextricably linked, and you can't just separate them. Key quote: "Viewing [probabilities and utilities] in this way makes it somewhat more natural to think that probabilities are more like "caring measure" expressing how much the agent cares about how things go in particular worlds, rather than subjective approximations of an objective "magical reality fluid" which determines what worlds are experienced."I am confused. If you want to read my probably-incoherent confused opinion on it, it's [here](
Bayesian Utility: Representing Preference by Probability Measures
Agent foundationsBayesian Probability is for things that are Space-like Separated from YouScott GarrabrantLessWrongAN #15RohinWhen an agent has uncertainty about things that either influenced which algorithm the agent is running (the agent's "past") or about things that will be affected by the agent's actions (the agent's "future"), you may not want to use Bayesian probability. Key quote: "The problem is that the standard justifications of Bayesian probability are in a framework where the facts that you are uncertain about are not in any way affected by whether or not you believe them!" This is not the case for events in the agent's "past" or "future". So, you should only use Bayesian probability for everything else, which are "space-like separated" from you (in analogy with space-like separation in relativity).I don't know much about the justifications for Bayesianism. However, I would expect any justification to break down once you start to allow for sentences where the agent's degree of belief in the sentence affects its truth value, so the post makes sense given that intuition.
Agent foundationsComplete Class: Consequentialist FoundationsAbram DemskiLessWrongAN #15RohinAn introduction to "complete class theorems", which can be used to motivate the use of probabilities and decision theory.This is cool, and I do want to learn more about complete class theorems. The post doesn't go into great detail on any of the theorems, but from what's there it seems like these theorems would be useful for figuring out what things we can argue from first principles (akin to the VNM theorem and dutch book arguments).
Agent foundationsCounterfactual Mugging Poker GameScott GarrabrantLessWrongAN #11RohinThis is a variant of counterfactual mugging, in which an agent doesn't take the action that is locally optimal, because that would provide information in the counterfactual world where one aspect of the environment was different that would lead to a large loss in that setting.This example is very understandable and very short -- I haven't summarized it because I don't think I can make it any shorter.
Agent foundationsWeak arguments against the universal prior being malignX4vierLessWrongAN #11RohinIn an [earlier post](, Paul Christiano has argued that if you run Solomonoff induction and use its predictions for important decisions, most of your probability mass will be placed on universes with intelligent agents that make the right predictions so that their predictions will influence your decisions, and then use that influence to manipulate you into doing things that they value. This post makes a few arguments that this wouldn't actually happen, and Paul responds to the arguments in the comments.I still have only a fuzzy understanding of what's going on here, so I'm going to abstain from an opinion on this one.
What does the universal prior actually look like?
Agent foundationsPrisoners' Dilemma with Costs to ModelingScott GarrabrantLessWrongAN #10RohinOpen source game theory looks at the behavior of agents that have access to each other's source code. A major result is that we can define an agent FairBot that will cooperate with itself in the prisoner's dilemma, yet can never be exploited. Later, we got PrudentBot, which still cooperates with FairBots, but will defect against CooperateBots (which always cooperate) since it can at no cost to itself. Given this, you would expect that if you evolved a population of such bots, you'd hopefully get an equilibrium in which everyone cooperates with each other, since they can do so robustly without falling prey to DefectBots (which always defect). However, being a FairBot or PrudentBot is costly -- you have to think hard about the opponent and prove things about them, it's a lot easier to rely on everyone else to punish the DefectBots and become a CooperateBot yourself. In this post, Scott analyzes the equilibria in the two person prisoner's dilemma with small costs to play bots that have to prove things. It turns out that in addition to the standard Defect-Defect equilibirum, there are two mixed strategy equilibria, including one that leads to generally cooperative behavior -- and if you evolve agents to play this game, they generally stay in the vicinity of this good equilibrium, for a range of initial conditions.This is an interesting result. I continue to be surprised at how robust this Lobian cooperative behavior seems to be -- while I used to think that humans could only cooperate with each other because of prosocial tendencies that meant that we were not fully selfish, I'm now leaning more towards the theory that we are simply very good at reading other people, which gives us insight into them, and leads to cooperative behavior in a manner similar to Lobian cooperation.
[Robust Cooperation in the Prisoner's Dilemma]( and/or [Open-source game theory is weird](
Agent foundations2018 research plans and predictionsRob BensingerMIRI BlogAN #1RohinScott and Nate from MIRI score their predictions for research output in 2017 and make predictions for research output in 2018.I don't know enough about MIRI to have any idea what the predictions mean, but I'd still recommend reading it if you're somewhat familiar with MIRI's technical agenda to get a bird's-eye view of what they have been focusing on for the last year.
A basic understanding of MIRI's technical agenda (eg. what they mean by naturalized agents, decision theory, Vingean reflection, and so on).
Agent foundationsMarkets are Universal for Logical InductionJohn S WentworthAlignment ForumAN #66RohinA logical inductor is a system that assigns probabilities to logical statements (such as "the millionth digit of pi is 3") over time, that satisfies the _logical induction criterion_: if we interpret the probabilities as prices of contracts that pay out $1 if the statement is true and $0 otherwise, then there does not exist a polynomial-time trader function with bounded money that can make unbounded returns over time. The [original paper]( shows that logical inductors exist. This post proves that for any possible logical inductor, there exists some market of traders that produces the same prices as the logical inductor over time.
Agent foundationsApproval-directed agency and the decision theory of Newcomb-like problemsCaspar OesterheldFRI WebsiteAN #52
Agent foundationsNo surjection onto function space for manifold XStuart ArmstrongAlignment ForumAN #41
Agent foundationsFailures of UDT-AIXI, Part 1: Improper RandomizingDiffractorAlignment ForumAN #40
Agent foundationsRobust program equilibriumCaspar OesterheldFRI WebsiteAN #39RohinIn a prisoner's dilemma where you have access to an opponent's source code, you can hope to achieve cooperation by looking at how the opponent would perform against you. Naively, you could simply simulate what the opponent would do given your source code, and use that to make your decision. However, if your opponent also tries to simulate you, this leads to an infinite loop. The key idea of this paper is to break the infinite loop by introducing a small probability of guaranteed cooperation (without simulating the opponent), so that eventually after many rounds of simulation the recursion "bottoms out" with guaranteed cooperation. They explore what happens when applying this idea to the equivalents of FairBot/Tit-for-Tat strategies when you are simulating the opponent.
Agent foundationsOracle Induction ProofsDiffractorAlignment ForumAN #35
Agent foundationsDimensional regret without resetsVadim KosoyAlignment ForumAN #33
Agent foundationsWhat are Universal Inductors, Again?DiffractorAlignment ForumAN #32
Agent foundationsWhen EDT=CDT, ADT Does WellDiffractorAlignment ForumAN #30
Agent foundationsRamsey and Joyce on deliberation and predictionYang Liu, Hew PriceCSER WebsiteAN #27
Agent foundationsDoubts about UpdatelessnessAlex AppelAgent FoundationsAN #5
Agent foundationsComputing an exact quantilal policyVadim KosoyAgent FoundationsAN #3
Agent foundationsLogical Counterfactuals & the Cooperation GameChris LeongAlignment ForumAN #20
Agent foundationsNo Constant Distribution Can be a Logical InductorAlex AppelAgent FoundationsAN #2