CategoryHighlight?TitleAuthorsVenueYearH/TEmailSummarizerSummaryMy opinionPrerequisitesRead more
Adversarial examplesHighlightAdversarial Examples Are Not Bugs, They Are FeaturesAndrew Ilyas*, Shibani Santurkar*, Dimitris Tsipras*, Logan Engstrom*, Brandon Tran, Aleksander MadryarXiv2019AN #62Rohin_Distill published a discussion of this paper. This highlights section will cover the full discussion; all of these summaries and opinions are meant to be read together._

Consider two possible explanations of adversarial examples. First, they could be caused because the model "hallucinates" a signal that is not useful for classification, and it becomes very sensitive to this feature. We could call these "bugs", since they don't generalize well. Second, they could be caused by features that _do_ generalize to the test set, but _can_ be modified by an adversarial perturbation. We could call these "non-robust features" (as opposed to "robust features", which can't be changed by an adversarial perturbation). The authors argue that at least some adversarial perturbations fall into the second category of being informative but sensitive features, based on two experiments.

If the "hallucination" explanation were true, the hallucinations would presumably be caused by the training process, the choice of architecture, the size of the dataset, **but not by the type of data**. So one thing to do would be to see if we can construct a dataset such that a model trained on that dataset is _already_ robust, without adversarial training. The authors do this in the first experiment. They take an adversarially trained robust classifier, and create images whose features (final-layer activations of the robust classifier) match the features of some unmodified input. The generated images only have robust features because the original classifier was robust, and in fact models trained on this dataset are automatically robust.

If the "non-robust features" explanation were true, then it should be possible for a model to learn on a dataset containing only non-robust features (which will look nonsensical to humans) and **still generalize to a normal-looking test set**. In the second experiment (henceforth WrongLabels), the authors construct such a dataset. Their hypothesis is that adversarial perturbations work by introducing non-robust features of the target class. So, to construct their dataset, they take an image x with original label y, adversarially perturb it towards some class y' to get image x', and then add (x', y') to their dataset (even though to a human x' looks like class y). They have two versions of this: in RandLabels, the target class y' is chosen randomly, whereas in DetLabels, y' is chosen to be y + 1. For both datasets, if you train a new model on the dataset, you get good performance **on the original test set**, showing that the "non-robust features" do generalize.
I buy this hypothesis. It explains why adversarial examples occur ("because they are useful to reduce loss"), and why they transfer across models ("because different models can learn the same non-robust features"). In fact, the paper shows that architectures that did worse in ExpWrongLabels (and so presumably are bad at learning non-robust features) are also the ones to which adversarial examples transfer the least. I'll leave the rest of my opinion to the opinions on the responses.
[Paper]( and [Author response](
Adversarial examplesHighlightResponse: Learning from Incorrectly Labeled DataEric WallaceDistill2019AN #62RohinThis response notes that all of the experiments are of the form: create a dataset D that is consistent with a model M; then, when you train a new model M' on D you get the same properties as M. Thus, we can interpret these experiments as showing that [model distillation]( can work even with data points that we would naively think of "incorrectly labeled". This is a more general phenomenon: we can take an MNIST model, select _only_ the examples for which the top prediction is incorrect (labeled with these incorrect top predictions), and train a new model on that -- and get nontrivial performance on the original test set, even though the new model has never seen a "correctly labeled" example.
I definitely agree that these results can be thought of as a form of model distillation. I don't think this detracts from the main point of the paper: the _reason_ model distillation works even with incorrectly labeled data is probably because the data is labeled in such a way that it incentivizes the new model to pick out the same features that the old model was paying attention to.
Adversarial examplesHighlightResponse: Robust Feature LeakageGabriel GohDistill2019AN #62RohinThis response investigates whether the datasets in WrongLabels could have had robust features. Specifically, it checks whether a linear classifier over provably robust features trained on the WrongLabels dataset can get good accuracy on the _original_ test set. This shouldn't be possible since WrongLabels is meant to correlate only non-robust features with labels. It finds that you _can_ get some accuracy with RandLabels, but you don't get much accuracy with DetLabels.

The original authors can actually explain this: intuitively, you get accuracy with RandLabels because it's less harmful to choose labels randomly than to choose them explicitly incorrectly. With random labels on unmodified inputs, robust features should be completely uncorrelated with accuracy. However, with random labels _followed by an adversarial perturbation towards the label_, there can be some correlation, because the adversarial perturbation can add "a small amount" of the robust feature. However, in DetLabels, the labels are _wrong_, and so the robust features are _negatively correlated_ with the true label, and while this can be reduced by an adversarial perturbation, it can't be reversed (otherwise it wouldn't be robust).
The original authors' explanation of these results is quite compelling; it seems correct to me.
Adversarial examplesHighlightResponse: Adversarial Examples are Just Bugs, TooPreetum NakkiranDistill2019AN #62RohinThe main point of this response is that adversarial examples can be bugs too. In particular, if you construct adversarial examples that explicitly _don't_ transfer between models, and then run ExpWrongLabels with such adversarial perturbations, then the resulting model doesn't perform well on the original test set (and so it must not have learned non-robust features).

It also constructs a data distribution where **every useful feature _of the optimal classifer_ is guaranteed to be robust**, and shows that we can still get adversarial examples with a typical model, showing that it is not just non-robust features that cause adversarial examples.

In their response, the authors clarify that they didn't intend to claim that adversarial examples could not arise due to "bugs", just that "bugs" were not the only explanation. In particular, they say that their main thesis is “adversarial examples will not just go away as we fix bugs in our models”, which is consistent with the point in this response.
Amusingly, I think I'm more bullish on the original paper's claims than the authors themselves. It's certainly true that adversarial examples can arise from "bugs": if your model overfits to your data, then you should expect adversarial examples along the overfitted decision boundary. The dataset constructed in this response is a particularly clean example: the optimal classifier would have an accuracy of 90%, but the model is trained to accuracy 99.9%, which means it must be overfitting.

However, I claim that with large and varied datasets with neural nets, we are typically not in the regime where models overfit to the data, and the presence of "bugs" in the model will decrease. (You certainly _can_ get a neural net to be "buggy", e.g. by randomly labeling the data, but if you're using real data with a natural task then I don't expect it to happen to a significant degree.) Nonetheless, adversarial examples persist, because the features that models use are not the ones that humans use.

It's also worth noting that this experiment strongly supports the hypothesis that adversarial examples transfer because they are real features that generalize to the test set.
Adversarial examplesHighlightResponse: Adversarial Example Researchers Need to Expand What is Meant by ‘Robustness’Justin Gilmer, Dan HendrycksDistill2019AN #62RohinThis response argues that the results in the original paper are simply a consequence of a generally accepted principle: "models lack robustness to distribution shift because they latch onto superficial correlations in the data". This isn't just about L_p norm ball adversarial perturbations: for example, one [recent paper]( shows that if the model is only given access to high frequency features of images (which look uniformly grey to humans), it can still get above 50% accuracy. In fact, when we do adversarial training to become robust to L_p perturbations, then the model pays attention to different non-robust features and becomes more vulnerable to e.g. [low-frequency fog corruption]( The authors call for adversarial examples researchers to move beyond L_p perturbations and think about the many different ways models can be fragile, and to make them more robust to distributional shift.
I strongly agree with the worldview behind this response, and especially the principle they identified. I didn't know this was a generally accepted principle, though of course I am not an expert on distributional robustness.

One thing to note is what is meant by "superficial correlation" here. I interpret it to mean a correlation that really does exist in the dataset, that really does generalize to the test set, but that _doesn't_ generalize out of distribution. A better term might be "fragile correlation". All of the experiments so far have been looking at within-distribution generalization (aka generalization to the test set), and are showing that non-robust features _do_ generalize within-distribution. By my understanding, this response is arguing that there are many such non-robust features that will generalize within-distribution but will not generalize under distributional shift, and we need to make our models robust to all of them, not just L_p adversarial perturbations.
Adversarial examplesHighlightResponse: Two Examples of Useful, Non-Robust FeaturesGabriel GohDistill2019AN #62RohinThis response studies linear features, since we can analytically compute their usefulness and robustness. It plots the singular vectors of the data as features, and finds that such features are either robust and useful, or non-robust and not useful. However, you can get useful, non-robust features by ensembling or contamination (see response for details).
Adversarial examplesHighlightResponse: Adversarially Robust Neural Style TransferReiichiro NakanoDistill2019AN #62RohinThe original paper showed that adversarial examples don't transfer well to VGG, and that VGG doesn't tend to learn similar non-robust features as a ResNet. Separately, VGG works particularly well for style transfer. Perhaps since VGG doesn't capture non-robust features as well, the results of style transfer look better to humans? This response and the author's response investigate this hypothesis in more detail and find that it seems broadly supported, but there are still finnicky details to be worked out.
This is an intriguing empirical fact. However, I don't really buy the theoretical argument that style transfer works because it doesn't use non-robust features, since I would typically expect that a model that doesn't use L_p-fragile features would instead use features that are fragile or non-robust in some other way.
Adversarial examplesHighlightFeature Denoising for Improving Adversarial RobustnessCihang Xie, Yuxin Wu, Laurens van der Maaten, Alan Yuille, Kaiming HeICML 20182018AN #49Dan HThis paper claims to obtain nontrivial adversarial robustness on ImageNet. Assuming an adversary can add perturbations of size 16/255 (l_infinity), previous adversarially trained classifiers could not obtain above 1% adversarial accuracy. Some groups have tried to break the model proposed in this paper, but so far it appears its robustness is close to what it claims, [around]( 40% adversarial accuracy. Vanilla adversarial training is how they obtain said adversarial robustness. There has only been one previous public attempt at applying (multistep) adversarial training to ImageNet, as those at universities simply do not have the GPUs necessary to perform adversarial training on 224x224 images. Unlike the previous attempt, this paper ostensibly uses better hyperparameters, possibly accounting for the discrepancy. If true, this result reminds us that hyperparameter tuning can be critical even in vision, and that improving adversarial robustness on large-scale images may not be possible outside industry for many years.
Adversarial examplesHighlightConstructing Unrestricted Adversarial Examples with Generative ModelsYang Song, Rui Shu, Nate Kushman, Stefano ErmonNeurIPS 20182018AN #39RohinThis paper predates the [unrestricted adversarial examples challenge]( ([AN #24]( and shows how to generate such unrestricted adversarial examples using generative models. As a reminder, most adversarial examples research is focused on finding imperceptible perturbations to existing images that cause the model to make a mistake. In contrast, unrestricted adversarial examples allow you to find _any_ image that humans will reliably classify a particular way, where the model produces some other classification.

The key idea is simple -- train a GAN to generate images in the domain of interest, and then create adversarial examples by optimizing an image to simultaneously be "realistic" (as evaluated by the generator), while still being misclassified by the model under attack. The authors also introduce another term into the loss function that minimizes deviation from a randomly chosen noise vector -- this allows them to get diverse adversarial examples, rather than always converging to the same one.

They also consider a "noise-augmented" attack, where in effect they are running the normal attack they have, and then running a standard attack like FGSM or PGD afterwards. (They do these two things simultaneously, but I believe it's nearly equivalent.)

For evaluation, they generate adversarial examples with their method and check that humans on Mechanical Turk reliably classify the examples as a particular class. Unsurprisingly, their adversarial examples "break" all existing defenses, including the certified defenses, though to be clear existing defenses assume a different threat model where an adversarial example must be an imperceptible perturbation to one of a known set of images. You could imagine doing something similar by taking the imperceptible-perturbation attacks and raise the value of ϵ until it is perceptible -- but in this case the generated images are much less realistic.
This is the clear first thing to try with unrestricted adversarial examples, and it seems to work reasonably well. I'd love to see whether adversarial training with these sorts of adversarial examples works as a defense against both this attack and standard imperceptible-perturbation attacks. In addition, it would be interesting to see if humans could direct or control the search for unrestricted adversarial examples.
Adversarial examplesHighlightMotivating the Rules of the Game for Adversarial Example ResearchJustin Gilmer, Ryan P. Adams, Ian Goodfellow, David Andersen, George E. Dahl2018 IEEE/RSJ International Conference on Intelligent Robots and Systems2018AN #28Dan HIn this position paper, the authors argue that many of the threat models which motivate adversarial examples are unrealistic. They enumerate various previously proposed threat models, and then they show their limitations or detachment from reality. For example, it is common to assume that an adversary must create an imperceptible perturbation to an example, but often attackers can input whatever they please. In fact, in some settings an attacker can provide an input from the clean test set that is misclassified. Also, they argue that adversarial robustness defenses which degrade clean test set error are likely to make systems less secure since benign or nonadversarial inputs are vastly more common. They recommend that future papers motivated by adversarial examples take care to define the threat model realistically. In addition, they encourage researchers to establish “content-preserving” adversarial attacks (as opposed to “imperceptible” l_p attacks) and improve robustness to unseen input transformations.
This is my favorite paper of the year as it handily counteracts much of the media coverage and research lab PR purporting ``doom'' from adversarial examples. While there are some scenarios in which imperceptible perturbations may be a motivation---consider user-generated privacy-creating perturbations to Facebook photos which stupefy face detection algorithms---much of the current adversarial robustness research optimizing small l_p ball robustness can be thought of as tackling a simplified subproblem before moving to a more realistic setting. Because of this paper, new tasks such as [Unrestricted Adversarial Examples]( ([AN #24]( take an appropriate step toward increasing realism without appearing to make the problem too hard.
Adversarial examplesHighlightIntroducing the Unrestricted Adversarial Examples ChallengeTom B. Brown and Catherine Olsson, Reserach Engineers, Google Brain TeamGoogle AI Blog2018AN #24RohinThere's a new adversarial examples contest, after the one from NIPS 2017. The goal of this contest is to figure out how to create a model that never confidently makes a mistake on a very simple task, even in the presence of a powerful adversary. This leads to many differences from the previous contest. The task is a lot simpler -- classifiers only need to distinguish between bicycles and birds, with an option of saying "ambiguous". Instead of using the L-infinity norm ball to define what an adversarial example is, attackers are allowed to supply any image whatsoever, as long as a team of human evaluators agrees unanimously on the classification of the image. The contest has no time bound, and will run until some defense survives for 90 days without being broken even once. A defense is not broken if it says "ambiguous" on an adversarial example. Any submitted defense will be published, which means that attackers can specialize their attacks to that specific model (i.e. it is white box).
I really like this contest format, it seems like it's actually answering the question we care about, for a simple task. If I were designing a defense, the first thing I'd aim for would be to get a lot of training data, ideally from different distributions in the real world, but data augmentation techniques may also be necessary, especially for eg. images of a bicycle against an unrealistic textured background. The second thing would be to shrink the size of the model, to make it more likely that it generalizes better (in accordance with Occam's razor or the minimum description length principle). After that I'd think about the defenses proposed in the literature. I'm not sure how the verification-based approaches will work, since they are intrinsically tied to the L-infinity norm ball definition of adversarial examples, or something similar -- you can't include the human evaluators in your specification of what you want to verify.
Adversarial examplesHighlightAdversarial Attacks and Defences CompetitionAlexey Kurakin, Ian Goodfellow, Samy Bengio, Yinpeng Dong, Fangzhou Liao, Ming Liang, Tianyu Pang, Jun Zhu, Xiaolin Hu, Cihang Xie, Jianyu Wang, Zhishuai Zhang, Zhou Ren, Alan Yuille, Sangxia Huang, Yao Zhao, Yuzhe Zhao, Zhonglin Han, Junjiajia Long, Yerkebulan Berdibekov, Takuya Akiba, Seiya Tokui, Motoki AbeThe NIPS '17 Competition: Building Intelligent Systems2018AN #1RohinThis is a report on a competition held at NIPS 2017 for the best adversarial attacks and defences. It includes a summary of the field and then shows the results from the competition.
I'm not very familiar with the literature on adversarial examples and so I found this very useful as an overview of the field, especially since it talks about the advantages and disadvantages of different methods, which are hard to find by reading individual papers. The actual competition results are also quite interesting -- they find that the best attacks and defences are both quite successful on average, but have very bad worst-case performance (that is, the best defence is still very weak against at least one attack, and the best attack fails to attack at least one defence). Overall, this paints a bleak picture for defence, at least if the attacker has access to enough compute to actually try out different attack methods, and has a way of verifying whether the attacks succeed.
Adversarial examplesPhysically Realistic Attacks on Deep Reinforcement LearningAdam GleaveBAIR Blog2020AN #93RohinThis is a blog post for a previously summarized paper, <@Adversarial Policies: Attacking Deep Reinforcement Learning@>.
Adversarial examplesRobustness beyond Security: Representation LearningLogan Engstrom, Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Brandon Tran, Aleksander MadryarXiv2019AN #68CodyEarlier this year, a <@provocative paper@>(@Adversarial Examples Are Not Bugs, They Are Features@) out of MIT claimed that adversarial perturbations weren’t just spurious correlations, but were, at least in some cases, features that generalize to the test set. A subtler implied point of the paper was that robustness to adversarial examples wasn’t a matter of resolving the model’s misapprehensions, but rather one of removing the model’s sensitivity to features that would be too small for a human to perceive. If we do this via adversarial training, we get so-called “robust representations”. The same group has now put out another paper, asking the question: are robust representations also human-like representations?

To evaluate how human-like the representations are, they propose the following experiment: take a source image, and optimize it until its representations (penultimate layer activations) match those of some target image. If the representations are human-like, the result of this optimization should look (to humans) very similar to the target image. (They call this property “invertibility”.) Normal image classifiers fail miserably at this test: the image looks basically like the source image, making it a classic adversarial example. Robust models on the other hand pass the test, suggesting that robust representations usually are human-like. They provide further evidence by showing that you can run feature visualization without regularization and get meaningful results (existing methods result in noise if you don’t regularize).
I found this paper clear, well-written, and straightforward in its empirical examination of how the representations learned by standard and robust models differ. I also have a particular interest in this line of research, since I have thought for a while that we should be more clear about the fact that adversarially-susceptible models aren’t wrong in some absolute sense, but relative to human perception in particular.

**Rohin’s opinion:** I agree with Cody above, and have a few more thoughts.

Most of the evidence in this paper suggests that the learned representations are “human-like” in the sense that two images that have similar representations must also be perceptually similar (to humans). That is, by enforcing that “small change in pixels” implies “small change in representations”, you seem to get for free the converse: “small change in representations” implies “small change in pixels”. This wasn’t obvious to me: a priori, each feature could have corresponded to 2+ “clusters” of inputs.

The authors also seem to be making a claim that the representations are semantically similar to the ones humans use. I don’t find the evidence for this as compelling. For example, they claim that when putting the “stripes” feature on a picture of an animal, only the animal gets the stripes and not the background. However, when I tried it myself in the interactive visualization, it looked like a lot of the background was also getting stripes.

One typical regularization for [feature visualization]( is to jitter the image while optimizing it, which seems similar to selecting for robustness to imperceptible changes, so it makes sense that using robust features helps with feature visualization. That said, there are several other techniques for regularization, and the authors didn’t need any of them, which is very interesting. On the other hand, their visualizations don't look as good to me as those from other papers.
Paper: Adversarial Robustness as a Prior for Learned Representations
Adversarial examplesRobustness beyond Security: Computer Vision ApplicationsShibani Santurkar*, Dimitris Tsipras*, Brandon Tran*, Andrew Ilyas*, Logan Engstrom*, Aleksander MadryarXiv2019AN #68RohinSince a robust model seems to have significantly more "human-like" features (see post above), it should be able to help with many of the tasks in computer vision. The authors demonstrate results on image generation, image-to-image translation, inpainting, superresolution and interactive image manipulation: all of which are done simply by optimizing the image to maximize the probability of a particular class label or the value of a particular learned feature.
This provides more evidence of the utility of robust features, though all of the comments from the previous paper apply here as well. In particular, looking at the results, my non-expert guess is that they are probably not state-of-the-art (but it's still interesting that one simple method is able to do well on all of these tasks).
Paper: Image Synthesis with a Single (Robust) Classifier
Adversarial examplesNatural Adversarial ExamplesDan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, Dawn SongarXiv2019AN #64Flo DornerThis paper introduces a new dataset to evaluate the worst-case performance of image classifiers. ImageNet-A consists of unmodified natural images that are consistently misclassified by popular neural-network architectures trained on ImageNet. Based on some concrete misclassifications, like a dragonfly on a yellow plastic shovel being classified as a banana, the authors hypothesize that current classifiers rely too much on color, texture and background cues. Neither classical adversarial training nor training on a version of ImageNet designed to reduce the reliance on texture helps a lot, but modifying the network architecture can increase the accuracy on ImageNet-A from around 5% to 15%.
This seems to show that current methods and/or training sets for image classification are still far away from allowing for robust generalization, even in naturally occuring scenarios. While not too surprising, the results might convince those who have heavily discounted the evidence provided by classical adversarial examples due to the reliance on artificial perturbations.

**Rohin's opinion:** I'm particularly excited about this dataset because it seems like a significantly better way to evaluate new techniques for robustness: it's much closer to a "real world" test of the technique (as opposed to e.g. introducing an artificial perturbation that classifiers are expected to be robust to).
Adversarial examplesTesting Robustness Against Unforeseen AdversariesDaniel Kang, Yi Sun, Dan Hendrycks, Tom Brown, Jacob SteinhardtarXiv2019AN #63CodyThis paper demonstrates that adversarially training on just one type or family of adversarial distortions fails to provide general robustness against different kinds of possible distortions. In particular, they show that adversarial training against L-p norm ball distortions transfer reasonably well to other L-p norm ball attacks, but provides little value, and can in fact reduce robustness, when evaluated on other families of attacks, such as adversarially-chosen Gabor noise, "snow" noise, or JPEG compression. In addition to proposing these new perturbation types beyond the typical L-p norm ball, the paper also provides a "calibration table" with epsilon sizes they judge to be comparable between attack types, by evaluating them according to how much they reduce accuracy on either a defended or undefended model. (Because attacks are so different in approach, a given numerical value of epsilon won't correspond to the same "strength" of attack across methods)
I didn't personally find this paper hugely surprising, given the past pattern of whack-a-mole between attack and defense suggesting that defenses tend to be limited in their scope, and don't confer general robustness. That said, I appreciate how centrally the authors lay this lack of transfer as a problem, and the effort they put in to generating new attack types and calibrating them so they can be meaningfully compared to existing L-p norm ball ones.

**Rohin's opinion:** I see this paper as calling for adversarial examples researchers to stop focusing just on the L-p norm ball, in line with <@one of the responses@>(@Response: Adversarial Example Researchers Need to Expand What is Meant by ‘Robustness’@) to the last newsletter's highlight, <@Adversarial Examples Are Not Bugs, They Are Features@>.
Testing Robustness Against Unforeseen Adversaries
Adversarial examplesOn the Geometry of Adversarial ExamplesMarc Khoury, Dylan Hadfield-MenellProceedings of the Genetic and Evolutionary Computation Conference '182018AN #36RohinThis paper analyzes adversarial examples based off a key idea: even if the data of interest forms a low-dimensional manifold, as we often assume, the ϵ-tube _around_ the manifold is still high-dimensional, and so accuracy in an ϵ-ball around true data points will be hard to learn.

For a given L_p norm, we can define the optimal decision boundary to be the one that maximizes the margin from the true data manifold. If there exists some classifier that is adversarially robust, then the optimal decision boundary is as well. Their first result is that the optimal decision boundary can change dramatically if you change p. In particular, for concentric spheres, the optimal L_inf decision boundary provides an L_2 robustness guarantee √d times smaller than the optimal L_2 decision boundary, where d is the dimensionality of the input. This explains why a classifier that is adversarially trained on L_inf adversarial examples does so poorly on L_2 adversarial examples.

I'm not sure I understand the point of the next section, but I'll give it a try. They show that a nearest neighbors classifier can achieve perfect robustness if the underlying manifold is sampled sufficiently densely (requiring samples exponential in k, the dimensionality of the manifold). However, a learning algorithm with a particular property that they formalize would require exponentially more samples in at least some cases in order to have the same guarantee. I don't know why they chose the particular property they did -- my best guess is that the property is meant to represent what we get when we train a neural net on L_p adversarial examples. If so, then their theorem suggests that we would need exponentially more training points to achieve perfect robustness with adversarial training compared to a nearest neighbor classifier.

They next turn to the fact that the ϵ-tube around the manifold is d-dimensional instead of k-dimensional. If we consider ϵ-balls around the training set X, this covers a very small fraction of the ϵ-tube, approaching 0 as d becomes much larger than k, even if the training set X covers the k-dimensional manifold sufficiently well.

Another issue is that if we require adversarial robustness, then we severely restrict the number of possible decision boundaries, and so we may need significantly more expressive models to get one of these decision boundaries. In particular, since feedforward neural nets with Relu activations have "piecewise linear" decision boundaries (in quotes because I might be using the term incorrectly), it is hard for them to separate concentric spheres. Suppose that the spheres are separated by a distance d. Then for accuracy on the manifold, we only need the decision boundary to lie entirely in the shell of width d. However, for ϵ-tube adversarial robustness, the decision boundary must lie in a shell of width d - 2ϵ. They prove a lower bound on the number of linear regions for the decision boundary that grows as τ^(-d), where τ is the width of the shell, suggesting that adversarial robustness would require more parameters in the model.

Their experiments show that for simple learning problems (spheres and planes), adversarial examples tend to be in directions orthogonal to the manifold. In addition, if the true manifold has high codimension, then the learned model has poor robustness.
I think this paper has given me a significantly better understanding of how L_p norm balls work in high dimensions. I'm more fuzzy on how this applies to adversarial examples, in the sense of any confident misclassification by the model on an example that humans agree is obvious. Should we be giving up on L_p robustness since it forms a d-dimensional manifold, whereas we can only hope to learn the smaller k-dimensional manifold? Surely though a small enough perturbation shouldn't change anything? On the other hand, even humans have _some_ decision boundary, and the points near the decision boundary have some small perturbation which would change their classification (though possibly to "I don't know" rather than some other class).

There is a phenomenon where if you train on L_inf adversarial examples, the resulting classifier fails on L_2 adversarial examples, which has previously been described as "overfitting to L_inf". The authors interpret their first theorem as contradicting this statement, since the optimal decision boundaries are very different for L_inf and L_2. I don't see this as a contradiction. The L_p norms are simply a method of label propagation, which augments the set of data points for which we know labels. Ultimately, we want the classifier to reproduce the labels that we would assign to data points, and L_p propagation captures some of that. So, we can think of there as being many different ways that we can augment the set of training points until it matches human classification, and the L_p norm balls are such methods. Then an algorithm is more robust as it works with more of these augmentation methods. Simply doing L_inf training means that by default the learned model only works on one of the methods (L_inf norm balls) and not all of them as we wanted, and we can think of this as "overfitting" to the imperfect L_inf notion of adversarial robustness. The meaning of "overfitting" here is that the learned model is too optimized for L_inf, at the cost of other notions of robustness like L_2 -- and their theorem says basically the same thing, that optimizing for L_inf comes at the cost of L_2 robustness.
Adversarial examplesA Geometric Perspective on the Transferability of Adversarial DirectionsZachary Charles, Harrison Rosenberg, Dimitris PapailiopoulosarXiv2018AN #34
Adversarial examplesTowards the first adversarially robust neural network model on MNISTLukas Schott, Jonas Rauber, Matthias Bethge, Wieland BrendelarXiv2018AN #27Dan HThis recent pre-print claims to make MNIST classifiers more adversarially robust to different L-p perturbations, while the previous paper only worked for L-infinity perturbations. The basic building block in their approach is a variational autoencoder, one for each MNIST class. Each variational autoencoder computes the likelihood of the input sample, and this information is used for classification. The authors also demonstrate that binarizing MNIST images can serve as strong defense against some perturbations. They evaluate against strong attacks and not just the fast gradient sign method.
This paper has generated considerable excitement among my peers. Yet inference time with this approach is approximately 100,000 times that of normal inference (10^4 samples per VAE * 10 VAEs). Also unusual is that the L-infinity "latent descent attack" result is missing. It is not clear why training a single VAE does not work. Also, could results improve by adversarially training the VAEs? As with all defense papers, it is prudent to wait for third-party reimplementations and analysis, but the range of attacks they consider is certainly thorough.
Adversarial examplesTowards Deep Learning Models Resistant to Adversarial AttacksAleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, Adrian VladuICLR2018AN #27Dan HMadry et al.'s paper is a seminal work which shows that some neural networks can attain more adversarial robustness with a well-designed adversarial training procedure. The key idea is to phrase the adversarial defense problem as minimizing the expected result of the adversarial attack problem, which is maximizing the loss on an input training point when the adversary is allowed to perturb the point anywhere within an L-infinity norm ball. They also start the gradient descent from a random point in the norm ball. Then, given this attack, to optimize the adversarial defense problem, we simply do adversarial training. When trained long enough, some networks will attain more adversarial robustness.
It is notable that this paper has survived third-party security analysis, so this is a solid contribution. This contribution is limited by the fact that its improvements are limited to L-infinity adversarial perturbations on small images, as [follow-up work]( has shown.
Adversarial examplesMotivating the Rules of the Game for Adversarial Example ResearchJustin Gilmer, Ryan P. Adams, Ian Goodfellow, David Andersen, George E. DahlarXiv2018Daniel FilanAN #19
Adversarial examplesPixels still beat text: Attacking the OpenAI CLIP model with text patches and adversarial pixel perturbationsStanislav FortAuthor's Website2021AN #142RohinTypographic adversarial examples demonstrate that [CLIP]( can be significantly affected by text in an image. How powerfully does text affect CLIP, and how does it compare to more traditional attack vectors like imperceptible pixel changes? This blog post seeks to find out, through some simple tests on CIFAR-10.

First, to see how much text can affect CLIP’s performance, we add a handwritten label to each of the test images that spells out the class (so a picture of a deer would have overlaid a picture of a handwritten sticker of the word “deer”). This boosts CLIP’s zero-shot performance on CIFAR-10 from 87.37% to literally 100% (not a single mistake), showing that text really is quite powerful in affecting CLIP’s behavior.

You might think that since text can boost performance so powerfully, CLIP would at least be more robust against pixel-level attacks when the sticker is present. However, this does _not_ seem to be true: even when there is a sticker with the true class, a pixel-level attack works quite well (and is still imperceptible).

This suggests that while the text is powerful, pixel-level changes are more powerful still. To test this, we can try adding another, new sticker (with the same label). It turns out that this _does_ successfully switch the label back to the original correct label. In general, you can keep iterating the text sticker attack and the pixel-change attack, and the attacks keep working, with CLIP’s classification being determined by whichever attack was performed most recently.

You might think that the model's ability to read text is fairly brittle, and that's what's being changed by pixel-level attacks, hence adding a fresh piece of text would switch it back. Unfortunately, it doesn't seem like anything quite that simple is going on. The author conducts several experiments where only the sticker can be adversarially perturbed, or everything but the sticker can be adversarially perturbed, or where the copy-pasted sticker is one that was previously adversarially perturbed; unfortunately the results don't seem to tell a clean story.
This is quite an interesting phenomenon, and I'm pretty curious to understand what's going on here. Maybe that's an interesting new challenge for people interested in Circuits-style interpretability? My pretty uneducated guess is that it seems difficult enough to actually stress our techniques, but not so difficult that we can't make any progress.
Adversarial examplesAdversarial examples for the OpenAI CLIP in its zero-shot classification regime and their semantic generalizationStanislav FortAuthor's Website2021AN #136Rohin[CLIP]( is a model that was trained on a vast soup of image-caption data, and as a result can perform zero-shot image classification (for example, it gets 87% accuracy on CIFAR-10 out of the box). Does it also have adversarial examples within the image classification regime? This post shows that the answer is yes, and in fact these adversarial examples are easy to find.

More interestingly though, these adversarial examples persist if you change the labels in a semantically meaningful way. For example, if you take an image X that is correctly classified as a cat and imperceptibly modify it to Y which is now classified as a dog, if you change the class names to “kitty” and “hound”, then the same X will now be classified as a kitty while the same Y will be classified as a hound. This even works (though not as well) for labels like “domesticated animal which barks and is best friend”. The author takes this as evidence that the adversarial image actually looks like the adversarial class to the neural net, rather than being a peculiar consequence of the specific label.
This seems like further validation of the broad view put forth in <@Adversarial Examples Are Not Bugs, They Are Features@>.
Adversarial examplesAXRP 1: Adversarial PoliciesDaniel Filan and Adam GleaveAXRP Podcast2020AN #130RohinThe first part of this podcast describes the <@adversarial policies paper@>(@Adversarial Policies: Attacking Deep Reinforcement Learning@); see the summary for details about that. (As a reminder, this is the work which trained an adversarial goalie, that by spasming in a random-looking manner, causes the kicker to completely fail to even kick the ball towards the goal.)

Let’s move on to the more speculative thoughts discussed in this podcast (and not in the paper). One interesting thing that the paper highlights is that the space of policies is very non-transitive: it is possible, perhaps even common, that policy A beats policy B, which beats policy C, which beats policy A. This is clear if you allow arbitrary policies -- for example, the policy “play well, unless you see your opponent make a particular gesture; if you see that gesture then automatically lose” will beat many policies, but can be beaten by a very weak policy that knows to make the particular gesture. You might have thought that in practice, the policies produced by deep RL would exclude these weird possibilities, and so could be ranked by some notion of “competence”, where more competent agents would usually beat less competent agents (implying transitivity). The results of this paper suggest that isn’t the case.

The conversation then shifts to the research community and how to choose what research to do. The motivation behind this work was to improve the evaluation of policies learned by deep RL: while the freedom from the lack of theoretical guarantees (as in control theory) has allowed RL to make progress on previously challenging problems, there hasn’t been a corresponding uptick in engineering-based guarantees, such as testing. The work has had a fairly positive reception in the AI community, though unfortunately it seems this is probably due in part to its flashy results. Other papers that Adam is equally excited about have not had as good a reception.
Adversarial examplesA learning and masking approach to secure learningLinh Nguyen, Sky Wang, Arunesh SinhaInternational Conference on Decision and Game Theory for Security 20182018N/ARohinOne way to view the problem of adversarial examples is that adversarial attacks map "good" clean data points that are classified correctly into a nearby "bad" space that is low probability and so is misclassified. This suggests that in order to attack a model, we can use a neural net to _learn_ a transformation from good data points to bad ones. The loss function is easy -- one term encourages similarity to the original data point, and the other term encourages the new data point to have a different class label. Then, for any new input data point, we can simply feed it through the neural net to get an adversarial example.

Similarly, in order to defend a model, we can learn a neural net transformation that maps bad data points to good ones. The loss function continues to encourage similarity between the data points, but now encourages that the new data point have the correct label. Note that we need to use some attack algorithm in order to generate the bad data points that are used to train the defending neural net.
Ultimately, in the defense proposed here, the information on how to be more robust comes from the "bad" data points that are used to train the neural net. It's not clear why this would outperform adversarial training, where we train the original classifier on the "bad" data points. In fact, if the best way to deal with adversarial examples is to transform them to regular examples, then we could simply use adversarial training with a more expressive neural net, and it could learn this transformation.
Adversarial examplesAdversarial Policies: Attacking Deep Reinforcement LearningAdam Gleave, Michael Dennis, Neel Kant, Cody Wild, Sergey Levine, Stuart RussellarXiv2019AN #70SudhanshuThis work demonstrates the existence of _adversarial policies_ of behaviour in high-dimensional, two-player zero-sum games. Specifically, they show that adversarially-trained agents ("Adv"), who can only affect a victim's observations of their (Adv's) states, can act in ways that confuse the victim into behaving suboptimally.

An adversarial policy is trained by reinforcement learning in a single-player paradigm where the victim is a black-box fixed policy that was previously trained via self-play to be robust to adversarial attacks. As a result, the adversarial policies learn to push the observations of the victim outside the training distribution, causing the victim to behave poorly. The adversarial policies do not actually behave intelligently, such as blocking or tackling the victim, but instead do unusual things like spasming in a manner that appears random to humans, curling into a ball or kneeling.

Further experiments showed that if the victim's observations of the adversary were removed, then the adversary was unable to learn such an adversarial policy. In addition, the victim's network activations were very different when playing against an adversarial policy relative to playing against a random or lifeless opponent. By comparing two similar games where the key difference was the number of adversary dimensions being observed, they showed that such policies were easier to learn in higher-dimensional games.
This work points to an important question about optimisation in high dimension continuous spaces: without guarantees on achieving solution optimality, how do we design performant systems that are robust to (irrelevant) off-distribution observations? By generating demonstrations that current methods are insufficient, it can inspire future work across areas like active learning, continual learning, fall-back policies, and exploration.

I had a tiny nit-pick: while the discussion is excellent, the paper doesn't cover whether this phenomenon has been observed before with discrete observation/action spaces, and why/why not, which I feel would be an important aspect to draw out. In a finite environment, the victim policy might have actually covered every possible situation, and thus be robust to such attacks; for continuous spaces, it is not clear to me whether we can _always_ find an adversarial attack.

In separate correspondence, author Adam Gleave notes that he considers these to be relatively low-dimensional -- even MNIST has way more dimensions -- so when comparing to regular adversarial examples work, it seems like multi-agent RL is harder to make robust than supervised learning.
Adversarial Policies website
Adversarial examplesE-LPIPS: Robust Perceptual Image Similarity via Random Transformation EnsemblesMarkus Kettunen et alarXiv2019AN #66Dan HConvolutional neural networks are one of the best methods for assessing the perceptual similarity between images. This paper provides evidence that perceptual similarity metrics can be made adversarially robust. Out-of-the-box, network-based perceptual similarity metrics exhibit some adversarial robustness. While classifiers transform a long embedding vector to class scores, perceptual similarity measures compute distances between long and wide embedding tensors, possibly from multiple layers. Thus the attacker must alter far more neural network responses, which makes attacks on perceptual similarity measures harder for adversaries. This paper makes attacks even harder for the adversary by using a barrage of input image transformations and by using techniques such as dropout while computing the embeddings. This forces the adversarial perturbation to be substantially larger.
Adversarial examplesThe LogBarrier adversarial attack: making effective use of decision boundary informationChris Finlay, Aram-Alexandre Pooladian, Adam M. ObermanarXiv2019AN #53Dan HRather than maximizing the loss of a model given a perturbation budget, this paper minimizes the perturbation size subject to the constraint that the model misclassify the example. This misclassification constraint is enforced by adding a logarithmic barrier to the objective, which they prevent from causing a loss explosion through through a few clever tricks. Their attack appears to be faster than the Carlini-Wagner attack.
[The code is here.](
Adversarial examplesQuantifying Perceptual Distortion of Adversarial ExamplesMatt Jordan, Naren Manoj, Surbhi Goel, Alexandros G. DimakisarXiv2019AN #48Dan HThis paper takes a step toward more general adversarial threat models by combining adversarial additive perturbations small in an l_p sense with [spatially transformed adversarial examples](, among other other attacks. In this more general setting, they measure the size of perturbations by computing the [SSIM]( between clean and perturbed samples, which has limitations but is on the whole better than the l_2 distance. This work shows, along with other concurrent works, that perturbation robustness under some threat models does not yield robustness under other threat models. Therefore the view that l_p perturbation robustness must be achieved before considering other threat models is made more questionable. The paper also contributes a large code library for testing adversarial perturbation robustness.
Adversarial examplesOn the Sensitivity of Adversarial Robustness to Input Data DistributionsGavin Weiguang Ding, Kry Yik Chau Lui, Xiaomeng Jin, Luyu Wang, Ruitong HuangICLR 20192019AN #48
Adversarial examplesTheoretically Principled Trade-off between Robustness and AccuracyHongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric P. Xing, Laurent El Ghaoui, Michael I. JordanarXiv2019AN #44Dan HThis paper won the NeurIPS 2018 Adversarial Vision Challenge. For robustness on CIFAR-10 against l_infinity perturbations (epsilon = 8/255), it improves over the Madry et al. adversarial training baseline from 45.8% to 56.61%, making it [almost]( state-of-the-art. However, it does decrease clean set accuracy by a few percent, despite using a deeper network than Madry et al. Their technique has many similarities to Adversarial Logit Pairing, which is not cited, because they encourage the network to embed a clean example and an adversarial perturbation of a clean example similarly. I now describe Adversarial Logit Pairing. During training, ALP teaches the network to classify clean and adversarially perturbed points; added to that loss is an l_2 loss between the logit embeddings of clean examples and the logits of the corresponding adversarial examples. In contrast, in place of the l_2 loss from ALP, this paper uses the KL divergence from the softmax of the clean example to the softmax of an adversarial example. Yet the softmax distributions are given a high temperature, so this loss is not much different from an l_2 loss between logits. The other main change in this paper is that adversarial examples are generated by trying to maximize the aforementioned KL divergence between clean and adversarial pairs, not by trying to maximize the classification log loss as in ALP. This paper then shows that some further engineering to adversarial logit pairing can improve adversarial robustness on CIFAR-10.
Adversarial examplesAdversarial Vulnerability of Neural Networks Increases With Input DimensionCarl-Johann Simon-Gabriel, Yann Ollivier, Léon Bottou, Bernhard Schölkopf, David Lopez-PazarXiv2018AN #36RohinThe key idea of this paper is that imperceptible adversarial vulnerability happens when small changes in the input lead to large changes in the output, suggesting that the gradient is large. They first recommend choosing ϵ_p to be proportional to d^(1/p). Intuitively, this is because larger values of p behave more like maxing instead of summing, and so using the same value of ϵ across values of p would lead to more points being considered for larger p. They show a link between adversarial robustness and regularization, which makes sense since both of these techniques aim for better generalization.

Their main point is that the norm of the gradient increases with the input dimension d. In particular, a typical initialization scheme will set the variance of the weights to be inversely proportional to d, which means the absolute value of each weight is inversely proportional to √d. For a single-layer neural net (that is, a perceptron), the gradient is exactly the weights. For L_inf adversarial robustness, the relevant norm for the gradient is the L_1 norm. This gives the sum of the d weights, which will be proportional to √d. For L_p adversarial robustness, the corresponding gradient is L_q with q larger than 1, which decreases the size of the gradient. However, this is exactly offset by the increase in the size of ϵ_p that they proposed. Thus, in this simple case the adversarial vulnerability increases with input dimension. They then prove theorems that show that this generalizes to other neural nets, including CNNS (albeit still only at initialization, not after training). They also perform experiments showing that their result also holds after training.
I suspect that there is some sort of connection between the explanation given in this paper and the explanation that there are many different perturbation directions in high-dimensional space which means that there are lots of potential adversarial examples, which increases the chance that you can find one. Their theoretical result comes primarily from the fact that weights are initialized with variance inversely proportional to d. We could eliminate this by having the variance be inversely proportional to d^2, in which case their result would say that adversarial vulnerability is constant with input dimension. However, in this case the variance of the activations would be inversely proportional to d, making it hard to learn. It seems like adversarial vulnerability should be the product of "number of directions", and "amount you can search in a direction", where the latter is related to the variance of the activations, making the connection to this paper.
Adversarial examplesRobustness via curvature regularization, and vice versaSeyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, Jonathan Uesato, Pascal FrossardarXiv2018AN #35Dan HThis paper proposes a distinct way to increase adversarial perturbation robustness. They take an adversarial example generated with the FGSM, compute the gradient of the loss for the clean example and the gradient of the loss for the adversarial example, and they penalize this difference. Decreasing this penalty relates to decreasing the loss surface curvature. The technique works slightly worse than adversarial training.
Adversarial examplesIs Robustness [at] the Cost of Accuracy?Dong Su, Huan Zhang, Hongge Chen, Jinfeng Yi, Pin-Yu Chen, Yupeng GaoECCV2018AN #32Dan HThis work shows that older architectures such as VGG exhibit more adversarial robustness than newer models such as ResNets. Here they take adversarial robustness to be the average adversarial perturbation size required to fool a network. They use this to show that architecture choice matters for adversarial robustness and that accuracy on the clean dataset is not necessarily predictive of adversarial robustness. A separate observation they make is that adversarial examples created with VGG transfers far better than those created with other architectures. All of these findings are for models without adversarial training.
Adversarial examplesAdversarial Examples Are a Natural Consequence of Test Error in NoiseNic Ford*, Justin Gilmer*, Nicolas Carlini, Dogus CubukarXiv2018AN #32Dan HThis paper argues that there is a link between model accuracy on noisy images and model accuracy on adversarial images. They establish this empirically by showing that augmenting the dataset with random additive noise can improve adversarial robustness reliably. To establish this theoretically, they use the Gaussian Isoperimetric Inequality, which directly gives a relation between error rates on noisy images and the median adversarial perturbation size. Given that measuring test error on noisy images is easy, given that claims about adversarial robustness are almost always wrong, and given the relation between adversarial noise and random noise, they suggest that future defense research include experiments demonstrating enhanced robustness on nonadversarial, noisy images.
Adversarial examplesRobustness May Be at Odds with AccuracyDimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, Aleksander MadryOpenReview2018AN #32Dan HSince adversarial training can markedly reduce accuracy on clean images, one may ask whether there exists an inherent trade-off between adversarial robustness and accuracy on clean images. They use a simple model amenable to theoretical analysis, and for this model they demonstrate a trade-off. In the second half of the paper, they show adversarial training can improve feature visualization, which has been shown in several concurrent works.
Adversarial examplesAre adversarial examples inevitable?Ali Shafahi, W. Ronny Huang, Christoph Studer, Soheil Feizi, Tom GoldsteinInternational Conference on Learning Representations, 2019.2018AN #24
Adversarial examplesAdversarial Reprogramming of Sequence Classification Neural NetworksPaarth Neekhara, Shehzeen Hussain, Shlomo Dubnov, Farinaz KoushanfarAAAI-2019 Workshop on Engineering Dependable and Secure Machine Learning Systems2018AN #23
Adversarial examplesFortified Networks: Improving the Robustness of Deep Networks by Modeling the Manifold of Hidden RepresentationsAlex Lamb, Jonathan Binas, Anirudh Goyal, Dmitriy Serdyuk, Sandeep Subramanian, Ioannis Mitliagkas, Yoshua BengioarXiv2018AN #2
Adversarial examplesAdversarial Vision ChallengeWieland Brendel, Jonas Rauber, Alexey Kurakin, Nicolas Papernot, Veliqi, Marcel Salathé, Sharada P. Mohanty, Matthias BethgeNIPS 20182018AN #19RohinThere will be a competition on adversarial examples for vision at NIPS 2018.
Adversarial examplesEvaluating and Understanding the Robustness of Adversarial Logit PairingLogan Engstrom, Andrew Ilyas, Anish AthalyearXiv2018AN #18
Adversarial examplesBenchmarking Neural Network Robustness to Common Corruptions and Surface VariationsDan Hendrycks, Thomas G. DietterichICLR 20192018AN #15RohinSee [Import AI](
Adversarial examplesAvoiding textual adversarial examplesNoa NabeshimaTwitter2021AN #143RohinLast week I speculated that CLIP might "know" that a textual adversarial example is a "picture of an apple with a piece of paper saying an iPod on it" and the zero-shot classification prompt is preventing it from demonstrating this knowledge. Gwern Branwen [commented]( to link me to this Twitter thread as well as this [YouTube video]( in which better prompt engineering significantly reduces these textual adversarial examples, demonstrating that CLIP does "know" that it's looking at an apple with a piece of paper on it.
Adversarial examplesAdversarial Reprogramming of Neural NetworksGamaleldin F. Elsayed, Ian Goodfellow, Jascha Sohl-DicksteinarXiv2018AN #14
Adversarial examplesAdversarial images for the primate brainLi Yuan, Will Xiao, Gabriel Kreiman, Francis E.H. Tay, Jiashi Feng, Margaret S. LivingstonearXiv2020XuanAN #138RohinIt turns out that you can create adversarial examples for monkeys! The task: classifying a given face as coming from a monkey vs. a human. The method is pretty simple: train a neural network to predict what monkeys would do, and then find adversarial examples for monkeys. These examples don’t transfer perfectly, but they transfer enough that it seems reasonable to call them adversarial examples. In fact, these adversarial examples also make humans make the wrong classification reasonably often (though not as often as with monkeys), when given about 1 second to classify (a fairly long amount of time). Still, it is clear that the monkeys and humans are much more behaviorally robust than the neural networks.
First, a nitpick: the adversarially modified images are pretty significantly modified, such that you now have to wonder whether we should say that the humans are getting the answer “wrong”, or that the image has been modified meaningfully enough that there is no longer a right answer (as is arguably the case with the infamous [cat-dog]( The authors do show that e.g. Gaussian noise of the same magnitude doesn't degrade human performance, which is a good sanity check, but doesn’t negate this point.

Nonetheless, I liked this paper -- it seems like good evidence that neural networks and biological brains are picking up on similar features. My preferred explanation is that these are the “natural” features for our environment, though other explanations are possible, e.g. perhaps brains and neural networks are sufficiently similar architectures that they do similar things. Note however that they do require a _grey-box_ approach, where they first train the neural network to predict the monkey's neuronal responses. When they instead use a neural network trained to classify human faces vs. monkey faces, the resulting adversarial images do not cause misclassifications in monkeys. So they do need to at least finetune the final layer for this to work, and thus there is at least some difference between the neural networks and monkey brains.
Adversarial examplesDefense Against the Dark Arts: An overview of adversarial example security research and future research directionsIan GoodfellowarXiv2018AN #11
Adversarial examplesOn Evaluating Adversarial RobustnessNicholas Carlini, Anish Athalye, Nicolas Papernot, Wieland Brendel, Jonas Rauber, Dimitris Tsipras, Ian Goodfellow, Aleksander Madry, Alexey KurakinNeurIPS SECML 20182019AN #46
Adversarial examplesCharacterizing Adversarial Examples Based on Spatial Consistency Information for Semantic SegmentationChaowei Xiao, Ruizhi Deng, Bo Li, Fisher Yu, Mingyan Liu, and Dawn SongECCV2018AN #29Dan HThis paper considers adversarial attacks on segmentation systems. They find that segmentation systems behave inconsistently on adversarial images, and they use this inconsistency to detect adversarial inputs. Specifically, they take overlapping crops of the image and segment each crop. For overlapping crops of an adversarial image, they find that the segmentation are more inconsistent. They defend against one adaptive attack.
Adversarial examplesSpatially Transformed Adversarial ExamplesChaowei Xiao, Jun-Yan Zhu, Bo Li, Warren He, Mingyan Liu, Dawn SongICLR2018AN #29Dan HMany adversarial attacks perturb pixel values, but the attack in this paper perturbs the pixel locations instead. This is accomplished with a smooth image deformation which has subtle effects for large images. For MNIST images, however, the attack is more obvious and not necessarily content-preserving (see Figure 2 of the paper).
Adversarial examplesAdversarial Logit PairingHarini Kannan, Alexey Kurakin, Ian GoodfellowarXivRecon #5
Adversarial examplesLearning to write programs that generate imagesS M Ali Eslami, Tejas Kulkarni, Oriol Vinyals, Yaroslav GaninDeepMind BlogRecon #5
Adversarial examplesIntrinsic Geometric Vulnerability of High-Dimensional Artificial IntelligenceLuca Bortolussi, Guido SanguinettiarXiv2018AN #36
Adversarial examplesOn Adversarial Examples for Character-Level Neural Machine TranslationJavid Ebrahimi, Daniel Lowd, Dejing DouarXiv2018AN #13
Adversarial examplesIdealised Bayesian Neural Networks Cannot Have Adversarial Examples: Theoretical and Empirical StudyYarin Gal, Lewis SmitharXiv2018AN #10Rohin
Agent foundationsHighlightFinite Factored Sets sequenceScott GarrabrantAlignment Forum2021AN #163RohinThis newsletter is a combined summary + opinion for the [Finite Factored Sets sequence]( by Scott Garrabrant. I (Rohin) have taken a lot more liberty than I usually do with the interpretation of the results; Scott may or may not agree with these interpretations.

## Motivation

One view on the importance of deep learning is that it allows you to automatically _learn_ the features that are relevant for some task of interest. Instead of having to handcraft features using domain knowledge, we simply point a neural net at an appropriate dataset and it figures out the right features. Arguably this is the _majority_ of what makes up intelligent cognition; in humans it seems very analogous to [System 1](,_Fast_and_Slow), which we use for most decisions and actions. We are also able to infer causal relations between the resulting features.

Unfortunately, [existing models]( of causal inference don’t model these learned features -- they instead assume that the features are already given to you. Finite Factored Sets (FFS) provide a theory which can talk directly about different possible ways to featurize the space of outcomes and still allows you to perform causal inference. This sequence develops this underlying theory and demonstrates a few examples of using finite factored sets to perform causal inference given only observational data.

Another application is to <@embedded agency@>(@Embedded Agents@): we would like to think of “agency” as a way to featurize the world into an “agent” feature and an “environment” feature, that together interact to determine the world. In <@Cartesian Frames@>, we worked with a function A × E → W, where pairs of (agent, environment) together determined the world. In the finite factored set regime, we’ll think of A and E as features, the space S = A × E as the set of possible feature vectors, and S → W as the mapping from feature vectors to actual world states.

## What is a finite factored set?

Generalizing this idea to apply more broadly, we will assume that there is a set of possible worlds Ω, a set S of arbitrary elements (which we will eventually interpret as feature vectors), and a function f : S → Ω that maps feature vectors to world states. Our goal is to have some notion of “features” of elements of S. Normally, when working with sets, we identify a feature value with the set of elements that have that value. For example, we can identify “red” as the set of all red objects, and in [some versions of mathematics](, we define “2” to be the class of all sets that have exactly two elements. So, we define a feature to be a _partition_ of S into subsets, where each subset corresponds to one of the possible feature values. We can also interpret a feature as a _question_ about items in S, and the values as possible _answers_ to that question; I’ll be using that terminology going forward.

A finite factored set is then given by (S, B), where B is a set of **factors** (questions), such that if you choose a particular answer to every question, that uniquely determines an element in S (and vice versa). We’ll put aside the set of possible worlds Ω; for now we’re just going to focus on the theory of these (S, B) pairs.

Let’s look at a contrived example. Consider S = {chai, caesar salad, lasagna, lava cake, sprite, strawberry sorbet}. Here are some possible questions for this S:

- **FoodType**: Possible answers are Drink = {chai, sprite}, Dessert = {lava cake, strawberry sorbet}, Savory = {caesar salad, lasagna}
- **Temperature**: Possible answers are Hot = {chai, lava cake, lasagna} and Cold = {sprite, strawberry sorbet, caesar salad}.
- **StartingLetter**: Possible answers are “C” = {chai, caesar salad}, “L” = {lasagna, lava cake}, and “S” = {sprite, strawberry sorbet}.
- **NumberOfWords**: Possible answers are “1” = {chai, lasagna, sprite} and “2” = {caesar salad, lava cake, strawberry sorbet}.

Given these questions, we could factor S into {FoodType, Temperature}, or {StartingLetter, NumberOfWords}. We _cannot_ factor it into, say, {StartingLetter, Temperature}, because if we set StartingLetter = L and Temperature = Hot, that does not uniquely determine an element in S (it could be either lava cake or lasagna).

Which of the two factorizations should we use? We’re not going to delve too deeply into this question, but you could imagine that if you were interested in questions like “does this need to be put in a glass” you might be more interested in the {FoodType, Temperature} factorization.

Just to appreciate the castle of abstractions we’ve built, here’s the finite factored set F with the factorization {FoodType, Temperature}:

F = ({chai, caesar salad, lasagna, lava cake, sprite, strawberry sorbet}, {{{chai, sprite}, {lava cake, strawberry sorbet}, {caesar salad, lasagna}}, {{chai, lava cake, lasagna}, {sprite, strawberry sorbet, caesar salad}}})

To keep it all straight, just remember: a **factorization** B is a set of **questions** (factors, partitions) each of which is a set of **possible answers** (parts), each of which is a set of elements in S.

## A brief interlude

Some objections you might have about stuff we’ve talked about so far:

**Q.** Why do we bother with the set S -- couldn’t we just have the set of questions B, and then talk about answer vectors of the form (a1, a2, … aN)?

**A.** You could in theory do this, as there is a bijection between S and the Cartesian product of the sets in B. However, the problem with this framing is that it is hard to talk about other derived features. For example, the question “what is the value of B1+B2” has no easy description in this framing. When we instead directly work with S, the B1+B2 question is just another partition of S, just like B1 or B2 individually.

**Q.** Why does f map S to Ω? Doesn’t this mean that a feature vector uniquely determines a world state, whereas it’s usually the opposite in machine learning?

**A.** This is true, but here the idea is that the set of features together captures _all_ the information within the setting we are considering. You could think of feature vectors in deep learning as only capturing an important subset of all of the features (which we’d have to do in practice since we only have bounded computation), and those features are not enough to determine world states.

## Orthogonality in Finite Factored Sets

We’re eventually going to use finite factored sets similarly to Pearlian causal models: to infer which questions (random variables) are conditionally independent of each other. However, our analysis will apply to arbitrary questions, unlike Pearlian models, which can only talk about independence between the predefined variables from which the causal model is built.

Just like Pearl, we will talk about _conditioning on evidence_: given evidence e, a subset of S, we can “observe” that we are within e. In the formal setup, this looks like erasing all elements that are not in e from all questions, answers, factors, etc.

You might think that "factors" are not analogous to nodes or random variables in a Pearlian model. However, this isn't right, since we’re going to assume that all of our factors are <em>independent</em> from each other, which is usually not the case in a Pearlian model. For example, you might have a Pearlian model with two binary variables, e.g. “Variable Rain causes Variable Wet Sidewalk”; these are obviously not independent. The corresponding finite factored set would have _three_ factors: “did it rain?”, “if it rained did the sidewalk get wet?” and “if it didn’t rain did the sidewalk get wet?” This way all three factors can be independent of each other. We will still be able to ask whether Wet Sidewalk is independent of Rain, since Wet Sidewalk is just another question about the set S -- it just isn’t one of the underlying factors anymore.

The point of this independence is to allow us to reason about _counterfactuals_: it should be possible to say “imagine the element s, except with underlying factor b2 changed to have value v”. As a result, our definitions will include clauses that say “and make sure we can still take counterfactuals”. For example, let’s talk about the “history” of a question X, which for now you can think of as the “factors relevant to X”. The _history_ of X given e is the smallest set of factors such that:

1) if you know the answers to these factors, then you can infer the answer to X, and
2) any factors that are _not_ in the history are independent of X. As suggested above, we can think of this as being about counterfactuals -- we’re saying that for any such factor, we can counterfactually change its answer and this will remain consistent with the evidence e.

(A technicality on the second point: we’ll never be able to counterfactually change a factor to a value that is never found in the evidence; this is fine and doesn’t prevent things from being independent.)

Time for an example! Consider the set S = {000, 001, 010, 011, 100, 101, 110, 111} and the factorization {X, Y, Z}, where X is the question “what is the first bit”, Y is the question “what is the second bit”, and Z is the question “what is the third bit”. Consider the question Q = “when interpreted as a binary number, is the number >= 2?” In this case, the history of Q given no evidence is {X, Y} because you can determine the answer to Q with the combination of X and Y. (You can still counterfact on anything, since there is no evidence to be inconsistent with.)

Let’s consider an example with evidence. Suppose we observe that all the bits are equal, that is, e = {000, 111}. Now, what is the history of X? If there wasn’t any evidence, the history would just be {X}; you only need to know X in order to determine the value of X. However, suppose we learned that X = 0, implying that our element is 000. We can’t counterfact on Y or Z, since that would produce 010 or 001, both of which are inconsistent with the evidence. So given this evidence, the history of X is actually {X, Y, Z}, i.e. the entire set of factors! If we’d only observed that the first two bits were equal, so e = {000, 001, 110, 111}, then we _could_ counterfact on Z and the history of X would be {X, Y}.

(Should you want more examples, here are two [relevant]( [posts](

Given this notion of “history”, it is easy to define orthogonality: X is orthogonal to Y given evidence e if the history of X given e has no overlap with the history of Y given e. Intuitively, this means that the factors relevant to X are completely separate from those relevant to Y, and so there cannot be any entanglement between X and Y. For a _question_ Z, we say that X is orthogonal to Y given Z if X is orthogonal to Y given z, for every possible answer z in Z.

Now that we have defined orthogonality, we can state the _Fundamental Theorem of Finite Factored Sets_. Given some questions X, Y, and Z about a finite factored set F, X is orthogonal to Y given Z if and only if in every probability distribution on F, X is conditionally independent of Y given Z, that is, P(X, Y | Z) = P(X | Z) * P(Y | Z).

(I haven’t told you how you put a probability distribution on F. It’s exactly what you would think -- you assign a probability to every possible answer in every factor, and then the probability of an individual element is defined to be the product of the probabilities of its answers across all the factors.)

(I also haven’t given you any intuition about why this theorem holds. Unfortunately I don’t have great intuition for this; the proof has multiple non-trivial steps, each of which I locally understand and have intuition for... but globally it’s just a sequence of non-trivial steps to me. Here’s an attempt, which isn’t very good: we specifically defined orthogonality to capture *all* the relevant information for a question, in particular by having that second condition requiring that we be able to counterfact on other factors, and so it intuitively makes sense that if the relevant information doesn’t overlap, then there can’t be a way for the probability distribution to have interactions between the variables.)

The fundamental theorem is in some sense a _justification_ for calling the property “orthogonality” -- if we determine just by studying the structure of the finite factored set that X is orthogonal to Y given Z, then we know that this implies conditional independence in the “true” probability distribution, whatever it ends up being. Pearlian models have a similar theorem, where the graphical property of d-separation implies conditional independence.

## Foundations of causality and time

You might be wondering why we have been calling the minimal set of relevant factors “history”. The core philosophical idea is that, if you have the right factorization, then “time” or “causality” can be thought of as flowing in the direction of larger histories. Specifically, we say that X is “before” Y if the history of X is a subset of the history of Y. (We then call it “history” because every factor in the history of X will be “before” X by this definition.)

One intuition pump for this is that in physics, if an event A causes an event B, then the past light cone of A is a subset of the past light cone of B, and A happens before B in every possible reference frame.

But perhaps the best argument for thinking of this as causality is that we can actually use this notion of “time” or “causality” to perform causal inference. Before I talk about that, let’s see what this looks like in Pearlian models.

Strictly speaking, in Pearlian models, the edges do not _have_ to correspond to causality: formally they only represent conditional independence assumptions on a probability distribution. However, consider the following Cool Fact: for some Pearlian models, if you have observational data that is generated from that model, you can recover the exact graphical structure of the generating model just by looking at the observational data. In this case, you really are inferring cause-and-effect relationships from observational data! (In the general case where the data is generated by an arbitrary model, you can recover a lot of the structure of the model but be uncertain about the direction of some of the edges, so you are still doing _some_ causal inference from observational data.)

We will do something similar: we’ll use our notion of “before” to perform causal inference given observational data.

## Temporal inference: the three dependent bits

You are given statistical (i.e. observational) data for three bits: X, Y and Z. You quickly notice that it is always the case that Z = X xor Y (which implies that X = Y xor Z, and Y = Z xor X). Clearly, there are only two independent bits here and the other bit is derived as the xor of the two independent bits. From the raw statistical data, can you tell which bits are the independent ones, and which one is the derived one, thus inferring which one was _caused_ by the other two? It turns out that you can!

Specifically, you want to look for which two bits are _orthogonal_ to each other, that is, you want to check whether we approximately have P(X, Y) = P(X) P(Y) (and similarly for other possible pairings). In the world where two of the bits were generated by a biased coin, you will find exactly one pair that is orthogonal in this way. (The case where the bits are generated by a fair coin is special; the argument won’t work there, but it’s in some sense “accidental” and happens because the probability of 0.5 is very special.)

Let’s suppose that the orthogonal pair was (X, Z). In this case, we can _prove_ that in _every_ finite factored set that models this situation, X and Z come “before” Y, i.e. their histories are strict subsets of Y’s history. Thus, we’ve inferred causality using only observational data! (And unlike with Pearlian models, we did this in a case where one “variable” was a deterministic function of two other “variables”, which is a type of situation that Pearlian models struggle to handle.)

## Future work

Remember that motivation section, a couple thousand words ago? We talked about how we can do causal inference with learned featurizations and apply it to embedded agency. Well, we actually haven’t done that yet, beyond a few examples of causal inference (as in the example above). There is a lot of future work to be done in applying it to the case that motivated it in the first place. The author wrote up potential future work [here](, which has categories for both causal inference and embedded agency, and also adds a third one: generalizing the theory to infinite sets. If you are interested in this framework, there are many avenues for pushing it forward.
Agent foundationsHighlightInfra-Bayesianism sequenceDiffractor and Vanessa KosoyAlignment Forum2020AN #143RohinI have finally understood this sequence enough to write a summary about it, thanks to [AXRP Episode 5]( Think of this as a combined summary + highlight of the sequence and the podcast episode.

The central problem of <@embedded agency@>(@Embedded Agents@) is that there is no clean separation between an agent and its environment: rather, the agent is _embedded_ in its environment, and so when reasoning about the environment it is reasoning about an entity that is “bigger” than it (and in particular, an entity that _contains_ it). We don’t have a good formalism that can account for this sort of reasoning. The standard Bayesian account requires the agent to have a space of precise hypotheses for the environment, but then the true hypothesis would also include a precise model of the agent itself, and it is usually not possible to have an agent contain a perfect model of itself.

A natural idea is to reduce the precision of hypotheses. Rather than requiring a hypothesis to assign a probability to every possible sequence of bits, we now allow the hypotheses to say “I have no clue about this aspect of this part of the environment, but I can assign probabilities to the rest of the environment”. The agent can then limit itself to hypotheses that don’t make predictions about the part of the environment that corresponds to the agent, but do make predictions about other parts of the environment.

Another way to think about it is that it allows you to start from the default of “I know nothing about the environment”, and then add in details that you do know to get an object that encodes the easily computable properties of the environment you can exploit, while not making any commitments about the rest of the environment.

Of course, so far this is just the idea of using [Knightian uncertainty]( The contribution of infra-Bayesianism is to show how to formally specify a decision procedure that uses Knightian uncertainty while still satisfying many properties we would like a decision procedure to satisfy. You can thus think of it as an extension of the standard Bayesian account of decision-making to the setting in which the agent cannot represent the true environment as a hypothesis over which it can reason.

Imagine that, instead of having a probability distribution over hypotheses, we instead have two “levels”: first are all the properties we have Knightian uncertainty over, and then are all the properties we can reason about. For example, imagine that the environment is an infinite sequence of bits and we want to say that all the even bits come from flips of a possibly biased coin, but we know nothing about the odd coin flips. Then, at the top level, we have a separate branch for each possible setting of the odd coin flips. At the second level, we have a separate branch for each possible bias of the coin. At the leaves, we have the hypothesis “the odd bits are as set by the top level, and the even bits are generated from coin flips with the bias set by the second level”.

(Yes, there are lots of infinite quantities in this example, so you couldn’t implement it the way I’m describing it here. An actual implementation would not represent the top level explicitly and would use computable functions to represent the bottom level. We’re not going to worry about this for now.)

If we were using orthodox Bayesianism, we would put a probability distribution over the top level, and a probability distribution over the bottom level. You could then multiply that out to get a single probability distribution over the hypotheses, which is why we don’t do this separation into two levels in orthodox Bayesianism. (Also, just to reiterate, the _whole point_ is that we can’t put a probability distribution at the top level, since that implies e.g. making precise predictions about an environment that is bigger than you are.)

Infra-Bayesianism says, “what if we just… don't put a probability distribution over the top level?” Instead, we have a set of probability distributions over hypotheses, and Knightian uncertainty over which distribution in this set is the right one. A common suggestion for Knightian uncertainty is to do _worst-case_ reasoning, so that’s what we’ll do at the top level. Lots of problems immediately crop up, but it turns out we can fix them.

First, let’s say your top level consists of two distributions over hypotheses, A and B. You then observe some evidence E, which A thought was 50% likely and B thought was 1% likely. Intuitively, you want to say that this makes A “more likely” relative to B than we previously thought. But how can you do this if you have Knightian uncertainty and are just planning to do worst-case reasoning over A and B? The solution here is to work with _unnormalized_ probability distributions at the second level. Then, in the case above, we can just scale the “probabilities” in both A and B by the likelihood assigned to E. We _don’t_ normalize A and B after doing this scaling.

But now what exactly do the numbers mean if we’re going to leave these distributions unnormalized? Regular probabilities only really make sense if they sum to 1. We can take a different view on what a “probability distribution” is -- instead of treating it as an object that tells you how _likely_ various hypotheses are, treat it as an object that tells you how much we _care_ about particular hypotheses. (See [related]( <@posts@>(@An Orthodox Case Against Utility Functions@).) So scaling down the “probability” of a hypothesis just means that we care less about what that hypothesis “wants” us to do.

This would be enough if we were going to take an average over A and B to make our final decision. However, our plan is to do worst-case reasoning at the top level. This interacts horribly with our current proposal: when we scale hypotheses in A by 0.5 on average, and hypotheses in B by 0.01 on average, the minimization at the top level is going to place _more_ weight on B, since B is now _more_ likely to be the worst case. Surely this is wrong?

What’s happening here is that B gets most of its expected utility in worlds where we observe different evidence, but the worst-case reasoning at the top level doesn’t take this into account. Before we update, since B assigned 1% to E, the expected utility of B is given by 0.99 * expected utility given not-E + 0.01 * expected utility given E. After the update, the second part remains but the first part disappears, which makes the worst-case reasoning wonky. So what we do is we keep track of the first part as well and make sure that our worst-case reasoning takes it into account.

This gives us **infradistributions**: sets of (m, b) pairs, where m is an unnormalized probability distribution and b corresponds to “the value we would have gotten if we had seen different evidence”. When we observe some evidence E, the hypotheses within m are scaled by the likelihood they assign to E, and b is updated to include the value we would have gotten in the world where we saw anything other than E. Note that it is important to specify the utility function for this to make sense, as otherwise it is not clear how to update b. To compute utilities for decision-making, we do worst-case reasoning over the (m, b) pairs, where we use standard expected values within each m. We can prove that this update rule satisfies _dynamic consistency_: if initially you believe “if I see X, then I want to do Y”, then after seeing X, you believe “I want to do Y”.

So what can we do with infradistributions? Our original motivation was to talk about embedded agency, so a natural place to start is with decision-theory problems in which the environment contains a perfect predictor of the agent, such as in Newcomb’s problem. Unfortunately, we can’t immediately write this down with infradistributions because we have no way of (easily) formally representing “the environment perfectly predicts my actions”. One trick we can use is to consider hypotheses in which the environment just spits out some action, without the constraint that it must match the agent’s action. We then modify the utility function to give infinite utility when the prediction is incorrect. Since we do worst-case reasoning, the agent will effectively act as though this situation is impossible. With this trick, infra-Bayesianism performs similarly to UDT on a variety of challenging decision problems.
This seems pretty cool, though I don’t understand it that well yet. While I don’t yet feel like I have a better philosophical understanding of embedded agency (or its subproblems), I do think this is significant progress along that path.

In particular, one thing that feels a bit odd to me is the choice of worst-case reasoning for the top level -- I don’t really see anything that _forces_ that to be the case. As far as I can tell, we could get all the same results by using best-case reasoning instead (assuming we modified the other aspects appropriately). The obvious justification for worst-case reasoning is that it is a form of risk aversion, but it doesn’t feel like that is really sufficient -- risk aversion in humans is pretty different from literal worst-case reasoning, and also none of the results in the post seem to depend on risk aversion.

I wonder whether the important thing is just that we don’t do expected value reasoning at the top level, and there are in fact a wide variety of other kinds of decision rules that we could use that could all work. If so, it seems interesting to characterize what makes some rules work while others don’t. I suspect that would be a more philosophically satisfying answer to “how should agents reason about environments that are bigger than them”.
AXRP Episode 5 - Infra-Bayesianism
Agent foundationsHighlightCartesian FramesScott GarrabrantAlignment Forum2020AN #127RohinThe <@embedded agency sequence@>(@Embedded Agents@) hammered in the fact that there is no clean, sharp dividing line between an agent and its environment. This sequence proposes an alternate formalism: Cartesian frames. Note this is a paradigm that helps us _think about agency_: you should not be expecting some novel result that, say, tells us how to look at a neural net and find agents within it.

The core idea is that rather than _assuming_ the existence of a Cartesian dividing line, we consider how such a dividing line could be _constructed_. For example, when we think of a sports team as an agent, the environment consists of the playing field and the other team; but we could also consider a specific player as an agent, in which case the environment consists of the rest of the players (on both teams) and the playing field. Each of these are valid ways of carving up what actually happens into an “agent” and an “environment”, they are _frames_ by which we can more easily understand what’s going on, hence the name “Cartesian frames”.

A Cartesian frame takes **choice** as fundamental: the agent is modeled as a set of options that it can freely choose between. This means that the formulation cannot be directly applied to deterministic physical laws. It instead models what agency looks like [“from the inside”]( _If_ you are modeling a part of the world as capable of making choices, _then_ a Cartesian frame is appropriate to use to understand the perspective of that choice-making entity.

Formally, a Cartesian frame consists of a set of agent options A, a set of environment options E, a set of possible worlds W, and an interaction function that, given an agent option and an environment option, specifies which world results. Intuitively, the agent can “choose” an agent option, the environment can “choose” an environment option, and together these produce some world. You might notice that we’re treating the agent and environment symmetrically; this is intentional, and means that we can define analogs of all of our agent notions for environments as well (though they may not have nice philosophical interpretations).

The full sequence uses a lot of category theory to define operations on these sorts of objects and show various properties of the objects and their operations. I will not be summarizing this here; instead, I will talk about their philosophical interpretations.

First, let’s look at an example of using a Cartesian frame on something that isn’t typically thought of as an agent: the atmosphere, within the broader climate system. The atmosphere can “choose” whether to trap sunlight or not. Meanwhile, in the environment, either the ice sheets could melt or they could not. If sunlight is trapped and the ice sheets melt, then the world is Hot. If exactly one of these is true, then the world is Neutral. Otherwise, the world is Cool.

(Yes, this seems very unnatural. That’s good! The atmosphere shouldn’t be modeled as an agent! I’m choosing this example because its unintuitive nature makes it more likely that you think about the underlying rule, rather than just the superficial example. I will return to more intuitive examples later.)


A _property_ of the world is something like “it is neutral or warmer”. An agent can _ensure_ a property if it has some option such that no matter what environment option is chosen, the property is true of the resulting world. The atmosphere could ensure the warmth property above by “choosing” to trap sunlight. Similarly the agent can _prevent_ a property if it can guarantee that the property will not hold, regardless of the environment option. For example, the atmosphere can prevent the property “it is hot”, by “choosing” not to trap sunlight. The agent can _control_ a property if it can both ensure and prevent it. In our example, there is no property that the atmosphere can control.

**Coarsening or refining worlds**

We often want to describe reality at different levels of abstraction. Sometimes we would like to talk about the behavior of various companies; at other times we might want to look at an individual employee. We can do this by having a function that maps low-level (refined) worlds to high-level (coarsened) worlds. In our example above, consider the possible worlds {YY, YN, NY, NN}, where the first letter of a world corresponds to whether sunlight was trapped (Yes or No), and the second corresponds to whether the ice sheets melted. The worlds {Hot, Neutral, Cool} that we had originally are a coarsened version of this, where we map YY to Hot, YN and NY to Neutral, and NN to Cool.


A major upside of Cartesian frames is that given the set of possible worlds that can occur, we can choose how to divide it up into an “agent” and an “environment”. Most of the interesting aspects of Cartesian frames are in the relationships between different ways of doing this division, for the same set of possible worlds.

First, we have interfaces. Given two different Cartesian frames <A, E, W> and <B, F, W> with the same set of worlds, an interface allows us to interpret the agent A as being used in place of the agent B. Specifically, if A would choose an option a, the interface maps this to one of B’s options b. This is then combined with the environment option f (from F) to produce a world w.

A valid interface also needs to be able to map the environment option f to e, and then combine it with the agent option a to get the world. This alternate way of computing the world must always give the same answer.

Since A can be used in place of B, all of A’s options must have equivalents in B. However, B could have options that A doesn’t. So the existence of this interface implies that A is “weaker” in a sense than B. (There are a bunch of caveats here.)

(Relevant terms in the sequence: _morphism_)

**Decomposing agents into teams of subagents**

The first kind of subagent we will consider is a subagent that can control “part of” the agent’s options. Consider for example a coordination game, where there are N players who each individually can choose whether or not to press a Big Red Button. There are only two possible worlds: either the button is pressed, or it is not pressed. For now, let’s assume there are two players, Alice and Bob.

One possible Cartesian frame is the frame for the entire team. In this case, the team has perfect control over the state of the button -- the agent options are either to press the button or not to press the button, and the environment does not have any options (or more accurately, it has a single “do nothing” option).

However, we can also decompose this into separate Alice and Bob _subagents_. What does a Cartesian frame for Alice look like? Well, Alice also has two options -- press the button, or don’t. However, Alice does not have perfect control over the result: from her perspective, Bob is part of the environment. As a result, for Alice, the environment also has two options -- press the button, or don’t. The button is pressed if Alice presses it _or_ if the environment presses it. (The Cartesian frame for Bob is identical, since he is in the same position that Alice is in.)

Note however that this decomposition isn’t perfect: given the Cartesian frames for Alice and Bob, you cannot uniquely recover the original Cartesian frame for the team. This is because both Alice and Bob’s frames say that the environment has some ability to press the button -- _we_ know that this is just from Alice and Bob themselves, but given just the frames we can’t be sure that there isn’t a third person Charlie who also might press the button. So, when we combine Alice and Bob back into the frame for a two-person team, we don’t know whether or not the environment should have the ability to press the button. This makes the mathematical definition of this kind of subagent a bit trickier though it still works out.

Another important note is that this is relative to how coarsely you model the world. We used a fairly coarse model in this example: only whether or not the button was pressed. If we instead used a finer model that tracked which subset of people pressed the button, then we _would_ be able to uniquely recover the team’s Cartesian frame from Alice and Bob’s individual frames.

(Relevant terms in the sequence: _multiplicative subagents, sub-tensors, tensors_)

**Externalizing and internalizing**

This decomposition isn’t just for teams of people: even a single “mind” can often be thought of as the interaction of various parts. For example, hierarchical decision-making can be thought of as the interaction between multiple agents at different levels of the hierarchy.

This decomposition can be done using _externalization_. Externalization allows you to take an existing Cartesian frame and some specific property of the world, and then construct a new Cartesian frame where that property of the world is controlled by the environment.

Concretely, let’s imagine a Cartesian frame for Alice that represents her decision on whether to cook a meal or eat out. If she chooses to cook a meal, then she must also decide which recipe to follow. If she chooses to eat out, she must decide which restaurant to eat out at.

We can externalize the high-level choice of whether Alice cooks a meal or eats out. This results in a Cartesian frame where the environment chooses whether Alice is cooking or eating out, and the agent must then choose a restaurant or recipe as appropriate. This is the Cartesian frame corresponding to the low-level policy that must pursue whatever subgoal is chosen by the high-level planning module (which is now part of the environment). The agent of this frame is a subagent of Alice.

The reverse operation is called internalization, where some property of the world is brought under the control of the agent. In the above example, if we take the Cartesian frame for the low-level policy, and then internalize the cooking / eating out choice, we get back the Cartesian frame for Alice as a unified whole.

Note that in general externalization and internalization are _not_ inverses of each other. As a simple example, if you externalize something that is already “in the environment” (e.g. whether it is raining, in a frame for Alice), that does nothing, but when you then internalize it, that thing is now assumed to be under the agent’s control (e.g. now the “agent” in the frame can control whether or not it is raining). We will return to this point when we talk about observability.

**Decomposing agents into disjunctions of subagents**

Our subagents so far have been “team-based”: the original agent could be thought of as a supervisor that got to control all of the subagents together. (The team agent in the button-pressing game could be thought of as controlling both Alice and Bob’s actions; in the cooking / eating out example Alice could be thought of as controlling both the high-level subgoal selection as well as the low-level policy that executes on the subgoals.)

The sequence also introduces another decomposition into subagents, where the superagent can be thought of as a supervisor that gets to choose _which_ of the subagents gets to control the overall behavior. Thus, the superagent can do anything that either of the subagents could do.

Let’s return to our cooking / eating out example. We previously saw that we could decompose Alice into a high-level subgoal-choosing subagent that chooses whether to cook or eat out, and a low-level subgoal-execution subagent that then chooses which recipe to make or which restaurant to go to. We can also decompose Alice as being the choice of two subagents: one that chooses which restaurant to go to, and one that chooses which recipe to make. The union of these subagents is an agent that first chooses whether to go to a restaurant or to make a recipe, and then uses the appropriate subagent to choose the restaurant or recipe: this is exactly a description of Alice.

(Relevant terms in the sequence: _additive subagents, sub-sums, sums_)

**Committing and assuming**

One way to think about the subagents of the previous example is that they are the result of Alice _committing_ to a particular subset of choices. If Alice commits to eating out (but doesn’t specify at what restaurant), then the resulting frame is equivalent to the restaurant-choosing subagent.

Similarly to committing, we can also talk about _assuming_. Just as commitments restrict the set of options available to the agent, assumptions restrict the set of options available to the environment.

Just as we can union two agents together to get an agent that gets to choose between two subagents, we can also union two environments together to get an environment that gets to choose between two subenvironments. (In this case the agent is more constrained: it must be able to handle the environment regardless of which way the environment chooses.)

(Relevant terms in the sequence: _product_)


The most interesting (to me) part of this sequence was the various equivalent definitions of what it means for something to be observable. The overall story is similar to the one in [Knowledge is Freedom]( an agent is said to “observe” a property P if it is capable of making different decisions based on whether P holds or not.

Thus we get our first definition of observability: **a property P of the world is _observable_ if, for any two agent options a and b, the agent also has an option that is equivalent to “if P then a else b”.**

Intuitively, this is meant to be similar to the notion of “inputs” to an agent. Intuitively, a neural net should be able to express arbitrary computations over its inputs, and so if we view the neural net as “choosing” what computation to do (by “choosing” what its parameters are), then the neural net can have its outputs (agent options) depend in arbitrary ways on the inputs. Thus, we say that the neural net “observes” its inputs, because what the neural net does can depend freely on the inputs.

Note that this is a very black-or-white criterion: we must be able to express _every_ conditional policy on the property for it to be observable; if even one such policy is not expressible then the property is not observable.

One way to think about this is that an observable property needs to be completely under the control of the environment, that is, the environment option should completely determine whether the resulting world satisfies the property or not -- nothing the agent does can matter (for this property). To see this, suppose that there was some environment option e that didn’t fully determine a property P, so that there are agent options a and b such that the world corresponding to (a, e) satisfies P but the one corresponding to (b, e) does not. Then our agent cannot implement the conditional policy “if P then b else a”, because it would lead to a self-referential contradiction (akin to “this sentence is false”) when the environment chooses e. Thus, P cannot be observable.

This is not equivalent to observability: it is possible for the environment to fully control P, while the agent is still unable to always condition on P. So we do need something extra. Nevertheless, this intuition suggests a few other ways of thinking about observability. The key idea is to identify a decomposition of the agent based on P that should only work if the environment has all the control, and then to identify a union step that puts the agent back together, that automatically adds in all of the policies that are conditional on P. I’ll describe these definitions here; the sequence proves that they are in fact equivalent to the original definition above.

First, recall that externalization and internalization are methods that allow us to “transfer” control of some property from the agent to the environment and vice versa. Thus, if all the control of P is in the environment, one would hope that internalization followed by externalization just transfers the control back and forth. In addition, when we externalize P, the externalization process will enforce that the agent can condition on P arbitrarily (this is how it is defined). This suggests the definition: **P is observable if and only if internalizing P followed by externalizing P gives us back the original frame.**

Second, if the environment has all of the control over P, then we should be able to decompose the agent into two parts: one that decides what to do when P is true, and one that decides what to do when P is false. We can achieve this using _assumptions_, that is, the first agent is the original agent under the assumption that P is true, and the second is under the assumption that P is false. Note that if the environment didn’t have perfect control over P, this would not work, as the environment options where P is not guaranteed to be true or false would simply be deleted, and could not be reconstructed from the two new agents.

We now need to specify how to put the agents back together, in a way that includes all the conditional policies on P. There are actually two variants in how we can do this:

In the first case, we combine the agents by unioning the environments, which lets the environment choose whether P is true or not. Given how this union is defined, the new agent is able to specify both what to do given the environment’s choice, _as well as_ what it would have done in the counterfactual case where the environment had decided P differently. This allows it to implement all conditional policies on P. So, **P is observable if and only if decomposing the frame using assumptions on P, and then unioning the environments of the resulting frames gives back the original frame.**

In the second case, after getting agents via assumption on P, you extend each agent so that in the case where its assumption is false, it is as though it takes a noop action. Intuitively, the resulting agent is an agent that is hobbled so that it has no power in worlds where P comes out differently than was assumed. These agents are then combined into a team. Intuitively, the team selects an option of the form “the first agent tries to do X (which only succeeds when P is true) and the second agent tries to do Y (which only succeeds when P is false)”. Like the previous decomposition, this specifies both what to do in whatever actual environment results, as well as what would have been done in the counterfactual world where the value of P was reversed. Thus, this way of combining the agents once again adds in all conditional policies on P. So, **P is observable if and only if decomposing the frame using assumptions on P, then hobbling the resulting frames in cases where their assumptions are false, and then putting the agents back in a team, is equivalent to the original frame.**


Cartesian frames do not have an intrinsic notion of time. However, we can still use them to model sequential processes, by having the agent options be _policies_ rather than actions, and having the worlds be histories or trajectories rather than states.

To say useful things about time, we need to broaden our notion of observables. So far I’ve been talking about whether you can observe binary properties P that are either true or false. In fact, all of the definitions can be easily generalized to n-ary properties P that can take on one of N values. We’ll be using this notion of observability here.

Consider a game of chess where Alice plays as white and Bob as black. Intuitively, when Alice is choosing her second move, she can observe Bob’s first move. However, the property “Bob’s first move” would not be observable in Alice’s Cartesian frame, because Alice’s _first_ move cannot depend on Bob’s first move (since Bob hasn’t made it yet), and so when deciding the first move we can’t implement policies that condition on what Bob’s first move is.

Really, we want some way to say “after Alice has made her first move, from the perspective of the rest of her decisions, Bob’s first move is observable”. But we know how to remove some control from the agent in order to get the perspective of “everything else” -- that’s externalization! In particular, in Alice’s frame, if we externalize the property “Alice’s first move”, then the property “Bob’s first move” _is_ observable in the new frame.

This suggests a way to define a sequence of frames that represent the passage of time: we define the Tth frame as “the original frame, but with the first T moves externalized”, or equivalently as “the T-1th frame, but with the Tth move externalized”. Each of these frames are subagents of the original frame, since we can think of the full agent (Alice) as the team of “the agent that plays the first T moves” and “the agent that plays the T+1th move and onwards”. As you might expect, as “time” progresses, the agent loses controllables and gains observables. For example, by move 3 Alice can no longer control her first two moves, but she can now observe Bob’s first two moves, relative to Alice at the beginning of the game.
I like this way of thinking about agency: we’ve been talking about “where to draw the line around the agent” for quite a while in AI safety, but there hasn’t been a nice formalization of this until now. In particular, it’s very nice that we can compare different ways of drawing the line around the agent, and make precise various concepts around this, such as “subagent”.

I’ve also previously liked the notion that “to observe P is to be able to change your decisions based on the value of P”, but I hadn’t really seen much discussion about it until now. This sequence makes some real progress on conceptual understanding of this perspective: in particular, the notion that observability requires “all the control to be in the environment” is not one I had until now. (Though I should note that this particular phrasing is mine, and I’m not sure the author would agree with the phrasing.)

One of my checks for the utility of foundational theory for a particular application is to see whether the key results can be explained without having to delve into esoteric mathematical notation. I think this sequence does very well on this metric -- for the most part I didn’t even read the proofs, yet I was able to reconstruct conceptual arguments for many of the theorems that are convincing to me. (They aren’t and shouldn’t be as convincing as the proofs themselves.) However, not all of the concepts score so well on this -- for example, the generic subagent definition was sufficiently unintuitive to me that I did not include it in this summary.
Agent foundationsHighlightThe ground of optimizationAlex FlintAlignment Forum2020AN #105RohinMany arguments about AI risk depend on the notion of “optimizing”, but so far it has eluded a good definition. One natural [approach]( is to say that an optimizer causes the world to have higher values according to some reasonable utility function, but this seems insufficient, as then a <@bottle cap would be an optimizer@>(@Bottle Caps Aren't Optimisers@) for keeping water in the bottle.

This post provides a new definition of optimization, by taking a page from <@Embedded Agents@> and analyzing a system as a whole instead of separating the agent and environment. An **optimizing system** is then one which tends to evolve toward some special configurations (called the **target configuration set**), when starting anywhere in some larger set of configurations (called the **basin of attraction**), _even if_ the system is perturbed.

For example, in gradient descent, we start with some initial guess at the parameters θ, and then continually compute loss gradients and move θ in the appropriate direction. The target configuration set is all the local minima of the loss landscape. Such a program has a very special property: while it is running, you can change the value of θ (e.g. via a debugger), and the program will probably _still work_. This is quite impressive: certainly most programs would not work if you arbitrarily changed the value of one of the variables in the middle of execution. Thus, this is an optimizing system that is robust to perturbations in θ. Of course, it isn’t robust to arbitrary perturbations: if you change any other variable in the program, it will probably stop working. In general, we can quantify how powerful an optimizing system is by how robust it is to perturbations, and how small the target configuration set is.

The bottle cap example is _not_ an optimizing system because there is no broad basin of configurations from which we get to the bottle being full of water. The bottle cap doesn’t cause the bottle to be full of water when it didn’t start out full of water.

Optimizing systems are a superset of goal-directed agentic systems, which require a separation between the optimizer and the thing being optimized. For example, a tree is certainly an optimizing system (the target is to be a fully grown tree, and it is robust to perturbations of soil quality, or if you cut off a branch, etc). However, it does not seem to be a goal-directed agentic system, as it would be hard to separate into an “optimizer” and a “thing being optimized”.

This does mean that we can no longer ask “what is doing the optimization” in an optimizing system. This is a feature, not a bug: if you expect to always be able to answer this question, you typically get confusing results. For example, you might say that your liver is optimizing for making money, since without it you would die and fail to make money.

The full post has several other examples that help make the concept clearer.
I’ve <@previously argued@>(@Intuitions about goal-directed behavior@) that we need to take generalization into account in a definition of optimization or goal-directed behavior. This definition achieves that by primarily analyzing the robustness of the optimizing system to perturbations. While this does rely on a notion of counterfactuals, it still seems significantly better than any previous attempt to ground optimization.

I particularly like that the concept doesn’t force us to have a separate agent and environment, as that distinction does seem quite leaky upon close inspection. I gave a shot at explaining several other concepts from AI alignment within this framework in [this comment](, and it worked quite well. In particular, a computer program is a goal-directed AI system if there is an environment such that adding the computer program to the environment transforms it into an optimizing system for some “interesting” target configuration states (with one caveat explained in the comment).
Agent foundationsEmbedded Agency via AbstractionJohn WentworthAlignment Forum2019AN #83Asya<@Embedded agency problems@>(@Embedded Agents@) are a class of theoretical problems that arise as soon as an agent is part of the environment it is interacting with and modeling, rather than having a clearly-defined and separated relationship. This post makes the argument that before we can solve embedded agency problems, we first need to develop a theory of _abstraction_. _Abstraction_ refers to the problem of throwing out some information about a system while still being able to make predictions about it. This problem can also be referred to as the problem of constructing a map for some territory.

The post argues that abstraction is key for embedded agency problems because the underlying challenge of embedded world models is that the agent (the map) is smaller than the environment it is modeling (the territory), and so inherently has to throw some information away.

Some simple questions around abstraction that we might want to answer include:
- Given a map-making process, characterize the queries whose answers the map can reliably predict.
- Given some representation of the map-territory correspondence, translate queries from the territory-representation to the map-representation and vice versa.
- Given a territory, characterize classes of queries which can be reliably answered using a map much smaller than the territory itself.
- Given a territory and a class of queries, construct a map which throws out as much information as possible while still allowing accurate prediction over the query class.

The post argues that once we create the simple theory, we will have a natural way of looking at more challenging problems with embedded agency, like the problem of self-referential maps, the problem of other map-makers, and the problem of self-reasoning that arises when the produced map includes an abstraction of the map-making process itself.
My impression is that embedded agency problems as a class of problems are very young, extremely entangled, and characterized by a lot of confusion. I am enthusiastic about attempts to decrease confusion and intuitively, abstraction does feel like a key component to doing that.

That being said, my guess is that it’s difficult to predictably suggest the most promising research directions in a space that’s so entangled. For example, [one thread in the comments of this post]( discusses the fact that this theory of abstraction as presented looks at “one-shot” agency where the system takes in some data once and then outputs it, rather than “dynamic” agency where a system takes in data and outputs decisions repeatedly over time. [Abram Demski argues]( that the “dynamic” nature of embedded agency is a [central part of the problem]( and that it may be more valuable and neglected to put research emphasis there.
Agent foundationsTheory of Ideal Agents, or of Existing Agents?John WentworthAlignment Forum2019AN #66FloThere are at least two ways in which a theoretical understanding of agency can be useful: On one hand, such understanding can enable the **design** of an artificial agent with certain properties. On the other hand, it can be used to **describe** existing agents. While both perspectives are likely needed for successfully aligning AI, individual researchers face a tradeoff: either they focus their efforts on existence results concerning strong properties, which helps with design (e.g. most of <@MIRI's work on embedded agency@>(@Embedded Agents@)), or they work on proving weaker properties for a broad class of agents, which helps with description (e.g. [all logical inductors can be described as markets](, summarized next). The prioritization of design versus description is a likely crux in disagreements about the correct approach to developing a theory of agency.
To facilitate productive discussions it seems important to disentangle disagreements about goals from disagreements about means whenever we can. I liked the clear presentation of this attempt to identify a common source of disagreements on the (sub)goal level.
Agent foundationsTroll BridgeAbram DemskiAlignment Forum2019AN #63RohinThis is a particularly clean exposition of the Troll Bridge problem in decision theory. In this problem, an agent is determining whether to cross a bridge guarded by a troll who will blow up the agent if its reasoning is inconsistent. It turns out that an agent with consistent reasoning can prove that if it crosses, it will be detected as inconsistent and blown up, and so it decides not to cross. This is rather strange reasoning about counterfactuals -- we'd expect perhaps that the agent is uncertain about whether its reasoning is consistent or not.
Agent foundationsSelection vs ControlAbram DemskiAlignment Forum2019AN #58RohinThe previous paper focuses on mesa optimizers that are explicitly searching across a space of possibilities for an option that performs well on some objective. This post argues that in addition to this "selection" model of optimization, there is a "control" model of optimization, where the model cannot evaluate all of the options separately (as in e.g. a heat-seeking missile, which can't try all of the possible paths to the target separately). However, these are not cleanly separated categories -- for example, a search process could have control-based optimization inside of it, in the form of heuristics that guide the search towards more likely regions of the search space.
This is an important distinction, and I'm of the opinion that most of what we call "intelligence" is actually more like the "control" side of these two options.
Agent foundationsPavlov GeneralizesAbram DemskiAlignment Forum2019AN #52RohinIn the iterated prisoner's dilemma, the [Pavlov strategy]( is to start by cooperating, and then switch the action you take whenever the opponent defects. This can be generalized to arbitrary games. Roughly, an agent is "discontent" by default and chooses actions randomly. It can become "content" if it gets a high payoff, in which case it continues to choose whatever action it previously chose as long as the payoffs remain consistently high. This generalization achieves Pareto optimality in the limit, though with a very bad convergence rate. Basically, all of the agents start out discontent and do a lot of exploration, and as long as any one agent is discontent the payoffs will be inconsistent and all agents will tend to be discontent. Only when by chance all of the agents take actions that lead to all of them getting high payoffs do they all become content, at which point they keep choosing the same action and stay in the equilibrium.

Despite the bad convergence, the cool thing about the Pavlov generalization is that it only requires agents to notice when the results are good or bad for them. In contrast, typical strategies that aim to mimic Tit-for-Tat require the agent to reason about the beliefs and utility functions of other agents, which can be quite difficult to do. By just focusing on whether things are going well for themselves, Pavlov agents can get a lot of properties in environments with other agents that Tit-for-Tat strategies don't obviously get, such as exploiting agents that always cooperate. However, when thinking about <@logical time@>(@In Logical Time, All Games are Iterated Games@), it would seem that a Pavlov-esque strategy would have to make decisions based on a prediction about its own behavior, which is... not obviously doomed, but seems odd. Regardless, given the lack of work on Pavlov strategies, it's worth trying to generalize them further.
Agent foundationsCDT=EDT=UDTAbram DemskiAlignment Forum2019AN #42
Agent foundationsGrokking the Intentional StanceJack KochAlignment Forum2021AN #164RohinThis post describes takeaways from [The Intentional Stance]( by Daniel Dennett for the concept of agency. The key idea is that whether or not some system is an “agent” depends on who is observing it: for example, humans may not look like agents to superintelligent Martians who can predict our every move through a detailed understanding of the laws of physics. A system is an agent relative to an observer if the observer’s best model of the system (i.e. the one that is most predictive) is one in which the system has “goals” and “beliefs”. Thus, with AI systems, we should not ask whether an AI system “is” an agent; instead we should ask whether the AI system’s behavior is reliably predictable by the intentional stance.

How is the idea that agency only arises relative to some observer compatible with our view of ourselves as agents? This can be understood as one “part” of our cognition modeling “ourselves” using the intentional stance. Indeed, a system usually cannot model itself in full fidelity, and so it makes a lot of sense that an intentional stance would be used to make an approximate model instead.
I generally agree with the notion that whether or not something feels like an “agent” depends primarily on whether or not we model it using the intentional stance, which is primarily a statement about our understanding of the system. (For example, I expect programmers are much less likely to anthropomorphize a laptop than laypeople, because they understand the mechanistic workings of laptops better.) However, I think we do need an additional ingredient in AI risk arguments, because such arguments make claims about how an AI system will behave in novel circumstances that we’ve never seen before. To justify that claim, we need to have an argument that can predict how the agent behaves in new situations; it doesn’t seem like the intentional stance can give us that information by itself. See also [this comment](
<@The ground of optimization@>
Agent foundationsThe Accumulation of KnowledgeAlex FlintAlignment Forum2021AN #156RohinProbability theory can tell us about how we ought to build agents that have knowledge (start with a prior and perform Bayesian updates as evidence comes in). However, this is not the only way to create knowledge: for example, humans are not ideal Bayesian reasoners. As part of our quest to <@_describe_ existing agents@>(@Theory of Ideal Agents, or of Existing Agents?@), could we have a theory of knowledge that specifies when a particular physical region within a closed system is “creating knowledge”? We want a theory that <@works in the Game of Life@>(@Agency in Conway’s Game of Life@) as well as the real world.

This sequence investigates this question from the perspective of defining the accumulation of knowledge as increasing correspondence between [a map and the territory](, and concludes that such definitions are not tenable. In particular, it considers four possibilities and demonstrates counterexamples to all of them:

1. Direct map-territory resemblance: Here, we say that knowledge accumulates in a physical region of space (the “map”) if that region of space looks more like the full system (the “territory”) over time.

Problem: This definition fails to account for cases of knowledge where the map is represented in a very different way that doesn’t resemble the territory, such as when a map is represented by a sequence of zeros and ones in a computer.

2. Map-territory mutual information: Instead of looking at direct resemblance, we can ask whether there is increasing mutual information between the supposed map and the territory it is meant to represent.

Problem: In the real world, nearly _every_ region of space will have high mutual information with the rest of the world. For example, by this definition, a rock accumulates lots of knowledge as photons incident on its face affect the properties of specific electrons in the rock giving it lots of information.

3. Mutual information of an abstraction layer: An abstraction layer is a grouping of low-level configurations into high-level configurations such that transitions between high-level configurations are predictable without knowing the low-level configurations. For example, the zeros and ones in a computer are the high-level configurations of a digital abstraction layer over low-level physics. Knowledge accumulates in a region of space if that space has a digital abstraction layer, and the high-level configurations of the map have increasing mutual information with the low-level configurations of the territory.

Problem: A video camera that constantly records would accumulate much more knowledge by this definition than a human, even though the human is much more able to construct models and act on them.

4. Precipitation of action: The problem with our previous definitions is that they don’t require the knowledge to be _useful_. So perhaps we can instead say that knowledge is accumulating when it is being used to take action. To make this mechanistic, we say that knowledge accumulates when an entity’s actions become more fine-tuned to a specific environment configuration over time. (Intuitively, they learned more about the environment and so could condition their actions on that knowledge, which they previously could not do.)

Problem: This definition requires the knowledge to actually be used to count as knowledge. However, if someone makes a map of a coastline, but that map is never used (perhaps it is quickly destroyed), it seems wrong to say that during the map-making process knowledge was not accumulating.
Agent foundationsA Semitechnical Introductory Dialogue on Solomonoff InductionEliezer YudkowskyAlignment Forum2015AN #141RohinThis post is a good introduction to Solomonoff induction and why it’s interesting (though note it is quite long).
Agent foundationsTwo Alternatives to Logical CounterfactualsJessica TaylorAlignment Forum2020AN #94
Agent foundationsDissolving Confusion around Functional Decision TheoryStephen CasperLessWrong2020AN #83RohinThis post argues for functional decision theory (FDT) on the basis of the following two principles:

1. Questions in decision theory are not about what "choice" you should make with your "free will", but about what source code you should be running.
2. P "subjunctively depends" on A to the extent that P's predictions of A depend on correlations that can't be confounded by choosing the source code that A runs.
I liked these principles, especially the notion that subjunctive dependence should be cashed out as "correlations that aren't destroyed by changing the source code". This isn't a perfect criterion: FDT can and should apply to humans as well, but we _don't_ have control over our source code.
Agent foundationsConceptual Problems with UDT and Policy SelectionAbram DemskiAlignment Forum2019AN #82RohinIn Updateless Decision Theory (UDT), the agent decides "at the beginning of time" exactly how it will respond to every possible sequence of observations it could face, so as to maximize the expected value it gets with respect to its prior over how the world evolves. It is updateless because it decides ahead of time how it will respond to evidence, rather than updating once it sees the evidence. This works well when the agent can consider the full environment and react to it, and often gets the right result even when the environment can model the agent (as in Newcomblike problems), as long as the agent knows how the environment will model it.

However, it seems unlikely that UDT will generalize to logical uncertainty and multiagent settings. Logical uncertainty occurs when you haven't computed all the consequences of your actions and is reduced by thinking longer. However, this effectively is a form of updating, whereas UDT tries to know everything upfront and never update, and so it seems hard to make it compatible with logical uncertainty. With multiagent scenarios, the issue is that UDT wants to decide on its policy "before" any other policies, which may not always be possible, e.g. if another agent is also using UDT. The philosophy behind UDT is to figure out how you will respond to everything ahead of time; as a result, UDT aims to precommit to strategies assuming that other agents will respond to its commitments; so two UDT agents are effectively "racing" to make their commitments as fast as possible, reducing the time taken to consider those commitments as much as possible. This seems like a bad recipe if we want UDT agents to work well with each other.
I am no expert in decision theory, but these objections seem quite strong and convincing to me.
Agent foundationsA Critique of Functional Decision TheoryWill MacAskillAlignment Forum2019AN #82Rohin_This summary is more editorialized than most._ This post critiques [Functional Decision Theory]( (FDT). I'm not going to go into detail, but I think the arguments basically fall into two camps. First, there are situations in which there is no uncertainty about the consequences of actions, and yet FDT chooses actions that do not have the highest utility, because of their impact on counterfactual worlds which "could have happened" (but ultimately, the agent is just leaving utility on the table). Second, FDT relies on the ability to tell when someone is "running an algorithm that is similar to you", or is "logically correlated with you". But there's no such crisp concept, and this leads to all sorts of problems with FDT as a decision theory.
Like [Buck from MIRI](, I feel like I understand these objections and disagree with them. On the first argument, I agree with [Abram]( that a decision should be evaluated based on how well the agent performs with respect to the probability distribution used to define the problem; FDT only performs badly if you evaluate on a decision problem produced by conditioning on a highly improbable event. On the second class of arguments, I certainly agree that there isn't (yet) a crisp concept for "logical similarity"; however, I would be shocked if the _intuitive concept_ of logical similarity was not relevant in the general way that FDT suggests. If your goal is to hardcode FDT into an AI agent, or your goal is to write down a decision theory that in principle (e.g. with infinite computation) defines the correct action, then it's certainly a problem that we have no crisp definition yet. However, FDT can still be useful for getting more clarity on how one ought to reason, without providing a full definition.
Agent foundationsTwo senses of “optimizer”Joar SkalseAlignment Forum2019AN #63RohinThe first sense of "optimizer" is an optimization algorithm, that given some formally specified problem computes the solution to that problem, e.g. a SAT solver or linear program solver. The second sense is an algorithm that acts upon its environment to change it. Joar believes that people often conflate the two in AI safety.
I agree that this is an important distinction to keep in mind. It seems to me that the distinction is whether the optimizer has knowledge about the environment: in canonical examples of the first kind of optimizer, it does not. If we somehow encoded the dynamics of the world as a SAT formula and asked a super-powerful SAT solver to solve for the actions that accomplish some goal, it would look like the second kind of optimizer.
Agent foundationsClarifying Logical CounterfactualsChris LeongLessWrong2019AN #43
Agent foundationsCDT Dutch BookAbram DemskiAlignment Forum2019AN #42
Agent foundationsAnthropic paradoxes transposed into Anthropic Decision TheoryStuart ArmstrongAlignment Forum2018AN #38
Agent foundationsAnthropic probabilities and cost functionsStuart ArmstrongAlignment Forum2018AN #38
Agent foundationsBounded Oracle InductionDiffractorAlignment Forum2018AN #35
Agent foundationsRobust program equilibriumCaspar OesterheldSpringer2018AN #35
Agent foundationsA Rationality Condition for CDT Is That It Equal EDT (Part 2)Abram DemskiAlignment Forum2018AN #28
Agent foundationsA Rationality Condition for CDT Is That It Equal EDT (Part 1)Abram DemskiAlignment Forum2018AN #27
Agent foundationsEDT solves 5 and 10 with conditional oraclesjessicataAlignment Forum2018AN #27
Agent foundationsAsymptotic Decision Theory (Improved Writeup)DiffractorAlignment Forum2018AN #26
Agent foundationsIn Logical Time, All Games are Iterated GamesAbram DemskiAlignment Forum2018AN #25RichardThe key difference between causal and functional decision theory is that the latter supplements the normal notion of causation with "logical causation". The decision of agent A can logically cause the decision of agent B even if B made their decision before A did - for example, if B made their decision by simulating A. Logical time is an informal concept developed to help reason about which computations cause which other computations: logical causation only flows forward through logical time in the same way that normal causation only flows forward through normal time (although maybe logical time turns out to be loopy). For example, when B simulates A, B is placing themselves later in logical time than A. When I choose not to move my bishop in a game of chess because I've noticed it allows a sequence of moves which ends in me being checkmated, then I am logically later than that sequence of moves. One toy model of logical time is based on proof length - we can consider shorter proofs to be earlier in logical time than longer proofs. It's apparently surprisingly difficult to find a case where this fails badly.

In logical time, all games are iterated games. We can construct a series of simplified versions of each game where each player's thinking time is bounded. As thinking time increases, the games move later in logical time, and so we can treat them as a series of iterated games whose outcomes causally affect all longer versions. Iterated games are fundamentally different from single-shot games: the [folk theorem]( states that virtually any outcome is possible in iterated games.
I like logical time as an intuitive way of thinking about logical causation. However, the analogy between normal time and logical time seems to break down in some cases. For example, suppose we have two boolean functions F and G, such that F = not G. It seems like G is logically later than F - yet we could equally well have defined them such that G = not F, which leads to the opposite conclusion. As Abram notes, logical time is intended as an intuition pump not a well-defined theory - yet the possibility of loopiness makes me less confident in its usefulness. In general I am pessimistic about the prospects for finding a formal definition of logical causation, for reasons I described in [Realism about Rationality](, which Rohin summarised above.
Agent foundationsWhen wishful thinking worksAlex MennenAlignment Forum2018AN #23RohinSometimes beliefs can be loopy, in that the probability of a belief being true depends on whether you believe it. For example, the probability that a placebo helps you may depend on whether you believe that a placebo helps you. In the situation where you know this, you can "wish" your beliefs to be the most useful possible beliefs. In the case where the "true probability" depends continuously on your beliefs, you can use a fixed point theorem to find a consistent set of probabilities. There may be many such fixed points, in which case you can choose the one that would lead to highest expected utility (such as choosing to believe in the placebo). One particular application of this would be to think of the propositions as "you will take action a_i". In this case, you act the way you believe you act, and then every probability distribution over the propositions is a fixed point, and so we just choose the probability distribution (i.e. stochastic policy) that maximized expected utility, as usual. This analysis can also be carried to Nash equilibria, where beliefs in what actions you take will affect the actions that the other player takes.
Agent foundationsCounterfactuals and reflective oraclesNisanAlignment Forum2018AN #23
Agent foundationsBottle Caps Aren't OptimisersDaniel FilanAlignment Forum2018AN #22RohinThe previous paper detects optimizers by studying their behavior. However, if the goal is to detect an optimizer before deployment, we need to determine whether an algorithm is performing optimization by studying its source code, _without_ running it. One definition that people have come up with is that an optimizer is something such that the objective function attains higher values than it otherwise would have. However, the author thinks that this definition is insufficient. For example, this would allow us to say that a bottle cap is an optimizer for keeping water inside the bottle. Perhaps in this case we can say that there are simpler descriptions of bottle caps, so those should take precedence. But what about a liver? We could say that a liver is optimizing for its owner's bank balance, since in its absence the bank balance is not going to increase.
Here, we want a definition of optimization because we're worried about an AI being deployed, optimizing for some metric in the environment, and then doing something unexpected that we don't like but nonetheless does increase the metric (falling prey to Goodhart's law). It seems better to me to talk about "optimizer" and "agent" as models of predicting behavior, not something that is an inherent property of the thing producing the behavior. Under that interpretation, we want to figure out whether the agent model with a particular utility function is a good model for an AI system, by looking at its internals (without running it). It seems particularly important to be able to use this model to predict the behavior in novel situations -- perhaps that's what is needed to make the definition of optimizer avoid the counterexamples in this post. (A bottle cap definitely isn't going to keep water in containers if it is simply lying on a table somewhere.)
Agent foundationsComputational complexity of RL with trapsVadim KosoyAlignment Forum2018AN #22RohinA post asking about complexity theoretic results around RL, both with (unknown) deterministic and stochastic dynamics.
Agent foundationsCorrigibility doesn't always have a good action to takeStuart ArmstrongAlignment Forum2018AN #22RohinStuart has [previously argued]( that an AI could be put in situations where no matter what it does, it would affect the human's values. In this short post, he notes that if you then say that it is possible to have situations where the AI cannot act corrigibly, then other problems arise, such as how you can create a superintelligent corrigible AI that does anything at all (since any action that it takes would likely affect our values somehow).
Agent foundationsUsing expected utility for Good(hart)Stuart ArmstrongAlignment Forum2018AN #22RohinIf we include all of the uncertainty we have about human values into the utility function, then it seems possible to design an expected utility maximizer that doesn't fall prey to Goodhart's law. The post shows a simple example where there are many variables that may be of interest to humans, but we're not sure which ones. In this case, by incorporating this uncertainty into our proxy utility function, we can design an expected utility maximizer that has conservative behavior that makes sense.
On the one hand, I'm sympathetic to this view -- for example, I see risk aversion as a heuristic leading to good expected utility maximization for bounded reasoners on large timescales. On the other hand, an EU maximizer still seems hard to align, because whatever utility function it gets, or distribution over utility functions, it will act as though that input is definitely true, which means that anything we fail to model will never make it into the utility function. If you could have some sort of "unresolvable" uncertainty, some reasoning (similar to the [problem of induction]( suggesting that you can never fully trust your own thoughts to be perfectly correct, that would make me more optimistic about an EU maximization based approach, but I don't think it can be done by just changing the utility function, or by adding a distribution over them.
Agent foundationsAgents and Devices: A Relative Definition of AgencyLaurent Orseau, Simon McGregor McGill, Shane LeggarXiv2018AN #22RohinThis paper considers the problem of modeling other behavior, either as an agent (trying to achieve some goal) or as a device (that reacts to its environment without any clear goal). They use Bayesian IRL to model behavior as coming from an agent optimizing a reward function, and design their own probability model to model the behavior as coming from a device. They then use Bayes rule to decide whether the behavior is better modeled as an agent or as a device. Since they have a uniform prior over agents and devices, this ends up choosing the one that better fits the data, as measured by log likelihood.

In their toy gridworld, agents are navigating towards particular locations in the gridworld, whereas devices are reacting to their local observation (the type of cell in the gridworld that they are currently facing, as well as the previous action they took). They create a few environments by hand which demonstrate that their method infers the intuitive answer given the behavior.
In their experiments, they have two different model classes with very different inductive biases, and their method correctly switches between the two classes depending on which inductive bias works better. One of these classes is the maximization of some reward function, and so we call that the agent class. However, they also talk about using the Solomonoff prior for devices -- in that case, even if we have something we would normally call an agent, if it is even slightly suboptimal, then with enough data the device explanation will win out.

I'm not entirely sure why they are studying this problem in particular -- one reason is explained in the next post, I'll write more about it in that section.
Agent foundationsCooperative OraclesDiffractorAlignment Forum2018AN #22
Agent foundationsReducing collective rationality to individual optimization in common-payoff games using MCMCjessicataAlignment Forum2018AN #21RohinGiven how hard multiagent cooperation is, it would be great if we could devise an algorithm such that each agent is only locally optimizing their own utility (without requiring that anyone else change their policy), that still achieves the globally optimal policy. This post considers the case where all players have the same utility function in an iterated game. In this case, we can define a process where at every timestep, one agent is randomly selected, and that agent changes their action in the game uniformly at random with probability that depends on how much utility was just achieved. This depends on a rationality parameter α -- the higher α is, the more likely it is for the player to stick with a high utility action.

This process allows you to reach every possible joint action from every other possible joint action with some non-zero probability, so in the limit of running this process forever, you will end up visiting every state infinitely often. However, by cranking up the value of α, we can ensure that in the limit we spend most of the time in the high-value states and rarely switch to anything lower, which lets us get arbitrarily close to the optimal deterministic policy (and so arbitrarily close to the optimal expected value).
I like this, it's an explicit construction that demonstrates how you can play with the explore-exploit tradeoff in multiagent settings. Note that when α is set to be very high (the condition in which we get near-optimal outcomes in the limit), there is very little exploration, and so it will take a long time before we actually find the optimal outcome in the first place. It seems like this would make it hard to use in practice, but perhaps we could replace the exploration with reasoning about the game and other agents in it? The author was planning to use reflective oracles to do something like this if I understand correctly.
Agent foundationsComputational efficiency reasons not to model VNM-rational preference relations with utility functionsAlex MennenLessWrong2018AN #17RohinRealistic agents don't use utility functions over world histories to make decisions, because it is computationally infeasible, and it's quite possible to make a good decision by only considering the local effects of the decision. For example, when deciding whether or not to eat a sandwich, we don't typically worry about the outcome of a local election in Siberia. For the same computational reasons, we wouldn't want to use a utility function to model other agents. Perhaps a utility function is useful for measuring the strength of an agent's preference, but even then it is really measuring the ratio of the strength of the agent's preference to the strength of the agent's preference over the two reference points used to determine the utility function.
I agree that we certainly don't want to model other agents using full explicit expected utility calculations because it's computationally infeasible. However, as a first approximation it seems okay to model other agents as computationally bounded optimizers of some utility function. It seems like a bigger problem to me that any such model predicts that the agent will never change its preferences (since that would be bad according to the current utility function).
Agent foundationsStable Pointers to Value III: Recursive QuantilizationAbram DemskiLessWrong2018AN #17RohinWe often try to solve alignment problems by going a level meta. For example, instead of providing feedback on what the utility function is, we might provide feedback on how to best learn what the utility function is. This seems to get more information about what safe behavior is. What if we iterate this process? For example, in the case of quantilizers with three levels of iteration, we would do a quantilized search over utility function generators, then do a quantilized search over the generated utility functions, and then do a quantilized search to actually take actions.
The post mentions what seems like the most salient issue -- that it is really hard for humans to give feedback even a few meta levels up. How do you evaluate a thing that will create a distribution over utility functions? I might go further -- I'm not even sure there is good normative feedback on the meta level(s). There is feedback we can give on the meta level for any particular object-level instance, but it seems not at all obvious (to me) that this advice will generalize well to other object-level instances. On the other hand, it does seem to me that the higher up you are in meta-levels, the smaller the space of concepts and the easier it is to learn. So maybe my overall take is that it seems like we can't depend on humans to give meta-level feedback well, but if we can figure out how to either give better feedback or learn from noisy feedback, it would be easier to learn and likely generalize better.
Agent foundationsCountable Factored SpacesDiffractorAlignment Forum2021AN #164RohinThis post generalizes the math in <@Finite Factored Sets@>(@Finite Factored Sets sequence@) to (one version of) the infinite case. Everything carries over, except for one direction of the fundamental theorem. (The author suspects that direction is true, but was unable to prove it.)
Agent foundationsBuridan's ass in coordination gamesjessicataLessWrong2018AN #16RohinSuppose two agents have to coordinate to choose the same action, X or Y, where X gives utility 1 and Y gives utility u, for some u in [0, 2]. (If the agents fail to coordinate, they get zero utility.) If the agents communicate, decide on policies, then observe the value of u with some noise ϵ, and then execute their policies independently, there must be some u for which they lose out on significant utility. Intuitively, the proof is that at u = 0, you should say X, and at u = 2, you should say Y, and there is some intermediate value where you are indifferent between the two (equal probability of choosing X or Y), meaning that 50% of the time you will fail to coordinate. However, if you have a shared source of randomness (after observing the value of u), then you can correlate your decisions using the randomness in order to do much better.
Cool result, and quite easy to understand. As usual I don't want to speculate on relevance to AI alignment because it's not my area.
Agent foundationsProbability is Real, and Value is ComplexAbram DemskiLessWrong2018AN #16RohinIf you interpret events as vectors on a graph, with probability on the x-axis and probability * utility on the y-axis, then any rotation of the vectors preserves the preference relation, so that you will make the same decision. This means that from decisions, you cannot distinguish between rotations, which intuitively means that you can't tell if a decision was made because it had a low probability of high utility, or medium probability of medium utility, for example. As a result, beliefs and utilities are inextricably linked, and you can't just separate them. Key quote: "Viewing [probabilities and utilities] in this way makes it somewhat more natural to think that probabilities are more like "caring measure" expressing how much the agent cares about how things go in particular worlds, rather than subjective approximations of an objective "magical reality fluid" which determines what worlds are experienced."
I am confused. If you want to read my probably-incoherent confused opinion on it, it's [here](
Bayesian Utility: Representing Preference by Probability Measures
Agent foundationsAgency in Conway’s Game of LifeAlex FlintAlignment Forum2021AN #151RohinConway’s Game of Life (GoL) is a simple cellular automaton which is Turing-complete. As a result, it should be possible to build an “artificial intelligence” system in GoL. One way that we could phrase this is: Imagine a GoL board with 10^30 rows and 10^30 columns, where we are able to set the initial state of the top left 10^20 by 10^20 square. Can we set that initial state appropriately such that after a suitable amount of time, the full board evolves to a desired state (perhaps a giant smiley face) for the vast majority of possible initializations of the remaining area?

This requires us to find some setting of the initial 10^20 by 10^20 square that has [expandable, steerable influence]( Intuitively, the best way to do this would be to build “sensors” and “effectors” to have inputs and outputs and then have some program decide what the effectors should do based on the input from the sensors. The “goal” of the program would then be to steer the world towards the desired state. Thus, this is a framing of the problem of AI (both capabilities and alignment) in GoL, rather than in our native physics.
With the tower of abstractions we humans have built, we now naturally think in terms of inputs and outputs for the agents we build. This hypothetical seems good for shaking us out of that mindset, as we don’t really know what the analogous inputs and outputs in GoL would be, and so we are forced to consider those aspects of the design process as well.