Alignment Newsletter Database (public)
The version of the browser you are using is no longer supported. Please upgrade to a supported browser.Dismiss

View only
CategoryTitleAuthorsVenueH/TEmailSummarizerSummaryMy opinionPrerequisitesRead more
Adversarial examplesHighlightFeature Denoising for Improving Adversarial RobustnessCihang Xie et alICML 2018AN #49Dan HThis paper claims to obtain nontrivial adversarial robustness on ImageNet. Assuming an adversary can add perturbations of size 16/255 (l_infinity), previous adversarially trained classifiers could not obtain above 1% adversarial accuracy. Some groups have tried to break the model proposed in this paper, but so far it appears its robustness is close to what it claims, [around]( 40% adversarial accuracy. Vanilla adversarial training is how they obtain said adversarial robustness. There has only been one previous public attempt at applying (multistep) adversarial training to ImageNet, as those at universities simply do not have the GPUs necessary to perform adversarial training on 224x224 images. Unlike the previous attempt, this paper ostensibly uses better hyperparameters, possibly accounting for the discrepancy. If true, this result reminds us that hyperparameter tuning can be critical even in vision, and that improving adversarial robustness on large-scale images may not be possible outside industry for many years.
Adversarial examplesHighlightConstructing Unrestricted Adversarial Examples with Generative ModelsYang Song et alNeurIPS 2018AN #39This paper predates the [unrestricted adversarial examples challenge]( ([AN #24]( and shows how to generate such unrestricted adversarial examples using generative models. As a reminder, most adversarial examples research is focused on finding imperceptible perturbations to existing images that cause the model to make a mistake. In contrast, unrestricted adversarial examples allow you to find _any_ image that humans will reliably classify a particular way, where the model produces some other classification.

The key idea is simple -- train a GAN to generate images in the domain of interest, and then create adversarial examples by optimizing an image to simultaneously be "realistic" (as evaluated by the generator), while still being misclassified by the model under attack. The authors also introduce another term into the loss function that minimizes deviation from a randomly chosen noise vector -- this allows them to get diverse adversarial examples, rather than always converging to the same one.

They also consider a "noise-augmented" attack, where in effect they are running the normal attack they have, and then running a standard attack like FGSM or PGD afterwards. (They do these two things simultaneously, but I believe it's nearly equivalent.)

For evaluation, they generate adversarial examples with their method and check that humans on Mechanical Turk reliably classify the examples as a particular class. Unsurprisingly, their adversarial examples "break" all existing defenses, including the certified defenses, though to be clear existing defenses assume a different threat model where an adversarial example must be an imperceptible perturbation to one of a known set of images. You could imagine doing something similar by taking the imperceptible-perturbation attacks and raise the value of ϵ until it is perceptible -- but in this case the generated images are much less realistic.
This is the clear first thing to try with unrestricted adversarial examples, and it seems to work reasonably well. I'd love to see whether adversarial training with these sorts of adversarial examples works as a defense against both this attack and standard imperceptible-perturbation attacks. In addition, it would be interesting to see if humans could direct or control the search for unrestricted adversarial examples.
Adversarial examplesHighlightMotivating the Rules of the Game for Adversarial Example ResearchJustin Gilmer, George E. Dahl et al2018 IEEE/RSJ International Conference on Intelligent Robots and SystemsAN #28Dan HIn this position paper, the authors argue that many of the threat models which motivate adversarial examples are unrealistic. They enumerate various previously proposed threat models, and then they show their limitations or detachment from reality. For example, it is common to assume that an adversary must create an imperceptible perturbation to an example, but often attackers can input whatever they please. In fact, in some settings an attacker can provide an input from the clean test set that is misclassified. Also, they argue that adversarial robustness defenses which degrade clean test set error are likely to make systems less secure since benign or nonadversarial inputs are vastly more common. They recommend that future papers motivated by adversarial examples take care to define the threat model realistically. In addition, they encourage researchers to establish “content-preserving” adversarial attacks (as opposed to “imperceptible” l_p attacks) and improve robustness to unseen input transformations.This is my favorite paper of the year as it handily counteracts much of the media coverage and research lab PR purporting ``doom'' from adversarial examples. While there are some scenarios in which imperceptible perturbations may be a motivation---consider user-generated privacy-creating perturbations to Facebook photos which stupefy face detection algorithms---much of the current adversarial robustness research optimizing small l_p ball robustness can be thought of as tackling a simplified subproblem before moving to a more realistic setting. Because of this paper, new tasks such as [Unrestricted Adversarial Examples]( ([AN #24]( take an appropriate step toward increasing realism without appearing to make the problem too hard.
Adversarial examplesHighlightIntroducing the Unrestricted Adversarial Examples ChallengeTom B. Brown et alGoogle AI BlogAN #24There's a new adversarial examples contest, after the one from NIPS 2017. The goal of this contest is to figure out how to create a model that never confidently makes a mistake on a very simple task, even in the presence of a powerful adversary. This leads to many differences from the previous contest. The task is a lot simpler -- classifiers only need to distinguish between bicycles and birds, with an option of saying "ambiguous". Instead of using the L-infinity norm ball to define what an adversarial example is, attackers are allowed to supply any image whatsoever, as long as a team of human evaluators agrees unanimously on the classification of the image. The contest has no time bound, and will run until some defense survives for 90 days without being broken even once. A defense is not broken if it says "ambiguous" on an adversarial example. Any submitted defense will be published, which means that attackers can specialize their attacks to that specific model (i.e. it is white box).I really like this contest format, it seems like it's actually answering the question we care about, for a simple task. If I were designing a defense, the first thing I'd aim for would be to get a lot of training data, ideally from different distributions in the real world, but data augmentation techniques may also be necessary, especially for eg. images of a bicycle against an unrealistic textured background. The second thing would be to shrink the size of the model, to make it more likely that it generalizes better (in accordance with Occam's razor or the minimum description length principle). After that I'd think about the defenses proposed in the literature. I'm not sure how the verification-based approaches will work, since they are intrinsically tied to the L-infinity norm ball definition of adversarial examples, or something similar -- you can't include the human evaluators in your specification of what you want to verify.
Adversarial examplesHighlightAdversarial Attacks and Defences CompetitionAlexey Kurakin et alThe NIPS '17 Competition: Building Intelligent SystemsAN #1This is a report on a competition held at NIPS 2017 for the best adversarial attacks and defences. It includes a summary of the field and then shows the results from the competition.I'm not very familiar with the literature on adversarial examples and so I found this very useful as an overview of the field, especially since it talks about the advantages and disadvantages of different methods, which are hard to find by reading individual papers. The actual competition results are also quite interesting -- they find that the best attacks and defences are both quite successful on average, but have very bad worst-case performance (that is, the best defence is still very weak against at least one attack, and the best attack fails to attack at least one defence). Overall, this paints a bleak picture for defence, at least if the attacker has access to enough compute to actually try out different attack methods, and has a way of verifying whether the attacks succeed.
Adversarial examplesOn the Geometry of Adversarial ExamplesMarc Khoury et alProceedings of the Genetic and Evolutionary Computation Conference '18AN #36This paper analyzes adversarial examples based off a key idea: even if the data of interest forms a low-dimensional manifold, as we often assume, the ϵ-tube _around_ the manifold is still high-dimensional, and so accuracy in an ϵ-ball around true data points will be hard to learn.

For a given L_p norm, we can define the optimal decision boundary to be the one that maximizes the margin from the true data manifold. If there exists some classifier that is adversarially robust, then the optimal decision boundary is as well. Their first result is that the optimal decision boundary can change dramatically if you change p. In particular, for concentric spheres, the optimal L_inf decision boundary provides an L_2 robustness guarantee √d times smaller than the optimal L_2 decision boundary, where d is the dimensionality of the input. This explains why a classifier that is adversarially trained on L_inf adversarial examples does so poorly on L_2 adversarial examples.

I'm not sure I understand the point of the next section, but I'll give it a try. They show that a nearest neighbors classifier can achieve perfect robustness if the underlying manifold is sampled sufficiently densely (requiring samples exponential in k, the dimensionality of the manifold). However, a learning algorithm with a particular property that they formalize would require exponentially more samples in at least some cases in order to have the same guarantee. I don't know why they chose the particular property they did -- my best guess is that the property is meant to represent what we get when we train a neural net on L_p adversarial examples. If so, then their theorem suggests that we would need exponentially more training points to achieve perfect robustness with adversarial training compared to a nearest neighbor classifier.

They next turn to the fact that the ϵ-tube around the manifold is d-dimensional instead of k-dimensional. If we consider ϵ-balls around the training set X, this covers a very small fraction of the ϵ-tube, approaching 0 as d becomes much larger than k, even if the training set X covers the k-dimensional manifold sufficiently well.

Another issue is that if we require adversarial robustness, then we severely restrict the number of possible decision boundaries, and so we may need significantly more expressive models to get one of these decision boundaries. In particular, since feedforward neural nets with Relu activations have "piecewise linear" decision boundaries (in quotes because I might be using the term incorrectly), it is hard for them to separate concentric spheres. Suppose that the spheres are separated by a distance d. Then for accuracy on the manifold, we only need the decision boundary to lie entirely in the shell of width d. However, for ϵ-tube adversarial robustness, the decision boundary must lie in a shell of width d - 2ϵ. They prove a lower bound on the number of linear regions for the decision boundary that grows as τ^(-d), where τ is the width of the shell, suggesting that adversarial robustness would require more parameters in the model.

Their experiments show that for simple learning problems (spheres and planes), adversarial examples tend to be in directions orthogonal to the manifold. In addition, if the true manifold has high codimension, then the learned model has poor robustness.
I think this paper has given me a significantly better understanding of how L_p norm balls work in high dimensions. I'm more fuzzy on how this applies to adversarial examples, in the sense of any confident misclassification by the model on an example that humans agree is obvious. Should we be giving up on L_p robustness since it forms a d-dimensional manifold, whereas we can only hope to learn the smaller k-dimensional manifold? Surely though a small enough perturbation shouldn't change anything? On the other hand, even humans have _some_ decision boundary, and the points near the decision boundary have some small perturbation which would change their classification (though possibly to "I don't know" rather than some other class).

There is a phenomenon where if you train on L_inf adversarial examples, the resulting classifier fails on L_2 adversarial examples, which has previously been described as "overfitting to L_inf". The authors interpret their first theorem as contradicting this statement, since the optimal decision boundaries are very different for L_inf and L_2. I don't see this as a contradiction. The L_p norms are simply a method of label propagation, which augments the set of data points for which we know labels. Ultimately, we want the classifier to reproduce the labels that we would assign to data points, and L_p propagation captures some of that. So, we can think of there as being many different ways that we can augment the set of training points until it matches human classification, and the L_p norm balls are such methods. Then an algorithm is more robust as it works with more of these augmentation methods. Simply doing L_inf training means that by default the learned model only works on one of the methods (L_inf norm balls) and not all of them as we wanted, and we can think of this as "overfitting" to the imperfect L_inf notion of adversarial robustness. The meaning of "overfitting" here is that the learned model is too optimized for L_inf, at the cost of other notions of robustness like L_2 -- and their theorem says basically the same thing, that optimizing for L_inf comes at the cost of L_2 robustness.
Adversarial examplesA Geometric Perspective on the Transferability of Adversarial DirectionsZachary Charles et alarXivAN #34
Adversarial examplesTowards the first adversarially robust neural network model on MNISTLukas Schott, Jonas Rauber et alarXivAN #27Dan HThis recent pre-print claims to make MNIST classifiers more adversarially robust to different L-p perturbations, while the previous paper only worked for L-infinity perturbations. The basic building block in their approach is a variational autoencoder, one for each MNIST class. Each variational autoencoder computes the likelihood of the input sample, and this information is used for classification. The authors also demonstrate that binarizing MNIST images can serve as strong defense against some perturbations. They evaluate against strong attacks and not just the fast gradient sign method.This paper has generated considerable excitement among my peers. Yet inference time with this approach is approximately 100,000 times that of normal inference (10^4 samples per VAE * 10 VAEs). Also unusual is that the L-infinity "latent descent attack" result is missing. It is not clear why training a single VAE does not work. Also, could results improve by adversarially training the VAEs? As with all defense papers, it is prudent to wait for third-party reimplementations and analysis, but the range of attacks they consider is certainly thorough.
Adversarial examplesTowards Deep Learning Models Resistant to Adversarial AttacksAleksander Madry et alICLRAN #27Dan HMadry et al.'s paper is a seminal work which shows that some neural networks can attain more adversarial robustness with a well-designed adversarial training procedure. The key idea is to phrase the adversarial defense problem as minimizing the expected result of the adversarial attack problem, which is maximizing the loss on an input training point when the adversary is allowed to perturb the point anywhere within an L-infinity norm ball. They also start the gradient descent from a random point in the norm ball. Then, given this attack, to optimize the adversarial defense problem, we simply do adversarial training. When trained long enough, some networks will attain more adversarial robustness.It is notable that this paper has survived third-party security analysis, so this is a solid contribution. This contribution is limited by the fact that its improvements are limited to L-infinity adversarial perturbations on small images, as [follow-up work]( has shown.
Adversarial examplesMotivating the Rules of the Game for Adversarial Example ResearchJustin Gilmer, George E. Dahl et alarXivDaniel FilanAN #19
Adversarial examplesA learning and masking approach to secure learningLinh Nguyen et alInternational Conference on Decision and Game Theory for Security 2018N/AOne way to view the problem of adversarial examples is that adversarial attacks map "good" clean data points that are classified correctly into a nearby "bad" space that is low probability and so is misclassified. This suggests that in order to attack a model, we can use a neural net to _learn_ a transformation from good data points to bad ones. The loss function is easy -- one term encourages similarity to the original data point, and the other term encourages the new data point to have a different class label. Then, for any new input data point, we can simply feed it through the neural net to get an adversarial example.

Similarly, in order to defend a model, we can learn a neural net transformation that maps bad data points to good ones. The loss function continues to encourage similarity between the data points, but now encourages that the new data point have the correct label. Note that we need to use some attack algorithm in order to generate the bad data points that are used to train the defending neural net.
Ultimately, in the defense proposed here, the information on how to be more robust comes from the "bad" data points that are used to train the neural net. It's not clear why this would outperform adversarial training, where we train the original classifier on the "bad" data points. In fact, if the best way to deal with adversarial examples is to transform them to regular examples, then we could simply use adversarial training with a more expressive neural net, and it could learn this transformation.
Adversarial examplesThe LogBarrier adversarial attack: making effective use of decision boundary informationChris Finlay et alarXivAN #53Dan HRather than maximizing the loss of a model given a perturbation budget, this paper minimizes the perturbation size subject to the constraint that the model misclassify the example. This misclassification constraint is enforced by adding a logarithmic barrier to the objective, which they prevent from causing a loss explosion through through a few clever tricks. Their attack appears to be faster than the Carlini-Wagner attack.
[The code is here.](
Adversarial examplesQuantifying Perceptual Distortion of Adversarial ExamplesMatt Jordan et alarXivAN #48Dan HThis paper takes a step toward more general adversarial threat models by combining adversarial additive perturbations small in an l_p sense with [spatially transformed adversarial examples](, among other other attacks. In this more general setting, they measure the size of perturbations by computing the [SSIM]( between clean and perturbed samples, which has limitations but is on the whole better than the l_2 distance. This work shows, along with other concurrent works, that perturbation robustness under some threat models does not yield robustness under other threat models. Therefore the view that l_p perturbation robustness must be achieved before considering other threat models is made more questionable. The paper also contributes a large code library for testing adversarial perturbation robustness.
Adversarial examplesOn the Sensitivity of Adversarial Robustness to Input Data DistributionsGavin Weiguang Ding et alICLR 2019AN #48
Adversarial examplesTheoretically Principled Trade-off between Robustness and AccuracyHongyang Zhang et alarXivAN #44Dan HThis paper won the NeurIPS 2018 Adversarial Vision Challenge. For robustness on CIFAR-10 against l_infinity perturbations (epsilon = 8/255), it improves over the Madry et al. adversarial training baseline from 45.8% to 56.61%, making it [almost]( state-of-the-art. However, it does decrease clean set accuracy by a few percent, despite using a deeper network than Madry et al. Their technique has many similarities to Adversarial Logit Pairing, which is not cited, because they encourage the network to embed a clean example and an adversarial perturbation of a clean example similarly. I now describe Adversarial Logit Pairing. During training, ALP teaches the network to classify clean and adversarially perturbed points; added to that loss is an l_2 loss between the logit embeddings of clean examples and the logits of the corresponding adversarial examples. In contrast, in place of the l_2 loss from ALP, this paper uses the KL divergence from the softmax of the clean example to the softmax of an adversarial example. Yet the softmax distributions are given a high temperature, so this loss is not much different from an l_2 loss between logits. The other main change in this paper is that adversarial examples are generated by trying to maximize the aforementioned KL divergence between clean and adversarial pairs, not by trying to maximize the classification log loss as in ALP. This paper then shows that some further engineering to adversarial logit pairing can improve adversarial robustness on CIFAR-10.
Adversarial examplesAdversarial Vulnerability of Neural Networks Increases With Input DimensionCarl-Johann Simon-Gabriel et alarXivAN #36The key idea of this paper is that imperceptible adversarial vulnerability happens when small changes in the input lead to large changes in the output, suggesting that the gradient is large. They first recommend choosing ϵ_p to be proportional to d^(1/p). Intuitively, this is because larger values of p behave more like maxing instead of summing, and so using the same value of ϵ across values of p would lead to more points being considered for larger p. They show a link between adversarial robustness and regularization, which makes sense since both of these techniques aim for better generalization.

Their main point is that the norm of the gradient increases with the input dimension d. In particular, a typical initialization scheme will set the variance of the weights to be inversely proportional to d, which means the absolute value of each weight is inversely proportional to √d. For a single-layer neural net (that is, a perceptron), the gradient is exactly the weights. For L_inf adversarial robustness, the relevant norm for the gradient is the L_1 norm. This gives the sum of the d weights, which will be proportional to √d. For L_p adversarial robustness, the corresponding gradient is L_q with q larger than 1, which decreases the size of the gradient. However, this is exactly offset by the increase in the size of ϵ_p that they proposed. Thus, in this simple case the adversarial vulnerability increases with input dimension. They then prove theorems that show that this generalizes to other neural nets, including CNNS (albeit still only at initialization, not after training). They also perform experiments showing that their result also holds after training.
I suspect that there is some sort of connection between the explanation given in this paper and the explanation that there are many different perturbation directions in high-dimensional space which means that there are lots of potential adversarial examples, which increases the chance that you can find one. Their theoretical result comes primarily from the fact that weights are initialized with variance inversely proportional to d. We could eliminate this by having the variance be inversely proportional to d^2, in which case their result would say that adversarial vulnerability is constant with input dimension. However, in this case the variance of the activations would be inversely proportional to d, making it hard to learn. It seems like adversarial vulnerability should be the product of "number of directions", and "amount you can search in a direction", where the latter is related to the variance of the activations, making the connection to this paper.
Adversarial examplesRobustness via curvature regularization, and vice versaMoosavi-Dezfooli et alarXivAN #35Dan HThis paper proposes a distinct way to increase adversarial perturbation robustness. They take an adversarial example generated with the FGSM, compute the gradient of the loss for the clean example and the gradient of the loss for the adversarial example, and they penalize this difference. Decreasing this penalty relates to decreasing the loss surface curvature. The technique works slightly worse than adversarial training.
Adversarial examplesIs Robustness [at] the Cost of Accuracy?Dong Su, Huan Zhang et alECCVAN #32Dan HThis work shows that older architectures such as VGG exhibit more adversarial robustness than newer models such as ResNets. Here they take adversarial robustness to be the average adversarial perturbation size required to fool a network. They use this to show that architecture choice matters for adversarial robustness and that accuracy on the clean dataset is not necessarily predictive of adversarial robustness. A separate observation they make is that adversarial examples created with VGG transfers far better than those created with other architectures. All of these findings are for models without adversarial training.
Adversarial examplesAdversarial Examples Are a Natural Consequence of Test Error in NoiseAnonymousOpenReviewAN #32Dan HThis paper argues that there is a link between model accuracy on noisy images and model accuracy on adversarial images. They establish this empirically by showing that augmenting the dataset with random additive noise can improve adversarial robustness reliably. To establish this theoretically, they use the Gaussian Isoperimetric Inequality, which directly gives a relation between error rates on noisy images and the median adversarial perturbation size. Given that measuring test error on noisy images is easy, given that claims about adversarial robustness are almost always wrong, and given the relation between adversarial noise and random noise, they suggest that future defense research include experiments demonstrating enhanced robustness on nonadversarial, noisy images.
Adversarial examplesRobustness May Be at Odds with AccuracyDimitris Tsipras, Shibani Santurkar, Logan Engstrom et alOpenReviewAN #32Dan HSince adversarial training can markedly reduce accuracy on clean images, one may ask whether there exists an inherent trade-off between adversarial robustness and accuracy on clean images. They use a simple model amenable to theoretical analysis, and for this model they demonstrate a trade-off. In the second half of the paper, they show adversarial training can improve feature visualization, which has been shown in several concurrent works.
Adversarial examplesAre adversarial examples inevitable?Ali Shafahi et alInternational Conference on Learning Representations, 2019.AN #24
Adversarial examplesAdversarial Reprogramming of Sequence Classification Neural NetworksPaarth Neekhara et alAAAI-2019 Workshop on Engineering Dependable and Secure Machine Learning SystemsAN #23
Adversarial examplesFortified Networks: Improving the Robustness of Deep Networks by Modeling the Manifold of Hidden RepresentationsAlex Lamb et alarXivAN #2
Adversarial examplesAdversarial Vision ChallengeWieland Brendel et alNIPS 2018AN #19There will be a competition on adversarial examples for vision at NIPS 2018.
Adversarial examplesEvaluating and Understanding the Robustness of Adversarial Logit PairingLogan Engstrom, Andrew Ilyas and Anish AthalyearXivAN #18
Adversarial examplesBenchmarking Neural Network Robustness to Common Corruptions and Surface VariationsDan Hendrycks et alICLR 2019AN #15See [Import AI](
Adversarial examplesAdversarial Reprogramming of Neural NetworksGamaleldin F. Elsayed et alarXivAN #14
Adversarial examplesDefense Against the Dark Arts: An overview of adversarial example security research and future research directionsIan GoodfellowarXivAN #11
Adversarial examplesOn Evaluating Adversarial RobustnessNicholas Carlini et alNeurIPS SECML 2018AN #46
Adversarial examplesCharacterizing Adversarial Examples Based on Spatial Consistency Information for Semantic SegmentationChaowei Xiao et alECCVAN #29Dan HThis paper considers adversarial attacks on segmentation systems. They find that segmentation systems behave inconsistently on adversarial images, and they use this inconsistency to detect adversarial inputs. Specifically, they take overlapping crops of the image and segment each crop. For overlapping crops of an adversarial image, they find that the segmentation are more inconsistent. They defend against one adaptive attack.
Adversarial examplesSpatially Transformed Adversarial ExamplesChaowei Xiao et alICLRAN #29Dan HMany adversarial attacks perturb pixel values, but the attack in this paper perturbs the pixel locations instead. This is accomplished with a smooth image deformation which has subtle effects for large images. For MNIST images, however, the attack is more obvious and not necessarily content-preserving (see Figure 2 of the paper).
Adversarial examplesAdversarial Logit PairingarXivRecon #5
Adversarial examplesLearning to write programs that generate imagesDeepMind BlogRecon #5
Adversarial examplesIntrinsic Geometric Vulnerability of High-Dimensional Artificial IntelligenceLuca Bortolussi et alarXivAN #36
Adversarial examplesOn Adversarial Examples for Character-Level Neural Machine TranslationJavid Ebrahimi et alarXivAN #13
Adversarial examplesIdealised Bayesian Neural Networks Cannot Have Adversarial Examples: Theoretical and Empirical StudyYarin Gal et alarXivAN #10
Agent foundationsPavlov GeneralizesAbram DemskiAlignment ForumAN #52In the iterated prisoner's dilemma, the [Pavlov strategy]( is to start by cooperating, and then switch the action you take whenever the opponent defects. This can be generalized to arbitrary games. Roughly, an agent is "discontent" by default and chooses actions randomly. It can become "content" if it gets a high payoff, in which case it continues to choose whatever action it previously chose as long as the payoffs remain consistently high. This generalization achieves Pareto optimality in the limit, though with a very bad convergence rate. Basically, all of the agents start out discontent and do a lot of exploration, and as long as any one agent is discontent the payoffs will be inconsistent and all agents will tend to be discontent. Only when by chance all of the agents take actions that lead to all of them getting high payoffs do they all become content, at which point they keep choosing the same action and stay in the equilibrium.

Despite the bad convergence, the cool thing about the Pavlov generalization is that it only requires agents to notice when the results are good or bad for them. In contrast, typical strategies that aim to mimic Tit-for-Tat require the agent to reason about the beliefs and utility functions of other agents, which can be quite difficult to do. By just focusing on whether things are going well for themselves, Pavlov agents can get a lot of properties in environments with other agents that Tit-for-Tat strategies don't obviously get, such as exploiting agents that always cooperate. However, when thinking about <@logical time@>(@In Logical Time, All Games are Iterated Games@), it would seem that a Pavlov-esque strategy would have to make decisions based on a prediction about its own behavior, which is... not obviously doomed, but seems odd. Regardless, given the lack of work on Pavlov strategies, it's worth trying to generalize them further.
Agent foundationsCDT=EDT=UDTAbram DemskiAlignment ForumAN #42
Agent foundationsClarifying Logical CounterfactualsChris LeongLessWrongAN #43
Agent foundationsCDT Dutch BookAbram DemskiAlignment ForumAN #42
Agent foundationsAnthropic paradoxes transposed into Anthropic Decision TheoryStuart ArmstrongAlignment ForumAN #38
Agent foundationsAnthropic probabilities and cost functionsStuart ArmstrongAlignment ForumAN #38
Agent foundationsBounded Oracle InductionDiffractorAlignment ForumAN #35
Agent foundationsRobust program equilibriumCaspar OesterheldSpringerAN #35
Agent foundationsA Rationality Condition for CDT Is That It Equal EDT (Part 2)Abram DemskiAlignment ForumAN #28
Agent foundationsA Rationality Condition for CDT Is That It Equal EDT (Part 1)Abram DemskiAlignment ForumAN #27
Agent foundationsEDT solves 5 and 10 with conditional oraclesjessicataAlignment ForumAN #27
Agent foundationsAsymptotic Decision Theory (Improved Writeup)DiffractorAlignment ForumAN #26
Agent foundationsIn Logical Time, All Games are Iterated GamesAbram DemskiAlignment ForumAN #25RichardThe key difference between causal and functional decision theory is that the latter supplements the normal notion of causation with "logical causation". The decision of agent A can logically cause the decision of agent B even if B made their decision before A did - for example, if B made their decision by simulating A. Logical time is an informal concept developed to help reason about which computations cause which other computations: logical causation only flows forward through logical time in the same way that normal causation only flows forward through normal time (although maybe logical time turns out to be loopy). For example, when B simulates A, B is placing themselves later in logical time than A. When I choose not to move my bishop in a game of chess because I've noticed it allows a sequence of moves which ends in me being checkmated, then I am logically later than that sequence of moves. One toy model of logical time is based on proof length - we can consider shorter proofs to be earlier in logical time than longer proofs. It's apparently surprisingly difficult to find a case where this fails badly.

In logical time, all games are iterated games. We can construct a series of simplified versions of each game where each player's thinking time is bounded. As thinking time increases, the games move later in logical time, and so we can treat them as a series of iterated games whose outcomes causally affect all longer versions. Iterated games are fundamentally different from single-shot games: the [folk theorem]( states that virtually any outcome is possible in iterated games.
I like logical time as an intuitive way of thinking about logical causation. However, the analogy between normal time and logical time seems to break down in some cases. For example, suppose we have two boolean functions F and G, such that F = not G. It seems like G is logically later than F - yet we could equally well have defined them such that G = not F, which leads to the opposite conclusion. As Abram notes, logical time is intended as an intuition pump not a well-defined theory - yet the possibility of loopiness makes me less confident in its usefulness. In general I am pessimistic about the prospects for finding a formal definition of logical causation, for reasons I described in [Realism about Rationality](, which Rohin summarised above.
Agent foundationsCounterfactuals and reflective oraclesNisanAlignment ForumAN #23
Agent foundationsWhen wishful thinking worksAlex MennenAlignment ForumAN #23Sometimes beliefs can be loopy, in that the probability of a belief being true depends on whether you believe it. For example, the probability that a placebo helps you may depend on whether you believe that a placebo helps you. In the situation where you know this, you can "wish" your beliefs to be the most useful possible beliefs. In the case where the "true probability" depends continuously on your beliefs, you can use a fixed point theorem to find a consistent set of probabilities. There may be many such fixed points, in which case you can choose the one that would lead to highest expected utility (such as choosing to believe in the placebo). One particular application of this would be to think of the propositions as "you will take action a_i". In this case, you act the way you believe you act, and then every probability distribution over the propositions is a fixed point, and so we just choose the probability distribution (i.e. stochastic policy) that maximized expected utility, as usual. This analysis can also be carried to Nash equilibria, where beliefs in what actions you take will affect the actions that the other player takes.
Agent foundationsBottle Caps Aren't OptimisersDaniel FilanAlignment ForumAN #22The previous paper detects optimizers by studying their behavior. However, if the goal is to detect an optimizer before deployment, we need to determine whether an algorithm is performing optimization by studying its source code, _without_ running it. One definition that people have come up with is that an optimizer is something such that the objective function attains higher values than it otherwise would have. However, the author thinks that this definition is insufficient. For example, this would allow us to say that a bottle cap is an optimizer for keeping water inside the bottle. Perhaps in this case we can say that there are simpler descriptions of bottle caps, so those should take precedence. But what about a liver? We could say that a liver is optimizing for its owner's bank balance, since in its absence the bank balance is not going to increase.Here, we want a definition of optimization because we're worried about an AI being deployed, optimizing for some metric in the environment, and then doing something unexpected that we don't like but nonetheless does increase the metric (falling prey to Goodhart's law). It seems better to me to talk about "optimizer" and "agent" as models of predicting behavior, not something that is an inherent property of the thing producing the behavior. Under that interpretation, we want to figure out whether the agent model with a particular utility function is a good model for an AI system, by looking at its internals (without running it). It seems particularly important to be able to use this model to predict the behavior in novel situations -- perhaps that's what is needed to make the definition of optimizer avoid the counterexamples in this post. (A bottle cap definitely isn't going to keep water in containers if it is simply lying on a table somewhere.)
Agent foundationsComputational complexity of RL with trapsVadim KosoyAlignment ForumAN #22A post asking about complexity theoretic results around RL, both with (unknown) deterministic and stochastic dynamics.
Agent foundationsCooperative OraclesDiffractorAlignment ForumAN #22
Agent foundationsCorrigibility doesn't always have a good action to takeStuart ArmstrongAlignment ForumAN #22Stuart has [previously argued]( that an AI could be put in situations where no matter what it does, it would affect the human's values. In this short post, he notes that if you then say that it is possible to have situations where the AI cannot act corrigibly, then other problems arise, such as how you can create a superintelligent corrigible AI that does anything at all (since any action that it takes would likely affect our values somehow).
Agent foundationsUsing expected utility for Good(hart)Stuart ArmstrongAlignment ForumAN #22If we include all of the uncertainty we have about human values into the utility function, then it seems possible to design an expected utility maximizer that doesn't fall prey to Goodhart's law. The post shows a simple example where there are many variables that may be of interest to humans, but we're not sure which ones. In this case, by incorporating this uncertainty into our proxy utility function, we can design an expected utility maximizer that has conservative behavior that makes sense.On the one hand, I'm sympathetic to this view -- for example, I see risk aversion as a heuristic leading to good expected utility maximization for bounded reasoners on large timescales. On the other hand, an EU maximizer still seems hard to align, because whatever utility function it gets, or distribution over utility functions, it will act as though that input is definitely true, which means that anything we fail to model will never make it into the utility function. If you could have some sort of "unresolvable" uncertainty, some reasoning (similar to the [problem of induction]( suggesting that you can never fully trust your own thoughts to be perfectly correct, that would make me more optimistic about an EU maximization based approach, but I don't think it can be done by just changing the utility function, or by adding a distribution over them.
Agent foundationsAgents and Devices: A Relative Definition of AgencyLaurent Orseau et alarXivAN #22This paper considers the problem of modeling other behavior, either as an agent (trying to achieve some goal) or as a device (that reacts to its environment without any clear goal). They use Bayesian IRL to model behavior as coming from an agent optimizing a reward function, and design their own probability model to model the behavior as coming from a device. They then use Bayes rule to decide whether the behavior is better modeled as an agent or as a device. Since they have a uniform prior over agents and devices, this ends up choosing the one that better fits the data, as measured by log likelihood.

In their toy gridworld, agents are navigating towards particular locations in the gridworld, whereas devices are reacting to their local observation (the type of cell in the gridworld that they are currently facing, as well as the previous action they took). They create a few environments by hand which demonstrate that their method infers the intuitive answer given the behavior.
In their experiments, they have two different model classes with very different inductive biases, and their method correctly switches between the two classes depending on which inductive bias works better. One of these classes is the maximization of some reward function, and so we call that the agent class. However, they also talk about using the Solomonoff prior for devices -- in that case, even if we have something we would normally call an agent, if it is even slightly suboptimal, then with enough data the device explanation will win out.

I'm not entirely sure why they are studying this problem in particular -- one reason is explained in the next post, I'll write more about it in that section.
Agent foundationsReducing collective rationality to individual optimization in common-payoff games using MCMCjessicataAlignment ForumAN #21Given how hard multiagent cooperation is, it would be great if we could devise an algorithm such that each agent is only locally optimizing their own utility (without requiring that anyone else change their policy), that still achieves the globally optimal policy. This post considers the case where all players have the same utility function in an iterated game. In this case, we can define a process where at every timestep, one agent is randomly selected, and that agent changes their action in the game uniformly at random with probability that depends on how much utility was just achieved. This depends on a rationality parameter α -- the higher α is, the more likely it is for the player to stick with a high utility action.

This process allows you to reach every possible joint action from every other possible joint action with some non-zero probability, so in the limit of running this process forever, you will end up visiting every state infinitely often. However, by cranking up the value of α, we can ensure that in the limit we spend most of the time in the high-value states and rarely switch to anything lower, which lets us get arbitrarily close to the optimal deterministic policy (and so arbitrarily close to the optimal expected value).
I like this, it's an explicit construction that demonstrates how you can play with the explore-exploit tradeoff in multiagent settings. Note that when α is set to be very high (the condition in which we get near-optimal outcomes in the limit), there is very little exploration, and so it will take a long time before we actually find the optimal outcome in the first place. It seems like this would make it hard to use in practice, but perhaps we could replace the exploration with reasoning about the game and other agents in it? The author was planning to use reflective oracles to do something like this if I understand correctly.
Agent foundationsComputational efficiency reasons not to model VNM-rational preference relations with utility functionsAlex MennenLessWrongAN #17Realistic agents don't use utility functions over world histories to make decisions, because it is computationally infeasible, and it's quite possible to make a good decision by only considering the local effects of the decision. For example, when deciding whether or not to eat a sandwich, we don't typically worry about the outcome of a local election in Siberia. For the same computational reasons, we wouldn't want to use a utility function to model other agents. Perhaps a utility function is useful for measuring the strength of an agent's preference, but even then it is really measuring the ratio of the strength of the agent's preference to the strength of the agent's preference over the two reference points used to determine the utility function.I agree that we certainly don't want to model other agents using full explicit expected utility calculations because it's computationally infeasible. However, as a first approximation it seems okay to model other agents as computationally bounded optimizers of some utility function. It seems like a bigger problem to me that any such model predicts that the agent will never change its preferences (since that would be bad according to the current utility function).
Agent foundationsStable Pointers to Value III: Recursive QuantilizationAbram DemskiLessWrongAN #17We often try to solve alignment problems by going a level meta. For example, instead of providing feedback on what the utility function is, we might provide feedback on how to best learn what the utility function is. This seems to get more information about what safe behavior is. What if we iterate this process? For example, in the case of quantilizers with three levels of iteration, we would do a quantilized search over utility function generators, then do a quantilized search over the generated utility functions, and then do a quantilized search to actually take actions.The post mentions what seems like the most salient issue -- that it is really hard for humans to give feedback even a few meta levels up. How do you evaluate a thing that will create a distribution over utility functions? I might go further -- I'm not even sure there is good normative feedback on the meta level(s). There is feedback we can give on the meta level for any particular object-level instance, but it seems not at all obvious (to me) that this advice will generalize well to other object-level instances. On the other hand, it does seem to me that the higher up you are in meta-levels, the smaller the space of concepts and the easier it is to learn. So maybe my overall take is that it seems like we can't depend on humans to give meta-level feedback well, but if we can figure out how to either give better feedback or learn from noisy feedback, it would be easier to learn and likely generalize better.
Agent foundationsBuridan's ass in coordination gamesjessicataLessWrongAN #16Suppose two agents have to coordinate to choose the same action, X or Y, where X gives utility 1 and Y gives utility u, for some u in [0, 2]. (If the agents fail to coordinate, they get zero utility.) If the agents communicate, decide on policies, then observe the value of u with some noise ϵ, and then execute their policies independently, there must be some u for which they lose out on significant utility. Intuitively, the proof is that at u = 0, you should say X, and at u = 2, you should say Y, and there is some intermediate value where you are indifferent between the two (equal probability of choosing X or Y), meaning that 50% of the time you will fail to coordinate. However, if you have a shared source of randomness (after observing the value of u), then you can correlate your decisions using the randomness in order to do much better.Cool result, and quite easy to understand. As usual I don't want to speculate on relevance to AI alignment because it's not my area.
Agent foundationsProbability is Real, and Value is ComplexAbram DemskiLessWrongAN #16If you interpret events as vectors on a graph, with probability on the x-axis and probability * utility on the y-axis, then any rotation of the vectors preserves the preference relation, so that you will make the same decision. This means that from decisions, you cannot distinguish between rotations, which intuitively means that you can't tell if a decision was made because it had a low probability of high utility, or medium probability of medium utility, for example. As a result, beliefs and utilities are inextricably linked, and you can't just separate them. Key quote: "Viewing [probabilities and utilities] in this way makes it somewhat more natural to think that probabilities are more like "caring measure" expressing how much the agent cares about how things go in particular worlds, rather than subjective approximations of an objective "magical reality fluid" which determines what worlds are experienced."I am confused. If you want to read my probably-incoherent confused opinion on it, it's [here](
Bayesian Utility: Representing Preference by Probability Measures
Agent foundationsBayesian Probability is for things that are Space-like Separated from YouScott GarrabrantLessWrongAN #15When an agent has uncertainty about things that either influenced which algorithm the agent is running (the agent's "past") or about things that will be affected by the agent's actions (the agent's "future"), you may not want to use Bayesian probability. Key quote: "The problem is that the standard justifications of Bayesian probability are in a framework where the facts that you are uncertain about are not in any way affected by whether or not you believe them!" This is not the case for events in the agent's "past" or "future". So, you should only use Bayesian probability for everything else, which are "space-like separated" from you (in analogy with space-like separation in relativity).I don't know much about the justifications for Bayesianism. However, I would expect any justification to break down once you start to allow for sentences where the agent's degree of belief in the sentence affects its truth value, so the post makes sense given that intuition.
Agent foundationsComplete Class: Consequentialist FoundationsAbram DemskiLessWrongAN #15An introduction to "complete class theorems", which can be used to motivate the use of probabilities and decision theory.This is cool, and I do want to learn more about complete class theorems. The post doesn't go into great detail on any of the theorems, but from what's there it seems like these theorems would be useful for figuring out what things we can argue from first principles (akin to the VNM theorem and dutch book arguments).
Agent foundationsCounterfactual Mugging Poker GameScott GarrabrantLessWrongAN #11This is a variant of counterfactual mugging, in which an agent doesn't take the action that is locally optimal, because that would provide information in the counterfactual world where one aspect of the environment was different that would lead to a large loss in that setting.This example is very understandable and very short -- I haven't summarized it because I don't think I can make it any shorter.
Agent foundationsWeak arguments against the universal prior being malignX4vierLessWrongAN #11In an [earlier post](, Paul Christiano has argued that if you run Solomonoff induction and use its predictions for important decisions, most of your probability mass will be placed on universes with intelligent agents that make the right predictions so that their predictions will influence your decisions, and then use that influence to manipulate you into doing things that they value. This post makes a few arguments that this wouldn't actually happen, and Paul responds to the arguments in the comments.I still have only a fuzzy understanding of what's going on here, so I'm going to abstain from an opinion on this one.
What does the universal prior actually look like?
Agent foundationsPrisoners' Dilemma with Costs to ModelingScott GarrabrantLessWrongAN #10Open source game theory looks at the behavior of agents that have access to each other's source code. A major result is that we can define an agent FairBot that will cooperate with itself in the prisoner's dilemma, yet can never be exploited. Later, we got PrudentBot, which still cooperates with FairBots, but will defect against CooperateBots (which always cooperate) since it can at no cost to itself. Given this, you would expect that if you evolved a population of such bots, you'd hopefully get an equilibrium in which everyone cooperates with each other, since they can do so robustly without falling prey to DefectBots (which always defect). However, being a FairBot or PrudentBot is costly -- you have to think hard about the opponent and prove things about them, it's a lot easier to rely on everyone else to punish the DefectBots and become a CooperateBot yourself. In this post, Scott analyzes the equilibria in the two person prisoner's dilemma with small costs to play bots that have to prove things. It turns out that in addition to the standard Defect-Defect equilibirum, there are two mixed strategy equilibria, including one that leads to generally cooperative behavior -- and if you evolve agents to play this game, they generally stay in the vicinity of this good equilibrium, for a range of initial conditions.This is an interesting result. I continue to be surprised at how robust this Lobian cooperative behavior seems to be -- while I used to think that humans could only cooperate with each other because of prosocial tendencies that meant that we were not fully selfish, I'm now leaning more towards the theory that we are simply very good at reading other people, which gives us insight into them, and leads to cooperative behavior in a manner similar to Lobian cooperation.
[Robust Cooperation in the Prisoner's Dilemma]( and/or [Open-source game theory is weird](
Agent foundations2018 research plans and predictionsRob BensingerMIRI BlogAN #1Scott and Nate from MIRI score their predictions for research output in 2017 and make predictions for research output in 2018.I don't know enough about MIRI to have any idea what the predictions mean, but I'd still recommend reading it if you're somewhat familiar with MIRI's technical agenda to get a bird's-eye view of what they have been focusing on for the last year.
A basic understanding of MIRI's technical agenda (eg. what they mean by naturalized agents, decision theory, Vingean reflection, and so on).
Agent foundationsApproval-directed agency and the decision theory of Newcomb-like problemsCaspar OesterheldFRI WebsiteAN #52
Agent foundationsNo surjection onto function space for manifold XStuart ArmstrongAlignment ForumAN #41
Agent foundationsFailures of UDT-AIXI, Part 1: Improper RandomizingDiffractorAlignment ForumAN #40
Agent foundationsRobust program equilibriumCaspar OesterheldFRI WebsiteAN #39In a prisoner's dilemma where you have access to an opponent's source code, you can hope to achieve cooperation by looking at how the opponent would perform against you. Naively, you could simply simulate what the opponent would do given your source code, and use that to make your decision. However, if your opponent also tries to simulate you, this leads to an infinite loop. The key idea of this paper is to break the infinite loop by introducing a small probability of guaranteed cooperation (without simulating the opponent), so that eventually after many rounds of simulation the recursion "bottoms out" with guaranteed cooperation. They explore what happens when applying this idea to the equivalents of FairBot/Tit-for-Tat strategies when you are simulating the opponent.
Agent foundationsOracle Induction ProofsDiffractorAlignment ForumAN #35
Agent foundationsDimensional regret without resetsVadim KosoyAlignment ForumAN #33
Agent foundationsWhat are Universal Inductors, Again?DiffractorAlignment ForumAN #32
Agent foundationsWhen EDT=CDT, ADT Does WellDiffractorAlignment ForumAN #30
Agent foundationsRamsey and Joyce on deliberation and predictionYang Liu et alCSER WebsiteAN #27
Agent foundationsDoubts about UpdatelessnessAlex AppelAgent FoundationsAN #5
Agent foundationsComputing an exact quantilal policyVadim KosoyAgent FoundationsAN #3
Agent foundationsLogical Counterfactuals & the Cooperation GameChris LeongAlignment ForumAN #20
Agent foundationsNo Constant Distribution Can be a Logical InductorAlex AppelAgent FoundationsAN #2
Agent foundationsResource-Limited Reflective OraclesAlex AppelAgent FoundationsAN #2
Agent foundationsIdea: OpenAI Gym environments where the AI is a part of the environmentcrabmanLessWrongAN #2
Agent foundationsProbabilistic Tiling (Preliminary Attempt)DiffractorAlignment ForumAN #19
Agent foundations[Logical Counterfactuals for Perfect Predictors]( and [A Short Note on UDT]( LeongLessWrongAN #19
Agent foundationsCounterfactuals, thick and thinNisanLessWrongAN #18There are many different ways to formalize counterfactuals (the post suggests three such ways). Often, for any given way of formalizing counterfactuals, there are many ways you could take a counterfactual, which give different answers. When considering the physical world, we have strong causal models that can tell us which one is the "correct" counterfactual. However, there is no such method for logical counterfactuals yet.I don't think I understood this post, so I'll abstain on an opinion.
Agent foundationsConceptual problems with utility functions, second attempt at explainingDacynLessWrongAN #17Argues that there's a difference between object-level fairness (which sounds to me like fairness as a terminal value) and meta-level fairness (which sounds to me like instrumental fairness), and that this difference is not captured with single-player utility function maximization.I still think that the difference pointed out here is accounted for by traditional multiagent game theory, which has utility maximization for each player. For example, I would expect that in a repeated Ultimatum game, fairness would arise naturally, similarly to how tit-for-tat is a good strategy in an iterated prisoner's dilemma.
Conceptual problems with utility functions
Agent foundationsExorcizing the Speed Prior?Abram DemskiLessWrongAN #17Intuitively, in order to find a solution to a hard problem, we could either do an uninformed brute force search, or encode some domain knowledge and then do an informed search. Roughly, we should expect each additional bit of information to cut the required search roughly in half. The speed prior trades off a bit of complexity against a doubling of running time, so we should expect the informed and uninformed searches to be equally likely in the speed prior. So, uninformed brute force searches that can find weird edge cases (aka daemons) are only equally likely, not more likely.As the post acknowledges, this is extremely handwavy and just gesturing at an intuition, so I'm not sure what to make of it yet. One counterconsideration is that a lot of intelligence that is not just search, that still is general across domains (see [this comment]( for examples).
Agent foundationsThe Evil Genie PuzzleChris LeongLessWrongAN #17
Agent foundationsLogical uncertainty and Mathematical uncertaintyAlex MennenLessWrongAN #13
Agent foundationsForecasting using incomplete modelsVadim KosoyMIRI WebsiteAN #13
Agent foundationsA Possible Loophole for Self-Applicative Soundness?Alex AppelAgent FoundationsAN #10
Agent foundationsLogical Inductor LemmasAlex AppelAgent FoundationsAN #10
Agent foundationsLogical Inductor Tiling and Why it's HardAlex AppelAgent FoundationsAN #10
Agent foundationsLogical Inductors Converge to Correlated Equilibria (Kinda)Alex AppelAgent FoundationsAN #10
Agent foundationsTwo Notions of Best ResponseAlex AppelAgent FoundationsAN #10
Agent foundationsMusings on ExplorationAlex AppelAgent FoundationsAN #1Decision theories require some exploration in order to prevent the problem of spurious conterfactuals, where you condition on a zero-probability event. However, there are problems with exploration too, such as unsafe exploration (eg. launching a nuclear arsenal in an exploration step), and a sufficiently strong agent seems to have an incentive to self-modify to remove the exploration, because the exploration usually leads to suboptimal outcomes for the agent.I liked the linked [post]( that explains why conditioning on low-probability actions is not the same thing as a counterfactual, but I'm not knowledgeable enough to understand what's going on in this post, so I can't really say whether or not you should read it.
Agent foundationsQuantilal control for finite MDPsVadim KosoyAgent FoundationsAN #1
Agent foundationsDistributed CooperationAlex AppelAgent FoundationsRecon #5
Main menu