Singularity Summit 2011

Open Problems in Friendly Artificial Intelligence

[Video] [Audio]

For more transcripts, videos and audio of Singularity Summit talks visit

Speaker: Eliezer Yudkowsky

Transcriber(s): Rick Schwall and Ethan Dickinson

Moderator: OK, we really weren't kidding. Our next speaker is Eliezer Yudkowsky. He is a research fellow at the Singularity Institute, and a leading researcher on recursive self-improvement. His novel approach to AI safety, called "Friendly AI," emphasizes the structure of an AI's ethical processes and high-level goals, which is in contrast to the usual approach of seeking the right fixed set of rules for an AI to follow.

In 2001, he published the first technical analysis of motivationally-stable goal systems, called "Creating Friendly AI: The Analysis and Design of Benevolent Goal Architectures." He has also published papers on evolutionary psychology, the role of cognitive biases in global risk assessment, and how artificial intelligence could be a positive and/or negative factor in global risk.

Today he will give an overview of important open problems and research directions in Friendly AI. Please welcome Eliezer Yudkowsky.


Eliezer Yudkowsky: For the benefit of you who have no idea what this talk is about, very very briefly, it's about how to build stable, self-improving, artificial intelligences.

There was a recent conference on artificial general intelligence, real AI, AIs that really think, at Google in Mountain View. There was a poster there on Gödel machines, which are self-modifying AIs that prove each self-modification consistent before making it. It's the closest approach to the one we advocate. They had eleven steps for what you need to do to make one of these. Step seven was, "Prove that making the self-modification has higher expected utility than leaving the code constant." I went up to the person by this poster, and I said, "I think step 7 is harder than all of your other steps". In the first open problem, I am going to try to explain why.

This is not the problem of making a full artificial general intelligence, but it's a little toy problem. Let's say you've got a barrel that's one quarter diamonds, three-quarters circles, 90 percent of the diamonds are red, 80 percent of the circles are blue. A random object is chosen from this barrel. you're told whether it's red or blue, you have to guess whether it's a diamond or a circle, if you guess correctly you get $10, if you guess incorrectly you get nothing. Problem: prove that "guess diamond if red, guess circle if blue", is a better strategy than "always guess circle". Yes, we will be getting to more difficult problems shortly, I just wanted to introduce everything in the correct order.

One very bad answer you could give to this problem is, "The environment is uncertain, so nobody can prove anything. What if you see red, and it's actually a circle instead of a diamond? Oh noes! You should have just guessed circle to begin with!"

An even worse answer is, "Mere logic can't solve problems like this, you've got to use intuition!"

The correct, or as we like to say in the field, "Bayesian" answer is that in this barrel here, we see that there are 25 blue objects, 1 of which is a diamond, 24 of which are circles, and in this barrel also there are 15 red objects, six of which are circles, nine of which are diamonds. From this it follows that if you see blue, you should guess it's a circle, if you see red you should guess it's a diamond.

Let's talk about a more difficult challenge. Build an AI that, given the previous problem specification, would write a computer program to solve it. Now, in principle, we have a pretty good idea of how to do this. By "in principle", I mean, it's easy so long as you give us an infinitely large computer. Since we know what constitutes a proof that a program solves this problem, we can just search the space of all possible programs along with the space of all possible proofs that the program works. And then just select a program that works as soon as we find a proof that it works. Once you write that problem, of course the hard part is done, and then there's just the easy problem of speeding up that algorithm.


Eliezer: The point is, we know what sort of work it takes to do this, in principle.

Let's take it up another level of meta. "Build an AI that shown the format of this problem, invents a Bayesian updater to maximize the expected payoff, in general for problems like this, including more complicated ones." It may not look like this, but we've actually just taken a quantum leap upward in the complexity of what sort of reasoning you need to do to understand this problem.

The standard citation for this problem is I. J. Good, "On the Principle of Total Evidence." This is the standard citation for the proof that you ought to do Bayesian updating rather than refusing to look at your information, because the expected value of information in a rational agent is never negative. That is, you are never worse off, on average, because you updated on the evidence. It's easy to see this for any particular case, but the general proof is actually quite difficult. Actually, it looks quite simple in retrospect, but it was actually ten years between when the problem was posed in a sort of vague form, and when I. J. Good put it into its exact present form and solved it, which is why he is the citation and not the 17 distinguished philosophers of science who previously replied to the vague form of the question: "Why should we bother to update on the evidence, anyway?"

It's actually worth noting that 10 years passed between the question, "Why should we update on the evidence?", and a precise mathematical answer to it, because you occasionally hear people say, "We don't need to start on Friendly AI yet, it's too early." Well, every time you read a math textbook and there's one equation with a citation time of 1957 and there's another equation with a citation time of 1967, it means that 10 years passed between equation 1 and equation 2. Basic math research takes time, especially if you go about it in a loosely-organized way.

This problem is not actually that difficult to solve, despite the rather intimidating-looking solution over here [gestures to slide]. The thing was that it wasn't posed in the correct form, and nobody was specifically trying to do it, which is why it took 10 years.

I'm actually going to simplify that for you and sketch the proof, again very quickly. The problem is as follows: We have an unconditional strategy of "always guess circle", with an expected payoff of $7.75 per guess. We have a conditional strategy of guessing circle if blue, which pays $9.64 each time you do it, and "diamond if red", which pays $6.40 each time you do it. Why should it be the case that, in general, you improve your expected payoff each time you update using Bayes's Theorem?

The key thing is to notice that the unconditional strategy is a mixture of the conditional strategies, weighted by the probability of seeing that particular evidence. In other words, if you keep your eyes closed, your expected payoff is the weighted sum of playing the same strategy if you kept your eyes open.

This in turn means that you can view the conditional strategy as picking the maximizing element of each row, whereas the unconditional strategy you have to pick a single column. And since picking the maximizing element of each row is greater than picking the maximizing column, that is the core of I. J. Good's proof.

This is just a very small part of the math that would go into step 7 of a Gödel machine, where you are trying to prove that your modifications you make to your strategies increase expected utility. Again, note that even though this is very simple mathematically, the standard citation for the answer comes 10 years after the standard citation for the question, not necessarily because it is super-difficult mathematically, but just because math actually does take 10 years if you are not asking the right questions to start with and focusing specifically on the correct avenue for the answer.

But, with that proof in hand, we can see that the challenge of building a program which executes Bayesian updates in general involves more sophisticated math, but is nonetheless still straightforward, given infinite computing power, compared to just solving that one particular problem.

Let's take it up another step of meta, and ask how you build an AI that builds an AI that builds an expected utility maximizer. Again, contrary to what you are expecting, this is not all that much more complicated if you are familiar with modern mathematical logic, because even though the previous problem was solved in 1967, we are looking here at a problem that was solved in the 1930s by Gödel. Namely, we need proofs about proofs.

To keep track of the levels of meta here: the object-level calculator just needs to execute one Bayesian update. At one meta-level up, we need to build that calculator, which means we need to do proofs about expected utility in general. To build an AI that makes that AI – and, yes, that is GLaDOS –


Eliezer: – we need to prove things about the AI proving things about expected utility.

One simple way of doing this would be to add to the base system that the first AI uses to search for its proofs an axiom which states that any proof...

Let's call the system P that the first AI uses. We then have an axiom in P+1 which says, "If something is proven in P, then it's true. In particular, if you can prove in P that something maximizes expected utility, then that thing maximizes expected utility."

Now, of course, we want the meta-meta AI that designs that first AI, but that's pretty easy again. We just have system P+2, which has an axiom stating that if there exists a proof in P+1 that something maximizes expected utility, then it maximizes expected utility.

The recursion looks something like this. P proves blah-blah-blah in the base AI's proof system, the sort of proof system we would use to follow I. J. Good's proof. It would have the sort of axiom you would need in order to prove that doing a Bayesian update maximizes expected utility. P+1, which builds that AI, has to prove that if something is provable – this little box symbol means provable – in P, then it implies that that thing is true. Or rather, if there exists a proof and the proof shows that P implies S, that implies S. Then if we want the meta-meta-AI that build that AI, you just need the language P+2 that trusts the proof system P+1, which states if there exists a proof that P+1 proves S, that implies S.

Now of course what we really want, this is step 7 on that poster with 11 simple steps to build the Gödel machine, what we really want is a system which can prove that modifications of itself improve expected utility.

Which you might think would be as simple as having a language containing an axiom which says, "If a proof exists in this language of S, then S." It doesn't seem to be adding anything to the power of the language, right? The language already trusts itself in some sense, so why not add this axiom?

The reason you cannot do this is called Löb's theorem. Think of Gödel's theorem as the mathematical formalization of, "This sentence is false". Namely, "This sentence is not provable". Löb's theorem is based on the Santa Claus paradox. "If this sentence is true, then Santa Claus exists." Now, consider, my friends: suppose that sentence actually was true.


Eliezer: Well, if the sentence is true, then it's true that if the sentence is true, then Santa Claus exists, so if the sentence is true, then Santa Claus exists, which is just what this sentence says, so it is true, and Santa Claus does exist.


Eliezer: It also works for God.

[laughter & applause]

Eliezer: So the mathematical formalization of this is, "If the language P proves this sentence, then C." In other words, you have a recursive statement L, which is encoded using a fancy trick called diagonalization, saying that: "If P proves L, that implies C." Now you can always state this Löb sentence L over here. You can't always prove it, but you can always state it. It's usually easy to show in P, like any language which extends Peano arithmetic will show that if system P proves this sentence, system P will also prove C. Which means that if P also proves the statement, "If P proves C, then C", then we combine that with, "Provability of L implies provability of C", to get provability of L, with the statement, "provability of C implies C", and we get, "provability of L implies C", which is L, and then we get C.

So, Löb's theorem says that if P proves, "if C is provable then it is true", then you can use this to directly prove C immediately. For example: you might want to have a language which says, "If I can prove a statement of the form, 'x + y = z', then x + y = z", but if you have it in general, you also have, "If I can prove '2 + 2 = 5', then 2 + 2 = 5", from which we can directly obtain 2 + 2 = 5. This is bad, because two plus two does not equal five.

So, you are allowed to have a proof system P that proves a statement S, you can have a proof system P+1 that includes a general schema for "the provability of S in P implies S", you can have P+2 which says "Anything you can prove in P+1 is true", you can have P+Ω, which says "For any finite N, if you can prove this in P+N, then it is true," and you can have P+Ω+1, which, as you might guess, says, "If you can prove something in P+Ω it's true." What you can't have is the language which contains a self-referential axiom saying, "if a proof of S exists in this language, then it is true".

So, we can solve the problem of building an AI that builds an AI that builds AI that builds an AI that builds an AI that builds an AI... pretend I just kept on talking for a billion years... that builds an AI that builds an expected utility maximizer.

What we want to do, is have an AI that rewrites itself directly. Marcello Herreshoff and I once spent a summer messing around with paper toy systems, trying to figure out exactly what Lob's theorem was saying we couldn't do and if there was any way to bypass it. We came up with a couple of cheap tricks. We didn't solve the core problem.

An example of a trick that doesn't work is, you can't have p0 proves p-1 sound, p-1 proves p-2 sound, and then that would allow an infinite descending sequence of self-modifications because you can prove in p0 that if there exists a proof of S in p-1 then you can map that onto a proof in p0.

This is an open problem in Friendly AI. We do not know how to solve this problem. We do not know how to describe even in principle, given infinite computing power, how to build an AI that could completely rewrite itself, without decreasing the amount of trust it had in math every time it executed that self-rewrite.

You might ask, "Why not just have P+2 to the 1024th power, or something else that would outlast the expected age of the universe." One sort of answer would be: "Well, what if we invent some new physics and find some way to outlast the expected age of the universe, and then our civilization dies because its central AI broke down."


A deeper answer is that you'll run into a version of this problem any time that the AI notices it has a proof in memory. If it's a fully reflective AI that has full access to its own source code, and reasons about its own source code, the AI is going to ask itself: "Wait a minute, just because I remember proving this theorem, why should I trust it? I mean, just because something is proven in my base language doesn't mean that it's true."

The deepest answer is that a conundrum like this tells us that there are things we don't really understand yet about self-modification, that we do not understand the math of an AI modifying itself even in principle as well as we understand – well, here's the game tree of chess, you can reach this position from that position. We understood how to solve chess in principle, given infinite computing power, a long time before we could solve chess in practice. This is a problem we don't understand in principle.

All right, next open problem of Friendly AI: the prisoner's dilemma. I hope you have all heard of the prisoner's dilemma... Actually I see that I've messed up this particular slide over here... but the logic of the prisoner's dilemma is that, if both sides defect they each get $1, if both sides cooperate, they each get $100, and, keeping the opponent's move fixed, you always get one dollar more for defecting than for cooperating. You might think the obvious move here is just for both players to cooperate. I hope. If there are any of you who think it would be a good move to defect in the prisoner's dilemma, remind me not to eat any food you might have poisoned. [laughs]


Eliezer: But according to conventional game theory, in the one-shot version of the prisoner's dilemma, where both players have to choose, not knowing what the other player picked: Player 1 reasons, "Suppose Player 2 cooperates," they're each picking their move not knowing what the other player picked, they have no way of enforcing contracts, "Well if Player 2 cooperates, I get one dollar more if I defect than if I cooperate. If Player 2 defects, I get one dollar more if I defect than if I cooperate. Clearly I ought to defect."

Now that I've shown you this logic, you know that if Player 1 and Player 2 are the same sort of decision agent, probably Player 2 is going to decide exactly the same thing for exactly the same reasons and decide to defect as well. So you end up in the $1, $1 payoff instead of the $100, $100 payoff.

In real life, people have a remarkable tendency to both cooperate and end up with $100 apiece. Even in tougher versions of this game, where this is like $2 and $2 and this is $3 and $3 or something like that. But just to see that classical game theory is not being completely stupid here, imagine that you are Player 2, and Player 1 is a coin being flipped, or Player 1 is a piece of paper that has "defect" written on it, and then you would see that you indeed would want to defect.

You might think maybe you ought to realize, "Well, the other player's probably using pretty much the same line of reasoning as I am, they'll probably decide pretty much the same thing I'm deciding, so I'll cooperate, and then Player 2 will probably decide the same thing and we'll both end up getting $100." Douglas Hofstadter proposed something like this informally in a Metamagical Themas article, he called it "Super Rationality", a term I would actually object to. [ironically] For there is nothing above rationality!


Eliezer: If you wanted to formalize this argument, you would pull up the expected utility formula, and say, "Well, I'm supposed to choose an action such that, when I assume this action is taken, my probability of getting nice consequences goes up." In other words, you're maximizing over the utility of each outcome times the probability of that outcome, given that action.

Now, if you were considering the prisoner's dilemma for the first time in your life, and you didn't know whether you were going to cooperate or defect, as soon as you realized you were going to cooperate, or as soon as you realized you were going to defect, you would probably realize that the other person would do the same thing. So, wouldn't the expected utility formula say, that since you expected more cooperation assuming you yourself cooperated, you should therefore cooperate? Classical decision theory says "no."

Due to something called the "Smoking Lesion" problem. This was a thought experiment that was very influential in the modern formulation of decision theory. It is based on an embarrassing historical episode in which somebody named Fischer, one of the great enemies of the Bayesians – that's how you know he's a bad guy –


Eliezer: – testified in front of Congress that the massive observed correlations between lung cancer and smoking didn't prove that smoking caused lung cancer, since after all, it could be that there was a gene that caused people to like the taste of tobacco and also made them more likely to get lung cancer.

Hence the saying: "Correlation doesn't prove causation, but it points a finger and makes suggestive eyebrow-wiggling motions, as if to say, 'Look over there!'" [laughs]


Eliezer: This also proves that everyone who is an enemy of the Bayesians is evil.


Eliezer: The idea here is that, suppose this was actually true – it's a bit of a confusing example but it's also a standard example, so I'm going to use it – suppose it's actually true that smoking does not cause lung cancer directly, there's just this genetic lesion which causes some people to enjoy the taste of tobacco and also get lung cancer. Should you decide to smoke? It can't cause you to develop lung cancer, but presents you with the "bad news" that you're statistically more likely to have this genetic lesion. Clearly in this case – not in real life – you ought to smoke, because it can't actually make you get lung cancer.

And the rule here is that when you calculate the effect of an action and the expected utility formula, you update the causal descendants, rather than the ancestors, in your Bayesian network. If you observe someone smoking, you update your probability that they have the genetic lesion, and then you update the probability that they will get lung cancer. So if you observe someone else, then you update the straight way.

But if you imagine choosing to smoke, then you modify the Bayesian network by severing this little link over here [slide shows a crossed out line from p(Genetic Lesion) to do(Smoke)], and you only update the probability that you will enjoy the delicious, harmless taste of tobacco –  


Eliezer: – rather than the probability that you have your genetic lesion, since, of course, smoking can't affect that one way or the other.

Or, to consider a somewhat simpler example, people who have higher incomes tend to buy larger houses, which, in turn, gives them more costly mortgage payments. If I see someone buy a large house, ceteris paribus, I should believe it's statistically more likely that they have a higher income.

On the other hand, you cannot actually make yourself have a higher income by buying a larger house, contrary to what many people seemed to believe in 2008. [laughs]


Eliezer: So the general rule in classical decision theory is that you calculate the expected utility of an action by looking at the effects descending from an action, without updating the probabilities of the parent causes.

We write this using the counterfactual conditional over there, so the real formula isn't "Probability of O given that you see A", it's "Probability of O given that you do A".

And in the prisoner's dilemma, while there is some background probability that rational agents are cooperative, but, according to classical decision theory you can't make the other agent be cooperative by cooperating yourself, you are only supposed to do that which physically causes good things to happen. Therefore, defect.

It was generally believed that it was impossible to write out a general computation, a general decision formula that's the same in all cases, which would cooperate on the one-shot prisoner's dilemma, without also buying larger houses in order to have higher incomes.

The odd thing here is that this says that you should defect in the prisoner's dilemma even against your own copy. In other words, let's say you're both AIs, since at the Singularity Institute we think in terms of AIs, classical decision theory says you should defect against an AI that is an identical copy of your own source code, which seems a little odd.

What we think should happen is that if that the AIs know that they're copies of each other, they should prove that they will take the isomorphic action in both cases, and then decide to end up on the "cooperate" diagonal rather than the "defect" diagonal.

So there's a general principle: choose as if controlling the mathematical output of your own computation, which we tried to formalize over here. We have a sort of self-referential statement that you maximize A over the utility of each outcome times the probability of that outcome, assuming that this current computation outputs A.

I had a bit more I wanted to say about this, and Newcomb's problem, and how if you have our type of decision agent working with a classical decision agent, it can blackmail it and take its lunch money...


Eliezer: ...but I'm actually going to go ahead and skip to some of the open problems of Timeless Decision Theory, like proving a blackmail-free equilibrium among timeless strategists, avoid proving contradiction inside the formula there, better formalize the hybrid of causal and mathematical inference, and here are all the other open problems in Friendly AI I totally didn't get to in this talk. [slide full of small font text]


I'm now totally out of time, so maybe they'll gong me off the stage or maybe they'll give me time for one question?


[Q&A begins]

Man 1: I am by no means an expert in this field, but just hearing this talk now I'm wondering, why it is necessary for the AI to have literally infinite certainty in the truth of mathematically proven statements? If you treat an AI that's looking over its code as analogous to a human mathematician, when a human mathematician looks over a proof, they will not assign literally infinite odds that the proof is true. You might assign, for instance, a-thousand-to-one odds that the proof in the latest math paper is true, but you wouldn't assign infinite-to-one odds.

Eliezer: The answer is, if that turns out to be the key to solving the problem, Marcello and I did not successfully work out how it actually solves the problem, over that summer. So, it could be the key, but if so we didn't find it, and we did look in that particular place.

[next question]

Man 2: Do you think construction of AIs ought to wait until these issues are resolved?

Eliezer: Part of the part I had to skip over was the part where I tried to explain how if you start out with the wrong decision theory, it does not actually self-repair, it turns into a different thing that's actually still wrong, it's not reflectively consistent, but when it repairs itself it doesn't heal. And if you build an AI using the wrong decision theory, and it runs into another AI that uses a better decision theory, the better AI can use blackmail to take all of your AI's lunch money by threatening to blow up the universe unless the first AI gives it its lunch money.


Eliezer: So there are penalties to just building an AI with the wrong decision theory. And if you build an AI that doesn't preserve its utility function through lots of self-modification, there's an even larger penalty, which is that it just kills you directly because its utility function ends up as some completely odd thing that isn't quite what you had in mind.


Eliezer: So, do I think it ought to wait? I think that realistically speaking, they won't wait. I mean I've talked to these people.


Eliezer: And so I'd just like to run off and solve the problems ourselves, very quickly, before they go ahead and do things anyway, which they will. That would be my answer to that, and that's pretty much what the Singularity Institute is indeed all about.