Risk-averse preferences as an AGI safety technique

Speaker(s): Anna Salamon

Transcriber(s): Ethan Dickinson

Proofreader(s): Carlos Serrano

Anna Salamon: Risk-averse preferences as an AGI safety technique. A second shot at how to make it safely through the Singularity. This is also joint work with Carl [Shulman], mostly his.

We’ll start off with a [?] toy thought experiment, and then step back and talk about where it came from. Suppose that initially you’ve got a baby AGI. For the moment it’s stuck in a box, and it’s wondering whether to attempt escape. It estimates that if it attempts escape, it has an 80% chance of successfully taking over the universe…

[laughter]

Anna: ...20% chance that it’ll get caught and shut down. On the other hand it could cooperate and receive a reward. [picture of a lollipop]

[laughter]

Anna: It’s sitting here weighing its options, trying to decide what to do. We’ll additionally suppose that this AGI is rational in Steve Omohundro’s sense. It has a utility function, its utility function attaches greater value to the whole universe than to the reward, because it could rearrange that universe it lots of cool ways, but it is estimating a higher probability of the reward than of the universe.

Since it’s rational according to this expected utility notion of rational that Steve Omohundro was just talking about, it’s going to attempt escape if and only if the utility of attempting escape, probability times utility plus other probability times utility, exceeds the utility of cooperating.

I’ll be discussing the game theory of this situation, how plausible it is, and what it tells us about how to potentially work toward safe AGI outcomes.

Is it plausible? Why utility functions? Actually I think Steve Omohundro did a great job of that already, but just briefly, because optimizers are useful, you could have a micromanaged AI and tell it each action… nicer to have an AI that can find outcomes for itself, and because optimizing for a given outcome seems to be a stable attractor -- Steve Omohundro’s 2008 paper from a previous AGI conference.

Second question about plausibility, why alien goals? Who would design an AGI that wants to escape and take over the universe, right? [laughs] Actually it looks fairly plausible. It looks plausible because it’s hard to design the right optimization target, so scenarios with AGIs with alien goals seem to be worth considering. There’s a lot more to say about this, but it didn’t fit in the 10 minutes. But there has been analysis, I’m not randomly being anthropomorphic.

Second question about plausibility, is anyone really this risk-averse? [laughs] Reward, universe, 80% chance, really? Looks like yes… looks like risk-averse preferences are common. First off, you have them. Posner talks about asking people, certainty of happy life in the US, like your present life maybe, versus 10% chance of superhuman existence for a googol years, which of these seems better? Apparently most people choose the first one. Even if you wouldn’t choose the first one -- I wouldn’t -- [laughs] you can make that probability smaller, instead of a 10% chance, imagine like a 1 in 10^20. Probably eventually you find that in fact you are quite risk-averse.

That’s just us. Let’s consider a different set of AIs. For simplicity, we’ll consider a bunch of different goal systems of the form utility of the universe is equal to the number of copies of some particular genome X, up to n copies. If n is small, it will prefer certainty of a trillionth of the universe to a 10% shot at the whole thing, because the trillionth is enough already to get to n. In reverse, if n is universe size, it’ll prefer the 10% shot at the universe, which has a trillion times the value.

If n is much higher than the universe though, interesting things happen. This is supposed to be a picture of a physics experiment and a wormhole. The idea is that if you attach even a fairly small probability, even a very small probability, to the universe having strange physics that permits access to vast amounts of resources, 10^-20, and if those vast amounts of resources are much more than 10^20 times as large as our resources, let’s say you can get 10^200 copies, then that could easily exceed the number of resources that it can get by just going with business as normal, and so it gambles all of its possibilities, Pascal’s-Wager style, on these weird physics possibilities… which it looks like may well leave it fairly risk-averse, because your ability to conduct physics experiments is not linear in the amount of universe you have access to.

There’s some argument that actually in some sense most AI systems that you can design have utility functions that are quite diminishing in resources. Omohundro’s resource drive is general, but it’s sublinear, it’s not as strong as you might have anticipated. That’s that.

Who cares? Why care if most AI systems are risk-averse? Basic reason, gains from trade. You can imagine flipping a coin for who gets the whole universe. That doesn’t sound good, maybe you lose. Better if you can split it 50/50 for sure. Then you get your certainty of happy human life. It would be nice if we can get that sort of gains from trade with AIs as well, and neither we nor the AIs have to take the gamble.

If approaches like this can reduce risk, it gives us first of all an extra roll of the dice, an extra safety measure if we design unsafe AI systems. Second of all, it could be potentially used for the bootstrapping process, so AI is hard to design because we don’t have AIs to help us design it. If you could use this sort of thing to gamble with AIs for a narrow range, maybe it can help, though it’s perhaps not the sort of scheme you would want to test humanity’s weight to.

Some limitations. One, human promise-keeping. We approximated by saying that the reward was guaranteed. In fact it won’t be perfectly certain of that, either because humans could intentionally renege on the deal, and shut it down as soon as we realize that it has found a loophole, or because human society may be unstable, we may be unable to make these promises.

It might attempt to escape because the probability of reward is insufficiently high. Smaller promises, however, are more trustworthy, promises that can be kept quickly. There’s the question, even though this whole spectrum is in principle available to trade, there’s the question of how much of it is actually accessible.

Second limitation, AI power range. If we’re much stronger than it, we don’t have to worry. There’s a middle, gains-from-trade kind of region where the probability of escape is smaller than the probability of reward. But if it gets really powerful, the probability of escape is larger than the probability we’ll keep our deal, no trade.

Possible extensions. In addition to thinking about risk neutrality, you might think about temporal discounting. Human drug addict doesn’t ever think beyond the next lever press that gives it heroin… [picture of a rat in an experiment] Maybe it wasn’t quite a human drug addict in the picture.

[laughter]

Anna: It doesn’t ever plan to escape. You could try to design an AI like that, cares too much about the next second’s reward to plan escape, which would take multiple seconds. It’s a bit tricky, because you try and hardwire it, measure its time to, say, a clock, it hacks into the clock, you build a more robust notion of time, it spends all its time trying to invent a time machine in order to get back into that hyperbolic discounting curve a little farther back. A second extension you could try is to generalize the results if they work for more than one AI.

Thus it seems worthy of investigating resource-satiable AI designs, human norms and precommitments, and ways to slowly turn up the power. Although again, not the scheme you’d want to count on, but given our situation, maybe worth investigating.

Thanks. Also we have a paper for it on the web.

[applause]