Singularity Summit 2012
For more transcripts, videos and audio of Singularity Summit talks visit intelligence.org/singularitysummit
Speaker: Peter Norvig
Transcriber(s): Ethan Dickinson and Jeremy Miller
Moderator: We conclude today with Peter Norvig. Peter is director of research at Google, a fellow of the American Association for Artificial Intelligence, and a fellow of the Association for Computing Machinery. He was previously the head of the computational sciences division at NASA's AIMS Research Center, where he was NASA's most senior computer scientist. He has published extensively in the fields of artificial intelligence, natural language processing, and software engineering, and his textbook, "Artificial Intelligence, A Modern Approach," is considered a leader in the field.
In 2011 he created, along with colleague Sebastian Thrun, an online AI course that attracted more than 160,000 students worldwide. He has continued to teach at udacity.com. Please welcome, to conclude the 2012 Singularity Summit, Dr. Peter Norvig.
Peter Norvig: Welcome everybody. Let's see. We're at this point in the year when our two national pastimes are just wrapping up in the last couple weeks here, and one of the things I think is great about baseball is that when they have people come on and talk, they always show the numbers there. Here you see Buster was one for four with a homerun and a four RBIs, those are pretty good numbers.
In our other national pastime, politics, people come on and they never show the numbers, and we don't know. They're supposed to be experts, we've heard experts don't always have good judgment, and what I want to see is the numbers down at the bottom there.
[laughter and applause]
Peter: Then we can say, "This guy John looks like he's in an extended slump, maybe we should bench him and bring in a pinch hitter like they did for Arod."
Peter: That applies to experts, and it should apply to me. Let's go back to 2007, I was on this stage and said some stuff, and now it's time to evaluate it. Let's see how I did.
This was my concluding slide, where I talked about the prerequisites of what I thought was important to get to artificial general intelligence, and what the field had to work on in the coming years. These were my predictions. I rate them as, I think I did about five out of six. I think most of the things on there did in fact turn out to be important, and we did work on them, and we made some really nice progress.
The very first one I put up, this probabilistic logic, first-order turned out to be not as important as I thought it was. Probably the leader in that field in 2007 was Daphne Koller, but she ended up not doing much more in that field, she did other really exciting stuff in genetics, and in education, but turned away from this. Maybe it's because of what Daniel Kahneman was telling us this morning, that it's really the wrong side of the system, that most of what we should be focusing on is this System One, and getting the System Two right, well computers are already pretty good at that, and perfecting that's maybe not the way to go.
Five out of six, still pretty good. Let me tell you what we've been doing in the subsequent five years about getting understanding better, and mostly at this System One level.
I should have known that this was going to be important, because in 2007 I knew about this paper from 1996, Olshausen and Field, about how to do computer vision, using this thing called sparse coding. Bear with me, there's some formulas here, but basically they said, we take an image, it's a picture of something, we want to be able to represent that image in terms of basis functions, which are little features, little bits of the image. The image is an addition of all those features. Those little bits represent what we know about the world. That was their model of how vision works, and they're neuroscientists, so it was backed up by some real knowledge of what goes on in the V1 area of the cortex.
They worked that out. It's unsupervised, nobody has to tell you what the image is. We don't know very much about what's happening in the world and what types of 3D structures there are, we just know that pixels that are near each other probably represent things that are near each other in the world. Then we get a bunch of images in, we do some optimization, and we get these basis functions that represent what we know about the world.
What do they look like? They look like this. These are these basis functions, and basically they're just little lines. That says our visual system is made out of pieces that can recognize this line at this orientation, and so on. That's not very satisfying. It was able to perform some computer vision tasks, but it really didn't get us to this higher-level, hierarchical organization of the world. That was the state of the art in 1996, and in 2007. Since then, we've progressed.
I also should have known about this work by Geoff Hinton and his colleagues about deep belief networks, which was getting at the idea of imposing a hierarchical structure, so you could get something more than just lines. They did this work, and it was a big breakthrough, but in 2006 it hadn't gotten that far. They could do things like recognize digits, so you could sort your letters with zip codes on it, and they could pick out features like these circles, but it never got much higher-level than that.
It wasn't until 2009, when Andrew Ng and some of his colleagues at Stanford came up with a mathematical twist on it – I won't go into all the details of how – and they built this hierarchical method where we went from the lower level, where we're recognizing the lines, up to a higher level where we're putting those together to get more complicated features, and then up to the final image. Now something interesting really started to come out. We can see we can stack these up to multiple levels, and this is what we get out.
What we see here, the top is the second level. We start off the first level, which is not shown here, its pretty much the same as Olshausen and Field has, it's just collections of lines. The second level, if the input is faces, the second level becomes eyes and noses. It's figured out that these features, eyes and noses, are an important part of its environment, when its environment is faces, and that doors and wheels are an important part when its environment is cars. Then the third level is a reconstruction of canonical faces and cars. There it is for pieces of a chair, and there you see when you can mix multiple types of images together, it still works, and it mathematically efficiently divides up its representation of the world in about the right proportions to have some of it be faces, some of it be cars, and so on.
Now we're starting to get what I was asking for in 2007, which is this hierarchical representation of the world. That we can actually build up from low-level features without having to program it in by hand, but just doing all automatic, unsupervised learning, and this is really the first time it happened. Very exciting.
We decided, let's put together the dream team at Google, so we offered to Andrew that he could come over and work with us. We were able to convince Geoff Hinton to come by this summer and work with us for a few months, and then we added Jeff Dean, a Google fellow. He's the guy that you get when you want to scale up your program from running on one computer, like Andrew had been doing before, to running in a warehouse full of computers. Jeff is the one who has done a lot of the fundamental work in how we do distributed processing at Google.
We decided that we could go ahead and do that, that we could build something at about this scale, rather than the single computer scale. We have 16,000 CPUs on 1,000 servers. We built this deep network that had a billion parameters to it that were going to be learned, and a 200 by 200 pixel field. If you compare that to the work that had been done before, it means it's about 100 times bigger than any of the work in computer vision that had been done before. Still, at least 100 or 1,000 times smaller than what the human brain does, but getting a lot closer.
This is what we did. We said, we're going to train this. We're going to give our system 10 million YouTube videos, but for the first experiment, we'll just pick out one frame from each video. You know what YouTube looks like. We're going to feed in all those images, then we're going to ask it to represent the world.
What happens? Well, this is YouTube, so there will be cats.
Peter: What I have here is a representation of two of the top-level features. The images come in, they're compressed, we build up representations of what's in all the images, and then at the top level some representations come out, these basis functions, these features that are representing the world. The one on the left here is sensitive to cats. These are the images that most excite this node in the network, the best matches to that node in the network. The other one is a bunch of faces on the right. Then there's tens of thousands of these nodes, and each one picks out a different subset of the images that it matches best.
One way to represent what is this feature is to say this one is "cats", and this one is "people", although we never gave it the words, "cats" and "people," it's able to pick those out. We can also ask this feature, this neuron or node in the network, "What would be the best possible picture, that you would be most excited about?" By a process of mathematical optimization, we can come up with that picture, and here they are. Maybe it's a little bit hard to see here, but that looks like a cat, pretty much, and that definitely looks like a face. The system just by observing the world, without being told anything, has invented these concepts.
I can show you a complex journal paper and so on, but actually this webcomic explained it better than anything else, so go look that one up. [four panels from the webcomic "Abstruse Goose." First panel: no text. Second panel: Robot: "CAT"; Man: "Good boy." Third panel: Woman: "Hey, is that a robot?" Man: "Not just any robot. THIS is the current state-of-the-art in artificial intelligence." Fourth panel: Man: "We trained a 9-layered locally connected sparse autoencoder with pooling and local contrast normalization on a dataset of 10 million images. It was trained for 3 days on a cluster of 1000 machines comprising 16,000 cores."]
I should say, we call these things neural networks, but they aren't really brains in any way. Some of it is motivated by things that we know about the brain, so the fact that there's these local receptive fields, that there's columns that feed forward, we do this process of local contrast normalization, so pictures that have different lighting are normalized, but we're not trying to duplicate what's going on in a brain, we're just motivated by it.
This is from a different set of inputs. Not YouTube images, this is a standard set of 16 million images, and they were all clustered together, and these are the types of things that come out. You can see some of what the system is doing is picking out textures, like small circles. Some of what it's doing is picking out lines, like diagonal lines, shapes like circles, but also higher-level concepts, like flowers, and ducks, and zebras, and wine bottles, and pizza, and so on. It's doing a very good job of separating the world out into the types of concepts that are important to us.
I should say, we've also applied the same technology to other areas besides images. We applied it to speech recognition. We took the same data we'd used for our speech recognition system before, we used this deep learning technique on it, and we got a 37 percent decrease in word error rate. One guy on the team said this was the biggest single thing he had ever seen in his years of working in speech recognition, the biggest single advance. Another one said he thought it extrapolated out to 20 years worth of research if you look at the word-error-rate curve, and then you look at the jump that we got. If we hadn't had this he thought it would take 20 years, but not everybody on the team was willing to sign up for that judgment.
What are the key features, the difference between 2007 and now? We're making progress, finally, on this idea of hierarchical learning, which I was stressing then. We are doing the learning, it is lots of data, it is important that it's efficient, that we're able to get 100 times bigger, but we've still got 100 times more to go.
We're doing this pattern-matching, but I also want to talk about the Kahneman System Two, the logical inference. A good application of that is in machine translation. I've talked about that before. Here's an example of the types of things we can do with Google Translate. You put in, in this case, German, you get out English. It's pretty good. You look at that and you don't have to go more than a sentence or two before you say, "this is not quite fluent, it wasn't quite perfect English-speaker," but you can definitely understand what's going on. That's typical.
How do we do it? We start with all the data in the world, so mostly we go on the World Wide Web, and we find a pair of documents where we say, "This is an English document, and there's a link or something that tells us this is a translation of a German document." Then we find lots and lots of them. We find millions of documents, billions of words in a language pair.
Here's an example of two documents I collected when I was in my hotel. One side they, "Dear Guest" on the other side they gave the German translation. If anyone gave me exactly one of those sentences, I would know how to translate it. I don't know any German, but I could translate it because I was told those are translations. Of course I'm not going to get those exact sentences, so I have to figure out how phrases map to phrases so that I can combine the data I have when I get a novel sentence I've never seen before.
We do that, basically like solving a jigsaw puzzle. We say, "there's the word in one sentence, say the word 'art' in an English sentence and the word 'kunst' is in the German sentence, maybe these correspond to each other." From this one example I couldn't tell, but if I have thousands of examples then they start to add up, and I can count how many times does "art" occur in the English sentence and "kunst" occur in the German sentence, and if that's much higher than you'd expect otherwise based on the frequency of those two words, then that's evidence that that's the right translation. We just start counting those up, then once you've accounted for one phrase, you can cross it off that sentence and then look at which ones are left over, and as you go it becomes easier and easier, like doing the jigsaw puzzle. Then we just have this phrase of correspondences. We started with correspondence between documents, now we have correspondence between phrases.
Then we put all this together. We actually are combining three sub-models. We have this translation model that says, "how often does 'art' go to 'kunst,' versus something else." We have a target model for trying to translate from German into English. We want to not only do a good translation, but we also want the result to be fluent English, so we have a target model that just says how often does each phrase occur in English, the target language. Then we have a distortion model, which says we know we want to have phrase correspond to another phrase, but sometimes the order of the phrases are flipped. In English we tend to have our adjectives before the noun, in French they tend to come afterwards, so sometimes you want to flip a phrase forward or backward in a sentence, and we have a model of how often the phrases do that, how often they move forward or backwards.
Then there's just a probability statement. Here's this model of language that you can figure out the probability of a translation in terms of these three models, and then we just go through and optimize that. The result that comes out, that's the best translation. That's all we know. We've gone out, we've collected data, we've counted. It's like everything I need to know I learned from Sesame Street. God bless the Count, who passed away recently.
We've done this without anybody on our team's speaking languages. There's several languages that we've done where nobody on the team speaks the language, but we know enough to gather it, we build these models, and we get pretty good translation results.
The third and final area that I want to talk about learning is in education itself. I mentioned in the introduction that Sebastian and I taught this online class together, because we wanted to bring what we knew to the world, was one part of why we wanted to do that. Sebastian's very eloquent about how he grew up in a small town in Germany where they didn't really have good access to good teachers, so he felt stymied with that. He wanted other people not to have to feel that frustration.
We also wanted to do it because we wanted to be able to collect data, and when all education is going on in the classroom and none of that data is being collected, it's hard to improve. If we have a record of each student's interaction, and what works and doesn't work, then we think we can make much more rapid iteration and improvement.
As mentioned, we offered this class. 160,000 students signed up from 200 countries. What we're really trying to achieve is this level of one-on-one tutoring. That's why we tried to make our videos look informal, as if we were just talking to a student, here, like this is my mom talking to me. We know that's the best type of education that we know how to do, and we thought we could get there by having this power of personalization that's informed by machine learning.
In the class we did, this was just a first step. We didn't have any of this iterative improvement, because this was the first time we'd done it, we hadn't gathered any data. We've set the grounds by exposing students to this first iteration, now we go back, look at our logs, see what works and what doesn't work, and we can improve from there.
[slide shows a silhouette of Afghanistan filled with the country's flag] Why do I show this? Because I came across a quote yesterday that says that "Education is the Afghanistan of technology. It's where technologies go to die." A new technology says, "I'm going to move in, I'm going to conquer this space," and then a decade later they put their tails between their legs and exit.
The first example I saw, this is Thomas Edison with Eastman from Eastman Kodak. In 1913, Edison proclaimed that "Books will soon be obsolete in public schools. Our school system will be completely changed inside of 10 years." Maybe that's one of our predictions that we can put up in the list of predictions. That didn't turn out to be true. Edison goofed up on that one.
Why didn't it turn out to be true? I think there's a lot of complicated reasons why technology by itself isn't the answer, I think it's the personal interaction that's probably more important than the technology. I also think it's in part because of the lack of feedback, that films do only go in one direction. We don't get to see how much is the student learning from this film. Maybe the film – was – better than the book, but unless we can close that feedback loop, we don't know which part's better and which isn't.
I think if we can put that feedback loop together, then we can get better personalized interactions, and maybe we can approach this level of one-on-one tutoring. That's my claim, I don't know if it will work, so invite me back in five years and we'll see how well we did. Thank you.
Man 1: I'd like to follow up on your connection to Professor Kahneman's talk, talking about System One, System Two design architecture of the human brain. My question is, is that a good architecture...
Man 1: ... if you were going to design a mind from the ground up today, would you have a fast section and a slow section, and the fast section be better at associative learning, and the slow section using probabilistic inference and those things?
Peter: That's a great question. I guess I wouldn't want to have the only distinction be between fast and slow, and I don't think Kahneman really meant to take that seriously, he was saying, "Here's two effects," rather than to say the brain is necessarily completely split that way. It's not like one hemisphere's fast and one's slow.
You do want to be able to react quickly. One of the things we have to deal with is time-limited actions, where if I'm driving a car I don't want to prove a theorem about whether I should hit the brakes or not, I just want to hit the brakes. That seems to be really important, to be able to meet this criteria of interacting with the world at the speed at which you have to do it. Being able to prove things is a luxury we should probably have much less often. It makes sense to have that separated out.
If you were operating at a much faster speed, then you would probably change the barriers quite a bit, and say, "Now I can go much faster, now there's much fewer situations in which the necessity to act right now is important, and rather I can pull everything into play and make those decisions."
Maybe Watson is a good example of that, where this is a system that has something like 100 components, and each one of those is operating in parallel and coming up with its suggestions. Some of them are very deliberative and rational and work through many steps, and some of them are just associative, "do a text search in a database and come back with some solutions," and then it combines and ranks them all and comes up with a final result. It's able to do that because they measured the reaction time and they said, "We have so many milliseconds. Yes, we can do all this within the reaction time that we're given." That gives them an advantage over Ken Jennings, who doesn't have the ability to do all those calculations in the same amount of time. Depending on your speed, you get a different architecture.
Man 2: Peter, thank you for your talk. My question's about these "cat" neurons and big amounts of data analyzed by computers. How do you think in the next 5, 10 years, will the computers be able to be actively learning, like having millions of documents and deciding what documents are the most interesting to be learned, and chosen for learning, or not? Is it close in this dimension or not?
Peter: Yeah, I think that's a good question. I graded myself five out of six, and I was considering giving myself only four, because online was one that, maybe that's only half-way. The online wasn't an important factor in our deep-learning network, rather we're training it in a batch method. There's other work we're doing that is online, so I gave myself a point for that anyways.
But you're right, that that's one weakness, is that it's not living in the world. It's being trained once by us dumping data at it, and then it can react once it's learned, but it should be continuous and it should be exploring on its own, and should be doing trade-offs, to say, "Do I want to go out and try to learn more information, or do I want to make due with what I have?" It has to be making all those decisions.
I think before we worry about that, there's still some work to do just to scale up what we have. I think it's sensible, the approach we're taking. One thing I mentioned, we're still a couple orders of magnitude below what the human brain can do. It's also true that we're picking out one lousy still image from each of these YouTube videos. If you really want to know what a cat is, you want to see it walking around, jumping and scratching itself and doing all those other things. We got a question earlier about other modalities besides just vision. You want to get all those types of inputs. We've got a long way to go before we can process all that.
I agree that living in the world and making those choices is something we should be aiming for. There's nice work, Don Mitchell at Carnegie Mellon has this never-ending learning agent work where it's reading the web and deciding what to read next. Maybe that's one good example going in that direction.
Man 3: Great talk Peter. Quick question, continuing on the theme around the visual side of things. How hard would it be to extend it to three dimensions? In other words, when we look at a cat, we have a model in our head of a three-dimensional cat as well, so if we had to look at the back of the head of a cat... how difficult would it be to implement something like that?
Peter: Yeah, that's a good question. There's certainly nothing inherently two-dimensional about the work that we did. We did lay out the pixels in a two-dimensional array, we had these local receptive fields, but we've done other things like the sound processing that's more one-dimensional rather than two-dimensional, so that's not intrinsic. We could try to reconstruct 3D models in the middle, that might make sense.
Most of the history of computer vision work recently has gone away from the 3D towards the 2D, and I think that's because of the availability of data. It's hard to get 3D models, and it's very easy to get 2D images, and this idea that, you're doing visual reconnaissance and you needed to recognize tanks on the ground so you needed a 3D model of the tank, we needed that when we only had a couple images of each tank. If you have thousands of images from all angles, all the way around, then the sum of all those images pretty much adds up to something similar to a 3D model, so that works well.
I should also say that it depends a lot on the deformability of objects over time. That's an area that people are working on. I've seen some of the work that works really well on rigid bodies like cats or robots, but you give it a jellyfish and it just can't do anything with it because it doesn't see how there's a coherence from one frame to the next. So we've got a lot of stuff yet to work out.
Man 4: You talked about Hinton's work, which of course, pretty famously got to admit handwriting in different fonts. Then you went to machine translation and I got really excited because I thought you were going to show results using that method for translation. I'm curious if you tried it, or did it not work as well as Bayes?
Peter: We haven't done that yet. So far it's a pretty direct mapping, because we have such rich data. I think part of the issue is that most of what we're trying to capture with translation is pretty close to the surface. Words are good representations of themselves, and they're pretty good representations of the world, whereas a single image does not say "cat" in quite the way that the letters C-A-T do, there's a bigger jump there. If you have text, you've already gone up one level of the hierarchy, so there's less need to add on additional levels.
I should say that we're probably hitting plateaus, in some ways. There's not that much farther you can go just by looking at the words themselves. We've started to get past that in several ways. One of them is by adding in syntactic structure. For the first five years or so, we threw in a syntactic parser, it didn't make any difference at all to our scores. Now that the scores are starting to asymptote, adding a parser does help, so having this additional structure helps there. We're also helped by triangulating, by doing multiple paths, going through different languages and so on. That's what we've been concentrating on now.
I'm not sure exactly what I'd do to get all the hierarchical models. We have some of that just by clustering words. What I showed was very literal, where I was just doing counts for individual words. You also want to be able to worry about endings of words, and be able to do generalization across those, and also across word classes, you'd like to be able to say, "If I hear a sentence that's talking about Tuesday, then I should be able to fill in Wednesday or Thursday." We've started to do some of this hierarchical clustering. So far we've done it just based on what words co-occur with other words, to be able to make that kind of substitution of one word for another.
It would probably be nice to put this altogether, to be able to say, "Here's a sentence that has the word 'tiger' in it, maybe we should learn from a sentence that has the word 'lion,' or 'ocelot,' and how can we learn that?" In part because of the words we have, in part because of images we have, if we combine them all to have this hierarchical models of all the things there are in the world, then I'm sure that would help translation, but it's just not the very next thing that's on the list for how to help.
Moderator: Unfortunately that does take us to the end of our time. Peter Norvig, AI teacher to the world. Thank you so much for being here.