Summit 2011 - Sharon Bertsch McGrayne

Singularity Summit 2011

A History of Bayes’ Theorem

For more transcripts, videos and audio of Singularity Summit talks visit intelligence.org/singularitysummit

Speaker: Sharon Bertsch McGrayne

Transcriber(s): Alex Vermeer, Matt Cudmore

The MC: Sharon McGrayne is next. She is the author of highly praised books about scientific discoveries and the scientists who make them. She is interested in exploring the connection between social issues and scientific progress, and in making the science clear and interesting to non-specialists.

Her first book dealt with changing patterns of discrimination faced by leading female scientists during the twentieth century, and her latest book is called “The Theory that Would Not Die: How Bayes’ Rule Cracked the Enigma Code, Hunted Down Russian Submarines, and Emerged Triumphant from Two Centuries of Controversy.” The book describes how Bayes’ theorem has become a driver of scientific progress, and a New York Times book review named it an editor’s choice.

McGrayne’s work has been featured on the Charlie Rose show, reviewed in Nature, and highlighted on NPR’s “Talk of the Nation,” and she is a frequent contributor to many other publications as well.

She is here to tell the story of how an 18th century approach to assessing evidence was ignored for much of the 20th century, but ultimately embraced. Please welcome Sharon McGrayne.

[applause]

Sharon McGrayne: Thank you for inviting me, and thank you for coming so early in the morning. As Nathan explained, I write books about science and scientists. I’m not a mathematician or a scientist myself.

Bayes’ rule is the foundation of the singularity and artificial intelligence, and if—and when—artificial intelligence overtakes the human brain, Bayes’ rule will be there as the foundation of that as well, I feel.

I understand that some people have said that there was never a great argument about Bayes’ rule. I am here to tell you the contrary. There was an enormous food fight over Bayes that went through most of the twentieth century, and did not subside until quite recently. So in a very real sense, you folks here are real revolutionaries. That’s what I am going to talk about today.

Exhibit A: Air France 447 took off two years ago last June from Rio de Janeiro, bound overnight for Paris. It hit a very intense electrical storm, high altitude, and disappeared without a trace. Two hundred and twenty eight aboard.

I spent the afternoon in Paris a couple of weeks ago with Olivier Ferrante, the aviation engineer, who ran a two-year search for the wreckage of Air France 447. It was the largest, longest, most high-tech undersea search ever done. Fruitless for two years. Then he hires a Bayesian search firm in Virginia. Many of its members are profiled in the chapter on the development of Bayesian naval search theory in “The Theory that Wouldn’t Die.” They put together the most probable region to find Air France 447, and a two-year search ended in an undersea search of one week. Two year search, and Bayes finds it in an undersea search of one week.

These are the black boxes they were searching for. They’re red and white. They’re the size of shoe boxes. This is the map that Ferrante used. They were searching a vast area the size of Switzerland. Here’s Zurich up here, Geneva over here. This is Switzerland, and he chose to put it over Switzerland because the mountainous terrain four thousand meters under the sea was much like Switzerland’s. The pingers they were searching for were the size of cigars, and they found it.

The remarkable thing to me is, among other things, that the authorities publicly credited Bayes, because as we’re going to see, for decades of the 20th century, a lot of people didn’t even dare mention the word “Bayes.”

Exhibit B: Bayes is all around us.

The Google car, last spring, was basically Bayes. They use the Google maps to start out with, and then they update that information by what comes in through the sensors on top of the cars about new travel conditions around them, potholes, new deviations and things, and figure out the most rational way to drive at that particular time.

Bayes is all around us. On the New York Times front page last Sunday, two stories based on Bayes. Bayes is not mentioned; there’s no “Bayes” word in there. One story called “Inflating the Software Report Card” is about a Bayesian program that teaches children mathematics, and there’s a controversy about the statistics involved in proving the effectiveness of the instruction. The other one, of course as we’ve just heard, is about the rapid trades, and there is a question about that clamping down.

Two stories in the Sunday New York Times, followed by Tuesday’s New York Times. Two economists win a Nobel Prize for doing cause-and-effect studies in economics using Bayes’ rule.

There are holdouts, of course, particularly in courtrooms. The story has been circulating in the last couple of weeks, which a Guardian newspaper reporter wrote about an appeals judge in Britain who has banned Bayesian statistics from British courtrooms. He wants firm numbers, not approximations.

Now, to appreciate how revolutionary this is and how revolutionary you all are, we have to understand how long and embittered the assault on Bayes was.

I’ll very briefly show you an equation. I’ve been told that people who write popular science must never mention an equation. But, this is Thomas Bayes’—this is actually not the Reverend Bayes’ formula, he used a geometric form of Newtonian calculus—but this is it in modern notation.

The P(A), that is the heart of the fight against Bayes. Thomas Bayes told us that we can start with a measure of our belief about a situation. We can assess the probability of our prior belief, and if we don’t know very much about it, then we can guess. He uses the word “guess,” and then he goes even further and says that if you don’t really know how much to guess, just guess fifty-fifty. And it was that suggestion, that “subjective prior” they call it, that so inflamed anti-Bayesians. They called it subjectivity run amuck. They said it was ignorance coined into science. And as the 1700s, 1800s, and 1900s accumulate much more trustworthy data, statisticians mostly decided that they would prefer to judge the probability of an event or a situation by how frequently it occurs, and they become the Frequentists and the arch-opponents of Bayes.

Nevertheless, although by 1939 when the Second World War breaks out, Bayes is basically taboo amongst statistical sophisticates, a number of people including Alan Turing and the great Soviet Russian mathematician Kolmogorov [?] used Bayes during the war. It’s very good for dealing with very uncertain situations in which you have to make quick one-time decisions. That’s what they had to do during wartime. However, much of that work, particularly Turing’s for decrypting the Enigma code and other classified codes, was highly classified immediately after the peace by the British government.

So Bayes’ rule emerged from the Second World War as suspect as it had been when it entered the war, despite having been used for critical projects. We have a small group of maybe 100 or more fervent Bayesian believers who are stymied because they can’t prove that Bayes works because much of the proof is classified in secret.

The great founder of statistical science, an anti-Bayesian, Ronald Fisher kept up a very personalized fight against Bayes’ rule starting in the 1920s and 30s, and continuing into the 1950s when a Bayesian at the National Institute of Health was using Bayes to show that you could use statistical data, and Bayes in particular, to prove that cigarette smoking actually was a cause of lung cancer.

Another example, when Jack Good, who was Alan Turing’s assistant during the Second World War to break the Enigma code, and knows Bayes works but can’t say so, gives a talk about the theory at the Royal Statistical Society, and the next speaker’s opening words were, “After that nonsense...”

During Senator McCarthy’s witch-hunt against communists in the U.S. federal government, the Bureau of Standards called a Bayesian there, only half-jokingly, un-American and undermining the United States government. The National Bureau of Standards actually suppressed a Bayesian study that was going to be sent [xx] to the Aberdeen Proving Ground because it was “subjective Bayesian.”

Harvard Business School professors... I have a sense that some of you have used Howard Raiffa’s decision trees—highly Bayesian. Howard Raiffa was a convert, first an intellectual convert, and then an emotional convert to Bayes, but the Bayesians who did the decision tree at Harvard Business School were called socialists and “so-called scientists.” The Harvard Business School at one time was called a “Bayesian hothouse.” And a Swiss visitor to Berkeley’s very anti-Bayesian statistics department in the 1950s realized that it was kind of dangerous to espouse Bayes.

During this period of the Cold War, the military of course continued to use Bayes’s rule and develop it, but kept it secret. For example, the 1950s wrestled with the problem of “How do you judge the probability of an event that has never occurred?” Obviously an event that has never occurred has no frequency to it, so the frequentists couldn’t deal with that question. There had never been an accidental H-bomb explosion. There had been deliberate tests of the H-bombs, but not an accidental explosion.

If you all have seen Dr. Strangelove, the movie that satirizes General Curtis LeMay’s strategic air command, you could get the David-and-Goliath overtones of the story of a young post-doc named Albert Madansky at RAND who uses Bayes to show that expanding Curtis LeMay’s strategic air command would in all probability cause 19 accidents involving H-bombs, armed or unarmed, each year. So the Kennedy Administration eventually added safeguards.

There were other Cold War projects of course. At the National Security Agency, cryptographers used Bayes and cracked the Soviet codes. John Tukey was an immensely powerful advisor to the White House and to the National Security Agency. He was also a professor at Princeton and at Bell Labs. He used Bayes for 20 years to predict the winners of congressional presidential elections for the Huntley-Brinkley news hour, the most popular news program at the time. But he insisted on keeping Bayes secret, apparently because he wanted to keep his Bayesian connections to cryptography secret. He has been widely regarded as an enemy of Bayes all this time.

And of course the Navy used it to develop undersea search theory first for finding a hydrogen bomb that was inconveniently lost in Spain, and then for the lost nuclear sub the Scorpion in the Atlantic, and then to catch Russian submarines in the Mediterranean and the Atlantic.

As a result of this onslaught, and the fact that they couldn’t prove that Bayes worked, the Bayesians spent this period on a lot of theory. Many Bayesians of that generation remember the exact moment when the overarching logic of Bayes rule descended on them like an epiphany. They talk about their conversions. We’ve seen Howard Raiffa and the decision trees.

During this period, both sides were proselytizing that their version of probability was the only one that should be used, and both sides used religious terms. When Dennis Lindley, who is one of the founders of modern Bayes, became the chair of an English statistics department, the frequentists there called him a “Jehovah’s Witness elected Pope.” He, in turn, when asked how to encourage Bayes, said “Attend funerals.” The frequentists retorted, “If only the Bayesians had done as Thomas Bayes had done, and published after they were dead, we should all be saved a lot of trouble.”

As a result, there were very few visible civilian applications in the mainstream during this period. And when for example Norman Rasmussen, a physicist from MIT, was asked to make the first assessment of nuclear power plant safety in 1973, which was 20 years after the industry had been established, he had to use Bayes because there had never been an accident involving a nuclear power plant. He used Bayes to unite not only things like the failure rates of pipes and valves and so on, but also expert opinion to flesh this out, and he comes out with many of the things that actually happened at Three Mile Island. But Bayes’ rule was so controversial that he had to hide the word “Bayes” in the appendix of volume three of his massive multi-volume Rasmussen Report.

There was one big Bayesian application in the mainstream that was really public. That was a study using the words in the Federalist Papers for data—a classification project. The Federalist Papers were essays written by our founding fathers to convince New York State voters to ratify the U.S. Constitution. Frederick Mosteller of Harvard and David Wallace of the University of Chicago used Bayes to conclude that the 12 anonymous Federalist Papers were almost indubitably written by James Madison, a decision that is held still today.

But they also discovered, as a result of their massive Bayesian study, an “Awesome result,” they said. The century-long argument over the P(A)—the subjective prior—is irrelevant if you have a lot of data to update it with. The problem was that Mosteller had had to organize an army of 100 Harvard students to input data and trace this stuff across Boston and Cambridge to the MIT computer center, because Harvard didn’t have one at the time. It was a project that no one else could even imagine organizing.

This begins to change in the late 1980s. Imaging from industrial automation, from medical diagnostics, and from the military were all producing blurry images, and they wanted to go back to the cause of those blurry images, the original object that’s portrayed. There were a number of techniques floating around at the time: Bayes, of course; Gibbs sampling; Monte Carlo; Markov chains; iterations.

Alan Gelfand, an American who was spending his sabbatical in England with Adrian Smith, all of a sudden they realized their breakthrough synthesis, and put all of these pieces together into what we call MCMC today—Markov chain Monte Carlo. They worked very fast, because they were afraid everyone else would put the pieces together too. They also wrote very carefully.

They only used the word “Bayes” five times in twelve pages, and I asked Alan Gelfand “Why?” and he said, “There was always some concern about using the B word. It was a natural defensiveness on the part of Bayesians in terms of rocking the boat. We were always an oppressed minority, trying to get some recognition, and even if we thought we were doing things the right way, we were only a small component of the statistical community, and we didn’t have much outreach into the scientific community.”

Bayesians thought their paper was an epiphany. Using the new powerful desktop workstations that became available at the same time, and using the off-the-shelf computer programs—the BUGS programs—developed by the Bayesian David Spiegelhalter, Bayesians indulged in what they refer to as a ten-year frenzy of research. They could finally, after two and a half centuries, calculate really complex real-world problems. Outsiders pour in from artificial intelligence, from computer science, from physics. Bayes is broadened and refreshed, depoliticized, secularized, and Bayes is accepted almost overnight as scientific revolutions go, because it became pragmatic and useful. Prominent frequentists even moderated their positions.

Bradley Efron, a National Medal of Science winner, who had written a classic defense of frequentism, recently said, “I’ve always been a Bayesian.” [laughter]

Thank you. [applause]

[Q&A begins]

Rick Schwall: Hi, I’m Rick Schwall of Saving Humanity from Homo Sapiens.

It occurs to me that one of the solutions that could have been was to just replace the Bayes name, and say, “We have decided to use the Laplace rule,” and anybody who was a historian would know, “Oh yeah, Laplace again,” and that would have gotten around the security restrictions because who’s going to be bright enough in the government to realize, “He’s talking about Bayes rule, and we classified that because it works.” Thank you.

Sharon McGrayne: That’s a very clever question. Until about 50 years ago, Bayes rule was known for Laplace’s work. He is now known mostly for the Laplace transform, but he was a French mathematician who in the mid-to-late 1700s, early 1800s, mathematizes every field of science known to his era, and works on Bayes over 30 or 40 years.

But this idea of avoiding the word “Bayes” actually occurred to Hans Bühlmann. He is the Swiss statistician who works particularly in insurance theory, and becomes president of ETH Zürich. He’s the one who said after a stay at Berkeley that it was really kind of dangerous to talk about Bayes in the U.S. He goes back to Europe, and writes some very important essays, but he avoids using the word “Bayes.” He cooks up some generic term, very blah, and he thinks that that helped the continent avoid this British-North-American furor over Bayes rule, by avoiding the names just as you suggested.

[next question]

Man 2: Could you walk us through a simplified Bayesian decision process? So we can get a concrete example of what it is.

Sharon McGrayne: I’m putting together some simple problems that will show important parts of Bayes’ rule. I’m putting them on my web, and eventually they’ll be put in the book, either in the e-book immediately or the paperback.

Basically what happens if you don't have much data, your prior is going to become very important. Bayes is filled with stories about people in bars, or people dealing with black and white balls in urns. I’ll do the bar one.

Someone comes into a bar and says, “I’m going to flip this coin,” and you're going to figure out whether it’s a false coin or whether it’s an honest coin. He keeps flipping and flipping. Your idea of whether he’s honest affects the probability of the outcome. If you think he’s dishonest, you make one probability that this coin is going to be fair or not. If you think he’s slimy or honest really affects it. That subjectivity, the fact that the prior can affect the outcome, is what enraged people.

That’s not a good explanation, it’s much better on my webpage, it’s really nice there. Albert Madansky actually suggested it to me, because he didn’t think the example used in the New York Times book review showed the crux of the issue, so he comes up with a really nice one.

[next question]

Steve Kilpatrick: Hi, I’m Steve Kilpatrick [?]. As an Irishman I appreciate your bar analogy. [laughter] Makes the Bayes theory much easier to understand.

Sharon McGrayne: At least it’s not urns filled with black and white balls.

Steve Kilpatrick: Is Bayes being used right now to solve any problems that we might all be familiar with?

Sharon McGrayne: Air France 447, they were able to retrieve the remains of 100 people, and took them back to France, and gave the remains to their families. That’s pretty powerful. They found the pinger the size of a cigar, and retrieved it. They found that the reason for this enormous search was that it had been damaged in the crash. When the plane hit, it hit in the back first, and that absorbed most of the energy of the crash, and the pingers and the black boxes were back there, so the pingers never worked. What can I say, I think for those families it was... And for people like us who take planes all the time... They’ve come up with a lot of safety resolutions that I hope will make things much better.

[next question]

Man 4: Another important application of Bayes is in clinical trials. I think if Bayes theorem was more widely applied in adaptive clinical trials, we’d get a lot better data, and have a lot lower failure rate, because it’s hard enough to develop drugs as it is.

The MC: Any thought on how Bayes and frequentist statistics feed into medical trials today?

Sharon McGrayne: That’s been quite slow. It hit the medical exams quite early in the 1980s I think. They put some Bayesian probability problems on the exam, and one of the biggest most-read series of articles in the Annals of Internal Medicine involved how you do Bayes to pass these tests. But in clinical trials, it’s been much slower.

[next question]

Man 5: If you look at the autonomy [?] literature, they claim that their intelligence search and other things is based almost completely on Bayes’ theorems, it’s British work [?], the company CEO who invented the technique got a lordship [?] out of it, and now they have to do [?] this honour of having been bought a stock market company valued at about 4 billion dollars, was bought for 10 billion dollars by HP. So I don’t know whether that’s a plus or a minus, but here’s some more Bayesian for the...

Sharon McGrayne: A lot of artificial intelligence does not use that prior, it eliminates it, and that’s the cause for controversy still among Bayesian theorists. They go across a spectrum thinking that to have Bayes rule pure, you must have the prior. And then there are those who say you don’t need it. And there are those who say whatever works is okay.

[final question]

Man 6: Any thoughts on why it works? Like, I’ve got no knowledge about anything, I’m just going in incorporate my random guess about a particular probability into an equation. What’s the philosophical underpinnings about why that produces a better answer?

Sharon McGrayne: It took me seven or eight years to write this book, and what kept me going all that time was to answer your question. Many people see this as the natural way of learning. You start out with an idea, but you modify it as you learn more and get more data. To me, we live in such a dogmatic age, that to say we can have an initial idea about a situation, but we must, we are committed to updating it as each piece of new information arrives, I found that very congenial.

There was the British economist John Maynard Keynes who said... A knife... [video and audio end]