Bayes tutorial

Based on book: Stone JV, Bayes’ Rule: A Tutorial Introduction to Bayesian Analysis, 2013.

Bayes’ Rule: A Tutorial Introduction

by JV Stone

1 Introduction

All decisions are based on data, but the best decisions are also based on previous experience. In essence, Bayes’ rule provides a method for making use of previous experience in order to arrive at the best decision in interpreting data.

The example given below is based on speech, but Bayes’ rule can be applied to any situation where there is uncertainty regarding the value of a measured quantity (eg how much light is hitting each part of your retina, the identity of the person you are speaking to). As all measured quantities inevtiably include some uncertainty, Bayes’ rule has a wide range of applicability.

If you walk into a hardware store and said, “Have you got four candles?” then you probably would be surprised to be asked “How many fork handles do you want?” (see Comedy clip on YouTube by Two Ronnies)*. Even though the two phrases:

1) “Have you got four candles?”,

2) “Have you got fork handles?”,

are acoustically almost identical, the shop assistant knows that he sells many more candles than fork handles. This in turn, means that he probably does not even hear the words “fork handles”, but instead hears “four candles”. What has this got to do with Bayes’ rule?

Figure 1: The Reverend Thomas Bayes trying to make sense of a London accent.

The acoustic data that corresponds to the sounds spoken by you, the customer, are equally consistent with two interpretations, but the assistant’s brain assigns a much higher weighting to one of these interpretations. This weighting is based on prior experience, so the assistant knows that customers are much more likely to ask for candles than handles. In other words, the experience of the assistant allows him to hear what was most probably said by the customer, even though the acoustic signal uttered by the customer was pretty ambiguous. Without knowing it, the assistant has applied Bayes’ rule (or something that approximates Bayes’ rule) in order to hear what the customer most probably said. Here’s how.

2 Conditional Probability, Likelihood, and Asking the Wrong Question

If we define the two possible phrases as

phrase1 = “fork handles”

phrase2 = “four candles”,

then we can formalise this scenario by considering the probability of the acoustic data given each of the two possible phrases. As both phrases are equally consistent with the acoustic data, the probability of the data is the same in both cases. That is, the probability of the data given that “four candles” was spoken is the same as the probability of the data given that the phrase “fork handles” was spoken. In both cases, probability of the acoustic data depends on the words spoken, and this dependence is made explicit as two probabilities:

the probability of the acoustic data given that “four candles” was spoken,

the probability of the acoustic data given that “fork handles” was spoken.

A short-hand way of writing these is:

p(data | “four candles”),

p(data | “fork handles”), … (1)

where p stands for probability, and the vertical bar | stands for “given that”. These are known as conditional probabilities because each probability is conditional (depends) on something else (in this case, which phrase that was spoken). More specifically, these particular conditional probabilities are known as likelihoods. For reasons that will become clear later, the expression p(data | “four candles”) is interpreted as “the likelihood that the phrase spoken was “four candles””.

We have already established that, to all intents and purposes, the two likelihoods are equal, that is:

p(data | “four candles”) = p(data | “fork handles”) … (2)

As the data is consistent with both phrases, let’s assume that the likelihoods are both 0.8:

p(data | “four candles”) = 0.8

p(data | “fork handles”) = 0.8 … (3)

This gives a clear indication of the intrinsic ambiguity of many acoustic signals. Knowing these two likelihoods is insufficient to decide what the customer said, not only because (in this instance) these likelihoods are equal, but for a more in general reason. Indeed, these likelihoods provide an answer, but it is an answer to the wrong question. The likelihoods above provide an answer to the question:

“What is the probability of the observed acoustic data given that each of two possible phrases spoken?”.

3 Asking the right question: Posterior probability

The right question, the question to which we (and our brains) would really like an answer is:

“What is the probability that each of the two possible phrases was spoken given the observed acoustic data?”

The answer to this, the right question, is implicit in two new conditional probabilities, the posterior probabilities

p(“four candles” | data),

p(“fork handles” | data) … (4)

Notice the subtle, but important, difference between the pairs of equations (1) and (4). Equations (1) tells us the likelihoods, the probability of the data given two possible phrases, which turns out to be the same for both phrases in this example. Equations (4) tells us the posterior probabilities, the probability of each phrase given the acoustic data.

Crucially, each likelihood tells us the probability of the data given a particular phrase, but takes no account of how often that phrase has been given (ie has been encountered) in the past. In contrast, the posterior depends, not only on the data (in the form of the likelihood), but also on how frequently each phrase has been encountered in the past; that is, on prior experience.

As the likelihood depends only on the data, it is easier to evaluate than the posterior, which depends on the data and on previous experience. For the sake of brevity, let’s assume that we, and the assistant’s brain, can evaluate the likelihood.

So, what we want is the posterior, but what we have is the likelihood. Fortunately, Bayes’ rule provides a means of getting from the likelihood, to the posterior, by making use of extra knowledge in the form of prior experience.

4 Prior probability

Let’s suppose that the assistant has been asked for four candles a total of 90 times in the past, whereas he has been asked for fork handles only 10 times. To keep matters simple, let’s also assume that the next customer will ask either for four candles or fork candles (we’ll revist this simplification later). Thus, before the customer has uttered a single word, the assistant estimates that the probability that he will say each of the two phrases is

p(“four candles”) = 90/100 = 0.9

p(“fork handles”) = 10/100 = 0.1 … (5)

These two prior probabilities represent the prior knowledge of the assistant, based on his previous experience of what customers buy.

When confronted with an acoustic signal that could mean either of the two interpretations, the assistant ‘naturally’ interprets this as “four candles”, because, according to his past experience, this is what such an ambiguous acoustic data usually means in practice. To put it another way, his brain takes the two equally probable likelihood values, and assigns a weighting to each one, a weighting that depends on past experience.

5 Bayesian inference

Figure 2. A schematic account of Bayes’ rule.

One obvious way to implement this weighting is to simply multiply each likelihod by its corresponding prior probabilty, to estimate the posterior probaility

p(“four candles”|data) = p(data|“four candles”) x p(“four candles”)

p(“fork handles”|data) = p(data|“fork handles”) x p(“fork handles”) …(6)

If we put the likelihood and prior probability values defined in equations (3) and (5) in (6) then we obtain

p(“four candles”|data) = p(data|“four candles”) x p(“four candles”)

= 0.8 x 0.9

= 0.72 … (7)

p(“fork handles”|data) = p(data|“fork handles”) x p(“fork handles”)

= 0.8 x 0.1

= 0.08. … (8)

As these two posterior probabilities represent the answer to the right question, we can now see that the probability that the customer said “four candles” is 0.72 whereas the probability that the customer said “fork handles” was 0.08. As “four candles” is associated with the highest value of the posterior probability, it is known as the maximum a posteriori (MAP) estimate of the phrase that was spoken.

We have just applied Bayes’ rule. The result is our best guess at which phrase was spoken given an ambiguous acoustic signal, and is known as Bayesian inference. The line of reasoning given above tells us that Bayes’ rule seems to implement a plausible strategy, but plausibity is not mathematical proof. That proof is not given here (even though it is not especially complicated), but it was first published posthumously in a paper by the Reverend Thomas Bayes in 1763 (in fact, the modern form of Bayes’ rule was actually derived by Laplace).

6 Bayes’ rule in full

If we define h as a specific hypothesis (eg the phrase h=“four candles”) and d as some data (eg d=acoustic data) then we can write the full form of Bayes’ rule as

p(hypothesis|data) = p(data|hypothesis) x p(hypothesis) / p(data).

or (more succinctly)

p(h|d) = p(d|h) x p(h) / p(d) … (9)

where p(h|d) is the posterior probability, p(d|h) is the likelihood, p(h) is the prior probability of the hypothesis under consideration, and p(d) is the probability of the data, also know as the evidence. This term is a constant because it refers to the probability of observing the data in the first place. As there is only one observed set of data, its probability does not affect the particular h that yields the highest value of the posterior, and so its value is usually set to unity (one) for convenience. It’s value can be obtained by marginalising over the likelihood, but this is a detail too far for this tutorial.

7 Maximum likelihood estimate (MLE)

At this point, we will back up a little to consider maximum likelihood estimation. If the likelihoods are not equal then the best guess at the value of h (“four candles” or “fork handles”) can be based on those likelihood values. For example, let’s assume that likelihoods are

p(d|h=“four candles”) = 0.6

p(d|h=“fork handles”) = 0.8 … (10)

That is, the acoustic data is slightly more consistent with the phrase h=“fork handles” that the phrase h=“four candles”. If we have no prior experince then we effectively do not have a prior to help us decide which h is more probable. In such cases, we are forced to rely on the data alone, and to use the likelihood values to choose one of the possible hypotheses (phrases). In this example, we would choose h=“fork handles”. As this is the value of h associated with the maximum value of the likelihood, it is known as the maximum likelihood estimate (MLE) of the phrase that was spoken.

Notice that, decision based on the MLE can be over-ruled if we had access to prior probabilities. For example, if the prior probabilities in Equations (5) and the unequal likelihoods in (10) were to be substituted in (6) then the decision (fork handles) based on the MLE would be over-ruled by the decision (four candles) based on the MAP. The decision based on the MAP is based on both data and on prior experience, so it seems intuitively obvious that this decision should be better than one (the MLE) based on the data alone, and indeed this is provably true.

Notes

Note 1: The term “prior” seems to suggest previous experience, but it can be interpreted to mean any information that is not based on the current data. So the prior can be based on instinct, or just a guess at what the underlying probabilites of various hypotheses are.

Note 2: The likelihood, prior and posterior are assumed to be one of a finite number of values in the above example. More generally, each of these can be derived from a probability density function (pdf). However, the logic that underpins Bayes’ rule is the same whether we are dealing with probabilities or probability densities.

Note 3: Bayes’ rule is also called “Bayes’ theorem”. The word “theorem” is a mathematical statement which has been proven to be true (ie Bayes’ theorem is, by definition, true).

Note 4: A London accent effectively removes the h sound from words like handle, so “fork handles” sounds just like “four candles” (think of a Michael Caine accent rather than a Hugh Grant accent).

Note 5: Two lectures on Bayesian perception, which are part of a third year lecture course in psychology at Sheffield University, can be downloaded from here.

L07_bayes_v2SmallJVStone.pdf

L08_bayes_v2SmallJVStone.pdf

*See YouTube clip of four candles, a comedy sketch by The Two Ronnies (Barker and Corbett) which inspired the example used here.

JV Stone 11th December 2011.

Email: j.v.stone [at] sheffield.ac.uk

http://jim-stone.staff.shef.ac.uk/

Please cite this document as:

Stone, JV, “Bayes’ Rule: A Tutorial Introduction”, University of Sheffield, Psychology Technical Report Number 31417, January 2012.

============================================================

END OF FILE.

============================================================