Assume we are handed a coin with a set but unknown probability (called ‘theta’) of returning heads.

(def coin {:theta 0.6})

(Pretend we don’t know :theta).

We must infer the value of theta using only:

  1. The freedom to flip the coin as many times as we like,
  2. Our superawesome probabilistic inferencing skills

(defn flip-coin [coin]

  (if (< (rand) (:theta coin))

    :heads :tails))

(take 3 (repeatedly #(flip-coin coin)))

=> (:tails :heads :tails)

Let’s talk about what we know:

  1. By assumption .
  2. Thus, the probability of a particular sequence of coin flips is:

Now consider what we want: .

Versus what we have: .

This looks like an inverse probability problem for which we can invoke Bayes’ theorem.

Bayes’ theorem expresses the fact that two conditional distributions over the same set of variables can be calculated from the other by:

  1. Recovering the joint distribution from either of the two conditionals, and then
  2. Renormalizing the joint to the other distribution

For example, can be used to calculate  by:

  1. Recovering the joint
  2. And then renormalizing

Thus, let’s invoke Bayes’ theorem:

Well, what do we do about  and ? The first thing to note is that

is constant over theta (since it’s a summation over all possible values for theta) and can therefore be ignored. (Note the summation should actually be an integral since theta is continuous, but we are all computer scientists here).

That leaves us with . This encodes our prior beliefs about the probable value of theta. We will, by (reasoned) fiat, use a beta distribution for . The principal motivation for this is that the beta distribution multiplied by the probability of a particular coin toss sequence (a Bernoulli distribution), results in a beta distribution. Or in other words, with a beta prior the posterior is also a beta allowing for easy math.

The following graphic (read left to right, top to bottom) illustrates this live updating process as a coin is flipped. The true value of theta is the green line and the letters represent having just observed a Head or Tail:

Notice in the first three frames we happened to flip 3 heads in succession. This pushes our beliefs about theta strongly to the right. However as soon as we receive a tail our beliefs become much less extreme and distributed over the space, although we are still confident the value of theta is to the right of its true value. By the end of the sequence our most confident beliefs reasonably align with the true value of theta.

To predict the next coin, we could simply take the max over the posterior, or we could sum over all values of theta weighted by its posterior probability. The latter is the “bayesian” way of doing it:

And that is how one analyzes a coin, bayesianally.