1 of 154

Bayesian Statistics

Made Simple

Allen B. Downey

Olin College

sites.google.com/site/simplebayes

2 of 154

Follow along at home

sites.google.com/site/simplebayes

3 of 154

The plan

From Bayes's Theorem to Bayesian inference.

A computational framework.

Work on example problems.

4 of 154

Goals

At 4:40, you should be ready to:

Work on similar problems.
Learn more on your own.

5 of 154

Think Bayes

This tutorial is based on my book,

Think Bayes

Bayesian Statistics in Python

Published by O'Reilly Media

and available under a

Creative Commons license from

thinkbayes.com

6 of 154

Probability

p(A): the probability that A occurs.

p(A|B): the probability that A occurs, given that B has occurred.

p(A and B) = p(A) p(B|A)

7 of 154

Bayes's Theorem

By definition of conjoint probability:

p(A and B) = p(A) p(B|A) = (1)

p(B and A) = p(B) p(A|B)

Equate the right hand sides

p(B) p(A|B) = p(A) p(B|A) (2)

Divide by p(B) and ...

8 of 154

Bayes's Theorem

9 of 154

Bayes's Theorem

One way to think about it:

Bayes's Theorem is an algorithm to get

from p(B|A) to p(A|B).

Useful if p(B|A), p(A) and p(B) are easier

than p(A|B).

OR ...

10 of 154

Diachronic interpretation

H: Hypothesis

D: Data

Given p(H), the probability of the hypothesis before you saw the data.

Find p(H|D), the probability of the hypothesis after you saw the data.

11 of 154

A cookie problem

Suppose there are two bowls of cookies.

Bowl #1 has 10 chocolate and 30 vanilla.

Bowl #2 has 20 of each.

Fred picks a bowl at random, and then picks a cookie at random. The cookie turns out to be vanilla.

What is the probability that Fred picked from Bowl #1?

from Wikipedia

12 of 154

Cookie problem

H: Hypothesis that cookie came from Bowl 1.

D: Cookie is vanilla.

Given p(H), the probability of the hypothesis before you saw the data.

Find p(H|D), the probability of the hypothesis after you saw the data.

13 of 154

Diachronic interpretation

p(H|D) = p(H) p(D|H) / p(D)

p(H): prior

p(D|H): conditional likelihood of the data

p(D): total likelihood of the data

14 of 154

Diachronic interpretation

p(H|D) = p(H) p(D|H) / p(D)

p(H): prior = 1/2

p(D|H): conditional likelihood of the data = 3/4

p(D): total likelihood of the data = 5/8

15 of 154

Diachronic interpretation

p(H|D) = (1/2)(3/4) / (5/8) = 3/5

p(H): prior = 1/2

p(D|H): conditional likelihood of the data = 3/4

p(D): total likelihood of the data = 5/8

16 of 154

A little intuition

p(H|D): posterior = 60%

p(H): prior = 50%

Vanilla cookie was more likely under H.

Slightly increases our degree of belief in H.

17 of 154

Computation

Pmf represents a Probability Mass Function

Maps from possible values to probabilities.

Diagram by yuml.me

18 of 154

Install test

How many of you got install_test.py running?

Don't try to fix it now!

Instead...

19 of 154

Partner up

If you don't have a working environment, find a neighbor who does.
Even if you do, try pair programming!
Take a minute to introduce yourself.
Questions? Ask your partner first (please).

20 of 154

Icebreaker

What was your first computer?

What was your first programming language?

What is the longest time you have spent finding a stupid bug?

21 of 154

Start your engines

You cloned BayesMadeSimple, right?
cd into that directory.
Start the Python interpreter.

$ python

>>> from thinkbayes import Pmf

22 of 154

Pmf

from thinkbayes import Pmf

# make an empty Pmf

d6 = Pmf()

# outcomes of a six-sided die

for x in [1,2,3,4,5,6]:

d6.Set(x, 1)

23 of 154

Pmf

d6.Print()

d6.Normalize()

d6.Print()

d6.Random()

24 of 154

Style

I know what you're thinking.

25 of 154

The Bayesian framework

1) Build a Pmf that maps from each hypothesis to a prior probability, p(H).

2) Multiply each prior probability by the likelihood of the data, p(D|H).

3) Normalize, which divides through by the total likelihood, p(D).

26 of 154

Prior

pmf = Pmf()

pmf.Set('Bowl 1', 0.5)

pmf.Set('Bowl 2', 0.5)

27 of 154

Update

p(Vanilla | Bowl 1) = 30/40

p(Vanilla | Bowl 2) = 20/40

pmf.Mult('Bowl 1', 0.75)

pmf.Mult('Bowl 2', 0.5)

28 of 154

Normalize

pmf.Normalize()

0.625 # return val is p(D)

print pmf.Prob('Bowl 1')

0.6

29 of 154

Exercise

What if we select another cookie, and it’s chocolate?

The posterior (after the first cookie) becomes the prior (before the second cookie).

30 of 154

Exercise

What if we select another cookie, and it’s chocolate?

pmf.Mult('Bowl 1', 0.25)

pmf.Mult('Bowl 2', 0.5)

pmf.Normalize()

pmf.Print()

Bowl 1 0.43�Bowl 2 0.573�

31 of 154

Summary

Bayes's Theorem,

Cookie problem,

Pmf class.

32 of 154

The dice problem

I have a box of dice that contains a 4-sided die, a 6-sided die, an 8-sided die, a 12-sided die and a 20-sided die.

Suppose I select a die from the box at random, roll it, and get a 6. What is the probability that I rolled each die?

33 of 154

Hypothesis suites

A suite is a mutually exclusive and collectively exhaustive set of hypotheses.

Represented by a Suite that maps

hypothesis → probability.

34 of 154

Suite

class Suite(Pmf):

"Represents a suite of hypotheses and

their probabilities."

def __init__(self, hypos):

"Initializes the distribution."

for hypo in hypos:

self.Set(hypo, 1)

self.Normalize()

35 of 154

Suite

def Update(self, data):

"Updates the suite based on data."

for hypo in self.Values():� like = self.Likelihood(data, hypo)� self.Mult(hypo, like)�

self.Normalize()�

self.Likelihood?

36 of 154

Suite

Likelihood is an abstract method.

Child classes inherit Update,

provide Likelihood.

37 of 154

Likelihood

Outcome: 6

What is the likelihood of this outcome on a six-sided die?
On a ten-sided die?
On a four-sided die?

What is the likelihood of getting n on an m-sided die?

38 of 154

Likelihood

# hypo is the number of sides on the die

# data is the outcome

class Dice(Suite):

def Likelihood(self, data, hypo):� # write this method!�

Write your solution in dice.py

39 of 154

Likelihood

# hypo is the number of sides on the die

# data is the outcome

class Dice(Suite):

def Likelihood(self, data, hypo):� if hypo < data:� return 0� else:� return 1.0/hypo�

40 of 154

Dice

# start with equal priors

suite = Dice([4, 6, 8, 12, 20])

# update with the data

suite.Update(6)

suite.Print()

41 of 154

Dice

Posterior distribution:

4 0.0

6 0.39

8 0.30

12 0.19

20 0.12

More data? No problem...

42 of 154

Dice

for roll in [8, 7, 7, 5, 4]:

suite.Update(roll)

suite.Print()

43 of 154

Dice

Posterior distribution:

4 0.0

6 0.0

8 0.92

12 0.080

20 0.0038

44 of 154

Summary

Dice problem,

Likelihood function,

Suite class.

45 of 154

Recess!

http://ksdcitizens.org/wp-content/uploads/2010/09/recess_time.jpg

46 of 154

Trains

The trainspotting problem:

You believe that a freight carrier operates between 100 and 1000 locomotives with consecutive serial numbers.
You spot locomotive #321.
How many locomotives does the carrier operate?

Modify train.py to compute your answer.

47 of 154

Trains

If there are m trains, what is the chance of spotting train #n?

What does the posterior distribution look like?

How would you summarize it?

48 of 154

Train

print suite.Mean()

print suite.MaximumLikelihood()�print suite.CredibleInterval(90)�

49 of 154

Trains

What if we spot more trains?

Why did we do this example?

50 of 154

Trains

Practice using the Bayesian framework, and figuring out Likelihood().

Example that uses sparse data.

It’s a non-trivial, real problem.

51 of 154

Tanks

The German tank problem.

http://en.wikipedia.org/wiki/German_tank_problem

52 of 154

Good time for questions

53 of 154

A Euro problem

"When spun on edge 250 times, a Belgian one-euro coin came up heads 140 times and tails 110. 'It looks very suspicious to me,' said Barry Blight, a statistics lecturer at the London School of Economics. 'If the coin were unbiased, the chance of getting a result as extreme as that would be less than 7%.' "

From "The Guardian" quoted by MacKay, Information Theory, Inference, and Learning Algorithms.

54 of 154

A Euro problem

MacKay asks, "But do these data give evidence that the coin is biased rather than fair?"

Assume that the coin has probability x of landing heads.

(Forget that x is a probability; just think of it as a physical characteristic.)

55 of 154

A Euro problem

Estimation: Based on the data (140 heads, 110 tails), what is x?

Hypothesis testing: What is the probability that the coin is fair?

56 of 154

Euro

We can use the Suite template again.

We just have to figure out the likelihood function.

57 of 154

Likelihood

# hypo is the prob of heads (1-100)

# data is a string, either 'H' or 'T'

class Euro(Suite):�� def Likelihood(self, data, hypo):� # one more, please!

Modify euro.py to compute your answer.

58 of 154

Likelihood

# hypo is the prob of heads (1-100)

# data is a string, either 'H' or 'T'

class Euro(Suite):�� def Likelihood(self, data, hypo):� x = hypo / 100.0� if data == 'H':� return x� else:� return 1-x

59 of 154

Prior

What do we believe about x before seeing the data?

Start with something simple; we'll come back and review.

Uniform prior: any value of x between 0% and 100% is equally likely.

60 of 154

Prior

suite = Euro(range(0, 101))

62 of 154

Update

Suppose we spin the coin once and get heads.

suite.Update('H')

What does the posterior distribution look like?

Hint: what is p(x=0% | D)?

64 of 154

Update

Suppose we spin the coin again, and get heads again.

suite.Update('H')

What does the posterior distribution look like?

66 of 154

Update

Suppose we spin the coin again, and get tails.

suite.Update('T')

What does the posterior distribution look like?

Hint: what's p(x=100% | D)?

68 of 154

Update

After 10 spins, 7 heads and 3 tails:

for outcome in 'HHHHHHHTTT':

suite.Update(outcome)

70 of 154

Update

And finally, after 140 heads and 110 tails:

evidence = 'H' * 140 + 'T' * 110

for outcome in evidence:

suite.Update(outcome)

72 of 154

Posterior

Now what?
How do we summarize the information in the posterior Suite?

73 of 154

Posterior

Given the posterior distribution, what is the probability that x is 50%?

print suite.Prob(50)

And the answer is... 0.021

Hmm. Maybe that's not the right question.

74 of 154

Posterior

How about the most likely value of x?

print pmf.MaximumLikelihood()

And the answer is 56%.

75 of 154

Posterior

Or the expected value?

print suite.Mean()

And the answer is 55.95%.

76 of 154

Posterior

Credible interval?

print suite.CredibleInterval(90)

77 of 154

Posterior

The 5th percentile is 51.

The 95th percentile is 61.

These values form a 90% credible interval.

So can we say: "There's a 90% chance that x is between 51 and 61?"

78 of 154

Frequentist response

Thank you smbc-comics.com

79 of 154

Bayesian response

Yes, x is a random variable,

Yes, (51, 61) is a 90% credible interval,

Yes, x has a 90% chance of being in it.

Pro: Bayesian stats lends itself to decision analysis.

Con: The prior is subjective.

80 of 154

The prior is subjective

Remember the prior?

We chose it pretty arbitrarily, and reasonable people might disagree.

Is x as likely to be 1% as 50%?

Given what we know about coins, I doubt it.

81 of 154

Prior

How should we capture background knowledge about coins?

Try a triangle prior.

83 of 154

Posterior

What do you think the posterior distributions look like?

I was going to put an image here, but then I Googled "posterior". Never mind.

85 of 154

Swamp the prior

With enough data, reasonable people converge.

But if any p(H_i) = 0, no data will change that.

86 of 154

Swamp the prior

Priors can be arbitrarily low, but avoid 0.

See wikipedia.org/wiki/

Cromwell's_rule

"I beseech you, in the bowels of Christ, think it possible that you may be mistaken."�

87 of 154

Summary of estimation

Form a suite of hypotheses, H_i.
Choose prior distribution, p(H_i).
Compute likelihoods, p(D|H_i).
Turn off brain.
Compute posteriors, p(H_i|D).

88 of 154

Recess!

http://images2.wikia.nocookie.net/__cb20120116013507/recess/images/4/4c/Recess_Pic_for_the_Internet.png

89 of 154

Hypothesis testing

Remember the original question:

"But do these data give evidence that the coin is biased rather than fair?"

What does it mean to say that data give evidence for (or against) a hypothesis?

90 of 154

Hypothesis testing

D is evidence in favor of H if

p(H|D) > p(H)

which is true if

p(D|H) > p(D|~H)

or equivalently if

p(D|H) / p(D|~H) > 1

91 of 154

Hypothesis testing

This term

p(D|H) / p(D|~H)

is called the likelihood ratio, or Bayes factor.

It measures the strength of the evidence.

92 of 154

Hypothesis testing

F: hypothesis that the coin is fair

B: hypothesis that the coin is biased

p(D|F) is easy.

p(D|B) is hard because B is underspecified.

93 of 154

Bogosity

Tempting: we got 140 heads out of 250 spins, so B is the hypothesis that x = 140/250.

But,

Doesn't seem right to use the data twice.
By this process, almost any data would be evidence in favor of B.

94 of 154

We need some rules

You have to choose your hypothesis before you see the data.
You can choose a suite of hypotheses, but in that case we average over the suite.

96 of 154

Likelihood

def AverageLikelihood(suite, data):

total = 0

for hypo, prob in suite.Items():

like = suite.Likelihood(data, hypo)

total += prob * like

return total

97 of 154

Hypothesis testing

F: hypothesis that x = 50%.

B: hypothesis that x is not 50%, but might be any other value with equal probability.

98 of 154

Prior

fair = Euro()

fair.Set(50, 1)

99 of 154

Prior

bias = Euro()

for x in range(0, 101):

if x != 50:

bias.Set(x, 1)

bias.Normalize()

100 of 154

Bayes factor

data = 140, 110

like_fair = AverageLikelihood(fair, data)

like_bias = AverageLikelihood(bias, data)

ratio = like_bias / like_fair

101 of 154

Hypothesis testing

Read euro2.py.

Notice the new representation of the data, and corresponding Likelihood function.

Run it and interpret the results.

102 of 154

Hypothesis testing

And the answer is:

p(D|B) = 2.6 · 10^-76

p(D|F) = 5.5 · 10^-76

Likelihood ratio is about 0.47.

So this dataset is evidence against B.

103 of 154

Fair comparison?

Modify the code that builds bias; try out a different definition of B and run again.

bias = Euro()

for x in range(0, 49):

bias.Set(x, x)

for x in range(51, 101):

bias.Set(x, 100-x)

bias.Normalize()

104 of 154

Conclusion

The Bayes factor depends on the definition of B.

Depending on what “biased” means, the data might be evidence for or against B.

The evidence is weak either way (between 0.5 and 2).

105 of 154

Summary

Euro problem,

Bayesian estimation,

Bayesian hypothesis testing.

106 of 154

Recess!

http://ksdcitizens.org/wp-content/uploads/2010/09/recess_time.jpg

107 of 154

Word problem for geeks

ALICE: What did you get on the math SAT?

BOB: 760

ALICE: Oh, well I got a 780. I guess that means I'm smarter than you.

NARRATOR: Really? What is the probability that Alice is smarter than Bob?

108 of 154

Assume, define, quantify

Assume: each person has some probability, x, of answering a random SAT question correctly.

Define: "Alice is smarter than Bob" means x_a > x_b.

How can we quantify Prob { x_a > x_b} ?

109 of 154

Be Bayesian

Treat x as a random quantity.

Start with a prior distribution.

Update it.

Compare posterior distributions.

110 of 154

Prior?

Distribution of raw scores.

111 of 154

Likelihood

def Likelihood(self, data, hypo):� x = hypo� score = data� raw = self.exam.Reverse(score)�� yes, no = raw, self.exam.max_score - raw� like = x**yes * (1-x)**no� return like

112 of 154

Posterior

113 of 154

PmfProbGreater

def PmfProbGreater(pmf1, pmf2):

"""Returns the prob that a value from pmf1

is greater than a value from pmf2."""

114 of 154

PmfProbGreater

def PmfProbGreater(pmf1, pmf2):

"""Returns the prob that a value from pmf1

is greater than a value from pmf2."""

Iterate through all pairs of values.

Check whether the value from pmf1 is greater.

Add up total probability of successful pairs.

115 of 154

PmfProbGreater

def PmfProbGreater(pmf1, pmf2):

for x1, p1 in pmf1.Items():

for x2, p2 in pmf2.Items():

# FILL THIS IN!

116 of 154

PmfProbGreater

def PmfProbGreater(pmf1, pmf2):

total = 0.0

for x1, p1 in pmf1.Items():

for x2, p2 in pmf2.Items():

if x1 > x2:

total += p1 * p2

return total

117 of 154

And the answer is...

Alice: 780

Bob: 760

Probability that Alice is "smarter": 61%

118 of 154

Why this example?

Posterior distribution is often the input to the next step in an analysis.

Real world problems start (and end!) with modeling.

119 of 154

Modeling

This result is based on the simplification that all SAT questions are equally difficult.
An alternative (in the book) is based on item response theory.

120 of 154

Modeling

For most real world problems, there are several reasonable models.
The best choice depends on your goals.
Modeling errors often dominate.

121 of 154

Modeling

Therefore:

Don't mistake the map for the territory.
Don't sweat approximations smaller than modeling errors.
Iterate.

122 of 154

Recess!

http://images2.wikia.nocookie.net/__cb20120116013507/recess/images/4/4c/Recess_Pic_for_the_Internet.png

124 of 154

The Red Line problem

Kendall Square to South Station, catch the 4:40 train.

Arrive Kendall/MIT at 4:15, usually arrive South Station by 4:30.

125 of 154

Counting passengers

Few passengers waiting: just missed a train. Wait 7 minutes.

Many passengers: it’s been a while. Wait 2 minutes.

Platform packed: something’s wrong. Taxi.

126 of 154

Three inferences

1. Ahead of time, watch passengers arrive, estimate arrival rate.

2. On the day, count passengers on platform, estimate time since last train.

3. Compute the distribution of remaining time, conditioned on time since last train.

127 of 154

Data

Three weeks of train arrivals at Kendall/MIT between 4pm and 5pm.

128 of 154

Observer bias

More people arrive during longer intervals, so the average interval reported by passengers is longer than the average reported by MBTA.

Example: Alternating 5 and 10 minute delays.

MBTA average: 7.5 minutes

Passenger average: 8.33 minutes

129 of 154

Data

Three weeks of train arrivals at Kendall/MIT between 4pm and 5pm.

z = time between trains

zb = time between trains as seen by random arrival

131 of 154

Wait time

z = time between trains

zb = time between trains as seen by random arrival

y = wait time for a random arrival

132 of 154

Example

z: Trains run every 7.5 minutes.

zb: On average, a random passenger arrives during an 8.33-minute interval.

y: How long should the passenger expect to wait?

134 of 154

The Bayesian part

x = time since the last train

prior x: distribution of x before I see the data.

posterior x: distribution of x after.

135 of 154

Example

When I arrive, I see 15 passengers in my measurement area.

Based on previous observation, I expect 2 passengers to arrive per minute.

How long has it been since the last train?

137 of 154

Prediction

z: time between trains.

x: time since last train.

y: time until the next train.

z = x + y

y = z - x

139 of 154

Decision analysis

Having observed num passengers,

1. Compute the predictive distribution of time until next train, y.

2. Compute the probability that y exceeds 15 minutes.

141 of 154

Summary

Bayesian analysis consistent with intuition.

Predictive distributions guide decision analysis.

Just need a cost function.

142 of 154

Almost done!

https://www.surveymonkey.com/s/pycon2014_tutorials

And then some reading suggestions.

143 of 154

Think Bayes

This tutorial is based on my book,

Think Bayes

Bayesian Statistics Made Simple

Published by O'Reilly Media

and available under a

Creative Commons license from

thinkbayes.com

144 of 154

Case studies

Euro
SAT
Red line
Price is Right
Boston Bruins
Paintball
Variability hypothesis
Kidney tumor growth
Geiger counter
Unseen species

145 of 154

Think Stats

146 of 154

147 of 154

148 of 154

149 of 154

Need help?

I am always looking for interesting projects.

Summer 2014.
Sabbatical 2015-16.

151 of 154

Remember

The Bayesian approach is a divide and conquer strategy.
You write Likelihood().
Bayes does the rest.

152 of 154

Where does this fit?

Usual approach:

Analytic distributions.
Math.
Multidimensional integrals.
Numerical methods (MCMC).

153 of 154

Where does this fit?

Problem:

Hard to get started.
Hard to develop solutions incrementally.
Hard to develop understanding.

154 of 154

My theory

Start with non-analytic distributions.
Use background information to choose meaningful priors.
Start with brute-force solutions.
If the results are good enough and fast enough, stop.
Otherwise, optimize (where analysis is one kind of optimization).
Use your reference implementation for regression testing.

1 of 154

2 of 154

3 of 154

4 of 154

5 of 154

6 of 154

7 of 154

8 of 154

9 of 154

10 of 154

11 of 154

12 of 154

13 of 154

14 of 154

15 of 154

16 of 154

17 of 154

18 of 154

19 of 154

20 of 154

21 of 154

22 of 154

23 of 154

24 of 154

25 of 154

26 of 154

27 of 154

28 of 154

29 of 154

30 of 154

31 of 154

32 of 154

33 of 154

34 of 154

35 of 154

36 of 154

37 of 154

38 of 154

39 of 154

40 of 154

41 of 154

42 of 154

43 of 154

44 of 154

45 of 154

46 of 154

47 of 154

48 of 154

49 of 154

50 of 154

51 of 154

52 of 154

53 of 154

54 of 154

55 of 154

56 of 154

57 of 154

58 of 154

59 of 154

60 of 154

61 of 154

62 of 154

63 of 154

64 of 154

65 of 154

66 of 154

67 of 154

68 of 154

69 of 154

70 of 154

71 of 154

72 of 154

73 of 154

74 of 154

75 of 154

76 of 154

77 of 154

78 of 154

79 of 154

80 of 154