1 of 137

Get up to speed with Bayesian data analysis in R

Rasmus Bååth

Lund University ❧ castle.io

@rabaath ❧ rasmus.baath@gmail.com

www.sumsar.net

2 of 137

The butler is dead. Who did it?

Model A

Model B

3 of 137

Which is the most likely rate with which we’ll cure zombies?

Model A

Model B

5 % cured

15 % cured

4 of 137

What is the most likely regression line?

Model B

Intercept: 0.69

Slope: 2.86

Model A

Intercept: 4.23

Slope: 2.45

5 of 137

Q1: The butler is dead. Who did it?

Q2: Which is the most likely rate with which we’ll cure zombies?

Q3: What is the most likely regression line?

Given the data we observed which of the models considered is the most probable?

6 of 137

Why Bayes?

  • A general and flexible framework for learning from data.
  • Excels at modeling uncertainty and integrating many different sources of information into the analysis.
  • A theoretical framework to understand statistics and machine learning
  • There are a lot of good tools and packages!

7 of 137

Part 1

Bayesian hello world

Theoretical foundations

Exercise: A Bayesian model from scratch

Part 2

Bayes in practice

More exercises

CausalImpact

rstanarm

8 of 137

Part 1:�The Basics of Bayes

9 of 137

Thomas Bayes (1702 ‐ 1761)

print(“Hello World”)

10 of 137

Which is the likely rate with which we’ll cure zombies?

11 of 137

A Bayesian model for the proportion of success

  • The data is a vector of successes and failures represented by 1s and 0s.
  • There is an unknown underlying proportion of success.
  • If data point is a success is only affected by this proportion.
  • Prior to seeing any data, any underlying proportion of success is equally likely.
  • Computational method: prop_model()

The result is a probability distribution that represents what the model knows about the underlying proportion of success.

12 of 137

Which is the likely rate with which we’ll cure zombies?

13 of 137

Bååth, R. (2068). Curing Zombieism: Now a no-brainer?�Advances in Zombiology, 3(2), 15-21. doi:10.1214/aos/1176345338

14 of 137

Thomas Bayes (1702 ‐ 1761)

print(“Hello World”)

15 of 137

Prior probability distribution

Posterior probability distribution

Prior

Posterior

16 of 137

Theoretical foundations

  • Probability as uncertainty
  • Lot’s of Math
  • Representing probability distribution by samples
  • Bayesian inference

Pierre-Simon Laplace

(1749–1827)

17 of 137

Probability as uncertainty

  • A number between 0 and 1.
  • A statement about certainty / uncertainty.
  • 1 is complete certainty something is the case.
  • 0 is complete certainty something is not the case.
  • 50% / 50% is maximum uncertainty.

18 of 137

Coinflip example

19 of 137

Probability as uncertainty

  • It’s not a frequency
  • It’s not subjective nor personal
  • It’s not only about yes/no events.

20 of 137

Probability distributions

21 of 137

Probability as uncertainty

  • It’s not a frequency
  • It’s not subjective nor personal (anymore)
  • It’s not only about yes/no events.
  • It’s not only about 1D quantities

22 of 137

Joint probability distributions

23 of 137

Representing probability distributions

24 of 137

By a plot

25 of 137

By an equation

26 of 137

By enumerating cases

27 of 137

By samples

> n_sixes� [1] 0 0 0 0 0 0 0 0 1 1 1 1 1

[14] 1 1 1 2 2 2 3

28 of 137

By random samples

> n_sixes� [1] 0 0 1 2 0 2 2 1 1 0 0 0� [13] 1 0 1 1 1 3 0 1 2 0 1 0� [25] 0 0 0 0 2 0 1 1 1 0 2 1� [37] 1 0 1 1 2 1 1 1 1 1 0 1� [49] 1 1 1 2 1 5 0 0 0 1 1 1� [61] 2 0 1 0 1 0 1 1 0 2 0 2� [73] 0 0 1 2 2 0 1 2 1 1 4 0� [85] 1 0 1 0 0 0 0 0 1 2 1 1� [97] 1 1 2 1 1 0 1 1 0 2 0 3

> sum(n_sixes >= 2) /�+ length(n_sixes)�[1] 0.195

> mean(n_sixes)�[1] 0.826

> hist(n_sixes)

29 of 137

By tables of random samples

30 of 137

Why is this relevant?

  • Modern Bayesian computational methods all produce samples.
  • [Samples of data] != [Samples representing� probability distributions]
  • Representing probability distributions as samples makes it easier to explain...

31 of 137

Bayesian inference

32 of 137

Bayesian inference in a nutshell

A method for figuring out unobservable quantities

given known facts

that uses probability to describe the uncertainty over what the values of the unknown quantities could be.

Parameters

Models

Future data

Data

33 of 137

34 of 137

We want to know

  • How many visitors/clicks will we get out of a 100 shown ads.
  • How good is our ad really?
  • Will we get more than 5 visitors/clicks?

“Ads gets clicked on 10% of the time”

35 of 137

36 of 137

37 of 137

A generative model for clicking on ads

proportion_clicks <- 0.1�n_ads_shown <- 100

clicked_on_ad <- c()�for(nth_ad_shown in 1:n_ads_shown) {� clicked_on_ad[nth_ad_shown] <- � proportion_clicks > runif(1, min = 0, max = 1)�}

n_visitors <- sum(clicked_on_ad)

n_visitors�> [1] 12

38 of 137

A binomial model for clicking on ads

proportion_clicks <- 0.1�n_ads_shown <- 100

n_visitors <- rbinom(� n = 1, � size = n_ads_shown,� prob = proportion_clicks)

n_visitors�> [1] 9

39 of 137

Monte Carlo

40 of 137

A binomial model for clicking on ads

proportion_clicks <- 0.1�n_ads_shown <- 100

n_visitors <- rbinom(� n = 1, � size = n_ads_shown,� prob = proportion_clicks)

41 of 137

A binomial model for clicking on ads

proportion_clicks <- 0.1�n_ads_shown <- 100

n_visitors <- rbinom(� n = 100000, � size = n_ads_shown,� prob = proportion_clicks)

mean(n_visitors >= 5)�[1] 0.94

hist(n_visitors)

42 of 137

43 of 137

“Ads gets clicked on 10% of the time”

44 of 137

45 of 137

Prior uncertainty

proportion_clicks <- runif(� n = 100000,� min = 0.0, max = 0.2)�

n_ads_shown <- 100�n_visitors <- rbinom(� n = 100000, � size = n_ads_shown,� prob = 0.1)

46 of 137

Prior uncertainty

proportion_clicks <- runif(� n = 100000,� min = 0.0, max = 0.2)

n_ads_shown <- 100�n_visitors <- rbinom(� n = 100000, � size = n_ads_shown,� prob = proportion_clicks)

hist(n_visitors)

mean(n_visitors >= 5)�[1] 0.75

47 of 137

48 of 137

13×

100

49 of 137

Monte Carlo

Bayesian Inference

50 of 137

51 of 137

prior <- data.frame(� proportion_clicks, n_visitors)

head(prior)

proportion_clicks n_visitors

1 0.20 20

2 0.07 6

3 0.07 8

4 0.06 6

5 0.01 1

6 0.05 2

… … …

plot(prior)

52 of 137

Conditioning:

Filtering out parts of a probability distribution that doesn’t fulfill a condition

53 of 137

Conditioning:

Filtering out parts of a probability distribution that doesn’t fulfill a condition

54 of 137

Conditioning:

Filtering out parts of a probability distribution that doesn’t fulfill a condition

55 of 137

Conditioning:

Filtering out parts of a probability distribution that doesn’t fulfill a condition

56 of 137

Conditioning:

Filtering out parts of a probability distribution that doesn’t fulfill a condition

57 of 137

Conditioning:

Filtering out parts of a probability distribution that doesn’t fulfill a condition

58 of 137

prior <- data.frame(� proportion_clicks, n_visitors)

posterior <-� prior[prior$n_visitors == 13, ]

59 of 137

The essence of Bayesian inference

Bayesian inference is conditioning on data, in order to learn about parameter values.

60 of 137

We want to know

  • How good is our ad?
  • How many visitors/clicks will we get out of a 100 shown ads.
  • Will we get more than 5 visitors/clicks?

61 of 137

prior <- data.frame(� proportion_clicks, n_visitors)

posterior <-� prior[prior$n_visitors == 13, ]

62 of 137

prior <- data.frame(� proportion_clicks, n_visitors)

posterior <-� prior[prior$n_visitors == 13, ]

hist(prior$proportion_clicks)

hist(posterior$proportion_clicks)

median(posterior$proportion_clicks)

[1] 0.13

quantile(posterior$proportion_clicks,� probs = c(0.025, 0.975))

2.5% 97.5% � 0.078 0.19

63 of 137

posterior_n_visitors <- rbinom(� n = 100000, � size = n_ads_shown,� prob = posterior$proportion_clicks)

prior <- data.frame(� proportion_clicks, n_visitors)

posterior <-� prior[prior$n_visitors == 13, ]

hist(prior$n_visitors)

mean(posterior_n_visitors >= 5)�[1] 0.97

hist(posterior_n_visitors)

64 of 137

Prior

Prior

Predictive

Posterior

Predictive

Posterior

Bayesian inference

65 of 137

0.1

Prior

????

66 of 137

Bayesian inference in a nutshell

A method for figuring out unobservable quantities �given known facts �that uses probability to describe the uncertainty over what the values of the unknown quantities could be.

0.1

Prior

????

67 of 137

Who probably did it?

Which % cured is most probable?

5% cured 15% cured

68 of 137

69 of 137

Pierre-Simon Laplace

(1749–1827)

70 of 137

Get up to speed with Bayesian data analysis in R

Exercises part 1

Link to slides: bit.ly/2NvFutj

71 of 137

Two good things with Bayes

  1. You can make any comparisons between groups or data sets.

72 of 137

n_samples <- 100000

n_ads_shown <- 100

proportion_clicks <- runif(n_samples,� min = 0.0, max = 0.2)

n_visitors <- rbinom(n_samples,� size = n_ads_shown,� prob = proportion_clicks)

prior <- data.frame(

proportion_clicks, n_visitors)

�posterior <-� prior[prior$n_visitors == 9, ]

posterior1 <-� prior[prior$n_visitors == 9, ]

posterior2 <-� prior[prior$n_visitors == 13, ]

73 of 137

prop_post <- data.frame(

ad1 = posterior1$proportion_clicks[1:4000],

ad2 = posterior2$proportion_clicks[1:4000])

ad1 ad2�0.16 0.20�0.07 0.13�0.08 0.17�0.05 0.16�0.04 0.16�0.12 0.13�0.08 0.11� … …

prop_post$diff <- � prop_post$ad2 - prop_post$ad1

diff�0.04�0.06�0.09�0.11�0.12�0.00�0.03� …

74 of 137

hist(prop_post$diff)

mean(prop_post$diff > 0)�[1] 0.80

75 of 137

Two good things with Bayes

  • You can make any comparisons between groups or data sets.
  • You can change the prior and include information sources in addition to the data

76 of 137

77 of 137

78 of 137

79 of 137

n_samples <- 100000

n_ads_shown <- 100

proportion_clicks <- runif(n_samples,� min = 0.0, max = 0.2)

n_visitors <- rbinom(n_samples,� size = n_ads_shown,� prob = proportion_clicks)

prior <- data.frame(

proportion_clicks, n_visitors)

�posterior <-� prior[prior$n_visitors == 9, ]

80 of 137

n_samples <- 100000

n_ads_shown <- 100

proportion_clicks <- rbeta(n_samples,� shape1 = 5, shape2 =95)

n_visitors <- rbinom(n_samples,� size = n_ads_shown,� prob = proportion_clicks)

prior <- data.frame(

proportion_clicks, n_visitors)

�posterior <-� prior[prior$n_visitors == 9, ]

81 of 137

Old prior

Informative prior

82 of 137

Two good things with Bayes

  • You can make any comparisons between groups or data sets.
  • You can change the prior and include information sources in addition to the data

83 of 137

We start again 11:00

84 of 137

Get up to speed with Bayesian data analysis in R

Rasmus Bååth

Lund University ❧ castle.io

@rabaath ❧ rasmus.baath@gmail.com

www.sumsar.net

85 of 137

Part 2:

Bayesian Tools

86 of 137

Carl Friedrich Gauß�(1777–1855)

The method of least squares�(1809)

87 of 137

Part 1

Bayesian hello world

Theoretical foundations

Exercise: A Bayesian model from scratch

Part 2

Bayes in practice

More exercises

CausalImpact

rstanarm

88 of 137

89 of 137

Bayes in practice

  • Requires advanced computational methods
    • Markov chain Monte Carlo, Variational Bayes, etc.
  • Requires that the result of the generative model can be calculated directly, rather than simulated.

n_successes <- rbinom(n = 1000000, size = 100, prob = 0.1)�mean(n_successes == 13)�[1] 0.074

dbinom(x = 13, size = 100, prob = 0.1)�[1] 0.074

Likelihood�function

90 of 137

Bayes in practice

  • Requires advanced computational methods
    • Markov chain Monte Carlo, Variational Bayes, etc.
  • Requires that the result of the generative model can be calculated directly, rather than simulated.
  • But! The theory stays the same.�It’s still “just” defining a joint probability distribution and conditioning / filtering on the data.

91 of 137

92 of 137

Defining Bayesian models

p_success <- runif(100000,� min = low, max = high)

x <- rbinom(100000, � size = n_trials,� prob = p_success)

Sampling in R

tilde(~)-notation

93 of 137

Bayesian linear regression

-

-

σ: 2.0

intercept: -0.5

slope: 0.9

94 of 137

Markov chain Monte Carlo

A class of algorithms that samples from the posterior probability distribution by walking around the parameter space.

σ: 2.0

intercept: ???

slope: ???

95 of 137

Markov chain Monte Carlo

  • A class of algorithms that samples from the posterior probability distribution by walking around the parameter space.

96 of 137

rstanarm

  • Implements direct replacements for lm, glm, lmer, glmer, polr, aov, gamm4, etc.
  • Actively developed
  • Uses STAN as the underlying computational engine.

97 of 137

lm(y ~ x)

library(rstanarm)

stan_lm(y ~ x)

98 of 137

glm(y ~ x, family = “poisson”)

library(rstanarm)

stan_glm(y ~ x, family = “poisson”)

99 of 137

What is the most likely regression line?

100 of 137

head(ice_cream_sales)� daily_temperature ice_creams� 1 15.5 34� 2 20.9 94� 3 27.9 75� 4 14.3 62

lm_fit <- lm(ice_creams ~ daily_temperature,� data = ice_cream_sales)

summary(lm_fit)� Estimate Std. Error t value Pr(>|t|)�(Intercept) 0.6975 9.6615 0.072 0.943�daily_temperature 2.8685 0.4579 6.265 9.87e-08 ***

101 of 137

library(rstanarm)

stan_fit <- stan_lm(ice_creams ~ daily_temperature,� data = ice_cream_sales,� prior = R2(0.5, "mean"))

summary(stan_fit)� mean sd 2.5% 25% 50% 75% 97.5%�(Intercept) 1.9 9.6 -17.2 -4.4 2.1 8.3 21.0�daily_temperature 2.8 0.5 1.9 2.5 2.8 3.1 3.7�sigma 18.3 1.9 15.1 16.9 18.1 19.5 22.5

lm_fit <- lm(ice_creams ~ daily_temperature,� data = ice_cream_sales)

102 of 137

plot(stan_fit, plotfun = "hist")

103 of 137

plot(stan_fit, plotfun = "trace")

104 of 137

A quick Bayesian decision analysis

ice_creams_demand <- posterior_predict(stan_fit,� newdata = data.frame(daily_temperature = 26))

head(ice_creams_demand)�[1,] 88.31375�[2,] 87.58424�[3,] 83.39902�[4,] 35.84145�[5,] 71.64784�[6,] 78.71173

hist(ice_creams_demand)

105 of 137

A quick Bayesian decision analysis

ice_creams_brought <- 75

hist(posterior_profit)

mean(posterior_profit)�[1] 121.6

In the exercise: A full decision analysis -- How many ice creams should we bring for optimal profit?

posterior_profit <-� 1.8 * pmin(ice_creams_demand,� ice_creams_brought)

106 of 137

prophet

  • Made by Facebook
  • “Prophet is a forecasting procedure implemented in R and Python. It is fast and provides completely automated forecasts that can be tuned by hand by data scientists and analysts.”
  • Uses STAN as the underlying computational engine.

107 of 137

prophet’s generative model

108 of 137

109 of 137

head(groggbloggen)

ds y�1 2015-01-01 1.11 �2 2015-01-02 1.43 �3 2015-01-03 1.75 �4 2015-01-04 1.26 �5 2015-01-05 1.26 �6 2015-01-06 0.903

m <- prophet(groggbloggen)�future <- make_future_dataframe(m, periods = 365)�forecast <- predict(m, future)

plot(m, forecast)

110 of 137

111 of 137

112 of 137

113 of 137

In the exercise: Adding in the effect of holidays and predict how many reported crimes there will be in Chicago today.

114 of 137

CausalImpact

  • Made by Google
  • “This R package implements an approach to estimating the causal effect of a designed intervention on a time series. For example, how many additional daily clicks were generated by an advertising campaign?”

115 of 137

Volkswagen

Allianz

BMW

$$$

https://rinaldif.github.io/causal-impact/

116 of 137

CausalImpact’s generative model

Effect of covariates

+

117 of 137

Volkswagen

Allianz

BMW

$$$

https://rinaldif.github.io/causal-impact/

118 of 137

head(stock_prices)

volkswagen bmw allianz�2011-01-02 104.5 42.8 57.4�2011-01-09 102.7 42.5 60.3�2011-01-16 93.1 40.2 61.8�2011-01-23 98.4 41.2 63.4

library(CausalImpact)�pre.period <- as.Date(c("2011-01-03", "2015-09-14"))�post.period <- as.Date(c("2015-09-21", "2017-03-19"))

scandal_impact <- CausalImpact(stock_prices, � pre.period, post.period)

119 of 137

plot(scandal_impact)

In the exercise: Generate an automated report that describes the causal impact of the emissions scandal.

120 of 137

The tools

CausalImpact

rstanarm

121 of 137

Get up to speed with Bayesian data analysis in R

Exercises part 2

Link to slides: bit.ly/2NvFutj

122 of 137

What have we done?

123 of 137

Prior probability distribution

Posterior probability distribution

Prior

Posterior

124 of 137

125 of 137

The essence of Bayesian inference

Bayesian inference is conditioning on data, in order to learn about parameter values.

126 of 137

127 of 137

CausalImpact

rstanarm

128 of 137

Get up to speed with Bayesian data analysis in R

Rasmus Bååth

Lund University ❧ castle.io

@rabaath ❧ rasmus.baath@gmail.com

www.sumsar.net

129 of 137

Slask

130 of 137

Probability theory is nothing but common sense reduced to calculation.

  • Pierre-Simon Laplace, 1814

131 of 137

Priors & Posteriors

  • A prior is a probability distribution that represents what the model knows before seeing the data.
  • A posterior is a probability distribution that represents what the model knows after having seen the data.

132 of 137

-

-

intercept slope sigma y1 y2 y3 y4 y5

-1.2 1.3 2.8 -6.9 -3.0 -2.9 1.5 2.2

-1.3 1.6 1.6 -7.1 -4.2 0.5 0.5 4.7

0.7 0.2 1.9 -1.8 0.4 1.1 0.7 0.3

-0.5 1.4 1.7 -6.7 -2.8 -0.1 1.3 4.6

1.1 0.8 2.3 1.7 -2.0 4.9 -0.6 4.0

0.0 1.6 1.5 -2.9 -3.3 1.2 0.4 8.6

0.9 1.3 2.0 -2.8 -0.3 3.5 0.0 4.8

2.0 1.6 2.5 -1.3 1.8 4.8 7.5 13.2

-0.5 1.1 1.2 -3.7 -2.4 -0.8 2.4 3.3

1.1 1.1 2.8 5.2 -1.2 0.0 2.9 5.9

1.7 1.6 1.7 -3.6 -2.8 2.7 1.2 13.3

-1.2 0.0 2.7 -1.4 -5.1 -4.2 -1.1 1.5

0.6 1.0 1.7 -1.4 -2.5 2.1 4.1 4.7

-1.5 1.5 1.7 -4.1 -2.2 -2.8 -0.8 4.1

… … … … … … … …

133 of 137

-

-

σ: 2.0

intercept: -0.5

slope: 0.9

134 of 137

135 of 137

∝ P(D|θ) × P(θ)

P(θ|D) = P(D|θ) × P(θ)

∑ P(D|θ)×P(θ)

P(Data|intercept,slope)

P(D|θ)

Θ

P(θ|D)

136 of 137

P(θ|D) ∝ P(D|θ) × P(θ)

P(θ|D) = P(D|θ) × P(θ)

∑ P(D|θ)×P(θ)

P(Data|intercept,slope)

P(D|θ)

Θ

137 of 137

The tools

CausalImpact

rstanarm