1 of 17

Hierarchical Probabilistic Modelling in Real Life

Benjamin Batorsky, PhD

Matt Moocarme, PhD

PyData NYC 2018

https://github.com/moocarme/pydata_2018

Slides: https://goo.gl/aBHpuX

2 of 17

Data Scientist, ThriveHive

PhD, Policy Analysis

tech.thrivehive.com

bpben.github.io

@bpben2

Data Scientist, Viacom

PhD, Physics

Matthew Moocarme

Benjamin Batorsky

github.com/moocarme

3 of 17

First - Run the Docker container

...or clone the github repo

If you have docker already

docker run -p 8888:8888 moocarme/pydata1.0
2.5 gb

If not

Ubuntu/Mac - Install docker

Sudo yum update -y
Sudo yum install -y docker

Windows

See instructions: https://docs.docker.com/docker-for-windows/install/

git clone https://github.com/moocarme/pydata_2018.git

4 of 17

Powering the conversation about marketing spend

Business questions:

What is customer X likely to spend on product Y?

Perfect world: Every salesperson knows the answer

Real world: Salesperson answers based on their experience

Perfect model: Knows the answer, exactly

Real model: Outputs realistic ranges, given characteristics

6 of 17

Can we produce a range of “realistic spends”?

Have:

Customer spend on marketing products
Various customer characteristics

Aim:

Performant predictive model for estimating spend range given certain high-level characteristics
Ability to surface this information to sales team
Provide a range of predictions to aid sales team in closing a deal

Probabilistic models in Marketing

Customer lifetime value (CLV) prediction
Marketing Mix Models

Jain, D., & Singh, S. S. (2002). Customer lifetime value research in marketing: A review and future directions

Rossi, Peter E., and Greg M. Allenby. (2003) Bayesian statistics and marketing.

7 of 17

Probabilistic Modelling

A model describes data that one could observe from a system
If we use the mathematics of probability theory to express all forms of uncertainty and noise associated with our model...
...then inverse probability (i.e. Bayes rule) allows us to infer unknown quantities, adapt our models, make predictions and learn from data.

8 of 17

What is a bayesian model?

Bayes rule:

Have two events, A and B
P(A) : Probability of A (Prior)
P(B|A) : Probability of B, given A (Likelihood)
P(B) : Probability of B (Evidence)

Applications of Bayesian models

Customer time to purchase
Estimating the contribution of different channels in marketing results
Accounting for variation in customer preferences

Discrete

Continuous

P(hypothesis | data) =

P(data | hypothesis) P(hypothesis)

P(data)

9 of 17

What is a “hierarchical model”?

Account for group-level differences
Example: Estimating the effect of having a basement on radon levels across counties

Non-hierarchical (pooled): Single intercept, single effect parameter
Hierarchical (partially pooled): Per-county intercept and per-county effect
Individual (Unpooled): Separate models per county

Bayesian: County-level parameters drawn from prior distribution

Typical approach (pooled):

Hierarchical (partially pooled):

http://twiecki.github.io/blog/2014/03/17/bayesian-glms-3/

10 of 17

Why use this approach?

Radon example

Hierarchical approach results in estimation of consistent effect
Individual model sensitive to outliers/missing data

ThriveHive context

Spend varies by product and location, but in a consistent way
Incorporate our prior beliefs and share information cross-product and cross-location

http://twiecki.github.io/blog/2014/03/17/bayesian-glms-3/

Light lines are samples of Beta, dark line is average of the samples

11 of 17

ThriveHive’s data

4,000 customer marketing campaigns across five products since 2016
“Initial spend” - Spend on first contract signed
Spend as log-normally distributed
Groups: Product and Region
Covariates

NAICS industry data
Size data

12 of 17

A bit about the products (and what we expect)

SEM (Search Engine Marketing)

General term for sponsored search results
Typically used by larger companies for higher spend

SEO (Search Engine Optimization)

Focused on getting “organic” traffic (i.e. not paid) by getting “noticed” by search
Typically larger spend, but wide variation

Social Advertising

Appears in Facebook and Instagram feeds
Minimal options widely used

Direct email advertising
Typically lower spend, add-on for larger companies

Display

Display advertising targeting audiences
Lowest spend, usually “gateway” for small businesses

13 of 17

Sampling from the posterior

Markov Chain: Propose step, accept/reject based on current position
Monte Carlo: Random sampling
Metropolis vs Hamiltonian

Metropolis simple, but has difficulty with complex distributions
Hamiltonian based on “probability surface”

No U-Turn sampling

Increases the efficiency of the Hamiltonian technique

Hamiltonian MCMC

https://colcarroll.github.io/hamiltonian_monte_carlo_talk/bayes_talk.html

http://arogozhnikov.github.io/2016/12/19/markov_chain_monte_carlo.html

14 of 17

Bayesian model evaluation

PyMC3’s WAIC module
WAIC

Model M trained on data D
Likelihood of each observation, penalized by the average variance
V becomes very large in small samples

pWAIC

Penalized WAIC

Uses

Model comparison: Lower WAIC values indicates better posterior predictive distribution fit
Predictive performance: Sort of a “stand in” for cross-validation

Watanabe (2010) https://dl.acm.org/citation.cfm?id=1953045

Piironen (2015) https://arxiv.org/pdf/1503.08650.pdf

15 of 17

Predictions with Bayesian models

Frequentist

Calculate point estimates of parameters
Predictions come from combination of point estimates

Bayesian

Create a distribution based on the model and its parameters: Posterior predictive distribution
Sample predictions from predictive distribution
Can then create “new” data to compare to source data in posterior predictive checks

http://www.cs.princeton.edu/courses/archive/fall09/cos597A/papers/GelmanMengStern1996.pdf

http://doingbayesiandataanalysis.blogspot.com/2016/10/posterior-predictive-distribution-for.html

16 of 17

Implementing in production pipeline

Interactive application (Flask)
Requirements

Predictive samples for every region, product and industry
Low latency
Up-to-date estimates

Our solution

Daily retrain, saved model and variables to AWS
Set value of shared variable to request
Sample from predictive distribution, display range of spend

Spend range: SEM

Spend range: Social

17 of 17

Takeaways

Hierarchical models are powerful ways to incorporate the data’s structure
Probabilistic approaches allow inclusion of prior knowledge and output uncertainty estimates
This stuff is complicated...but usable!

THANKS: PyData, PyMC, and you!

https://xkcd.com/2059/