1 of 17

Hierarchical Probabilistic Modelling in Real Life

Benjamin Batorsky, PhD

Matt Moocarme, PhD

PyData NYC 2018

https://github.com/moocarme/pydata_2018

Slides: https://goo.gl/aBHpuX

2 of 17

Data Scientist, ThriveHive

PhD, Policy Analysis

tech.thrivehive.com

bpben.github.io

@bpben2

Data Scientist, Viacom

PhD, Physics

Matthew Moocarme

Benjamin Batorsky

github.com/moocarme

3 of 17

First - Run the Docker container

...or clone the github repo

  • If you have docker already
    • docker run -p 8888:8888 moocarme/pydata1.0
    • 2.5 gb
  • If not
    • Ubuntu/Mac - Install docker
      • Sudo yum update -y
      • Sudo yum install -y docker
    • Windows
      • See instructions: https://docs.docker.com/docker-for-windows/install/

  • git clone https://github.com/moocarme/pydata_2018.git

4 of 17

Powering the conversation about marketing spend

  • Business questions:
    • What is customer X likely to spend on product Y?
  • Perfect world: Every salesperson knows the answer
    • Real world: Salesperson answers based on their experience
  • Perfect model: Knows the answer, exactly
    • Real model: Outputs realistic ranges, given characteristics

5 of 17

6 of 17

Can we produce a range of “realistic spends”?

  • Have:
    • Customer spend on marketing products
    • Various customer characteristics
  • Aim:
    • Performant predictive model for estimating spend range given certain high-level characteristics
    • Ability to surface this information to sales team
    • Provide a range of predictions to aid sales team in closing a deal
  • Probabilistic models in Marketing
    • Customer lifetime value (CLV) prediction
    • Marketing Mix Models

Jain, D., & Singh, S. S. (2002). Customer lifetime value research in marketing: A review and future directions

Rossi, Peter E., and Greg M. Allenby. (2003) Bayesian statistics and marketing.

7 of 17

Probabilistic Modelling

  • A model describes data that one could observe from a system
  • If we use the mathematics of probability theory to express all forms of uncertainty and noise associated with our model...
  • ...then inverse probability (i.e. Bayes rule) allows us to infer unknown quantities, adapt our models, make predictions and learn from data.

8 of 17

What is a bayesian model?

  • Bayes rule:
    • Have two events, A and B
    • P(A) : Probability of A (Prior)
    • P(B|A) : Probability of B, given A (Likelihood)
    • P(B) : Probability of B (Evidence)
  • Applications of Bayesian models
    • Customer time to purchase
    • Estimating the contribution of different channels in marketing results
    • Accounting for variation in customer preferences

Discrete

Continuous

P(hypothesis | data) =

P(data | hypothesis) P(hypothesis)

P(data)

9 of 17

What is a “hierarchical model”?

  • Account for group-level differences
  • Example: Estimating the effect of having a basement on radon levels across counties
    • Non-hierarchical (pooled): Single intercept, single effect parameter
    • Hierarchical (partially pooled): Per-county intercept and per-county effect
    • Individual (Unpooled): Separate models per county
  • Bayesian: County-level parameters drawn from prior distribution

Typical approach (pooled):

Hierarchical (partially pooled):

http://twiecki.github.io/blog/2014/03/17/bayesian-glms-3/

10 of 17

Why use this approach?

  • Radon example
    • Hierarchical approach results in estimation of consistent effect
    • Individual model sensitive to outliers/missing data
  • ThriveHive context
    • Spend varies by product and location, but in a consistent way
    • Incorporate our prior beliefs and share information cross-product and cross-location

http://twiecki.github.io/blog/2014/03/17/bayesian-glms-3/

Light lines are samples of Beta, dark line is average of the samples

11 of 17

ThriveHive’s data

  • 4,000 customer marketing campaigns across five products since 2016
  • “Initial spend” - Spend on first contract signed
  • Spend as log-normally distributed
  • Groups: Product and Region
  • Covariates
    • NAICS industry data
    • Size data

12 of 17

A bit about the products (and what we expect)

  • SEM (Search Engine Marketing)
    • General term for sponsored search results
    • Typically used by larger companies for higher spend
  • SEO (Search Engine Optimization)
    • Focused on getting “organic” traffic (i.e. not paid) by getting “noticed” by search
    • Typically larger spend, but wide variation
  • Social Advertising
    • Appears in Facebook and Instagram feeds
    • Minimal options widely used
  • Email
    • Direct email advertising
    • Typically lower spend, add-on for larger companies
  • Display
    • Display advertising targeting audiences
    • Lowest spend, usually “gateway” for small businesses

13 of 17

Sampling from the posterior

  • Markov Chain: Propose step, accept/reject based on current position
  • Monte Carlo: Random sampling
  • Metropolis vs Hamiltonian
    • Metropolis simple, but has difficulty with complex distributions
    • Hamiltonian based on “probability surface”
  • No U-Turn sampling
    • Increases the efficiency of the Hamiltonian technique

Hamiltonian MCMC

14 of 17

Bayesian model evaluation

  • PyMC3’s WAIC module
  • WAIC
    • Model M trained on data D
    • Likelihood of each observation, penalized by the average variance
    • V becomes very large in small samples
  • pWAIC
    • Penalized WAIC
  • Uses
    • Model comparison: Lower WAIC values indicates better posterior predictive distribution fit
    • Predictive performance: Sort of a “stand in” for cross-validation

Watanabe (2010) https://dl.acm.org/citation.cfm?id=1953045

Piironen (2015) https://arxiv.org/pdf/1503.08650.pdf

15 of 17

Predictions with Bayesian models

  • Frequentist
    • Calculate point estimates of parameters
    • Predictions come from combination of point estimates
  • Bayesian
    • Create a distribution based on the model and its parameters: Posterior predictive distribution
    • Sample predictions from predictive distribution
    • Can then create “new” data to compare to source data in posterior predictive checks

16 of 17

Implementing in production pipeline

  • Interactive application (Flask)
  • Requirements
    • Predictive samples for every region, product and industry
    • Low latency
    • Up-to-date estimates
  • Our solution
    • Daily retrain, saved model and variables to AWS
    • Set value of shared variable to request
    • Sample from predictive distribution, display range of spend

Spend range: SEM

Spend range: Social

17 of 17

Takeaways

  • Hierarchical models are powerful ways to incorporate the data’s structure
  • Probabilistic approaches allow inclusion of prior knowledge and output uncertainty estimates
  • This stuff is complicated...but usable!

THANKS: PyData, PyMC, and you!

https://xkcd.com/2059/