1 of 71

Uplift Modeling: How to Enhance Customer Targeting in Marketing with Causal Machine Learning

Hajime Takeda (Jimmy)

Data Scientist

June 2024

2 of 71

Agenda

  • What is Causal Inference?
  • What is Causal Inference with Machine Learning?
  • Use case #1 : Measuring Treatment Effects
  • Use case #2 : Uplift Modeling
  • Code with “CausalML”
  • Summary

2

3 of 71

Introduction

4 of 71

About Me

Hajime Takeda (Jimmy)

4

Education & Career

  • Master’s in CS from Kyoto University 🇯🇵
  • Data Analyst at Procter & Gamble 🇯🇵
  • Data Scientist at fashion company (E-commerce) 🇺🇸

Conferences

  • PyData Global Speaker (2022) 🌎
  • ODSC EAST Speaker (2023) 🇺🇸
  • PyData NYC Speaker (2023) 🇺🇸

5 of 71

Expected Takeaways

5

Understand the key concepts and approaches of �Causal Inference with Machine Learning

1

Learn how to do Uplift Modeling using “CausalML”

2

6 of 71

What is Causal inference?

7 of 71

Typical Scenario

7

“Awesome! We sent coupons to some users and their purchase rate was twice as high as others! That means the coupons must have doubled the purchase rate!”

Jessy

Marketing

Without Coupon

With Coupon

# of Customers

1,000

1,000

# of Purchasers

100

200

Purchase rate (%)

10%

20%

8 of 71

Is She Really Right?

8

Jessy

Marketing

“We sent coupons to customers who purchased

within the last 12 months

Me

Data Scientist

“Congratulations, that's great news.

But may I ask who received the coupons?

9 of 71

Pitfall: Selection Bias

9

Customers who received coupons might have purchased anyway.

True effect of coupon

Selection Bias :

Effect that would have occurred without coupon

Purchase rate

Apparent effect

from simple aggregation

10 of 71

What is Causal Inference?

10

Treatment

Outcome

Causal Inference is the process of determining whether a cause-and-effect relationship exists between Treatment and Outcome.

🤒

😄

👞

🩳

👓

11 of 71

What is the challenge?

11

Counterfactual : We can only observe one outcome from the same individual.

👞

🩳

👓

Reality (Observed World)

Counterfactual (Hypothetical World)

Purchased

Can’t observe

12 of 71

Randomized Controlled Trial (RCT)

12

Participants are randomly assigned to either a treatment group or a control group.�RCT = A/B Test

Treatment Group

Control Group

Purchase rate: 40%

Purchase rate: 20%

+20 points

Outcome

13 of 71

Possible Scenario

13

I can’t wait much longer to reach a conclusion.

It’s hard for me to convince external clients to run RCT.

Your mission is to maximize sales.

Boss

Sales

While RCT is simple and robust, it cannot be used especially if…..

CEO

14 of 71

Limitations of RCT

14

RCT is time consuming

  • Experiments take a lot of time.�
  • You can’t analyze the past data that didn’t use RCT.

RCT is hard when external stakeholders are involved

  • Clients may disagree with the coupon test for randomized customers.�
  • Clients may not be capable of executing RCT.

That’s why we use Causal Inference.

RCT can lead to opportunity loss

  • Half of the customers don’t receive the treatment, leading to potential opportunity loss.

15 of 71

Simpson's Paradox - Challenges of Observed Data

15

Questions:

What’s the effect of exercise on cholesterol?

People who exercise frequently have high cholesterol (Is this true?)

Image : "The book of why" by Judea Pearl and Dana MacKenzie

16 of 71

Simpson's Paradox - Challenges of Observed Data

16

Age is an overlooked factor.

When viewed by age group, people who exercise frequently

have lower cholesterol levels.

Image : "The book of why" by Judea Pearl and Dana MacKenzie

17 of 71

Confounding Variable

17

Confounding Variable:

Age

Treatment:

Exercise

Outcome:

Cholesterol

Confounding variable affects both treatment and outcome.

18 of 71

What is Causal Machine Learning?

19 of 71

History

19

1970s - 1990s

2000s

2010s

Judea Pearl

Donald Rubin

Establishment of Traditional Causal Inference

(e.g., causal diagrams, Potential Outcomes Framework)

Machine learning technology evolved rapidly with the increase in data volume.

Big Data

Neural Networks

Fusion of Causal Inference and Machine Learning:

- New methods such as Causal Forest

- CausalML and EconML (2019)

2020s

20 of 71

From Causal Inference to Causal Machine Learning

20

Traditional Causal Inference

  • Typically assume low-dimensional data�
  • Basically rely on linear models�
  • Generally estimates the Average Treatment Effect (ATE)

Causal Machine Learning

  • Handle high-dimensional data�
  • Capture complex relationships using non-linear models�
  • Estimates effects at subgroup or individual levels

Handle more high dimensional data and

capture complex relationships at individual level

21 of 71

Common Questions

21

“Why should we use causal machine learning

when normal machine learning already offers

high predictive accuracy?”

PyData

Attendees

Because the objectives of normal ML and causal machine learning are different.

22 of 71

Machine Learning vs. Causal ML

22

Machine Learning

Causal Machine Learning /

Causal Inference

Purpose

Prediction based on correlation

Estimating the treatment effect

Questions

What will the customer buy next?

What is the probability?

Did the coupon increase sales?

Variables

X (Features): Age, gender, coupon availability

Y (Target Variable): Sales

X (Control Variables): Age, gender

Z (Treatment Variable): Coupon availability

Y (Outcome Variable): Sales uplift

Machine Learning focuses on “Prediction”,

while Causal Machine Learning focuses on “Causality”.

23 of 71

Two Use Cases in Marketing Science

23

Use Case #1

Measuring Treatment Effects

Use Case #2

Uplift Modeling

  • Using Meta learner
  • The goal is to select the right users for targeting
  • Using uplift tree / uplift random forest

24 of 71

Use Case #1

Measuring Treatment Effects

25 of 71

What is Treatment Effect?

25

Treatment Effect τ = Y(1) - Y(0)

Y(1): 20%

τ (Tau) : Treatment effect

𝑌 (Outcome): Sales, Purchase rate, etc.

𝑍 (Treatment): Campaign, Coupon, etc.

𝑋 (Confounders): Age, Gender, Living location, Preference, Past purchase history, etc.

Y(0): 10%

Z=1

Z=0

Potential outcome with treatment

Potential outcome without treatment

20% - 10%

26 of 71

Types of Treatment Effect

26

Term

Abbreviation

Formula

Definition

Average Treatment Effect

ATE

τATE=E[Y(1)−Y(0)]

Across the entire customers

Conditional Average Treatment Effect

CATE

τCATE (X)=E[Y(1)−Y(0)∣X]

Segment level (i.e. gender, age groups)

Individual Treatment Effect

ITE

τi =Yi (1)−Yi (0)

Individual Level

There are different types of treatment effects

27 of 71

How to Calculate Treatment Effect

27

X

Z

Y

Y(1)

Y(0)

Y(1) - Y(0)

Name

Age

Gender

Location

Treatment

Outcome

Outcome

with treatment

Outcome

without treatment

ITE : Individual Treatment Effect

Anne

30

Male

Urban

0

0

NA

0

?

Ben

40

Female

Urban

0

0

NA

0

?

Chris

45

Female

Rural

0

1

NA

1

?

Diana

58

Female

Rural

0

0

NA

0

?

Ethan

25

Female

Urban

1

1

1

NA

?

Faye

38

Male

Rural

1

0

0

NA

?

Gary

42

Male

Urban

1

1

1

NA

?

Helen

60

Male

Urban

1

1

1

NA

?

Question: Did a coupon increase sales?

28 of 71

Meta Learners

28

X

Z

Y

Y(1)

Y(0)

Y(1) - Y(0)

Name

Age

Gender

Location

Treatment

Outcome

Outcome

with treatment

Outcome

without treatment

ITE : Individual Treatment Effect

Anne

30

Male

Urban

0

0

0.9

0

+0.9

Ben

40

Female

Urban

0

0

0.8

0

+0.8

Chris

45

Female

Rural

0

1

0.9

1

-0.1

Diana

58

Female

Rural

0

0

0.6

0

+0.6

Ethan

25

Female

Urban

1

1

1

0.2

+0.8

Faye

38

Male

Rural

1

0

0

0.1

-0.1

Gary

42

Male

Urban

1

1

1

0.5

+0.5

Helen

60

Male

Urban

1

1

1

0.4

+0.6

Meta learners are techniques designed to estimate treatment effects

by using ML models to handle unobserved outcomes

29 of 71

Types of Meta Learners

29

Learner

Approach

When to Use

S Learner

Single Model Approach

- Small to medium datasets

T Learner

Two-Model Approach

- Large datasets

- Distinct treatment and control groups

X Learner

Cross-Fitting Approach

- Heterogeneous effects

- Imbalanced sample sizes

R Learner

Residual Approach

- High-dimensional data

- Robust confounder adjustment

Four major methods: S Learner, T Learner, X Learner and R Learner

Simple

Complex & Robust

30 of 71

S Learner - Single Model Approach

30

Train

S Learner uses a single model for training and prediction

X

Z

Y

Y(1)

Y(0)

Y(1) - Y(0)

Name

Age

Gender

Location

Treatment

Outcome

Outcome

with treatment

Outcome

without treatment

ITE : Individual Treatment Effect

Anne

30

Male

Urban

0

0

0.4

0.1

0.3

Ben

40

Female

Urban

0

0

0.6

0.2

0.4

Chris

45

Female

Rural

0

1

0.7

0.5

0.2

Diana

58

Female

Rural

0

0

0.9

0.6

0.3

Ethan

25

Female

Urban

1

1

0.6

0.1

0.5

Faye

38

Male

Rural

1

0

0.2

0.0

0.2

Gary

42

Male

Urban

1

1

0.7

0.3

0.4

Helen

60

Male

Urban

1

1

0.8

0.6

0.2

Model μ

Predict

31 of 71

S Learner - Pseudo Code

31

# Setting features and target variable

X = df[['age', 'gender', 'location', 'treatment']]

y = df['sales']

# Training the model

model = xgb.XGBRegressor()

model.fit(X, y)

# Predicting sales if treated and not treated

df['sales_pred_treated'] = model.predict(df[['age', 'gender', 'Location', 'treatment']].assign(treatment=1))

df[sales_pred_untreated] = model.predict(df[['age', 'gender', 'Location', 'treatment']].assign(treatment=0))

# Calculating ATE

df['treatment_effect'] = df['sales_pred_treated'] - df['sales_pred_untreated']

ATE = df['treatment_effect'].mean()

32 of 71

T Learner - Two Model Approach

32

Uses separate models for treatment and control groups

X

Z

Y

Y(1)

Y(0)

Y(1) - Y(0)

Name

Age

Gender

Location

Treatment

Outcome

Outcome

with treatment

Outcome

without treatment

ITE : Individual Treatment Effect

Anne

30

Male

Urban

0

0

0.4

0.1

0.3

Ben

40

Female

Urban

0

0

0.6

0.2

0.4

Chris

45

Female

Rural

0

1

0.7

0.5

0.2

Diana

58

Female

Rural

0

0

0.9

0.6

0.3

Ethan

25

Female

Urban

1

1

0.6

0.1

0.5

Faye

38

Male

Rural

1

0

0.2

0.0

0.2

Gary

42

Male

Urban

1

1

0.7

0.3

0.4

Helen

60

Male

Urban

1

1

0.8

0.6

0.2

Model (Treated)

Model (Untreated)

33 of 71

T Learner - Pseudo Code

33

# Splitting the data into treated and control groups

df_treated = df[df['treatment'] == 1]

df_control = df[df['treatment'] == 0]

# Training the model for the treated group

model_treated = xgb.XGBRegressor()

model_treated.fit(df_treated[['age', 'gender', 'location']], df_treated['sales'])

# Training the model for the untreated group

model_control = xgb.XGBRegressor()

model_control.fit(df_control[['age', 'gender', 'location']], df_control['sales'])

# Predicting sales

df['sales_pred_treated'] = model_treated.predict(df[['age', 'gender', 'location']])

df['sales_pred_control'] = model_control.predict(df[['age', 'gender', 'location']])

# Calculating ATE

df['treatment_effect'] = df['sales_pred_treated'] - df['sales_pred_control']

ATE = df['treatment_effect'].mean()

34 of 71

Types of Meta Learners

34

Learner

Approach

When to Use

S Learner

Single Model Approach

- Small to medium datasets

T Learner

Two-Model Approach

- Large datasets

- Distinct treatment and control groups

X Learner

Cross-Fitting Approach

- Heterogeneous effects

- Imbalanced sample sizes

R Learner

Residual Approach

- High-dimensional data

- Robust confounder adjustment

Four major methods: S Learner, T Learner, X Learner and R Learner

Simple

Complex & Robust

35 of 71

Use Case #2

Uplift Modeling

36 of 71

What is Uplift Modeling?

36

Treatment effect

τ = Y(1) − Y(0)

Prioritized customers

Uplift modeling identifies customers who are influenced positively by marketing offers.

37 of 71

Segmentation of Customers

37

Persuadables

Will buy if receives an incentive

Sure Things

Will buy no matter what

Lost Cause

Won’t buy regardless of the campaign

Sleeping Dogs

Won’t buy if receives an incentive

If treated

if NOT treated

😊

😄👍

😔👎

Focus marketing efforts on the Persuadables

Buy

Won’t Buy

Buy

Won’t Buy

38 of 71

Two Methods for Uplift Modeling

38

Meta Lerners

  • Predict the outcome with treatment and the outcome without treatment separately, and then calculate the uplift
  • i.e. T Learner (Two model approach)

Decision Tree Based Method

  • i.e. Uplift Trees, Uplift Random Forest
  • Some algorithms support multiple treatment groups

  • Feature importance

While uplift modeling can also be implemented with Meta Learners,

decision-tree based method is a common approach

39 of 71

Traditional Decision Tree to Uplift Tree

39

Will the customer make a purchase?

(Prediction)

Who should we give coupons to?

(Causality)

Traditional Decision Tree

Uplift Tree

Is the customer XX?

Yes

No

Is the customer YY?

Yes

No

Treated

Not treated

Treated

Not treated

40 of 71

Traditional Decision Tree

40

Age

>= 30

< 29

These tree tries to identify: “Will the customer make a purchase?”

How can we construct a better decision tree?

Location

Urban

Rural

Purchased

Not Purchased

Purchasers are mixed between the two clusters

Purchasers are grouped into one cluster

41 of 71

For split criteria, “Gini Impurity” is used

41

Age

>= 30

< 29

Location

Urban

Rural

Purchased

Not Purchased

Here, p1​ is the probability of purchase and p0​ is the probability of no purchase.

Gini impurity = 0.44

High Impurity

Gini impurity = 0

Low Impurity

42 of 71

Reference: Calculation of Gini Impurity for the Left Tree

42

Purchased

Not Purchased

Age

>= 30

< 29

Gini impurity = 0.44

High Impurity

43 of 71

Reference: Calculation of Gini Impurity for the Right Tree

43

Purchased

Not Purchased

Location

Urban

Rural

Gini impurity = 0

Low Impurity

44 of 71

Uplift Tree

44

Age

>= 30

< 29

This uplift tree tries to identify: “Who should we give coupons to?”

How can we construct a better causal decision tree?

Location

Urban

Rural

Purchased

Not Purchased

Treated

Not treated

Treated

Not treated

Treated

Not treated

Treated

Not treated

Purchasers and non-purchasers are mixed together

within the same clusters

Purchasers and non-purchasers are cleanly separated

45 of 71

For split criteria, “Squared Euclidean Distance” is used

45

Purchased

Not Purchased

Distance = 1

Low distance

Distance = 4

High divergence

  • P(1) / P(0) is the probability of purchase / no purchase in the treatment group.
  • Q(1) / Q(0) is the probability of purchase / no purchase in the control group.

Age

>= 30

< 29

Location

Urban

Rural

Treated

Not treated

Treated

Not treated

Treated

Not treated

Treated

Not treated

46 of 71

Reference: Euclidean Distance Calculation for the Left Tree

46

Purchased

Not Purchased

Age

>= 30

< 29

Treated

Not treated

Treated

Not treated

47 of 71

Reference: Euclidean Distance Calculation for the Right Tree

47

Purchased

Not Purchased

Location

Urban

Rural

Treated

Not treated

Treated

Not treated

48 of 71

Code with CausalML

49 of 71

Useful Libraries

49

Library

Features

GitHub

  • Focus on Uplift modeling and Meta Learners
  • Designed as a standalone tool
  • Uber

uber/causalml (4.8k star)

  • Covers a wide range of algorithms, strong in economics
  • Part of a bigger DoWhy ecosystem
  • Microsoft Research

py-why/EconML (3.6k star)

Code Walkthrough with CausalML today!

50 of 71

Get the Full Code Here!

50

https://github.com/takechanman1228/Effective-Uplif-Modeling

https://bit.ly/uplift-modeling

51 of 71

Importing Necessary Libraries

51

52 of 71

Criteo Uplift Prediction Dataset

  • Fetch Criteo Data via sklift library
  • Data : https://ailab.criteo.com/criteo-uplift-prediction-dataset

52

53 of 71

Data Key figures

53

13M rows

Treatment Ratio: 85%

54 of 71

Difference in Conversion Rate

  • Difference in Conversion Rate between Treatment and Control Groups: 0.12% (Treatment: 0.31%, Control: 0.19%)
  • Note: This is the simple difference in conversion rates and not the true treatment effect.

54

55 of 71

Average Treatment Effect

  • Calculating ATE with Meta Learners: Using XGBoost and T-Learner

55

56 of 71

Decompose Observed CVR in Treatment Group

  • Observed CVR in Treatment Group = Base CVR from Control Group + ATE (True Effect) + Selection Bias

56

57 of 71

Uplift modeling

  • Train the Uplift Random Forest model (uplift_rf) and predict the uplift (y_pred)

57

58 of 71

Feature Importance

58

59 of 71

Visualization of Uplift Tree Structure

59

60 of 71

Preparation for Evaluation

60

61 of 71

Uplift Curve : Total Cumulative Gain

  • Targeting just 20% of the total users can achieve 80% of the results as if we targeted everyone

61

Uplift compared to random targeting

62 of 71

AUUC (Area Under the Uplift Curve)

  • Evaluate the modeling using AUUC score.
  • The concept is similar to AUC (Area Under the ROC Curve).
  • The closer the AUUC is to 1, the better.

62

63 of 71

Extract User ID to Be Targeted

  • Extract the customer IDs who should be targeted.

63

64 of 71

Summary

65 of 71

Summary

  1. When measuring the treatment effect of marketing activities, it is important to be mindful of selection bias and to control for confounding variables.

  • Meta learners are techniques designed to estimate treatment effects by using ML models to handle unobserved outcomes

  • Uplift Modeling enables the identification of customers who are most likely to respond positively to treatments, thereby improving the marketing ROI.

  • If you're new to uplift modeling, CausalML is a good first step.

65

66 of 71

Reference

66

67 of 71

Questions/Collaboration

67

takeda.hajime.ja@gmail.com

linkedin.com/in/hajime-takeda

Feel free to contact me!

68 of 71

THANK YOU

69 of 71

Frequently Asked Questions

  • Are causal graphs not used in uplift modeling and meta-leaner approach?
    • No.
      • (A) Potential Outcomes Framework (Rubin Causal Model): Emphasizes the estimation of causal effects.
      • (B) Causal Graph by Judea Pearl: Focuses on visualizing and understanding the causal structure of the data using causal graphs.
  • Any other causal machine learning methods?
    • “CausalImpact” : Effective for causal inference in time series data.
    • Causal Discovery: Utilizes neural networks to discover and estimate causal relationships.

69

70 of 71

My use case

  • (A) Coupon Targeting (Uplift Modeling)
    • 5% off / 10% off / 15% off
  • (B) Loyalty Program Analysis (Measuring Treatment Effect)
    • The loyalty program is highly correlated with customers' past purchase history, making unbiased evaluation challenging.
    • Used CausalML to estimate the pure relationship between the Loyalty Program and Purchase.

70

71 of 71

My Personal Recommendation

  • 1. Be Aware of Selection Bias in Regular Analysis
    • If selection bias may exist, consider using a causal inference approach.
    • In ecommerce and retail, the original purchase rates can vary significantly based on demographics, past purchase frequency, and referral sources.
  • 2. Set Appropriate Time Frames
    • Set the conversion window considering your business context.
    • For high-priced items, customers may need a long consideration period before purchasing, whereas low-priced items are often bought quickly.
  • 3. Use RCT if possible
    • While causal machine learning has evolved, the gold standard to answer the causal question is RCT.
    • Using causal inference approach first and then conducting RCTs to validate your findings might work.

71