Uplift Modeling: How to Enhance Customer Targeting in Marketing with Causal Machine Learning
Hajime Takeda (Jimmy)
Data Scientist
June 2024
Agenda
2
Introduction
About Me
Hajime Takeda (Jimmy)
4
Education & Career
Conferences
Expected Takeaways
5
Understand the key concepts and approaches of �Causal Inference with Machine Learning
1
Learn how to do Uplift Modeling using “CausalML”
2
What is Causal inference?
Typical Scenario
7
“Awesome! We sent coupons to some users and their purchase rate was twice as high as others! That means the coupons must have doubled the purchase rate!”
Jessy
Marketing
| Without Coupon | With Coupon |
# of Customers | 1,000 | 1,000 |
# of Purchasers | 100 | 200 |
Purchase rate (%) | 10% | 20% |
Is She Really Right?
8
Jessy
Marketing
“We sent coupons to customers who purchased
within the last 12 months”
Me
Data Scientist
“Congratulations, that's great news.
But may I ask who received the coupons?”
Pitfall: Selection Bias
9
Customers who received coupons might have purchased anyway.
True effect of coupon
Selection Bias :
Effect that would have occurred without coupon
Purchase rate
Apparent effect
from simple aggregation
What is Causal Inference?
10
Treatment
Outcome
Causal Inference is the process of determining whether a cause-and-effect relationship exists between Treatment and Outcome.
🤒
😄
👞
🩳
👓
What is the challenge?
11
Counterfactual : We can only observe one outcome from the same individual.
👞
🩳
👓
Reality (Observed World)
Counterfactual (Hypothetical World)
Purchased
Can’t observe
Randomized Controlled Trial (RCT)
12
Participants are randomly assigned to either a treatment group or a control group.�RCT = A/B Test
Treatment Group
Control Group
Purchase rate: 40%
Purchase rate: 20%
+20 points
Outcome
Possible Scenario
13
I can’t wait much longer to reach a conclusion.
It’s hard for me to convince external clients to run RCT.
Your mission is to maximize sales.
Boss
Sales
While RCT is simple and robust, it cannot be used especially if…..
CEO
Limitations of RCT
14
RCT is time consuming
RCT is hard when external stakeholders are involved
That’s why we use Causal Inference.
RCT can lead to opportunity loss
Simpson's Paradox - Challenges of Observed Data
15
Questions:
What’s the effect of exercise on cholesterol?
People who exercise frequently have high cholesterol (Is this true?)
Image : "The book of why" by Judea Pearl and Dana MacKenzie
Simpson's Paradox - Challenges of Observed Data
16
Age is an overlooked factor.
When viewed by age group, people who exercise frequently
have lower cholesterol levels.
Image : "The book of why" by Judea Pearl and Dana MacKenzie
Confounding Variable
17
Confounding Variable:
Age
Treatment:
Exercise
Outcome:
Cholesterol
Confounding variable affects both treatment and outcome.
What is Causal Machine Learning?
History
19
1970s - 1990s
2000s
2010s
Judea Pearl
Donald Rubin
Establishment of Traditional Causal Inference
(e.g., causal diagrams, Potential Outcomes Framework)
Machine learning technology evolved rapidly with the increase in data volume.
Big Data
Neural Networks
Fusion of Causal Inference and Machine Learning:
- New methods such as Causal Forest
- CausalML and EconML (2019)
2020s
From Causal Inference to Causal Machine Learning
20
Traditional Causal Inference
Causal Machine Learning
Handle more high dimensional data and
capture complex relationships at individual level
Common Questions
21
“Why should we use causal machine learning
when normal machine learning already offers
high predictive accuracy?”
PyData
Attendees
Because the objectives of normal ML and causal machine learning are different.
Machine Learning vs. Causal ML
22
| Machine Learning | Causal Machine Learning / Causal Inference |
Purpose | Prediction based on correlation | Estimating the treatment effect |
Questions | What will the customer buy next? What is the probability? | Did the coupon increase sales? |
Variables | X (Features): Age, gender, coupon availability Y (Target Variable): Sales | X (Control Variables): Age, gender Z (Treatment Variable): Coupon availability Y (Outcome Variable): Sales uplift |
Machine Learning focuses on “Prediction”,
while Causal Machine Learning focuses on “Causality”.
Two Use Cases in Marketing Science
23
Use Case #1
Measuring Treatment Effects
Use Case #2
Uplift Modeling
Use Case #1
Measuring Treatment Effects
What is Treatment Effect?
25
Treatment Effect τ = Y(1) - Y(0)
Y(1): 20%
τ (Tau) : Treatment effect
𝑌 (Outcome): Sales, Purchase rate, etc.
𝑍 (Treatment): Campaign, Coupon, etc.
𝑋 (Confounders): Age, Gender, Living location, Preference, Past purchase history, etc.
Y(0): 10%
Z=1
Z=0
Potential outcome with treatment
Potential outcome without treatment
20% - 10%
Types of Treatment Effect
26
Term | Abbreviation | Formula | Definition |
Average Treatment Effect | ATE | τATE=E[Y(1)−Y(0)] | Across the entire customers |
Conditional Average Treatment Effect | CATE | τCATE (X)=E[Y(1)−Y(0)∣X] | Segment level (i.e. gender, age groups) |
Individual Treatment Effect | ITE | τi =Yi (1)−Yi (0) | Individual Level |
There are different types of treatment effects
How to Calculate Treatment Effect
27
X | Z | Y | Y(1) | Y(0) | Y(1) - Y(0) | |||
Name | Age | Gender | Location | Treatment | Outcome | Outcome with treatment | Outcome without treatment | ITE : Individual Treatment Effect |
Anne | 30 | Male | Urban | 0 | 0 | NA | 0 | ? |
Ben | 40 | Female | Urban | 0 | 0 | NA | 0 | ? |
Chris | 45 | Female | Rural | 0 | 1 | NA | 1 | ? |
Diana | 58 | Female | Rural | 0 | 0 | NA | 0 | ? |
Ethan | 25 | Female | Urban | 1 | 1 | 1 | NA | ? |
Faye | 38 | Male | Rural | 1 | 0 | 0 | NA | ? |
Gary | 42 | Male | Urban | 1 | 1 | 1 | NA | ? |
Helen | 60 | Male | Urban | 1 | 1 | 1 | NA | ? |
Question: Did a coupon increase sales?
Meta Learners
28
X | Z | Y | Y(1) | Y(0) | Y(1) - Y(0) | |||
Name | Age | Gender | Location | Treatment | Outcome | Outcome with treatment | Outcome without treatment | ITE : Individual Treatment Effect |
Anne | 30 | Male | Urban | 0 | 0 | 0.9 | 0 | +0.9 |
Ben | 40 | Female | Urban | 0 | 0 | 0.8 | 0 | +0.8 |
Chris | 45 | Female | Rural | 0 | 1 | 0.9 | 1 | -0.1 |
Diana | 58 | Female | Rural | 0 | 0 | 0.6 | 0 | +0.6 |
Ethan | 25 | Female | Urban | 1 | 1 | 1 | 0.2 | +0.8 |
Faye | 38 | Male | Rural | 1 | 0 | 0 | 0.1 | -0.1 |
Gary | 42 | Male | Urban | 1 | 1 | 1 | 0.5 | +0.5 |
Helen | 60 | Male | Urban | 1 | 1 | 1 | 0.4 | +0.6 |
Meta learners are techniques designed to estimate treatment effects
by using ML models to handle unobserved outcomes
Types of Meta Learners
29
Learner | Approach | When to Use |
S Learner | Single Model Approach | - Small to medium datasets |
T Learner | Two-Model Approach | - Large datasets - Distinct treatment and control groups |
X Learner | Cross-Fitting Approach | - Heterogeneous effects - Imbalanced sample sizes |
R Learner | Residual Approach | - High-dimensional data - Robust confounder adjustment |
Four major methods: S Learner, T Learner, X Learner and R Learner
Simple
Complex & Robust
S Learner - Single Model Approach
30
Train
S Learner uses a single model for training and prediction
X | Z | Y | Y(1) | Y(0) | Y(1) - Y(0) | |||
Name | Age | Gender | Location | Treatment | Outcome | Outcome with treatment | Outcome without treatment | ITE : Individual Treatment Effect |
Anne | 30 | Male | Urban | 0 | 0 | 0.4 | 0.1 | 0.3 |
Ben | 40 | Female | Urban | 0 | 0 | 0.6 | 0.2 | 0.4 |
Chris | 45 | Female | Rural | 0 | 1 | 0.7 | 0.5 | 0.2 |
Diana | 58 | Female | Rural | 0 | 0 | 0.9 | 0.6 | 0.3 |
Ethan | 25 | Female | Urban | 1 | 1 | 0.6 | 0.1 | 0.5 |
Faye | 38 | Male | Rural | 1 | 0 | 0.2 | 0.0 | 0.2 |
Gary | 42 | Male | Urban | 1 | 1 | 0.7 | 0.3 | 0.4 |
Helen | 60 | Male | Urban | 1 | 1 | 0.8 | 0.6 | 0.2 |
Model μ
Predict
S Learner - Pseudo Code
31
# Setting features and target variable
X = df[['age', 'gender', 'location', 'treatment']]
y = df['sales']
# Training the model
model = xgb.XGBRegressor()
model.fit(X, y)
# Predicting sales if treated and not treated
df['sales_pred_treated'] = model.predict(df[['age', 'gender', 'Location', 'treatment']].assign(treatment=1))
df[sales_pred_untreated] = model.predict(df[['age', 'gender', 'Location', 'treatment']].assign(treatment=0))
# Calculating ATE
df['treatment_effect'] = df['sales_pred_treated'] - df['sales_pred_untreated']
ATE = df['treatment_effect'].mean()
T Learner - Two Model Approach
32
Uses separate models for treatment and control groups
X | Z | Y | Y(1) | Y(0) | Y(1) - Y(0) | |||
Name | Age | Gender | Location | Treatment | Outcome | Outcome with treatment | Outcome without treatment | ITE : Individual Treatment Effect |
Anne | 30 | Male | Urban | 0 | 0 | 0.4 | 0.1 | 0.3 |
Ben | 40 | Female | Urban | 0 | 0 | 0.6 | 0.2 | 0.4 |
Chris | 45 | Female | Rural | 0 | 1 | 0.7 | 0.5 | 0.2 |
Diana | 58 | Female | Rural | 0 | 0 | 0.9 | 0.6 | 0.3 |
Ethan | 25 | Female | Urban | 1 | 1 | 0.6 | 0.1 | 0.5 |
Faye | 38 | Male | Rural | 1 | 0 | 0.2 | 0.0 | 0.2 |
Gary | 42 | Male | Urban | 1 | 1 | 0.7 | 0.3 | 0.4 |
Helen | 60 | Male | Urban | 1 | 1 | 0.8 | 0.6 | 0.2 |
Model (Treated)
Model (Untreated)
T Learner - Pseudo Code
33
# Splitting the data into treated and control groups
df_treated = df[df['treatment'] == 1]
df_control = df[df['treatment'] == 0]
# Training the model for the treated group
model_treated = xgb.XGBRegressor()
model_treated.fit(df_treated[['age', 'gender', 'location']], df_treated['sales'])
# Training the model for the untreated group
model_control = xgb.XGBRegressor()
model_control.fit(df_control[['age', 'gender', 'location']], df_control['sales'])
# Predicting sales
df['sales_pred_treated'] = model_treated.predict(df[['age', 'gender', 'location']])
df['sales_pred_control'] = model_control.predict(df[['age', 'gender', 'location']])
# Calculating ATE
df['treatment_effect'] = df['sales_pred_treated'] - df['sales_pred_control']
ATE = df['treatment_effect'].mean()
Types of Meta Learners
34
Learner | Approach | When to Use |
S Learner | Single Model Approach | - Small to medium datasets |
T Learner | Two-Model Approach | - Large datasets - Distinct treatment and control groups |
X Learner | Cross-Fitting Approach | - Heterogeneous effects - Imbalanced sample sizes |
R Learner | Residual Approach | - High-dimensional data - Robust confounder adjustment |
Four major methods: S Learner, T Learner, X Learner and R Learner
Simple
Complex & Robust
Use Case #2
Uplift Modeling
What is Uplift Modeling?
36
Treatment effect
τ = Y(1) − Y(0)
Prioritized customers
Uplift modeling identifies customers who are influenced positively by marketing offers.
Segmentation of Customers
37
Persuadables Will buy if receives an incentive | Sure Things Will buy no matter what |
Lost Cause Won’t buy regardless of the campaign | Sleeping Dogs Won’t buy if receives an incentive |
If treated
if NOT treated
😊
😄👍
😔👎
Focus marketing efforts on the Persuadables
Buy
Won’t Buy
Buy
Won’t Buy
Two Methods for Uplift Modeling
38
Meta Lerners
Decision Tree Based Method
While uplift modeling can also be implemented with Meta Learners,
decision-tree based method is a common approach
Traditional Decision Tree to Uplift Tree
39
Will the customer make a purchase?
(Prediction)
Who should we give coupons to?
(Causality)
Traditional Decision Tree
Uplift Tree
Is the customer XX?
Yes
No
Is the customer YY?
Yes
No
Treated
Not treated
Treated
Not treated
Traditional Decision Tree
40
Age
>= 30
< 29
These tree tries to identify: “Will the customer make a purchase?”
How can we construct a better decision tree?
Location
Urban
Rural
Purchased
Not Purchased
Purchasers are mixed between the two clusters
Purchasers are grouped into one cluster
For split criteria, “Gini Impurity” is used
41
Age
>= 30
< 29
Location
Urban
Rural
Purchased
Not Purchased
Here, p1 is the probability of purchase and p0 is the probability of no purchase.
Gini impurity = 0.44
High Impurity
Gini impurity = 0
Low Impurity
Reference: Calculation of Gini Impurity for the Left Tree
42
Purchased
Not Purchased
Age
>= 30
< 29
Gini impurity = 0.44
High Impurity
Reference: Calculation of Gini Impurity for the Right Tree
43
Purchased
Not Purchased
Location
Urban
Rural
Gini impurity = 0
Low Impurity
Uplift Tree
44
Age
>= 30
< 29
This uplift tree tries to identify: “Who should we give coupons to?”
How can we construct a better causal decision tree?
Location
Urban
Rural
Purchased
Not Purchased
Treated
Not treated
Treated
Not treated
Treated
Not treated
Treated
Not treated
Purchasers and non-purchasers are mixed together
within the same clusters
Purchasers and non-purchasers are cleanly separated
For split criteria, “Squared Euclidean Distance” is used
45
Purchased
Not Purchased
Distance = 1
Low distance
Distance = 4
High divergence
Age
>= 30
< 29
Location
Urban
Rural
Treated
Not treated
Treated
Not treated
Treated
Not treated
Treated
Not treated
Reference: Euclidean Distance Calculation for the Left Tree
46
Purchased
Not Purchased
Age
>= 30
< 29
Treated
Not treated
Treated
Not treated
Reference: Euclidean Distance Calculation for the Right Tree
47
Purchased
Not Purchased
Location
Urban
Rural
Treated
Not treated
Treated
Not treated
Code with CausalML
Useful Libraries
49
Library | Features | GitHub |
|
| uber/causalml (4.8k star) |
|
| py-why/EconML (3.6k star) |
Code Walkthrough with CausalML today!
Get the Full Code Here!
50
https://github.com/takechanman1228/Effective-Uplif-Modeling
https://bit.ly/uplift-modeling
Importing Necessary Libraries
51
Criteo Uplift Prediction Dataset
52
Data Key figures
53
13M rows
Treatment Ratio: 85%
Difference in Conversion Rate
54
Average Treatment Effect
55
Decompose Observed CVR in Treatment Group
56
Uplift modeling
57
Feature Importance
58
Visualization of Uplift Tree Structure
59
Preparation for Evaluation
60
Uplift Curve : Total Cumulative Gain
61
Uplift compared to random targeting
AUUC (Area Under the Uplift Curve)
62
Extract User ID to Be Targeted
63
Summary
Summary
65
Reference
66
Questions/Collaboration
67
takeda.hajime.ja@gmail.com
linkedin.com/in/hajime-takeda
Feel free to contact me!
THANK YOU
Frequently Asked Questions
69
My use case
70
My Personal Recommendation
71