1 of 71

Uplift Modeling: How to Enhance Customer Targeting in Marketing with Causal Machine Learning

Hajime Takeda (Jimmy)

Data Scientist

June 2024

2 of 71

Agenda

What is Causal Inference?
What is Causal Inference with Machine Learning?
Use case #1 : Measuring Treatment Effects
Use case #2 : Uplift Modeling
Code with “CausalML”
Summary

2

3 of 71

Introduction

4 of 71

About Me

Hajime Takeda (Jimmy)

4

Education & Career

Master’s in CS from Kyoto University 🇯🇵
Data Analyst at Procter & Gamble 🇯🇵
Data Scientist at fashion company (E-commerce) 🇺🇸

Conferences

PyData Global Speaker (2022) 🌎
ODSC EAST Speaker (2023) 🇺🇸
PyData NYC Speaker (2023) 🇺🇸

5 of 71

Expected Takeaways

5

Understand the key concepts and approaches of �Causal Inference with Machine Learning

1

Learn how to do Uplift Modeling using “CausalML”

2

6 of 71

What is Causal inference?

7 of 71

Typical Scenario

7

“Awesome! We sent coupons to some users and their purchase rate was twice as high as others! That means the coupons must have doubled the purchase rate!”

Jessy

Marketing

	Without Coupon	With Coupon
# of Customers	1,000	1,000
# of Purchasers	100	200
Purchase rate (%)	10%	20%

8 of 71

Is She Really Right?

8

Jessy

Marketing

“We sent coupons to customers who purchased

within the last 12 months”

Me

Data Scientist

“Congratulations, that's great news.

But may I ask who received the coupons?”

9 of 71

Pitfall: Selection Bias

9

Customers who received coupons might have purchased anyway.

True effect of coupon

Selection Bias :

Effect that would have occurred without coupon

Purchase rate

Apparent effect

from simple aggregation

10 of 71

What is Causal Inference?

10

Treatment

Outcome

Causal Inference is the process of determining whether a cause-and-effect relationship exists between Treatment and Outcome.

🤒

😄

👞

🩳

👓

11 of 71

What is the challenge?

11

Counterfactual : We can only observe one outcome from the same individual.

👞

🩳

👓

Reality (Observed World)

Counterfactual (Hypothetical World)

Purchased

Can’t observe

12 of 71

Randomized Controlled Trial (RCT)

12

Participants are randomly assigned to either a treatment group or a control group.�RCT = A/B Test

Treatment Group

Control Group

Purchase rate: 40%

Purchase rate: 20%

+20 points

Outcome

13 of 71

Possible Scenario

13

I can’t wait much longer to reach a conclusion.

It’s hard for me to convince external clients to run RCT.

Your mission is to maximize sales.

Boss

Sales

While RCT is simple and robust, it cannot be used especially if…..

CEO

14 of 71

Limitations of RCT

14

RCT is time consuming

Experiments take a lot of time.�
You can’t analyze the past data that didn’t use RCT.

RCT is hard when external stakeholders are involved

Clients may disagree with the coupon test for randomized customers.�
Clients may not be capable of executing RCT.

That’s why we use Causal Inference.

RCT can lead to opportunity loss

Half of the customers don’t receive the treatment, leading to potential opportunity loss.

15 of 71

Simpson's Paradox - Challenges of Observed Data

15

Questions:

What’s the effect of exercise on cholesterol?

People who exercise frequently have high cholesterol (Is this true?)

Image : "The book of why" by Judea Pearl and Dana MacKenzie

16 of 71

Simpson's Paradox - Challenges of Observed Data

16

Age is an overlooked factor.

When viewed by age group, people who exercise frequently

have lower cholesterol levels.

Image : "The book of why" by Judea Pearl and Dana MacKenzie

17 of 71

Confounding Variable

17

Confounding Variable:

Age

Treatment:

Exercise

Outcome:

Cholesterol

Confounding variable affects both treatment and outcome.

18 of 71

What is Causal Machine Learning?

19 of 71

History

19

1970s - 1990s

2000s

2010s

Judea Pearl

Donald Rubin

Establishment of Traditional Causal Inference

(e.g., causal diagrams, Potential Outcomes Framework)

Machine learning technology evolved rapidly with the increase in data volume.

Big Data

Neural Networks

Fusion of Causal Inference and Machine Learning:

- New methods such as Causal Forest

- CausalML and EconML (2019)

2020s

20 of 71

From Causal Inference to Causal Machine Learning

20

Traditional Causal Inference

Typically assume low-dimensional data�
Basically rely on linear models�
Generally estimates the Average Treatment Effect (ATE)

Causal Machine Learning

Handle high-dimensional data�
Capture complex relationships using non-linear models�
Estimates effects at subgroup or individual levels

Handle more high dimensional data and

capture complex relationships at individual level

21 of 71

Common Questions

21

“Why should we use causal machine learning

when normal machine learning already offers

high predictive accuracy?”

PyData

Attendees

Because the objectives of normal ML and causal machine learning are different.

22 of 71

Machine Learning vs. Causal ML

22

	Machine Learning	Causal Machine Learning / Causal Inference
Purpose	Prediction based on correlation	Estimating the treatment effect
Questions	What will the customer buy next? What is the probability?	Did the coupon increase sales?
Variables	X (Features): Age, gender, coupon availability Y (Target Variable): Sales	X (Control Variables): Age, gender Z (Treatment Variable): Coupon availability Y (Outcome Variable): Sales uplift

Machine Learning focuses on “Prediction”,

while Causal Machine Learning focuses on “Causality”.

23 of 71

Two Use Cases in Marketing Science

23

Use Case #1

Measuring Treatment Effects

Use Case #2

Uplift Modeling

Using Meta learner

The goal is to select the right users for targeting
Using uplift tree / uplift random forest

24 of 71

Use Case #1

Measuring Treatment Effects

25 of 71

What is Treatment Effect?

25

Treatment Effect τ = Y(1) - Y(0)

Y(1): 20%

τ (Tau) : Treatment effect

𝑌 (Outcome): Sales, Purchase rate, etc.

𝑍 (Treatment): Campaign, Coupon, etc.

𝑋 (Confounders): Age, Gender, Living location, Preference, Past purchase history, etc.

Y(0): 10%

Z=1

Z=0

Potential outcome with treatment

Potential outcome without treatment

20% - 10%

26 of 71

Types of Treatment Effect

26

Term	Abbreviation	Formula	Definition
Average Treatment Effect	ATE	τATE=E[Y(1)−Y(0)]	Across the entire customers
Conditional Average Treatment Effect	CATE	τCATE (X)=E[Y(1)−Y(0)∣X]	Segment level (i.e. gender, age groups)
Individual Treatment Effect	ITE	τi =Yi (1)−Yi (0)	Individual Level

There are different types of treatment effects

27 of 71

How to Calculate Treatment Effect

27

X				Z	Y	Y(1)	Y(0)	Y(1) - Y(0)
Name	Age	Gender	Location	Treatment	Outcome	Outcome with treatment	Outcome without treatment	ITE : Individual Treatment Effect
Anne	30	Male	Urban	0	0	NA	0	?
Ben	40	Female	Urban	0	0	NA	0	?
Chris	45	Female	Rural	0	1	NA	1	?
Diana	58	Female	Rural	0	0	NA	0	?
Ethan	25	Female	Urban	1	1	1	NA	?
Faye	38	Male	Rural	1	0	0	NA	?
Gary	42	Male	Urban	1	1	1	NA	?
Helen	60	Male	Urban	1	1	1	NA	?

Question: Did a coupon increase sales?

28 of 71

Meta Learners

28

X				Z	Y	Y(1)	Y(0)	Y(1) - Y(0)
Name	Age	Gender	Location	Treatment	Outcome	Outcome with treatment	Outcome without treatment	ITE : Individual Treatment Effect
Anne	30	Male	Urban	0	0	0.9	0	+0.9
Ben	40	Female	Urban	0	0	0.8	0	+0.8
Chris	45	Female	Rural	0	1	0.9	1	-0.1
Diana	58	Female	Rural	0	0	0.6	0	+0.6
Ethan	25	Female	Urban	1	1	1	0.2	+0.8
Faye	38	Male	Rural	1	0	0	0.1	-0.1
Gary	42	Male	Urban	1	1	1	0.5	+0.5
Helen	60	Male	Urban	1	1	1	0.4	+0.6

Meta learners are techniques designed to estimate treatment effects

by using ML models to handle unobserved outcomes

29 of 71

Types of Meta Learners

29

Learner	Approach	When to Use
S Learner	Single Model Approach	- Small to medium datasets
T Learner	Two-Model Approach	- Large datasets - Distinct treatment and control groups
X Learner	Cross-Fitting Approach	- Heterogeneous effects - Imbalanced sample sizes
R Learner	Residual Approach	- High-dimensional data - Robust confounder adjustment

Four major methods: S Learner, T Learner, X Learner and R Learner

Simple

Complex & Robust

30 of 71

S Learner - Single Model Approach

30

Train

S Learner uses a single model for training and prediction

X				Z	Y	Y(1)	Y(0)	Y(1) - Y(0)
Name	Age	Gender	Location	Treatment	Outcome	Outcome with treatment	Outcome without treatment	ITE : Individual Treatment Effect
Anne	30	Male	Urban	0	0	0.4	0.1	0.3
Ben	40	Female	Urban	0	0	0.6	0.2	0.4
Chris	45	Female	Rural	0	1	0.7	0.5	0.2
Diana	58	Female	Rural	0	0	0.9	0.6	0.3
Ethan	25	Female	Urban	1	1	0.6	0.1	0.5
Faye	38	Male	Rural	1	0	0.2	0.0	0.2
Gary	42	Male	Urban	1	1	0.7	0.3	0.4
Helen	60	Male	Urban	1	1	0.8	0.6	0.2

Model μ

Predict

31 of 71

S Learner - Pseudo Code

31

# Setting features and target variable

X = df[['age', 'gender', 'location', 'treatment']]

y = df['sales']

# Training the model

model = xgb.XGBRegressor()

model.fit(X, y)

# Predicting sales if treated and not treated

df['sales_pred_treated'] = model.predict(df[['age', 'gender', 'Location', 'treatment']].assign(treatment=1))

df[sales_pred_untreated] = model.predict(df[['age', 'gender', 'Location', 'treatment']].assign(treatment=0))

# Calculating ATE

df['treatment_effect'] = df['sales_pred_treated'] - df['sales_pred_untreated']

ATE = df['treatment_effect'].mean()

32 of 71

T Learner - Two Model Approach

32

Uses separate models for treatment and control groups

X				Z	Y	Y(1)	Y(0)	Y(1) - Y(0)
Name	Age	Gender	Location	Treatment	Outcome	Outcome with treatment	Outcome without treatment	ITE : Individual Treatment Effect
Anne	30	Male	Urban	0	0	0.4	0.1	0.3
Ben	40	Female	Urban	0	0	0.6	0.2	0.4
Chris	45	Female	Rural	0	1	0.7	0.5	0.2
Diana	58	Female	Rural	0	0	0.9	0.6	0.3
Ethan	25	Female	Urban	1	1	0.6	0.1	0.5
Faye	38	Male	Rural	1	0	0.2	0.0	0.2
Gary	42	Male	Urban	1	1	0.7	0.3	0.4
Helen	60	Male	Urban	1	1	0.8	0.6	0.2

Model (Treated)

Model (Untreated)

33 of 71

T Learner - Pseudo Code

33

# Splitting the data into treated and control groups

df_treated = df[df['treatment'] == 1]

df_control = df[df['treatment'] == 0]

# Training the model for the treated group

model_treated = xgb.XGBRegressor()

model_treated.fit(df_treated[['age', 'gender', 'location']], df_treated['sales'])

# Training the model for the untreated group

model_control = xgb.XGBRegressor()

model_control.fit(df_control[['age', 'gender', 'location']], df_control['sales'])

# Predicting sales

df['sales_pred_treated'] = model_treated.predict(df[['age', 'gender', 'location']])

df['sales_pred_control'] = model_control.predict(df[['age', 'gender', 'location']])

# Calculating ATE

df['treatment_effect'] = df['sales_pred_treated'] - df['sales_pred_control']

ATE = df['treatment_effect'].mean()

34 of 71

Types of Meta Learners

34

Learner	Approach	When to Use
S Learner	Single Model Approach	- Small to medium datasets
T Learner	Two-Model Approach	- Large datasets - Distinct treatment and control groups
X Learner	Cross-Fitting Approach	- Heterogeneous effects - Imbalanced sample sizes
R Learner	Residual Approach	- High-dimensional data - Robust confounder adjustment

Four major methods: S Learner, T Learner, X Learner and R Learner

Simple

Complex & Robust

35 of 71

Use Case #2

Uplift Modeling

36 of 71

What is Uplift Modeling?

36

Treatment effect

τ = Y(1) − Y(0)

Prioritized customers

Uplift modeling identifies customers who are influenced positively by marketing offers.

37 of 71

Segmentation of Customers

37

Persuadables Will buy if receives an incentive	Sure Things Will buy no matter what
Lost Cause Won’t buy regardless of the campaign	Sleeping Dogs Won’t buy if receives an incentive

If treated

if NOT treated

😊

😄👍

😔👎

Focus marketing efforts on the Persuadables

Buy

Won’t Buy

Buy

Won’t Buy

38 of 71

Two Methods for Uplift Modeling

38

Meta Lerners

Predict the outcome with treatment and the outcome without treatment separately, and then calculate the uplift�
i.e. T Learner (Two model approach)

Decision Tree Based Method

i.e. Uplift Trees, Uplift Random Forest�
Some algorithms support multiple treatment groups

Feature importance

While uplift modeling can also be implemented with Meta Learners,

decision-tree based method is a common approach

39 of 71

Traditional Decision Tree to Uplift Tree

39

Will the customer make a purchase?

(Prediction)

Who should we give coupons to?

(Causality)

Traditional Decision Tree

Uplift Tree

Is the customer XX?

Yes

No

Is the customer YY?

Yes

No

Treated

Not treated

Treated

Not treated

40 of 71

Traditional Decision Tree

40

Age

>= 30

< 29

These tree tries to identify: “Will the customer make a purchase?”

How can we construct a better decision tree?

Location

Urban

Rural

Purchased

Not Purchased

Purchasers are mixed between the two clusters

Purchasers are grouped into one cluster

41 of 71

For split criteria, “Gini Impurity” is used

41

Age

>= 30

< 29

Location

Urban

Rural

Purchased

Not Purchased

Here, p1 is the probability of purchase and p0 is the probability of no purchase.

Gini impurity = 0.44

High Impurity

Gini impurity = 0

Low Impurity

42 of 71

Reference: Calculation of Gini Impurity for the Left Tree

42

Purchased

Not Purchased

Age

>= 30

< 29

Gini impurity = 0.44

High Impurity

43 of 71

Reference: Calculation of Gini Impurity for the Right Tree

43

Purchased

Not Purchased

Location

Urban

Rural

Gini impurity = 0

Low Impurity

44 of 71

Uplift Tree

44

Age

>= 30

< 29

This uplift tree tries to identify: “Who should we give coupons to?”

How can we construct a better causal decision tree?

Location

Urban

Rural

Purchased

Not Purchased

Treated

Not treated

Treated

Not treated

Treated

Not treated

Treated

Not treated

Purchasers and non-purchasers are mixed together

within the same clusters

Purchasers and non-purchasers are cleanly separated

45 of 71

For split criteria, “Squared Euclidean Distance” is used

45

Purchased

Not Purchased

Distance = 1

Low distance

Distance = 4

High divergence

P(1) / P(0) is the probability of purchase / no purchase in the treatment group.
Q(1) / Q(0) is the probability of purchase / no purchase in the control group.

Age

>= 30

< 29

Location

Urban

Rural

Treated

Not treated

Treated

Not treated

Treated

Not treated

Treated

Not treated

46 of 71

Reference: Euclidean Distance Calculation for the Left Tree

46

Purchased

Not Purchased

Age

>= 30

< 29

Treated

Not treated

Treated

Not treated

47 of 71

Reference: Euclidean Distance Calculation for the Right Tree

47

Purchased

Not Purchased

Location

Urban

Rural

Treated

Not treated

Treated

Not treated

48 of 71

Code with CausalML

49 of 71

Useful Libraries

49

Library	Features	GitHub
	Focus on Uplift modeling and Meta Learners Designed as a standalone tool Uber	uber/causalml (4.8k star)
	Covers a wide range of algorithms, strong in economics Part of a bigger DoWhy ecosystem Microsoft Research	py-why/EconML (3.6k star)

Code Walkthrough with CausalML today!

50 of 71

Get the Full Code Here!

50

https://github.com/takechanman1228/Effective-Uplif-Modeling

https://bit.ly/uplift-modeling

51 of 71

Importing Necessary Libraries

51

52 of 71

Criteo Uplift Prediction Dataset

Fetch Criteo Data via sklift library
Data : https://ailab.criteo.com/criteo-uplift-prediction-dataset

52

53 of 71

Data Key figures

53

13M rows

Treatment Ratio: 85%

54 of 71

Difference in Conversion Rate

Difference in Conversion Rate between Treatment and Control Groups: 0.12% (Treatment: 0.31%, Control: 0.19%)
Note: This is the simple difference in conversion rates and not the true treatment effect.

54

55 of 71

Average Treatment Effect

Calculating ATE with Meta Learners: Using XGBoost and T-Learner

55

56 of 71

Decompose Observed CVR in Treatment Group

Observed CVR in Treatment Group = Base CVR from Control Group + ATE (True Effect) + Selection Bias

56

57 of 71

Uplift modeling

Train the Uplift Random Forest model (uplift_rf) and predict the uplift (y_pred)

57

58 of 71

Feature Importance

58

59 of 71

Visualization of Uplift Tree Structure

59

60 of 71

Preparation for Evaluation

60

61 of 71

Uplift Curve : Total Cumulative Gain

Targeting just 20% of the total users can achieve 80% of the results as if we targeted everyone

61

Uplift compared to random targeting

62 of 71

AUUC (Area Under the Uplift Curve)

Evaluate the modeling using AUUC score.
The concept is similar to AUC (Area Under the ROC Curve).
The closer the AUUC is to 1, the better.

62

63 of 71

Extract User ID to Be Targeted

Extract the customer IDs who should be targeted.

63

64 of 71

Summary

65 of 71

Summary

When measuring the treatment effect of marketing activities, it is important to be mindful of selection bias and to control for confounding variables.

Meta learners are techniques designed to estimate treatment effects by using ML models to handle unobserved outcomes

Uplift Modeling enables the identification of customers who are most likely to respond positively to treatments, thereby improving the marketing ROI.

If you're new to uplift modeling, CausalML is a good first step.

65

66 of 71

Reference

CausalML : https://github.com/uber/causalml
Criteo Dataset : https://ailab.criteo.com/criteo-uplift-prediction-dataset
Demo Code : https://github.com/takechanman1228/Effective-Uplif-Modeling
You tube : “Decision Trees are more powerful than you think” by “CodeEmporium”
Book : Causal Inference and Discovery in Python

66

67 of 71

Questions/Collaboration

67

takeda.hajime.ja@gmail.com

linkedin.com/in/hajime-takeda

Feel free to contact me!

68 of 71

THANK YOU

69 of 71

Frequently Asked Questions

Are causal graphs not used in uplift modeling and meta-leaner approach?

No.

(A) Potential Outcomes Framework (Rubin Causal Model): Emphasizes the estimation of causal effects.
(B) Causal Graph by Judea Pearl: Focuses on visualizing and understanding the causal structure of the data using causal graphs.

Any other causal machine learning methods?

“CausalImpact” : Effective for causal inference in time series data.
Causal Discovery: Utilizes neural networks to discover and estimate causal relationships.

69

70 of 71

My use case

(A) Coupon Targeting (Uplift Modeling)

5% off / 10% off / 15% off

(B) Loyalty Program Analysis (Measuring Treatment Effect)

The loyalty program is highly correlated with customers' past purchase history, making unbiased evaluation challenging.
Used CausalML to estimate the pure relationship between the Loyalty Program and Purchase.

70

71 of 71

My Personal Recommendation

1. Be Aware of Selection Bias in Regular Analysis

If selection bias may exist, consider using a causal inference approach.
In ecommerce and retail, the original purchase rates can vary significantly based on demographics, past purchase frequency, and referral sources.

2. Set Appropriate Time Frames

Set the conversion window considering your business context.
For high-priced items, customers may need a long consideration period before purchasing, whereas low-priced items are often bought quickly.

3. Use RCT if possible

While causal machine learning has evolved, the gold standard to answer the causal question is RCT.
Using causal inference approach first and then conducting RCTs to validate your findings might work.

71