2 of 33

Today’s talk

What is a “model data story”?

Feature importance in a minute

Correlation and causation and ice cream

The problem of confounding
Ice cream and crime

What good are feature importance metrics anyways?

Parsimonious models and mathematical explainability

Causal feature importance through Meta Learners

A closer look at customer churn, with a DAG
Prepping for Meta Learning
The simplest Meta Learner (S-Learner) explained three ways
Alternative Meta Learners
The causalml package API
Average and heterogenous importance
There is no free lunch

3 of 33

An, unfortunately, common data science workflow

Estimate feature importance

Discuss churn problem with stakeholder team

Exploratory data analysis

Train and

evaluate model

Socialize story with stakeholders, with recommendations for reducing churn

Create “data story” from model and feature importance values

4 of 33

An, unfortunately, common data science workflow

Estimate feature importance

Discuss churn problem with stakeholder team

Exploratory data analysis

Train and

evaluate model

Socialize story with stakeholders, with recommendations for reducing churn

Create “data story” from model and feature importance values

7 of 33

The problem of confounding

Confounder

8 of 33

Summer weather induces a false association between ice cream sales and crime

Ice cream sales

Crime Rate

Hot Weather

9 of 33

Creating a toy dataset to illustrate this

# crimes committed

Median Income

Day of Week

Population Density

Income Inequality

Temperature

Ice Cream Sales

10 of 33

Let’s say we don’t have access to temperature in your dataset

Median Income

Day of Week

Population Density

Income Inequality

Temperature

Ice Cream Sales

# crimes committed

13 of 33

Feature importance measures aren’t causal but they still serve a purpose!

Feature pruning

If you don’t care about actionable insights, prune features that contribute little predictive power
Simplifies your ETL

Explainability of predictions

Remember, this is not causal explainability, but rather mathematical explainability
E.g. “The model said this person shouldn’t be approved for a credit card because X was high, Y was low, and Z was low, and based on the model weights this produced a low outcome score.”

14 of 33

If you have a predictive model and really want to estimate the causal importance of features…

Enter: Meta Learners

15 of 33

However, before we do any modeling

Take stock of all of the variables/features at your disposal
You need to whiteboard out a proposed causal relationships between all features (like we did on the earlier slide with crimes per day)

Use your domain knowledge for this

You need to binarize all features

With continuous variables: assign 0 if below median values, assign 1 if above median value
With ordinal or nominal variables: create dummy variables

For each feature you’ll need to determine its specific set of confounders
Note which features are actionable / intervenable

16 of 33

Churn

Plan

Type

Age

Regional

coverage

Tenure

# customer service calls

Satisfaction

(survey)

Usage

per day

# outages in last week

‘

17 of 33

Confounders for a given feature

Tenure: ( )
Usage per day: (age)
Age: ( )
Plan type: (age)
Satisfaction: (plan type, regional coverage)
Regional coverage: ( )
# outages last week: ( )
# customer service calls: (# outages last week, plan type, regional coverage)

18 of 33

The simplest Meta Learner: S-Learner

19 of 33

1) Start with a set of participants for whom we have complete data on the feature of interest, outcome, and confounder data

Cust ID#	Confound 1	Confound 2	Feature	Churn
1	1	0	1	0
2	1	0	1	0
3	0	0	0	0
4	1	0	0	1
5	1	1	1	0

20 of 33

2) Train a model that predicts the outcome from all confounders and feature of interest. Aim for high recall and precision.

Cust ID#	Confound 1	Confound 2	Feature	Churn
1	1	0	1	0
2	1	0	1	0
3	0	0	0	0
4	1	0	0	1
5	1	1	1	0

21 of 33

3) Copy your dataset, and in the new version force the values of feature variable to be “1” for all observations

Cust ID#	Confound 1	Confound 2	Feature	Churn
1	1	0	1	0
2	1	0	1	0
3	0	0	1	0
4	1	0	1	1
5	1	1	1	0

22 of 33

4) Predict outcome values using the model you trained, and with the new feature values and same confounder values

Cust ID#	Confound 1	Confound 2	Feature	Churn	Prob₁
1	1	0	1	0	0.55
2	1	0	1	0	0.45
3	0	0	1	0	0.67
4	1	0	1	1	0.28
5	1	1	1	0	0.51

23 of 33

5) Copy the original dataset, and now force all features to take on values of “0”. Use that same earlier model to predict outcome values with these new feature values.

Cust ID#	Confound 1	Confound 2	Feature	Churn	Prob₀
1	1	0	0	0	0.44
2	1	0	0	0	0.42
3	0	0	0	0	0.80
4	1	0	0	1	0.26
5	1	1	0	0	0.43

24 of 33

6) For each observation, calculate the delta in the outcome. Average these deltas together to get the overall effect size. You can treat this as the average causal effect of the feature on the outcome.

Cust ID#	Prob₁	Prob₀	Δ
1	0.55	0.44	0.11
2	0.45	0.42	0.03
3	0.67	0.80	-0.13
4	0.28	0.26	0.02
5	0.51	0.43	0.08

𝜇_Δ= 0.02

26 of 33

S-Learner diagrammed

Adapted from Alves MF, “Causal Inference for the Brave and True”, 2022

Training

Model

Predicting

F = 1

F = 0

Model

p_Y | F=1

p_Y | F=0

𝜇_Δ=Average effect

27 of 33

T-Learner diagrammed

Adapted from Alves MF, “Causal Inference for the Brave and True”, 2022

Training

Model_F=1

Predicting

p_Y | F=1

p_Y | F=0

𝜇_Δ=Average effect

Model_F=0

F=1

F=0

Model_F=1

F=1

F=0

29 of 33

You can get confidence intervals via bootstrapping

Sample full N with replacement 1000 times

Run 1000 datasets through feature importance function to obtain distribution of importance values

Calculate mean, standard error, and finally confidence interval

I_F = 0.15

(95% CI: 0.10 - 0.20)

I_F values

Causal

Feature

Importance

Function

S-Learner Model

31 of 33

Can get causal feature importance for entire population, subgroups, or even individual observations

Causal

Feature

Importance

Function

Full

S-Learner Model

Subpop A

Subpop B

32 of 33

There is no free lunch, causal assumptions must be met

Temporally, the feature must have occurred before outcome.
The feature value of one observation must not have an effect on the outcomes of another observation.
For a given feature, there needs to be some variability in its values and outcome values across the confounder values.
All major confounding variables are included in your analysis. This is generally impossible to meet, but in practice you try the best you can to minimize the violations of this.

1 of 33