The Dangers of Storytelling with
Feature Importance
PyData NYC 2023
Roni Kobrosly Ph.D.
Today’s talk
An, unfortunately, common data science workflow
Estimate feature importance
Discuss churn problem with stakeholder team
Exploratory data analysis
Train and
evaluate model
Socialize story with stakeholders, with recommendations for reducing churn
Create “data story” from model and feature importance values
An, unfortunately, common data science workflow
Estimate feature importance
Discuss churn problem with stakeholder team
Exploratory data analysis
Train and
evaluate model
Socialize story with stakeholders, with recommendations for reducing churn
Create “data story” from model and feature importance values
The problem of confounding
X
Y
Confounder
Summer weather induces a false association between ice cream sales and crime
Ice cream sales
Crime Rate
//
Hot Weather
Creating a toy dataset to illustrate this
# crimes committed
Median Income
Day of Week
Population Density
Income Inequality
Temperature
Ice Cream Sales
/
Let’s say we don’t have access to temperature in your dataset
Median Income
Day of Week
Population Density
Income Inequality
Temperature
Ice Cream Sales
/
# crimes committed
Feature importance measures aren’t causal but they still serve a purpose!
If you have a predictive model and really want to estimate the causal importance of features…
Enter: Meta Learners
However, before we do any modeling
Churn
Plan
Type
Age
Regional
coverage
Tenure
# customer service calls
Satisfaction
(survey)
Usage
per day
# outages in last week
‘
Confounders for a given feature
The simplest Meta Learner: S-Learner
1) Start with a set of participants for whom we have complete data on the feature of interest, outcome, and confounder data
Cust ID# | Confound 1 | Confound 2 | Feature | Churn |
1 | 1 | 0 | 1 | 0 |
2 | 1 | 0 | 1 | 0 |
3 | 0 | 0 | 0 | 0 |
4 | 1 | 0 | 0 | 1 |
5 | 1 | 1 | 1 | 0 |
2) Train a model that predicts the outcome from all confounders and feature of interest. Aim for high recall and precision.
Cust ID# | Confound 1 | Confound 2 | Feature | Churn |
1 | 1 | 0 | 1 | 0 |
2 | 1 | 0 | 1 | 0 |
3 | 0 | 0 | 0 | 0 |
4 | 1 | 0 | 0 | 1 |
5 | 1 | 1 | 1 | 0 |
3) Copy your dataset, and in the new version force the values of feature variable to be “1” for all observations
Cust ID# | Confound 1 | Confound 2 | Feature | Churn |
1 | 1 | 0 | 1 | 0 |
2 | 1 | 0 | 1 | 0 |
3 | 0 | 0 | 1 | 0 |
4 | 1 | 0 | 1 | 1 |
5 | 1 | 1 | 1 | 0 |
4) Predict outcome values using the model you trained, and with the new feature values and same confounder values
Cust ID# | Confound 1 | Confound 2 | Feature | Churn | Prob1 |
1 | 1 | 0 | 1 | 0 | 0.55 |
2 | 1 | 0 | 1 | 0 | 0.45 |
3 | 0 | 0 | 1 | 0 | 0.67 |
4 | 1 | 0 | 1 | 1 | 0.28 |
5 | 1 | 1 | 1 | 0 | 0.51 |
5) Copy the original dataset, and now force all features to take on values of “0”. Use that same earlier model to predict outcome values with these new feature values.
Cust ID# | Confound 1 | Confound 2 | Feature | Churn | Prob0 |
1 | 1 | 0 | 0 | 0 | 0.44 |
2 | 1 | 0 | 0 | 0 | 0.42 |
3 | 0 | 0 | 0 | 0 | 0.80 |
4 | 1 | 0 | 0 | 1 | 0.26 |
5 | 1 | 1 | 0 | 0 | 0.43 |
6) For each observation, calculate the delta in the outcome. Average these deltas together to get the overall effect size. You can treat this as the average causal effect of the feature on the outcome.
Cust ID# | Prob1 | Prob0 | Δ |
1 | 0.55 | 0.44 | 0.11 |
2 | 0.45 | 0.42 | 0.03 |
3 | 0.67 | 0.80 | -0.13 |
4 | 0.28 | 0.26 | 0.02 |
5 | 0.51 | 0.43 | 0.08 |
𝜇Δ = 0.02
S-Learner diagrammed
Adapted from Alves MF, “Causal Inference for the Brave and True”, 2022
C
Training
Model
Predicting
F
Y
C
F = 1
Y
C
F = 0
Y
Model
pY | F=1
pY | F=0
𝜇Δ = Average effect
T-Learner diagrammed
Adapted from Alves MF, “Causal Inference for the Brave and True”, 2022
C
Training
ModelF=1
Predicting
F
Y
pY | F=1
pY | F=0
𝜇Δ = Average effect
ModelF=0
C
Y
F=1
C
Y
F=0
ModelF=1
C
Y
C
Y
ModelF=1
F=1
F=0
You can get confidence intervals via bootstrapping
C
F
Y
Sample full N with replacement 1000 times
Run 1000 datasets through feature importance function to obtain distribution of importance values
Calculate mean, standard error, and finally confidence interval
IF = 0.15
(95% CI: 0.10 - 0.20)
IF values
Causal
Feature
Importance
Function
S-Learner Model
Can get causal feature importance for entire population, subgroups, or even individual observations
Causal
Feature
Importance
Function
Full
S-Learner Model
Subpop A
Subpop B
There is no free lunch, causal assumptions must be met
Thank you.
Any questions?