1 of 33

The Dangers of Storytelling with

Feature Importance

PyData NYC 2023

Roni Kobrosly Ph.D.

2 of 33

Today’s talk

  • What is a “model data story”?
    • Feature importance in a minute
  • Correlation and causation and ice cream
    • The problem of confounding
    • Ice cream and crime
  • What good are feature importance metrics anyways?
    • Parsimonious models and mathematical explainability
  • Causal feature importance through Meta Learners
    • A closer look at customer churn, with a DAG
    • Prepping for Meta Learning
    • The simplest Meta Learner (S-Learner) explained three ways
    • Alternative Meta Learners
    • The causalml package API
    • Average and heterogenous importance
    • There is no free lunch

3 of 33

An, unfortunately, common data science workflow

Estimate feature importance

Discuss churn problem with stakeholder team

Exploratory data analysis

Train and

evaluate model

Socialize story with stakeholders, with recommendations for reducing churn

Create “data story” from model and feature importance values

4 of 33

An, unfortunately, common data science workflow

Estimate feature importance

Discuss churn problem with stakeholder team

Exploratory data analysis

Train and

evaluate model

Socialize story with stakeholders, with recommendations for reducing churn

Create “data story” from model and feature importance values

5 of 33

6 of 33

7 of 33

The problem of confounding

X

Y

Confounder

8 of 33

Summer weather induces a false association between ice cream sales and crime

Ice cream sales

Crime Rate

//

Hot Weather

9 of 33

Creating a toy dataset to illustrate this

# crimes committed

Median Income

Day of Week

Population Density

Income Inequality

Temperature

Ice Cream Sales

/

10 of 33

Let’s say we don’t have access to temperature in your dataset

Median Income

Day of Week

Population Density

Income Inequality

Temperature

Ice Cream Sales

/

# crimes committed

11 of 33

12 of 33

13 of 33

Feature importance measures aren’t causal but they still serve a purpose!

  • Feature pruning
    • If you don’t care about actionable insights, prune features that contribute little predictive power
    • Simplifies your ETL
  • Explainability of predictions
    • Remember, this is not causal explainability, but rather mathematical explainability
    • E.g. “The model said this person shouldn’t be approved for a credit card because X was high, Y was low, and Z was low, and based on the model weights this produced a low outcome score.”

14 of 33

If you have a predictive model and really want to estimate the causal importance of features…

Enter: Meta Learners

15 of 33

However, before we do any modeling

  • Take stock of all of the variables/features at your disposal
  • You need to whiteboard out a proposed causal relationships between all features (like we did on the earlier slide with crimes per day)
    • Use your domain knowledge for this
  • You need to binarize all features
    • With continuous variables: assign 0 if below median values, assign 1 if above median value
    • With ordinal or nominal variables: create dummy variables
  • For each feature you’ll need to determine its specific set of confounders
  • Note which features are actionable / intervenable

16 of 33

Churn

Plan

Type

Age

Regional

coverage

Tenure

# customer service calls

Satisfaction

(survey)

Usage

per day

# outages in last week

17 of 33

Confounders for a given feature

  • Tenure: ( )
  • Usage per day: (age)
  • Age: ( )
  • Plan type: (age)
  • Satisfaction: (plan type, regional coverage)
  • Regional coverage: ( )
  • # outages last week: ( )
  • # customer service calls: (# outages last week, plan type, regional coverage)

18 of 33

The simplest Meta Learner: S-Learner

19 of 33

1) Start with a set of participants for whom we have complete data on the feature of interest, outcome, and confounder data

Cust ID#

Confound 1

Confound 2

Feature

Churn

1

1

0

1

0

2

1

0

1

0

3

0

0

0

0

4

1

0

0

1

5

1

1

1

0

20 of 33

2) Train a model that predicts the outcome from all confounders and feature of interest. Aim for high recall and precision.

Cust ID#

Confound 1

Confound 2

Feature

Churn

1

1

0

1

0

2

1

0

1

0

3

0

0

0

0

4

1

0

0

1

5

1

1

1

0

21 of 33

3) Copy your dataset, and in the new version force the values of feature variable to be “1” for all observations

Cust ID#

Confound 1

Confound 2

Feature

Churn

1

1

0

1

0

2

1

0

1

0

3

0

0

1

0

4

1

0

1

1

5

1

1

1

0

22 of 33

4) Predict outcome values using the model you trained, and with the new feature values and same confounder values

Cust ID#

Confound 1

Confound 2

Feature

Churn

Prob1

1

1

0

1

0

0.55

2

1

0

1

0

0.45

3

0

0

1

0

0.67

4

1

0

1

1

0.28

5

1

1

1

0

0.51

23 of 33

5) Copy the original dataset, and now force all features to take on values of “0”. Use that same earlier model to predict outcome values with these new feature values.

Cust ID#

Confound 1

Confound 2

Feature

Churn

Prob0

1

1

0

0

0

0.44

2

1

0

0

0

0.42

3

0

0

0

0

0.80

4

1

0

0

1

0.26

5

1

1

0

0

0.43

24 of 33

6) For each observation, calculate the delta in the outcome. Average these deltas together to get the overall effect size. You can treat this as the average causal effect of the feature on the outcome.

Cust ID#

Prob1

Prob0

Δ

1

0.55

0.44

0.11

2

0.45

0.42

0.03

3

0.67

0.80

-0.13

4

0.28

0.26

0.02

5

0.51

0.43

0.08

𝜇Δ = 0.02

25 of 33

26 of 33

S-Learner diagrammed

Adapted from Alves MF, “Causal Inference for the Brave and True”, 2022

C

Training

Model

Predicting

F

Y

C

F = 1

Y

C

F = 0

Y

Model

pY | F=1

pY | F=0

𝜇Δ = Average effect

27 of 33

T-Learner diagrammed

Adapted from Alves MF, “Causal Inference for the Brave and True”, 2022

C

Training

ModelF=1

Predicting

F

Y

pY | F=1

pY | F=0

𝜇Δ = Average effect

ModelF=0

C

Y

F=1

C

Y

F=0

ModelF=1

C

Y

C

Y

ModelF=1

F=1

F=0

28 of 33

29 of 33

You can get confidence intervals via bootstrapping

C

F

Y

Sample full N with replacement 1000 times

Run 1000 datasets through feature importance function to obtain distribution of importance values

Calculate mean, standard error, and finally confidence interval

IF = 0.15

(95% CI: 0.10 - 0.20)

IF values

Causal

Feature

Importance

Function

S-Learner Model

30 of 33

31 of 33

Can get causal feature importance for entire population, subgroups, or even individual observations

Causal

Feature

Importance

Function

Full

S-Learner Model

Subpop A

Subpop B

32 of 33

There is no free lunch, causal assumptions must be met

  • Temporally, the feature must have occurred before outcome.
  • The feature value of one observation must not have an effect on the outcomes of another observation.
  • For a given feature, there needs to be some variability in its values and outcome values across the confounder values.
  • All major confounding variables are included in your analysis. This is generally impossible to meet, but in practice you try the best you can to minimize the violations of this.

33 of 33

Thank you.

Any questions?