1 of 59

One Size Doesn’t Fit All: �Why we Should Tailor Data Preparation to the Algorithm

Dean Abbott

Abbott Analytics, SmarterHQ

KNIME Spring Summit

21-Mar-2019

Email: dabbott@smarterhq.com, dean@abbottanalytics.com

Twitter: @deanabb

© Abbott Analytics 2001-2019

1

2 of 59

What do Predictive Modelers do?�The CRISP-DM Process Model

  • CRoss-Industry Standard Process Model for Data Mining
  • Describes Components of Complete Data Mining Cycle from the Project Management Perspective

© Abbott Analytics 2001-2019

2

Business Understanding

Data Understanding

Data Preparation

Modeling

Evaluation

Deployment

Data

Data

Data

3 of 59

What We �Want to Do!

© Abbott Analytics 2001-2019

3

4 of 59

Is “Global” Data Prep a Fit?

© Abbott Analytics 2001-2019

4

5 of 59

Or Does Data Prep Sometimes Fail Us?

© Abbott Analytics 2001-2019

5

6 of 59

Good Set of Data Prep Steps!

© Abbott Analytics 2001-2019

6

https://www.knime.com/blog/seven-techniques-for-data-dimensionality-reduction

7 of 59

Best Practices (this is for Greg)

© Abbott Analytics 2001-2019

7

https://www.kdnuggets.com/2018/12/six-steps-master-machine-learning-data-preparation.html

8 of 59

Problems are Two Fold

© Abbott Analytics 2001-2019

8

Model

Output

Inputs

9 of 59

First, the Inputs

© Abbott Analytics 2001-2019

9

Model

Output

Inputs

10 of 59

Data Preparation Dependencies

© Abbott Analytics 2001-2019

10

Neural Networks

Linear Regression*

Logistic Regression

K Nearest Neighbor*+

PCA*

Nearest Mean*+

Kohonen Self-Organizing Maps*+

Support Vector Machines

Radial Basis Function Networks

Discriminant Analysis

Decision Trees

Naïve Bayes

Rule Induction

Association Rules

  • Missing values MUST be filled/imputed
  • Explode categorical variables
  • Numeric data fine, but….
    • *Outliers very influential
    • +Scale (magnitude) matters
  • Missing values a category or auto-imputed
  • Categoricals are fine (or required)
  • Numeric data must be binned (except some decision trees)
  • Numeric magnitude and Outliers don’t matter

11 of 59

Data Representation Problems �by Algorithms

© Abbott Analytics 2001-2019

11

Data Preparation Problem

Linear Regression

K-NN

K-Means Clustering

PCA

Neural Networks

Decision Trees

Naïve Bayes

Missing Values

Y

Y

Y

Y

Y

*

Correlations / multi-collinearity

Y

Y

Y

 *

 

*

Skew / data shape

Y

Y

Y

Y

 

 

*

Outliers

Y

Y

Y

Y

*

*

Magnitude Bias (Scale)

*

Y

Y

Y

*

 

Categorical Variables

Y

Y

Y

Y

 

12 of 59

Missing Value Imputation

  • Delete the record (row), or delete the field (column)
  • Replace with a constant
  • Replace missing value with mean, median, or distribution
  • Replace missing with random self-substitution
  • Surrogate Splits (CART)
  • Make missing a category
    • Simple for “rule-based” algorithms; Turn continuous into categorical for numeric algorithms
  • Replace with the missing value with an estimate
    • Select value from another field having high correlation with variable containing missing values
    • Build a model with variable containing missing values as output, and other variables without missing values as an input

© Abbott Analytics 2001-2019

12

13 of 59

Missing Value Imputation

  • Delete the record (row), or delete the field (column)
  • Replace with a constant
  • Replace missing value with mean, median, or distribution
  • Replace missing with random self-substitution
  • Surrogate Splits (CART)
  • Make missing a category
    • Simple for “rule-based” algorithms; Turn continuous into categorical for numeric algorithms
  • Replace with the missing value with an estimate
    • Select value from another field having high correlation with variable containing missing values
    • Build a model with variable containing missing values as output, and other variables without missing values as an input

© Abbott Analytics 2001-2019

13

Missing Value Imputation is a Necessary Evil

14 of 59

CHAID Trees: Missing Values are �Just Another Category

© Abbott Analytics 2001-2019

14

15 of 59

Why Are Outliers, Skew, Shape a Problem?� 🡪Squares🡨

Linear Regression: Mean Squared Error

K-Means Clustering

© Abbott Analytics 2001-2019

15

https://en.wikipedia.org/wiki/Mean_squared_error

https://en.wikipedia.org/wiki/Euclidean_distance

16 of 59

Effect of Outliers on Correlations �(and Regression)

© Abbott Analytics 2001-2019

16

  • 4,843 records

17 of 59

Effect of Outliers on Correlations �(and Regression)

© Abbott Analytics 2001-2019

17

  • 4,843 records

18 of 59

Effect of Outliers on Correlations �(and Regression)

© Abbott Analytics 2001-2019

18

  • 4,843 records

Corresponds to R^2 increase from 0.42 to 0.53

19 of 59

Decision Trees Can Handle it

© Abbott Analytics 2001-2019

19

20 of 59

Effect of Distance on Clusters

© Abbott Analytics 2001-2019

20

21 of 59

Effect of Distance on Clusters

© Abbott Analytics 2001-2019

21

22 of 59

Effect of Distance on Clusters

© Abbott Analytics 2001-2019

22

23 of 59

Effect of Distance on Clusters

© Abbott Analytics 2001-2019

23

24 of 59

Heavily Skewed Variables Cause Problems with Std. Dev. Calculation

© Abbott Analytics 2001-2019

24

25 of 59

Typical: Log transform the �heavily skewed fields

© Abbott Analytics 2001-2019

25

26 of 59

Std. Dev. For Dummy Variable Doesn’t Make Sense!

© Abbott Analytics 2001-2019

26

Dummy Vars

Note: stdev are

Typically 0.3 - 0.5

27 of 59

Try K-Means with Different Normalization Approaches

© Abbott Analytics 2001-2019

27

28 of 59

K Means Clustering Variable Importance:�Natural vs. Scaled

© Abbott Analytics 2001-2019

28

Measurements are F Statistic

29 of 59

K Means Clustering Variable Importance:�Natural vs. Scaled

© Abbott Analytics 2001-2019

29

Measurements are F Statistic

30 of 59

K Means Clustering Variable Importance:�Natural vs. Scaled

© Abbott Analytics 2001-2019

30

Measurements are F Statistic

31 of 59

K Means Clustering Standard Deviations:�Natural vs. Scaled

© Abbott Analytics 2001-2019

31

Measurements are F Statistic

32 of 59

K Means Clustering Standard Deviations:�Natural vs. Scaled

© Abbott Analytics 2001-2019

32

Measurements are F Statistic

Scale Matters!!

33 of 59

PCA: Natural Units

© Abbott Analytics 2001-2019

33

34 of 59

PCA: Natural Units-zoomed

© Abbott Analytics 2001-2019

34

35 of 59

PCA: Scaled Units

© Abbott Analytics 2001-2019

35

36 of 59

PCA: Scaled Units--zoomed

© Abbott Analytics 2001-2019

36

37 of 59

PCA: Scaled and Dummy Scaling

© Abbott Analytics 2001-2019

37

38 of 59

PCA: Scaled and Dummy Scaling

© Abbott Analytics 2001-2019

38

39 of 59

PCA: Scaled and Dummy Scaling

© Abbott Analytics 2001-2019

39

Scale Matters!!

40 of 59

Input Variable Interactions

  • Algorithms are mixed on interactions in theory
    • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models
  • Decision trees are greedy searchers
      • Built to find interactions
      • But, only if they can be found in sequence (one at a time, stepwise)
  • Neural Networks find interactions well (XOR)
  • Naïve Bayes find intersections, not interactions
  • Algorithms don’t always identify interactions on their own well or �well-enough in practice

© Abbott Analytics 2001-2019

40

41 of 59

Simple Interaction Function

  • Two uniform variables: �x and y
  • 2,564 records�
  • if ( x*y > 0 ) return ("1");
    • else return("0");

© Abbott Analytics 2001-2019

41

42 of 59

Four Classifiers

  • aaa

© Abbott Analytics 2001-2019

42

Naïve Bayes

Decision Tree, min Leaf node 50 records

Logistic Regression

Rprop Neural Net, 300 epochs

43 of 59

Errors

© Abbott Analytics 2001-2019

43

Naïve Bayes

Decision Tree, min Leaf node 50 records

Logistic Regression

Rprop Neural Net, 300 epochs

True correct

False incorrect

False correct

True incorrect

44 of 59

Don’t Build Interactions Manually*

  • Too many…too many��������
  • So what do you do?

© Abbott Analytics 2001-2019

44

* Except for those you know about

45 of 59

Automatic Interaction Detection

  • Trees: build 2-level trees
    • Pros: works with continuous and categoricals
    • Cons: greedy, only finds one solution at a time (Battery)
  • Association rules: build 2-antecedent rules
    • Pros: exhaustive
    • Cons: only works with categoricals
  • Use the linear/logistic regression algorithm itself, �loop over all 2-way interactions
    • Pros: context is the model you may want to use, �easy to do in R, Matlab, Python, SAS (coding)
    • Cons: slow, have to code, what to do with dummies

© Abbott Analytics 2001-2019

45

46 of 59

Summary

© Abbott Analytics 2001-2019

46

Data Preparation Step

Linear Regression

K-NN

K-Means Clustering

PCA

Neural Networks

Decision Trees (Classifiers)

Fill Missing Values

Y

Y

Y

Y

Y

*

Correlation Filtering

Y

Y

Y

 *

 

 

De-Skew (log, box-cox)

Y

Y

Y

Y

 

 

Mitigate Outliers

Y

Y

Y

Y

*

Remove Magnitude Bias (Scale)

Y

Y

Y

*

Remove Categorical "Dummy" Bias

Y

Y

Y

Y

 

 

Find Interactions

Y

Y

Y

Y

47 of 59

And Now, the Output

© Abbott Analytics 2001-2019

47

Model

Output

Inputs

48 of 59

Stratify or Not to Stratify…�That is the Question!?

© Abbott Analytics 2001-2019

48

5.1% TARGET_B = 1: unbalanced data

49 of 59

Comparing Logistic Regression with and without Equal Size Sampling

© Abbott Analytics 2001-2019

49

Stratified Sampling

No Stratified Sampling

https://www.predictiveanalyticsworld.com/sanfrancisco/2013/pdf/Day2_1050_Abbott.pdf

50 of 59

Don’t Need to Stratify With Many Algorithms

© Abbott Analytics 2001-2019

50

https://www.predictiveanalyticsworld.com/sanfrancisco/2013/pdf/Day2_1050_Abbott.pdf

51 of 59

Know the Algorithm when Developing Sampling Strategy

© Abbott Analytics 2001-2019

51

Variable

Coeff.

Std. Err.

P>|z|

Coeff._natural

Std. Err._natural

P>|z|_natural

coeff diff

coeff compare

RFA_2F

-0.133532984

0.0338

0.000

-0.1563345

0.024

0.000

0.023

within SE

D_RFA_2A

-0.163727182

0.1210

0.176

-0.0934212

0.079

0.237

0.070

within SE

F_RFA_2A

0.038231571

0.0884

0.665

0.0357819

0.062

0.565

0.002

within SE

G_RFA_2A

0.316663027

0.1267

0.012

0.2779701

0.091

0.002

0.039

within SE

DOMAIN2

-0.068966948

0.0767

0.369

-0.1169964

0.056

0.036

0.048

within SE

DOMAIN1

-0.266408264

0.0837

0.001

-0.2845323

0.060

0.000

0.018

within SE

NGIFTALL_log10

-0.46212497

0.0998

0.000

-0.4444304

0.072

0.000

0.018

within SE

LASTGIFT_log10

0.062766545

0.2044

0.759

0.1813683

0.141

0.199

0.119

within SE

Constant

0.695770991

0.2785

0.012

3.5393926

0.194

0.000

2.844

outside SE

Stratified

Natural (orig)

52 of 59

Know the Algorithm when Developing Sampling Strategy

© Abbott Analytics 2001-2019

52

Stratified

Natural (orig)

i.e., 1 split

i.e., lots of splits!

0.7%

50%

53 of 59

Know the Algorithm when Developing Sampling Strategy

© Abbott Analytics 2001-2019

53

Stratified

Natural (orig)

i.e., no model

5%

50%

54 of 59

What We �Want to Do!

© Abbott Analytics 2001-2019

54

55 of 59

© Abbott Analytics 2001-2019

55

What We Actually Did

56 of 59

Can We Find a Good Fit?

© Abbott Analytics 2001-2019

56

57 of 59

Conclusions

  • Know what the algorithms can do (and not do!) before deciding on data preparation
    • When are data shapes and data ranges important?

© Abbott Analytics 2001-2019

57

58 of 59

Conclusions

  • Know what the algorithms can do (and not do!) before deciding on data preparation
    • When are data shapes and data ranges important?
  • It’s not hard….just requires some thought

© Abbott Analytics 2001-2019

58

59 of 59

Conclusions

  • Know what the algorithms can do (and not do!) before deciding on data preparation
    • When are data shapes and data ranges important?
  • It’s not hard….just requires some thought
  • Once you know what to do, you have your recipe!

© Abbott Analytics 2001-2019

59