One Size Doesn’t Fit All: �Why we Should Tailor Data Preparation to the Algorithm
Dean Abbott
Abbott Analytics, SmarterHQ
KNIME Spring Summit
21-Mar-2019
Email: dabbott@smarterhq.com, dean@abbottanalytics.com
Twitter: @deanabb
© Abbott Analytics 2001-2019
1
What do Predictive Modelers do?�The CRISP-DM Process Model
© Abbott Analytics 2001-2019
2
Business Understanding
Data Understanding
Data Preparation
Modeling
Evaluation
Deployment
Data
Data
Data
What We �Want to Do!
© Abbott Analytics 2001-2019
3
Is “Global” Data Prep a Fit?
© Abbott Analytics 2001-2019
4
Or Does Data Prep Sometimes Fail Us?
© Abbott Analytics 2001-2019
5
Good Set of Data Prep Steps!
© Abbott Analytics 2001-2019
6
https://www.knime.com/blog/seven-techniques-for-data-dimensionality-reduction
Best Practices (this is for Greg)
© Abbott Analytics 2001-2019
7
https://www.kdnuggets.com/2018/12/six-steps-master-machine-learning-data-preparation.html
Problems are Two Fold
© Abbott Analytics 2001-2019
8
Model
Output
Inputs
First, the Inputs
© Abbott Analytics 2001-2019
9
Model
Output
Inputs
Data Preparation Dependencies
© Abbott Analytics 2001-2019
10
Neural Networks
Linear Regression*
Logistic Regression
K Nearest Neighbor*+
PCA*
Nearest Mean*+
Kohonen Self-Organizing Maps*+
Support Vector Machines
Radial Basis Function Networks
Discriminant Analysis
Decision Trees
Naïve Bayes
Rule Induction
Association Rules
Data Representation Problems �by Algorithms
© Abbott Analytics 2001-2019
11
Data Preparation Problem | Linear Regression | K-NN | K-Means Clustering | PCA | Neural Networks | Decision Trees | Naïve Bayes |
Missing Values | Y | Y | Y | Y | Y | * | |
Correlations / multi-collinearity | Y | Y | Y | * |
| | * |
Skew / data shape | Y | Y | Y | Y |
|
| * |
Outliers | Y | Y | Y | Y | * | | * |
Magnitude Bias (Scale) | * | Y | Y | Y | * |
| |
Categorical Variables | Y | Y | Y | Y |
| * | |
| | | | | | | |
Missing Value Imputation
© Abbott Analytics 2001-2019
12
Missing Value Imputation
© Abbott Analytics 2001-2019
13
Missing Value Imputation is a Necessary Evil
CHAID Trees: Missing Values are �Just Another Category
© Abbott Analytics 2001-2019
14
Why Are Outliers, Skew, Shape a Problem?� 🡪Squares🡨
Linear Regression: Mean Squared Error
K-Means Clustering
© Abbott Analytics 2001-2019
15
https://en.wikipedia.org/wiki/Mean_squared_error
https://en.wikipedia.org/wiki/Euclidean_distance
Effect of Outliers on Correlations �(and Regression)
© Abbott Analytics 2001-2019
16
Effect of Outliers on Correlations �(and Regression)
© Abbott Analytics 2001-2019
17
Effect of Outliers on Correlations �(and Regression)
© Abbott Analytics 2001-2019
18
Corresponds to R^2 increase from 0.42 to 0.53
Decision Trees Can Handle it
© Abbott Analytics 2001-2019
19
Effect of Distance on Clusters
© Abbott Analytics 2001-2019
20
Effect of Distance on Clusters
© Abbott Analytics 2001-2019
21
Effect of Distance on Clusters
© Abbott Analytics 2001-2019
22
Effect of Distance on Clusters
© Abbott Analytics 2001-2019
23
Heavily Skewed Variables Cause Problems with Std. Dev. Calculation
© Abbott Analytics 2001-2019
24
Typical: Log transform the �heavily skewed fields
© Abbott Analytics 2001-2019
25
Std. Dev. For Dummy Variable Doesn’t Make Sense!
© Abbott Analytics 2001-2019
26
Dummy Vars
Note: stdev are
Typically 0.3 - 0.5
Try K-Means with Different Normalization Approaches
© Abbott Analytics 2001-2019
27
K Means Clustering Variable Importance:�Natural vs. Scaled
© Abbott Analytics 2001-2019
28
Measurements are F Statistic
K Means Clustering Variable Importance:�Natural vs. Scaled
© Abbott Analytics 2001-2019
29
Measurements are F Statistic
K Means Clustering Variable Importance:�Natural vs. Scaled
© Abbott Analytics 2001-2019
30
Measurements are F Statistic
K Means Clustering Standard Deviations:�Natural vs. Scaled
© Abbott Analytics 2001-2019
31
Measurements are F Statistic
K Means Clustering Standard Deviations:�Natural vs. Scaled
© Abbott Analytics 2001-2019
32
Measurements are F Statistic
Scale Matters!!
PCA: Natural Units
© Abbott Analytics 2001-2019
33
PCA: Natural Units-zoomed
© Abbott Analytics 2001-2019
34
PCA: Scaled Units
© Abbott Analytics 2001-2019
35
PCA: Scaled Units--zoomed
© Abbott Analytics 2001-2019
36
PCA: Scaled and Dummy Scaling
© Abbott Analytics 2001-2019
37
PCA: Scaled and Dummy Scaling
© Abbott Analytics 2001-2019
38
PCA: Scaled and Dummy Scaling
© Abbott Analytics 2001-2019
39
Scale Matters!!
Input Variable Interactions
© Abbott Analytics 2001-2019
40
Simple Interaction Function
© Abbott Analytics 2001-2019
41
Four Classifiers
© Abbott Analytics 2001-2019
42
Naïve Bayes
Decision Tree, min Leaf node 50 records
Logistic Regression
Rprop Neural Net, 300 epochs
Errors
© Abbott Analytics 2001-2019
43
Naïve Bayes
Decision Tree, min Leaf node 50 records
Logistic Regression
Rprop Neural Net, 300 epochs
True correct
False incorrect
False correct
True incorrect
Don’t Build Interactions Manually*
© Abbott Analytics 2001-2019
44
* Except for those you know about
Automatic Interaction Detection
© Abbott Analytics 2001-2019
45
Summary
© Abbott Analytics 2001-2019
46
Data Preparation Step | Linear Regression | K-NN | K-Means Clustering | PCA | Neural Networks | Decision Trees (Classifiers) |
Fill Missing Values | Y | Y | Y | Y | Y | * |
Correlation Filtering | Y | Y | Y | * |
|
|
De-Skew (log, box-cox) | Y | Y | Y | Y |
|
|
Mitigate Outliers | Y | Y | Y | Y | * | |
Remove Magnitude Bias (Scale) | | Y | Y | Y | * | |
Remove Categorical "Dummy" Bias | Y | Y | Y | Y |
|
|
Find Interactions | Y | Y | Y | Y | | |
And Now, the Output
© Abbott Analytics 2001-2019
47
Model
Output
Inputs
Stratify or Not to Stratify…�That is the Question!?
© Abbott Analytics 2001-2019
48
5.1% TARGET_B = 1: unbalanced data
Comparing Logistic Regression with and without Equal Size Sampling
© Abbott Analytics 2001-2019
49
Stratified Sampling
No Stratified Sampling
https://www.predictiveanalyticsworld.com/sanfrancisco/2013/pdf/Day2_1050_Abbott.pdf
Don’t Need to Stratify With Many Algorithms
© Abbott Analytics 2001-2019
50
https://www.predictiveanalyticsworld.com/sanfrancisco/2013/pdf/Day2_1050_Abbott.pdf
Know the Algorithm when Developing Sampling Strategy
© Abbott Analytics 2001-2019
51
Variable | Coeff. | Std. Err. | P>|z| | Coeff._natural | Std. Err._natural | P>|z|_natural | coeff diff | coeff compare |
RFA_2F | -0.133532984 | 0.0338 | 0.000 | -0.1563345 | 0.024 | 0.000 | 0.023 | within SE |
D_RFA_2A | -0.163727182 | 0.1210 | 0.176 | -0.0934212 | 0.079 | 0.237 | 0.070 | within SE |
F_RFA_2A | 0.038231571 | 0.0884 | 0.665 | 0.0357819 | 0.062 | 0.565 | 0.002 | within SE |
G_RFA_2A | 0.316663027 | 0.1267 | 0.012 | 0.2779701 | 0.091 | 0.002 | 0.039 | within SE |
DOMAIN2 | -0.068966948 | 0.0767 | 0.369 | -0.1169964 | 0.056 | 0.036 | 0.048 | within SE |
DOMAIN1 | -0.266408264 | 0.0837 | 0.001 | -0.2845323 | 0.060 | 0.000 | 0.018 | within SE |
NGIFTALL_log10 | -0.46212497 | 0.0998 | 0.000 | -0.4444304 | 0.072 | 0.000 | 0.018 | within SE |
LASTGIFT_log10 | 0.062766545 | 0.2044 | 0.759 | 0.1813683 | 0.141 | 0.199 | 0.119 | within SE |
Constant | 0.695770991 | 0.2785 | 0.012 | 3.5393926 | 0.194 | 0.000 | 2.844 | outside SE |
Stratified
Natural (orig)
Know the Algorithm when Developing Sampling Strategy
© Abbott Analytics 2001-2019
52
Stratified
Natural (orig)
i.e., 1 split
i.e., lots of splits!
0.7%
50%
Know the Algorithm when Developing Sampling Strategy
© Abbott Analytics 2001-2019
53
Stratified
Natural (orig)
i.e., no model
5%
50%
What We �Want to Do!
© Abbott Analytics 2001-2019
54
© Abbott Analytics 2001-2019
55
What We Actually Did
Can We Find a Good Fit?
© Abbott Analytics 2001-2019
56
Conclusions
© Abbott Analytics 2001-2019
57
Conclusions
© Abbott Analytics 2001-2019
58
Conclusions
© Abbott Analytics 2001-2019
59