Integrating Data Mining Techniques: �An Example of a Logistic Regression Model Integrated with Treenet and Association Rules Analysis
Pannapa Changpetch
Department of Mathematics, Faculty of Science, Mahidol University, Bangkok, Thailand �
Data mining techniques
Logistic regression
kNN
Naïve Bayes
Classification Tree
Association Rules
Cluster Analysis
Random Forest
TreeNet
Multinomial Logit model
Data mining techniques
Logistic regression
kNN
Naïve Bayes
Classification Tree
Association Rules
Cluster Analysis
Random Forest
TreeNet
Multinomial Logit model
Why we need to integrate?
Our goal is to improve the performance of the main technique.
Main technique: Logistic regression
where p is the probability of success or P(Y = 1),
X’s are the predictors, and
β’s are the coefficients of the predictors.
Example
The response (Y) with two levels
Level 1: Infected with disease A
Level 0: No infection
The first predictor (X1) = Age
The second predictor (X2) = Cholesterol level
The third predictor (X3) = Body mass index
Example
ID | X1 | X2 | X3 | Y |
1 | 25 | 113 | 18.5 | 0 |
2 | 32 | 189 | 25.5 | 1 |
3 | 45 | 201 | 26.4 | 1 |
. | | | | |
. | | | | |
. | | | | |
. | | | | |
. | | | | |
100 | 21 | 125 | 20.1 | 0 |
Example
After fitting the model with X1, X2 and X3, we have got
Is the new observation with X1 = 40, X2 = 200, and X3 = 28 more likely to be infected with disease A or is he/she free from this disease?
Example
After fitting the model with X1, X2 and X3, we have got
Is the new observation with X1 = 40, X2 = 200, and X3 = 28 more likely to be infected with disease A or is he/she free from this disease?
Example
After fitting the model with X1, X2 and X3, we have got
Is the new observation with X1 = 40, X2 = 200, and X3 = 28 more likely to be infected with disease A or is he/she free from this disease?
P(Y = 1) = 0.94 and P(Y = 0) = 0.06
Why Logistic regression needs help?
First technique which helps: TreeNet
(www.salford-systems.com/treenet.html, Friedman, 2001).
TreeNet: Partial dependency plot
Second technique which helps: Association rules analysis
if (condition), then (result).
Association rules analysis: Two measurements
Association rules analysis: Example
“if X1 = 1, X2 = 0, then Y = 1”
Note: To perform ASA, we use the CBA program developed by the Department of Information Systems and Computer Sciences at the National University of Singapore (http://www.comp.nus.edu.sg/~dm2/).
Inputs and Outputs for Proposed Model Selection Procedure
Step 1: TreeNet
Step 2: ASA and Rules Selection
Step 3: Interactions Generation
Step 4: Model Selection
Data
Generated Categorical Variables
Potential
Rules
Potential Interactions
+ Original Predictors
+ Generated Variables
Optimal Model
+ Original Categorical Predictors
Application: Hepatitis dataset
Detail
Objective
Of the 17 predictors in the analysis, 13 are categorical predictors.
Attribute | Levels | Binary Variables |
Sex | male, female | X1 = 1 if male and X1 = 0 if female |
Steroid | yes, no | X2 = 1 if yes and X2 = 0 if no |
Antivirals | yes, no | X3 = 1 if yes and X3 = 0 if no |
Fatigue | yes, no | X4 = 1 if yes and X4 = 0 if no |
Malaise | yes, no | X5 = 1 if yes and X5 = 0 if no |
Anorexia | yes, no | X6 = 1 if yes and X6 = 0 if no |
Liver Big | yes, no | X7 = 1 if yes and X7 = 0 if no |
Liver Firm | yes, no | X8 = 1 if yes and X8 = 0 if no |
Spleen Palpable | yes, no | X9 = 1 if yes and X9 = 0 if no |
Spiders | yes, no | X10 = 1 if yes and X10 = 0 if no |
Ascites | yes, no | X11 = 1 if yes and X11 = 0 if no |
Varices | yes, no | X12 = 1 if yes and X12 = 0 if no |
Histology | yes, no | X13 = 1 if yes and X13 = 0 if no |
Quantitative variables
Step 1: TreeNet
Discretize the quantitative variables into categories using TreeNet
X16L1 = 1 if X16 < 3.3 and X16L1 = 0 otherwise
Step 1: TreeNet
Discretize the quantitative variables into categories using TreeNet
X16L2 = 1 if 3.3 ≤ X16 < 3.9 and X16L2 = 0 otherwise
Step 1: TreeNet
Discretize the quantitative variables into categories using TreeNet
X16L3 = 1 if 3.9 ≤ X16 and X16L3 = 0 otherwise
Generated Binary Variables
Original variables | Generated binary variables |
Bilirubin (X14) | X14L1 = 1 if X14 < 1.3 and X14L1 = 0 otherwise |
X14L2 = 1 if 1.3 ≤ X14 < 1.9 and X14L2 = 0 otherwise | |
X14L3 = 1 if 1.9 ≤ X14 and X14L3 = 0 otherwise | |
SGOT (X15) | X15L1 = 1 if X15 < 65 and X15L1 = 0 otherwise |
X15L2 = 1 if 65 ≤ X15 < 120 and X15L2 = 0 otherwise | |
X15L3 = 1 if 120 ≤ X15 and X15L3 = 0 otherwise | |
Albumin (X16) | X16L1 = 1 if X16 < 3.3 and X16L1 = 0 otherwise |
X16L2 = 1 if 3.3 ≤ X16 < 3.9 and X16L2 = 0 otherwise | |
X16L3 = 1 if 3.9 ≤ X16 and X16L3 = 0 otherwise | |
Age (X17) | X17L1 = 1 if X17 < 36 and X17L1 = 0 otherwise |
X17L2 = 1 if 36 ≤ X17 < 47 and X17L2 = 0 otherwise | |
X17L3 = 1 if 47 ≤ X17 < 50 and X17L3 = 0 otherwise | |
X17L4 = 1 if 50 ≤ X17 and X17L4 = 0 otherwise |
Step 2: ASA and Rules Selection
Generate rules from CBA and select potential rules based on 26 classifier rules
Example Rule 1:
If X10 = 0, X11 = 0, X14L3 = 0, X16L2 = 0, then Y = 0
with support = 47.287%, confidence = 100%
Step 3: Interactions Generation
Generate interactions from selected rules
If X10 = 0, X11 = 0, X14L3 = 0, X16L2 = 0, then Y = 0
X10(0)X11(0)X14L3(0)X16L2(0) = 1
if X10 = 0, X11 = 0, X14L3 = 0 and X16L2 = 0,
and 0 otherwise
Step 4: Model Selection
Select optimal model based on model selection criterion
If Spider = yes, Bilirubin is at least 1.9, and Albumin is not between 3.3 and 3.9,
then the patient is predicted to be in the die class
LOOCV Accuracy for Four Methods
Dataset | Classification Tree | Random Forest | SVM | Logistic + TreeNet + ASA |
Hepatitis
| 81.40% | 86.05% | 85.27% | 93.80% |
Heart Disease | 81.14% | 83.16% | 84.15% | 86.20% |
Heart Failure | 75.25% | 74.92% | 74.25% | 88.96% |
Conclusion
�
THANK YOU