1 of 30

Integrating Data Mining Techniques: �An Example of a Logistic Regression Model Integrated with Treenet and Association Rules Analysis

Pannapa Changpetch

 

Department of Mathematics, Faculty of Science, Mahidol University, Bangkok, Thailand

2 of 30

Data mining techniques

Logistic regression

kNN

Naïve Bayes

Classification Tree

Association Rules

Cluster Analysis

Random Forest

TreeNet

Multinomial Logit model

3 of 30

Data mining techniques

Logistic regression

kNN

Naïve Bayes

Classification Tree

Association Rules

Cluster Analysis

Random Forest

TreeNet

Multinomial Logit model

4 of 30

Why we need to integrate?

Our goal is to improve the performance of the main technique.

5 of 30

Main technique: Logistic regression

where p is the probability of success or P(Y = 1),

X’s are the predictors, and

β’s are the coefficients of the predictors.

 

6 of 30

Example

The response (Y) with two levels

Level 1: Infected with disease A

Level 0: No infection

The first predictor (X1) = Age

The second predictor (X2) = Cholesterol level

The third predictor (X3) = Body mass index

7 of 30

Example

ID

X1

X2

X3

Y

1

25

113

18.5

0

2

32

189

25.5

1

3

45

201

26.4

1

.

.

.

.

.

100

21

125

20.1

0

8 of 30

Example

After fitting the model with X1, X2 and X3, we have got

Is the new observation with X1 = 40, X2 = 200, and X3 = 28 more likely to be infected with disease A or is he/she free from this disease?

 

9 of 30

Example

After fitting the model with X1, X2 and X3, we have got

Is the new observation with X1 = 40, X2 = 200, and X3 = 28 more likely to be infected with disease A or is he/she free from this disease?

 

 

 

10 of 30

Example

After fitting the model with X1, X2 and X3, we have got

Is the new observation with X1 = 40, X2 = 200, and X3 = 28 more likely to be infected with disease A or is he/she free from this disease?

 

 

 

P(Y = 1) = 0.94 and P(Y = 0) = 0.06

11 of 30

Why Logistic regression needs help?

  • A nonlinear relationship may exist between the quantitative predictors and the log-odds.

  • High-order interactions are typically ignored in logistic regression modeling.

12 of 30

First technique which helps: TreeNet

  • Data Mining Technique developed by Jerome H. Friedman at Stanford University.

  • Broad definition: A multi-tree method

  • We applied TreeNet models

(www.salford-systems.com/treenet.html, Friedman, 2001).

13 of 30

TreeNet: Partial dependency plot

  • The vertical axis represents a half of the log odds of the response, i.e., 1/2*log(p/(1-p)).
  • The horizontal axis represents the value of the predictor.

14 of 30

Second technique which helps: Association rules analysis

  • Association rules analysis (ARA) is a methodology for exploring relationships among items in the form of rules.

  • Each rule has two parts.
    • A left-hand side item(s) or a condition
    • A right-hand side item(s) or a result

  • The rule is always represented as a statement:

if (condition), then (result).

15 of 30

Association rules analysis: Two measurements

  • Two measurements calculated for each rule.
    • Support (s) is defined by:

    • Confidence (c) is defined by:

  • Find all the rules that satisfy two thresholds: minimum support and minimum confidence (Agrawal and Srikant, 1994).

16 of 30

Association rules analysis: Example

  • Given there are k binary factors, X1, X2, …, Xk, and a binary response Y

  • For the rule

“if X1 = 1, X2 = 0, then Y = 1”

 

 

Note: To perform ASA, we use the CBA program developed by the Department of Information Systems and Computer Sciences at the National University of Singapore (http://www.comp.nus.edu.sg/~dm2/).

17 of 30

Inputs and Outputs for Proposed Model Selection Procedure

Step 1: TreeNet

Step 2: ASA and Rules Selection

Step 3: Interactions Generation

Step 4: Model Selection

Data

Generated Categorical Variables

Potential

Rules

Potential Interactions

+ Original Predictors

+ Generated Variables

Optimal Model

+ Original Categorical Predictors

18 of 30

Application: Hepatitis dataset

Detail

Objective

  • To classify new patients into two classes, so called 0 (live) and 1 (die)

19 of 30

Of the 17 predictors in the analysis, 13 are categorical predictors.

Attribute

Levels

Binary Variables

Sex

male, female

X1 = 1 if male and X1 = 0 if female

Steroid

yes, no

X2 = 1 if yes and X2 = 0 if no

Antivirals

yes, no

X3 = 1 if yes and X3 = 0 if no

Fatigue

yes, no

X4 = 1 if yes and X4 = 0 if no

Malaise

yes, no

X5 = 1 if yes and X5 = 0 if no

Anorexia

yes, no

X6 = 1 if yes and X6 = 0 if no

Liver Big

yes, no

X7 = 1 if yes and X7 = 0 if no

Liver Firm

yes, no

X8 = 1 if yes and X8 = 0 if no

Spleen Palpable

yes, no

X9 = 1 if yes and X9 = 0 if no

Spiders

yes, no

X10 = 1 if yes and X10 = 0 if no

Ascites

yes, no

X11 = 1 if yes and X11 = 0 if no

Varices

yes, no

X12 = 1 if yes and X12 = 0 if no

Histology

yes, no

X13 = 1 if yes and X13 = 0 if no

20 of 30

Quantitative variables

  • Bilirubin (X14)
  • SGOT (X15)
  • Albumin (X16)
  • Age (X17)

21 of 30

Step 1: TreeNet

Discretize the quantitative variables into categories using TreeNet

X16L1 = 1 if X16 < 3.3 and X16L1 = 0 otherwise

22 of 30

Step 1: TreeNet

Discretize the quantitative variables into categories using TreeNet

X16L2 = 1 if 3.3 ≤ X16 < 3.9 and X16L2 = 0 otherwise

23 of 30

Step 1: TreeNet

Discretize the quantitative variables into categories using TreeNet

X16L3 = 1 if 3.9 ≤ X16 and X16L3 = 0 otherwise

24 of 30

Generated Binary Variables

Original variables

Generated binary variables

Bilirubin (X14)

X14L1 = 1 if X14 < 1.3 and X14L1 = 0 otherwise

X14L2 = 1 if 1.3 ≤ X14 < 1.9 and X14L2 = 0 otherwise

X14L3 = 1 if 1.9 ≤ X14 and X14L3 = 0 otherwise

SGOT (X15)

X15L1 = 1 if X15 < 65 and X15L1 = 0 otherwise

X15L2 = 1 if 65 ≤ X15 < 120 and X15L2 = 0 otherwise

X15L3 = 1 if 120 ≤ X15 and X15L3 = 0 otherwise

Albumin (X16)

X16L1 = 1 if X16 < 3.3 and X16L1 = 0 otherwise

X16L2 = 1 if 3.3 ≤ X16 < 3.9 and X16L2 = 0 otherwise

X16L3 = 1 if 3.9 ≤ X16 and X16L3 = 0 otherwise

Age (X17)

X17L1 = 1 if X17 < 36 and X17L1 = 0 otherwise

X17L2 = 1 if 36 ≤ X17 < 47 and X17L2 = 0 otherwise

X17L3 = 1 if 47 ≤ X17 < 50 and X17L3 = 0 otherwise

X17L4 = 1 if 50 ≤ X17 and X17L4 = 0 otherwise

25 of 30

Step 2: ASA and Rules Selection

Generate rules from CBA and select potential rules based on 26 classifier rules

Example Rule 1:

If X10 = 0, X11 = 0, X14L3 = 0, X16L2 = 0, then Y = 0

with support = 47.287%, confidence = 100%

26 of 30

Step 3: Interactions Generation

Generate interactions from selected rules

If X10 = 0, X11 = 0, X14L3 = 0, X16L2 = 0, then Y = 0

X10(0)X11(0)X14L3(0)X16L2(0) = 1

if X10 = 0, X11 = 0, X14L3 = 0 and X16L2 = 0,

and 0 otherwise

27 of 30

Step 4: Model Selection

Select optimal model based on model selection criterion

If Spider = yes, Bilirubin is at least 1.9, and Albumin is not between 3.3 and 3.9,

then the patient is predicted to be in the die class

28 of 30

LOOCV Accuracy for Four Methods

Dataset

Classification Tree

Random Forest

SVM

Logistic + TreeNet + ASA

Hepatitis

 

81.40%

86.05%

85.27%

93.80%

Heart Disease

81.14%

83.16%

84.15%

86.20%

Heart Failure

75.25%

74.92%

74.25%

88.96%

29 of 30

Conclusion

  • All the selected terms constitute a combination of the original variables and the newly generated variables.

  • TreeNet is a competent technique in transforming variables.

  • High-order interactions, which make a critical difference in the logistic regression model, can be found from association rules analysis.

30 of 30

THANK YOU