1 of 30

Integrating Data Mining Techniques: �An Example of a Logistic Regression Model Integrated with Treenet and Association Rules Analysis

Pannapa Changpetch

Department of Mathematics, Faculty of Science, Mahidol University, Bangkok, Thailand �

2 of 30

Data mining techniques

Logistic regression

kNN

Naïve Bayes

Classification Tree

Association Rules

Cluster Analysis

Random Forest

TreeNet

Multinomial Logit model

3 of 30

Data mining techniques

Logistic regression

kNN

Naïve Bayes

Classification Tree

Association Rules

Cluster Analysis

Random Forest

TreeNet

Multinomial Logit model

4 of 30

Why we need to integrate?

Our goal is to improve the performance of the main technique.

5 of 30

Main technique: Logistic regression

where p is the probability of success or P(Y = 1),

X’s are the predictors, and

β’s are the coefficients of the predictors.

6 of 30

Example

The response (Y) with two levels

Level 1: Infected with disease A

Level 0: No infection

The first predictor (X₁) = Age

The second predictor (X₂) = Cholesterol level

The third predictor (X₃) = Body mass index

7 of 30

Example

ID	X₁	X₂	X₃	Y
1	25	113	18.5	0
2	32	189	25.5	1
3	45	201	26.4	1
.
.
.
.
.
100	21	125	20.1	0

8 of 30

Example

After fitting the model with X₁, X₂ and X₃, we have got

Is the new observation with X₁ = 40, X₂ = 200, and X₃ = 28 more likely to be infected with disease A or is he/she free from this disease?

9 of 30

Example

After fitting the model with X₁, X₂ and X₃, we have got

Is the new observation with X₁ = 40, X₂ = 200, and X₃ = 28 more likely to be infected with disease A or is he/she free from this disease?

10 of 30

Example

After fitting the model with X₁, X₂ and X₃, we have got

Is the new observation with X₁ = 40, X₂ = 200, and X₃ = 28 more likely to be infected with disease A or is he/she free from this disease?

P(Y = 1) = 0.94 and P(Y = 0) = 0.06

11 of 30

Why Logistic regression needs help?

A nonlinear relationship may exist between the quantitative predictors and the log-odds.

High-order interactions are typically ignored in logistic regression modeling.

12 of 30

First technique which helps: TreeNet

Data Mining Technique developed by Jerome H. Friedman at Stanford University.

Broad definition: A multi-tree method

We applied TreeNet models

(www.salford-systems.com/treenet.html, Friedman, 2001).

13 of 30

TreeNet: Partial dependency plot

The vertical axis represents a half of the log odds of the response, i.e., 1/2*log(p/(1-p)).
The horizontal axis represents the value of the predictor.

14 of 30

Second technique which helps: Association rules analysis

Association rules analysis (ARA) is a methodology for exploring relationships among items in the form of rules.

Each rule has two parts.

A left-hand side item(s) or a condition
A right-hand side item(s) or a result

The rule is always represented as a statement:

if (condition), then (result).

15 of 30

Association rules analysis: Two measurements

Two measurements calculated for each rule.

Support (s) is defined by:

Confidence (c) is defined by:

Find all the rules that satisfy two thresholds: minimum support and minimum confidence (Agrawal and Srikant, 1994).

16 of 30

Association rules analysis: Example

Given there are k binary factors, X₁, X₂, …, X_k, and a binary response Y

For the rule

“if X₁ = 1, X₂ = 0, then Y = 1”

Note: To perform ASA, we use the CBA program developed by the Department of Information Systems and Computer Sciences at the National University of Singapore (http://www.comp.nus.edu.sg/~dm2/).

17 of 30

Inputs and Outputs for Proposed Model Selection Procedure

Step 1: TreeNet

Step 2: ASA and Rules Selection

Step 3: Interactions Generation

Step 4: Model Selection

Data

Generated Categorical Variables

Potential

Rules

Potential Interactions

+ Original Predictors

+ Generated Variables

Optimal Model

+ Original Categorical Predictors

18 of 30

Application: Hepatitis dataset

Detail

Obtained from the UCI machine learning website (https://archive.ics.uci.edu/ml/datasets/hepatitis)

Objective

To classify new patients into two classes, so called 0 (live) and 1 (die)

19 of 30

Of the 17 predictors in the analysis, 13 are categorical predictors.

Attribute	Levels	Binary Variables
Sex	male, female	X1 = 1 if male and X1 = 0 if female
Steroid	yes, no	X2 = 1 if yes and X2 = 0 if no
Antivirals	yes, no	X3 = 1 if yes and X3 = 0 if no
Fatigue	yes, no	X4 = 1 if yes and X4 = 0 if no
Malaise	yes, no	X5 = 1 if yes and X5 = 0 if no
Anorexia	yes, no	X6 = 1 if yes and X6 = 0 if no
Liver Big	yes, no	X7 = 1 if yes and X7 = 0 if no
Liver Firm	yes, no	X8 = 1 if yes and X8 = 0 if no
Spleen Palpable	yes, no	X9 = 1 if yes and X9 = 0 if no
Spiders	yes, no	X10 = 1 if yes and X10 = 0 if no
Ascites	yes, no	X11 = 1 if yes and X11 = 0 if no
Varices	yes, no	X12 = 1 if yes and X12 = 0 if no
Histology	yes, no	X13 = 1 if yes and X13 = 0 if no

20 of 30

Quantitative variables

Bilirubin (X14)
SGOT (X15)
Albumin (X16)
Age (X17)

21 of 30

Step 1: TreeNet

Discretize the quantitative variables into categories using TreeNet

X16L1 = 1 if X16 < 3.3 and X16L1 = 0 otherwise

22 of 30

Step 1: TreeNet

Discretize the quantitative variables into categories using TreeNet

X16L2 = 1 if 3.3 ≤ X16 < 3.9 and X16L2 = 0 otherwise

23 of 30

Step 1: TreeNet

Discretize the quantitative variables into categories using TreeNet

X16L3 = 1 if 3.9 ≤ X16 and X16L3 = 0 otherwise

24 of 30

Generated Binary Variables

Original variables	Generated binary variables
Bilirubin (X14)	X14L1 = 1 if X14 < 1.3 and X14L1 = 0 otherwise
	X14L2 = 1 if 1.3 ≤ X14 < 1.9 and X14L2 = 0 otherwise
	X14L3 = 1 if 1.9 ≤ X14 and X14L3 = 0 otherwise
SGOT (X15)	X15L1 = 1 if X15 < 65 and X15L1 = 0 otherwise
	X15L2 = 1 if 65 ≤ X15 < 120 and X15L2 = 0 otherwise
	X15L3 = 1 if 120 ≤ X15 and X15L3 = 0 otherwise
Albumin (X16)	X16L1 = 1 if X16 < 3.3 and X16L1 = 0 otherwise
	X16L2 = 1 if 3.3 ≤ X16 < 3.9 and X16L2 = 0 otherwise
	X16L3 = 1 if 3.9 ≤ X16 and X16L3 = 0 otherwise
Age (X17)	X17L1 = 1 if X17 < 36 and X17L1 = 0 otherwise
	X17L2 = 1 if 36 ≤ X17 < 47 and X17L2 = 0 otherwise
	X17L3 = 1 if 47 ≤ X17 < 50 and X17L3 = 0 otherwise
	X17L4 = 1 if 50 ≤ X17 and X17L4 = 0 otherwise

25 of 30

Step 2: ASA and Rules Selection

Generate rules from CBA and select potential rules based on 26 classifier rules

Example Rule 1:

If X10 = 0, X11 = 0, X14L3 = 0, X16L2 = 0, then Y = 0

with support = 47.287%, confidence = 100%

26 of 30

Step 3: Interactions Generation

Generate interactions from selected rules

If X10 = 0, X11 = 0, X14L3 = 0, X16L2 = 0, then Y = 0

X10(0)X11(0)X14L3(0)X16L2(0) = 1

if X10 = 0, X11 = 0, X14L3 = 0 and X16L2 = 0,

and 0 otherwise

27 of 30

Step 4: Model Selection

Select optimal model based on model selection criterion

If Spider = yes, Bilirubin is at least 1.9, and Albumin is not between 3.3 and 3.9,

then the patient is predicted to be in the die class

28 of 30

LOOCV Accuracy for Four Methods

Dataset	Classification Tree	Random Forest	SVM	Logistic + TreeNet + ASA
Hepatitis	81.40%	86.05%	85.27%	93.80%
Heart Disease	81.14%	83.16%	84.15%	86.20%
Heart Failure	75.25%	74.92%	74.25%	88.96%

29 of 30

Conclusion

All the selected terms constitute a combination of the original variables and the newly generated variables.

TreeNet is a competent technique in transforming variables.

High-order interactions, which make a critical difference in the logistic regression model, can be found from association rules analysis.

�

30 of 30

THANK YOU