LO 4.2.3.E

Learning Objective: Describe the construction of classification trees using classification error rate, Gini index, and cross-entropy.

Review:

We use recursive binary splitting to grow a classiﬁcation tree. Instead of using RSS (regression tree) we use the classiﬁcation error rate.

The classiﬁcation error rate is the fraction of the training observations in that region that do not belong to the most common class:

where

pmk represents the proportion of training observations in the mth region that are from the kth class.

It turns out that classiﬁcation error is not suﬃciently sensitive for tree-growing, and, in practice, two other measures are preferable: Gini Index and Cross-entropy. In fact, the Gini index and the Cross-entropy are typically used to evaluate the quality of a particular split, since these two approaches are more sensitive to node purity than is the classiﬁcation error rate.

It is used as a measure of node purity: the lower the Gini index, the purer the node. The Gini Index function is defined by

where

pmk represents the proportion of training observations in the mth region that are from the kth class.

It is also used as a measure of node purity: the lower the Cross-entropy, the purer the node. The Cross-entropy function is defined by

where

pmk represents the proportion of training observations in the mth region that are from the kth class.