2 of 45

Decision Tree Learning �

Decision tree representation
Appropriate problems for decision tree learning
The basic decision tree learning algorithm
Hypothesis space search indecision tree learning Inductive bias in decision tree learning
Issues in decision tree learning

3 of 45

Decision Tree Learning

It is a method of approximating discrete-valued functions that is robust to noisy data and capable of learning disjunctive expressions
The learned function is represented by a decision tree.
Disjunctive Expressions – (A ∧ B ∧ C) ∨ (D ∧ E ∧ F)

4 of 45

Decision Tree Representation

Each internal node tests an attribute
Each branch corresponds to an attribute value
Each leaf node assigns a classification

PlayTennis: This decision tree classifies Saturday mornings according to whether or not they are suitable for playing tennis

5 of 45

Decision Tree Representation - Classification

An example is classified by sorting it through the tree from the root to the leaf node
Example – (Outlook = Sunny, Humidity = High) => (PlayTennis = No)

PlayTennis: This decision tree classifies Saturday mornings according to whether or not they are suitable for playing tennis

6 of 45

Appropriate problems for decision tree learning

Instances are represented by attribute-value pairs
Target function has discrete output values
Disjunctive description may be required
Training data may contain error
Training data may contain missing attribute values
Examples – Classification Problems
Equipment malfunction or medical diagnosis
Loan applicants by their likelihood of defaulting on payments

7 of 45

Basic Decision tree Learning Algorithm ID3

Top-down construction of the tree, beginning with the question "which attribute should be tested at the root of the tree?'
Each instance attribute is evaluated using a statistical test to determine how well it alone classifies the training examples.
The best attribute is selected and used as the test at the root node of the tree.
A descendant of the root node is then created for each possible value of this attribute.
The training examples are sorted to the appropriate descendant node
The entire process is then repeated at the descendant node using the training examples associated with each descendant node
GREEDY Approach
No Backtracking - So we may get a suboptimal solution.

8 of 45

ID3 Algorithm

9 of 45

Top-Down induction of decision trees

Find A = the best decision attribute for next node
Assign A as decision attribute for node
For each value of A create new descendants of node
Sort the training examples to the leaf node.
If training examples classified perfectly, STOP else iterate over the new leaf nodes.

10 of 45

Which attribute is the best classifier?

Information Gain – A statistical property that measures how well a given attribute separates the training examples according to their target classification.
This measure is used to select among the candidate attributes at each step while growing the tree.

11 of 45

Entropy

S is a sample of training examples
is the proportion of positive examples in S
is the proportion of negative examples in S
Then the entropy measures the impurity of S:

But If the target attribute can take c different values:

The entropy varies between 0 and 1. if all the members belong to the same class => The entropy = 0

if there are equal number of positive and

negative examples => The entropy = 1

Where Pi – is the proportion of S belonging to class i

12 of 45

Entropy =0, if all members of S belong to same class
Entropy =1, when the collection contains equal number of positive and negative examples

If all the collection contains unequal number of positive and negative examples, the entropy is between 0 and 1

13 of 45

Entropy - Example

Entropy([29+, 35-])

= - (29/64) log₂(29/64) - (35/64) log₂(35/64)

= 0.994

14 of 45

Information Gain

Gain(S,A) = expected reduction in entropy due to sorting on A

Here s_vis simply the sum of the entropies of each subset, weighted by the fraction of examples |s_v/s| that belong to s_v

The measure of effectiveness of an attribute in classifying the training data is called information gain

Where

Values (A) is the set of all possible values of attribute A

15 of 45

Information Gain - Example

Gain(S,A1)

= 0.994 – (26/64)*.706 – (38/64)*.742

= 0.266

Information gained by partitioning

along attribute A1 is 0.266

Gain(S,A2)

= 0.994 – (51/64)*.937 – (13/64)*.619

= 0.121

Information gained by partitioning

along attribute A2 is 0.121

16 of 45

Hypothesis space search in

decision tree learning

17 of 45

HYPOTHESIS SPACE SEARCH IN DECISION TREE LEARNING

ID3 can be characterized as searching a space of hypotheses for one that fits the training examples.

The hypothesis space searched by ID3 is the set of possible decision trees.

ID3 performs a simple-to-complex, hill-climbing search through this hypothesis space, beginning with the empty tree, then considering progressively more elaborate hypotheses in search of a decision tree that correctly classifies the training data.

The evaluation function That guides this hill-climbing search is the information gain measure.

19 of 45

ID3 capabilities and limitations in search space and search strategy.

ID3's hypothesis space of all decision trees is a complete space of finite discrete-valued functions, relative to the available attributes.
ID3 maintains only a single current hypothesis as it searches through the space of decision trees.
ID3 in its pure form performs no backtracking in its search.
ID3 uses all training examples at each step in the search to make statistically based decisions regarding how to refine its current hypothesis.

20 of 45

Advantage of using statistical properties of all the examples (e.g., information gain) is that the resulting search is much less sensitive to errors in individual training examples.

ID3 can be easily extended to handle noisy training data by modifying its termination criterion to accept hypotheses that imperfectly fit the training data

21 of 45

INDUCTIVE BIAS IN DECISION TREE LEARNING

22 of 45

INDUCTIVE BIAS IN DECISION TREE LEARNING

Inductive bias is the set of assumptions that, together with the training data, deductively justify the classifications assigned by the learner to future instances.

Describing the inductive bias of ID3 there-fore consists of describing the basis by which it chooses one of these consistent hypotheses over the all possible decision trees.

23 of 45

Approximate inductive bias of ID3

Shorter trees are preferred over larger trees.

It uses BFS (Breadth-First Search)

24 of 45

Front-projection display system

unmanned aerial vehicles

Sports Event Interpretation

Restriction Biases and Preference Biases

ID3 searches a complete hypothesis space (i.e., one capable of expressing any finite discrete-valued function). It searches incompletely through this space, from simple to complex hypotheses, until its termination condition is met (e.g., until it finds a hypothesis consistent with the data).

Its hypothesis space introduces no additional bias.

The version space CANDIDATE-ELIMINATION algorithm searches an incomplete hypothesis space (i.e., one that can express only a subset of the potentially teachable concepts).

It searches this space completely, finding every hypothesis consistent with the training data. Its inductive bias is solely a consequence of the expressive power of its hypothesis representation.

25 of 45

Its search strategy introduces no additional bias.

The inductive bias of ID3 follows from its search strategy, whereas the inductive bias of the CANDIDATE-ELIMINATION algorithm follows from the definition of its search space.

The inductive bias of ID3 is thus a preference for certain hypotheses over others (e.g., for shorter hypotheses), with no hard restriction on the hypotheses that can be eventually enumerated. This form of bias is typically called a preference bias (or, alternatively, a search bias).

In contrast, the bias of the CANDIDATE-ELIMINATION

algorithm is in the form of a categorical restriction on the set of hypotheses

considered. This form of bias is typically called a restriction bias (or,

alternatively, a language bias).

26 of 45

Why Prefer Short Hypotheses?

William of Occam was one of the first to discuss the question, around the year 1320, so this bias often goes by the name of Occam's razor.

Occam's razor: Prefer the simplest hypothesis that fits the data.

It is the problem solving principle that the simplest solution tends to be the right one

When presented with competing hypothesis to solve a problem, one should select a solution with fewest assumptions

28 of 45

ISSUES IN DECISION

TREE LEARNING

29 of 45

ISSUES IN DECISION TREE LEARNING

Avoiding Overfitting the Data

REDUCED ERROR PRUNING

RULE POST-PRUNING

Incorporating Continuous-Valued Attributes
Alternative Measures for Selecting Attributes
Handling Training Examples with Missing Attribute Values
Handling Attributes with Differing Costs

30 of 45

Avoiding Overfitting the Data

Definition: Given a hypothesis space H, a hypothesis h E H is said to overfit the training data if there exists some alternative hypothesis h' E H, such that h has smaller error than h' over the training examples, but h' has a smaller error than h over the entire distribution of instances.

32 of 45

There are several approaches to avoiding overfitting in decision tree learning.

These can be grouped into two classes:

Approaches that stop growing the tree earlier, before it reaches the point where it perfectly classifies the training data,

2 Approaches that allow the tree to overfit the data, and then post-prune the tree.

33 of 45

To determine the correct final tree size. Approaches include:

Use a separate set of examples, distinct from the training examples, to evaluate the utility of post-pruning nodes from the tree.

Use all the available data for training, but apply a statistical test to estimate whether expanding (or pruning) a particular node is likely to produce an improvement beyond the training set. For example, Quinlan (1986) uses a chi-square test to estimate whether further expanding a node is likely to improve performance over the entire instance distribution, or only on the current sample of training data.

Use an explicit measure of the complexity for encoding the training examples and the decision tree, halting growth of the tree when this encoding size is minimized. This approach, based on a heuristic called the Minimum Description Length principle.

34 of 45

Front-projection display system

unmanned aerial vehicles

Sports Event Interpretation

The available data are separated into two sets of examples: a training set, which is used to form the learned hypothesis, and a separate validation set, which is used to evaluate the accuracy of this hypothesis over subsequent data and, in particular, to evaluate the impact of pruning this hypothesis.

Withhold one-third of the available examples for the validation set, using the other two-thirds for training.

35 of 45

REDUCED ERROR PRUNING

One approach,called reduced-error pruning (Quinlan 1987), is to consider each of the decision nodes in the.tree to be candidates for pruning. Pruning a decision node consists of removing the subtree rooted at that node, making it a leaf node, and assigning it the most common classification of the training examples affiliated with that node.

Nodes are removed only if the resulting pruned tree performs no worse than-the original over the validation set.

36 of 45

The major drawback of this approach is that when data is limited, withholding part of it for the validation set reduces even further the number of examples available for training.

37 of 45

RULE POST-PRUNING

One quite successful method for finding high accuracy hypotheses is a technique called rule post-pruning.

1. Infer the decision tree from the training set, growing the tree until the training data is fit as well as possible and allowing overfitting to occur.

2. Convert the learned tree into an equivalent set of rules by creating one rule for each path from the root node to a leaf node.

3. Prune (generalize) each rule by removing any preconditions that result in improving its estimated accuracy.

4. Sort the pruned rules by their estimated accuracy, and consider them in this sequence when classifying subsequent instances.

38 of 45

In rule postpruning, one rule is generated for each leaf node in the tree. Each attribute test along the path from the root to the leaf becomes a rule antecedent (precondition) and the classification at the leaf node becomes the rule consequent (postcondition).

IF (outlook = sunny) ^(humidity=high)

THEN playtennis=No

39 of 45

Incorporating Continuous-Valued Attributes

ID3 restriction to attributes

First, the target attribute whose value is predicted by the learned tree must be discrete valued.

Second, the attributes tested in the decision nodes of the tree must also be discrete valued.

Dynamically defining new discrete-valued attributes that partition the continuous attribute value into a discrete set of intervals.

A that is continuous-valued, the algorithm can dynamically create a new boolean attribute A, that is true if A < c and false otherwise.

40 of 45

These candidate thresholds can then be evaluated by computing the information gain associated with each.

In the current example, there are two candidate thresholds, corresponding to the values of Temperature at which the value of PlayTennis changes: (48 + 60)/2, and (80 90)/2.

The information gain can then be computed for each of the candidate attributes, (Temperature > 54) and (Temperature >85), the best can be selected ( Temperature>54) This dynamically created boolean attribute can then compete with the other discrete-valued candidate attributes available for growing the decision tree.

41 of 45

Alternative Measures for Selecting Attributes

One alternative measure that has been used successfully is the gain ratio (Quinlan 1986). The gain ratio measure penalizes attributes such as Date by incorporating a term, called split information, that is sensitive to how broadly and uniformly the attribute splits the data

where S1 through S, are the c subsets of examples resulting from partitioning S by the c-valued attribute A.

The Gain Ratio measure is defined in terms of the earlier Gain measure, as well as this Splitlnformation, as follows

42 of 45

Handling Training Examples with Missing Attribute Values

Consider the situation in which Gain(S, A ) is to be calculated at node n in the decision tree to evaluate whether the attribute A is the best attribute to test at this decision node. Suppose that ( x , c ( x ) ) is one of the training examples in S and that the value A ( x ) is unknown.

One strategy for dealing with the missing attribute value is to assign it the value that is most common among training examples at node n. Alternatively, we might assign it the most common value among examples at node n that have the classification c ( x ) . The elaborated training example using this estimated value for A(x) can then be used directly by the existing decision tree learning algorithm. This strategy is examined by Mingers (1989a).

A second, more complex procedure is to assign a probability to each of the possible values of A rather than simply assigning the most common value to A(x).

These probabilities can be estimated again based on the observed frequencies of the various values for A among the examples at node n.

43 of 45

Handling Attributes with Differing Costs

ID3 can be modified to take into account attribute costs by introducing a cost term into the attribute selection measure.

For example, we might divide the Gain by the cost of the attribute, so that lower-cost attributes would be preferred. While such cost-sensitive measures do not guarantee finding an optimal cost-sensitive decision tree, they do bias the search in favour of low-cost attributes.

Attribute cost is measured by the number of seconds required to obtain the attribute value by positioning and operating the sonar. They demonstrate that more efficient recognition strategies are learned, without sacrificing classification accuracy, by replacing the information gain attribute selection measure by the following measure

44 of 45

Nunez (1988) describes a related approach and its application to learning

medical diagnosis rules. Here the attributes are different symptoms and

laboratory tests with differing costs. His system uses a somewhat different

attribute selection measure

where w E [0, 1] is a constant that determines the relative importance of cost

versus information gain. Nunez (1991) presents an empirical comparison of these

two approaches over a range of tasks

1 of 45