1 of 14

Machine Learning in Real World:�The C4.5 Algorithm

2 of 14

Numeric attributes

D

E

T

Class

X

1

46

a

Y

3

65

b

X

2

68

b

Y

2

69

b

X

1

70

b

Y

3

71

a

X

2

72

a

Y

1

74

a

X

3

75

a

Y

3

75

a

X

1

80

a

Y

1

81

b

X

2

83

b

y

2

85

b

IBS = info(7,7) = entropy(1/2,1/2) = 1

InfoGain(D) = 0.02

InfoGain(E) = 0.05

InfoGain(T) = ???

3 of 14

InfoGain of a numeric attribute

  • Similar to class-dependent discretization:
    • putting a split point between examples of the same class cannot be optimal

  • Put split points only between examples of different classes.

4 of 14

46

65

68

69

70

71

72

74

75

75

80

81

83

85

a

b

b

b

b

a

a

a

a

a

a

b

b

b

55.5

70.5

80.5

entropy(1,0) = 0

entropy(6/13,7/13) = - 6/13 * log(6/13) - 7/13 * log(7/13) = 0.996

T

a

b

P(a)

P(b)

< 55.5

1

0

1

0

> 55.5

6

7

6/13

7/13

T

Entropy

Probability

< 55.5

0

1/14

> 55.5

0.996

13/14

Info(T, 55.5) = 0 * 1/14 + 0.996 * 13/14 = 0.925

entropy(1/5,4/5) = - 1/5 * log(1/5) - 4/5 * log(4/5) = 0.722

entropy(6/9,3/9) = - 6/9 * log(6/9) - 3/9 * log(3/9) = 0.918

T

a

b

P(a)

P(b)

< 70.5

1

4

1/5

4/5

> 70.5

6

3

6/9

3/9

T

Entropy

Probability

< 70.5

0.722

5/14

> 70.5

0.918

9/14

Info(T, 70.5) = 0.722 * 5/14 + 0.918 * 9/14 = 0.831

entropy(7/11,4/11) = - 7/11 * log(7/11) - 4/11 * log(4/11) = 0.946

entropy(0,1) =0

T

a

b

P(a)

P(b)

< 80.5

7

4

7/11

4/11

> 80.5

0

3

0

1

T

Entropy

Probability

< 80.5

0.946

11/14

> 80.5

0

3/14

Info(T, 80.5) = 0.946 * 11/14 + 0* 3/14 = 0.743

IBS = 1

InfoGain(T, 55.5) = 1 – 0.925 = 0.075

InfoGain(T, 70.5) = 1 – 0.831 = 0.169

InfoGain(T, 80.5) = 1 – 0.743 = 0.257

5 of 14

InfoGain of the T attribute

  • 3 possible InfoGains for 3 possible break points:
    • InfoGain(T, 55.5) = 0.075
    • InfoGain(T, 70.5) = 0.169
    • InfoGain(T, 80.5) = 0.257
  • Choose the highest one at break point 80.5
  • InfoGain of the T attribute is 0.257
  • Attribute T has highest InfoGain 🡺 gets chosen�(InfoGain(D) = 0.02, InfoGain(E) = 0.05)

T

< 80.5

> 80.5

6 of 14

Missing values

Outlook

Temp

Humidity

Windy

Play

Sunny

Hot

High

False

No

Sunny

Hot

High

True

No

Overcast

Hot

High

False

Yes

Rainy

Mild

High

False

Yes

Rainy

Cool

Normal

False

Yes

Rainy

Cool

Normal

True

No

Overcast

Cool

Normal

True

Yes

Sunny

Mild

High

False

No

Sunny

Cool

Normal

False

Yes

Rainy

Mild

Normal

False

Yes

Sunny

Mild

Normal

True

Yes

?

Mild

High

True

Yes

Overcast

Hot

Normal

False

Yes

Rainy

Mild

High

True

No

7 of 14

Missing values – calculation

  • N examples in the learning set
  • Attributes with no missing values = no change – take into account all N examples
  • If attribute A has k missing values:
    • Calculate InfoGain for n-k examples�(those examples for which values are known)
    • Multiply this InfoGain with a factor (n-k)/n

8 of 14

Gain("outlook")

Outlook

Yes

No

P(yes)

P(no)

Entropy�(bits)

Probability

Sunny

2

3

2/5

3/5

Overcast

3

0

3/3

0/3

Rainy

3

2

3/5

2/5

Gain("outlook") = 13/14 * (information before split – information after split)

Entropy:

Sunny: Info([2,3]) = entropy(2/5,3/5) = -2/5 * log(2/5) – 3/5 * log(3/5) = 0.971

0.971

0

0.971

5/13

3/13

5/13

Overcast: Info([3,0]) = entropy(1,0) = -1 * log(1) – 0 * log(0) = 0

Rainy: Info([3,2]) = entropy(3/5,2/5) = -3/5 * log(3/5) – 2/5 * log(2/5) = 0.971

Info([2,3],[3,0],[3,2]) = 0.971 * (5/13) + 0 * (3/13) + 0.971 * (5/13) = 0.747

Gain("outlook") = 13/14 * (0.961– 0.747) = 0.199

IBS = Info([8,5]) = entropy(8/13,5/13) = -5/13 * log(5/13) – 8/13 * log(8/13) = 0.961

9 of 14

Choosing the "best" attribute

  • Information Gain:
    • Outlook: 0.199 bits
    • Temperature: 0.029 bits
    • Humidity: 0.152 bits
    • Wind: 0.048 bits

10 of 14

Outlook

Temp

Humidity

Windy

Play

Weight

Sunny

Hot

High

False

No

1

Sunny

Hot

High

True

No

1

Sunny

Mild

High

False

No

1

Sunny

Cool

Normal

False

Yes

1

Sunny

Mild

Normal

True

Yes

1

?

Mild

High

True

Yes

5/13

Outlook

Temp

Humidity

Windy

Play

Weight

Rainy

Mild

High

False

Yes

1

Rainy

Cool

Normal

False

Yes

1

Rainy

Cool

Normal

True

No

1

Rainy

Mild

Normal

False

Yes

1

Rainy

Mild

High

True

No

1

?

Mild

High

True

Yes

5/13

Outlook

Outlook

Temp

Humidity

Windy

Play

Weight

Overcast

Hot

High

False

Yes

1

Overcast

Cool

Normal

True

Yes

1

Overcast

Hot

Normal

False

Yes

1

?

Mild

High

True

Yes

3/13

Rainy

Overcast

Sunny

11 of 14

Continue in the Outlook=Sunny branch

Outlook

Temp

Humidity

Windy

Play

Weight

Sunny

Hot

High

False

No

1

Sunny

Hot

High

True

No

1

Sunny

Mild

High

False

No

1

Sunny

Cool

Normal

False

Yes

1

Sunny

Mild

Normal

True

Yes

1

?

Mild

High

True

Yes

5/13 ~ 0.4

Hum.

Yes

No

P(yes)

P(no)

Entropy�(bits)

Probability

High

0.4

3

0.4/3.4 = 0.118

3/3.4 = 0.882

0.524

3.4 / 5.4 = 0.63

Normal

2

0

2/2 = 1

0/2 = 0

0

2 / 5.4 = 0.37

Info([0.4,3],[2,0]) = 0.524*0.63 + 0*0.37 = 0.33

InfoGain("Humidity") = info([2,3]) - Info([0,3],[2,0]) = 0.99 – 0.33 = 0.66

High: info([0.4,3]) = entropy(0.118, 0.882) = -0.118*log(0.118)-0.882*log(0.882) = 0.524

Normal: info([2,0]) = entropy(1,0) = 0

IBS = Info(3, 2.4) = entropy(0.56,0.44) = -0.56*log(0.56)-0.44*log(0.44) = 0.99

12 of 14

Yes (100%)

Yes(100%)

No(~80%)

Yes(100%)

No(88%)

13 of 14

Classifying examples with missing values

Outlook

Temp

Humidity

Windy

Play

Sunny

Cool

?

False

P(YES) = 3.4/5.4 * 12% + 2/5.4 * 100% = 44%

?

P(NO) = 3.4/5.4 * 88% + 2/5.4 * 0% = 56%

No

Class distribution in leaves:

4:3 means:

4 examples with class yes� 3 examples with class no

14 of 14

Making rules from a decision tree

  • Each path from root to a leaf = one rule �🡺 no. of rules = no. of leaves in the tree