Decision Tree

Complete data set

Count = how many candidates with this profile

Quality = the target we want to predict

Cross-Validation

a) Roles of training, validation and testing subsets

Training set is used to train our model
Validation set is used when we want to find the optimal value for some parameter (e.g. when we want to select parameter alpha for pruning)
Testing set is used to asses the performance of our model. We can’t use other sets for that because the model is already biased towards them

Impurity

b) Using misclassification error as impurity measure, what is the impurity of the training subset?

Impurity: I(S) = 1 − maxk pk

Good: 50+5+15+1 = 71, Bad: 5+2+2+20=29

Impurity: 29/100 (since have only two classes, just selecting the smaller fraction)

Information Gain

c) Find the best splitting criterion to partition the training set. You will precise the

information gain for every candidate splitting criterion.

Information gain: ΔI=I(S)−pL⋅I(SL)−pR⋅I(SR)

Using T1:

S_L : with T1 == P, P_L = (50+5+5+2)/100 = 62/100

I(S_L): Good = 55/62, Bad = 7/62

S_R: with T1 == F, P_R = 38/100

I(S_R): Good = 16/38, Bad = 22/38

Delta I = 29/100 - 62/100 * 7/62 - 38/100 * 16/38 = (29 - 7 - 16) / 100 = 6/100

Using T2:

S_L: P, P_L = (50+5+15+2)/100 = 72/100

I(S_L): Good = (50+15) = 65/72, Bad = (5+2)/72 = 7/72

S_R: F, P_R = (5+2+1+20)/100 = 28/100

I(S_R): Good = (5+1)/28 = 6/28, Bad = (2+20)/28 = 22/28

Delta I = 29/100 - 72/100 * 7/72 - 28/100 * 6/28 = (29 - 7 - 6)/100 = 16/100

Using T3:

S_L: P, P_L = (50+2+15+1+20)/100 = 88/100

I(S_L): Good = (50+15+1)/88 = 66/88, Bad = (2+20)/88 = 22/88

S_R: F, P_R = (5+5+2)/100 = 12/100

I(S_R): Good = 5/12, Bad = (5+2)/12

Delta I = 29/100 - 88/100 * 22/88 - 12/100 * 5/12 = (29 - 22 - 5)/100 = 2/100

⇒ T2 maximizes the information gain

d) Finish the construction of the decision tree using the following stopping criterion: all leaf nodes have a misclassification error (measure using the training set) less or equal than 10 %.

Now have two nodes:

if T2 = P then (Node1) else (Node2)

Node1

misclassification error is 7/72 < 10%

can stop here

Predict “Good” with confidence 65/72

Node2

misclassification error is 6/28 > 10%

continue splitting

Try T1, Condition T1 = P

S_L: P_L = 7/28

I(S_L): Good = 5/7, Bad = 2/7

S_R: P_R = 21/28

I(S_R): Good: 1/21, Bad = 20/21

Delta I = 6/28 - 7/28 * 2/7 - 21/28 * 1/21 = (6 - 2 - 1)/28 = 3/28

Try T3, Condition T3 = P

S_L: P_L = 23/28

I(S_L): Good = 1/23, Bad = 22/23-

S_R: P_R = 5/28 (100% pure)

I(S_R) Good = 5/5, Bad = 0/5

Delta I = 6/28 - 23/28 * 1/23 - 0 = 5/28

⇒ Use T3 as a splitting criterion

Tree so far:

if T2 = P then

predict “Good” with confidence 65/72

else

if (T3 = P) then

(Node2.1)

else

(Node2.2)

Node2.1

misclassification error is 1/23 < 10%

can stop here

Predict “Bad” with confidence 22/23

Node2.2

misclassification error is 0

Predict “Good” with confidence 1

Tree:

if (T2 = P) then

predict “Good” with confidence 65/72

else

if (T3 = P) then

Predict “Bad” with confidence 22/23

else

Predict “Good” with confidence 1

Pruning

e) Considering. possible subtree, compute the validation errors (using the validation subset). Which subtree has to be selected?

We can prune the tree in two places:

So we consider 3 trees:

Tree 1 - our tree before pruning

Tree 2 - pruning node with T3 = p

Tree 3 - pruning the entire tree

So Tree 1 is the best w.r.t the validation set

(For complexity cost based pruning see here)

Testing

f) Compute finally the test error of the decision tree obtained at the end of question e).

The test error in 0.16

Pattern Mining

Dataset D is divided into 2 subsets D1 and D2

Items: {A B C D E F}

Frequent Itemsets

freq(X,D) - # of transactions where itemset X occurs

freq(AB,D) = 3 and freq(AB,D1) = 2

supp(X,D) = freq(X,D)/|D| - the support is the proportion of transactions containing X

a) Enumerate all frequent itemsets of dataset D whose frequency is at least 2 and represent them by means of a lattice. Specify the frequency for each itemset.

use an inverse list to represent the data set (it’s simpler)

A → {T1, T2, T4}

B → {T1, T2, T4, T5}

C → {T2, T5, T6}

D → {T2, T3, T4, T5}

E → {T1}

F → {T6}

Use Apriori Algo for that

Start with {}: empty itemset is always frequent

iteration 1

generate candidates: {{A} → 3, {B} → 4 , {C} → 3, {D} → 4, {E} → 1, {F} → 1}

remove not frequent: {{A} → 3, {B} → 4, {C} → 3, {D} → 4}

iteration 2

generate candidates: {{AB} → 3, {AC} → 1, {AD} → 2, {BC} → 2, {BD} → 3, {CD} → 2}

remove not frequent: {{AB} → 3, {AD} → 2, {BC} → 2, {BD} → 3, {CD} → 2}

iteration 3

generate candidates: {{ABD} → 2, {BCD} → 2}

remove not frequent: {{ABD} → 2, {BCD} → 2}

stop

can’t generate {ABCD} because apart from {ABD} and {BCD} also need {ABC}, but it’s not frequent

Obtained lattice:

Emerging Patterns

GR1(X) = supp(X,D1)/supp(X,D2) - growth rate of itemset X

X is an emerging pattern iff GR1(X) > 1

b) Describe in one sentence the semantic of emerging patterns.

Emerging patterns are patterns whose support changes significantly from one dataset to another. [more]

I.e. the fraction of patterns that support X in D1 is higher that in D2

c) Rewrite the growth rate of an itemset X using freq(X,D1) and freq(X,D).

supp(X,D) = freq(X,D)/|D|

freq(X, D) = freq(X, D1) + freq(X, D2); freq(X, D2) = freq(X, D) - freq(X, D1)

GR1(X) = ( | D2 | * freq(X, D1) ) / ( | D1 | * freq(X, D2) ) = (| D2 | / | D1 |) * freq(X, D1) / (freq(X, D) - freq(X, D1))

We want to find all emerging X \in D s.t.

GR1(X) >= 2
freq(X, D) >= 2

let EP = { X \in D | GR1(X) >= 2 & freq(X, D) >= 2 }

d) Demonstrate that any pattern of EP has a frequency in D1 greater than or equal to 2.

GR1(X) >= 2 ⇒ supp(X, D1) / supp(X, D2) >= 2

In general it doesn’t hold:

if | D2 | >= 2 * | D1 | then it can be true that freq(X, D1) = freq(X, D2) = 1

Or it’s just on the given set of data? If only on the given data, then

Suppose freq(X, D1) < 2

know that freq(X, D1) > 0 (otherwise GR1(X) would be 0)

so freq(X, D1) = 1

then supp(X, D1) = 1 / | D1 |

for GR1(X) >= 2 to hold must have 1 / | D1 | >= 2 * freq(X, D2) / | D2 |

since | D1 | = | D2 |, 1 >= 2 * freq(X, D2)

it can be possible only if freq(X, D2) is 0,

but freq(X, D) = freq(X, D1) + freq(X, D2) >= 2,

and 1 + 0 is not >= 2

⇒ contraction ⇒ freq(X, D2) >= 2

(Maybe can do it simpler?)

e) Enumerate all the patterns of EP.

Frequent patterns of D1: {{} → 3, {A} → 2, {B} → 2, {D} → 2, {AB} → 2}

Frequency in D2 {{} → 3, {A} → 1, {B} → 2, {D} → 2, {AB} → 1}

So only {A} and {AB} have GR = 2

Clustering

Given A(2,3), B(6,5), C(7,1), D(1,1), B2(8,1), E(2,2), F(5,3), G(7,4)

Generate 3 clusters from this data set by using agglomerative method with Manhattan distance Use the mean distance (i.e., the distance between cluster centers) while regrouping clusters.

Manhattan distance: d(X, Y) = Sum from i=1 to N | X_i - Y_i |

A hierarchical method creates a hierarchical decomposition of the given set of data objects

The agglomerative - the bottom-up approach = AGNES

The divisive - the top-down approach = DIANA

a) Give the clusters generated after the third iteration.

At first, all data points form their own clusters

Iteration 1

Pairs of clusters with distance 1 are:

({A}, {E}), ({B2}, {C}) - join them

Calculate the centers for the new clusters:

{A, E} → (2, 2.5)

{B2, C} → (7.5, 1)

Iteration 2

Pairs of clusters with distance 2 are: ({B}, {G})

Iteration 3

dist({A,E}, {D}) = 2.5 since the center of {A, E} is (2, 2.5)

Recalculate the centers: {A, D, E} → (5/3, 2) {B, G} → (6.5, 4.5)

So clusters after the 3rd iteration are:

{A,D,E}, {F}, {B,G}, {C, B2}

b) Give the final 3 clusters.

Iteration 4

d({A,D,E}, {F}) = ⅓ + 4 = 4 ⅓

d({B,G}, {F}) = ½ + ½ + 2 = 3

d({B2, C}, {F}) = ½ + 4 = 4 ½

Closest are {B,G} and {F}

So the final 3 clusters are

Bayesian Network

Attributes:

Fraud (F = Yes or No) represents whether or not the current purchase is fraudulent,
Gas (G = Yes or No) represents whether or not there was a gas purchase in the last 24 hours,
Jewelry (J = Yes or No) represents whether or not there was a jewelry purchase in the last 24 hours,
Age (A is less than 30, between 30 and 50 or greater than 50) represents the age of the card holder, and
Sex (S = Male or Female) represents the sex of the cardholder.

D-Separation

Z d-separates X and Y iff every undirected path P from X to Y is blocked by Z,

where P is blocked iff one or more of the following conditions is true:

- P contains a chain, x ←m ←y, such that the middle node m is in Z, or

- P contains a fork, x ←m →y, such that the middle node m is in Z, or

- P contains an inverted fork, x →m ←y, such that the middle node m is not in Z and no descendant of m is in Z.

a) Using d-separation, show if the following assertions are true or not:

- The variables Fraud and Age independent given the variable Jewelry,

- The variables Sex and Gas are independent,

- The variables Sex and Gas are independent given the variable Jewelry.

Hung Chang

10:49 PM Today

AJF is not blocked, (F not ╨ A | J)

P contains an inverted fork, x → m ← y,

but the middle node m(J) is in Z

SJFG is blocked (S ╨ G| ∅ )

P contains an inverted fork, x → m ← y,

such that the middle node m(J) is not in Z and

no descendant of m is in Z

SJFG is not blocked (S not ╨ G| J)

P contains an inverted fork, x → m ← y,

but the middle node m(J) is in Z

Joint Probabilities

Express the joint probabilities of the five variables F, A, S, G and B.

The joint probability is

P(FASGB) = P(F) * P(A) * P(S) * P(J | FAS) * P(G | F)

Classification

A 40 years old male has bought a jewel during the last 24 hours. What is the probability that this purchase is fraudulent?

Our instance:

J = Y, A = [30;50], S = M, G - ?

Need to predict F

In order to answer this question, you will use the marginal law and the definition of conditional probability.

- Conditional probability: P(X / Y) = P(X,Y) / P(Y)

- Marginal law: P(x) = ∑ P(x, y_i) if the domain D_Y of Y is { y_1, …, y_p }

- If a directed acyclic graph is a Bayesian network of a probability distribution, then we have:

P(X1, …, XN) = ∏_i P(Xi / Parents(Xi)).

let’s fix the variables

P(F=f | J=j S=s A=a) = P(JSAF) / P(JSA) = P(J | FSA) P(S) P(A) P(F) / P(JSA)

P(JSA) = P(J) P(S) P(J | FSA)

P(J | FSA) - here J,S,A are fixed, but F is not (it appeared because J depends on F) - so need to use the marginal law

P(J=j | F S=s A=a) = P(J=j | F=y S=s A=a) + P(J=j | F=n S=s A=a)

P(F=y | J=y A=[30,40] S=m) = P(J=y | y [30,50] m) * P(S=m) * P(A=[30;50]) * P(F=y) / P(y [30,50] m)

P(y [30,50] m) = P(y | y [30,50] m) + P(y | n [30,50] m) = 0.05 + 0.004 = 0.054

P(F=y | J=y S=m A=[30,40]) = 0.05 * 0.5 * 0.40 * 0.1 / 0.054 = 0.01851

P(F=n | J=y A=[30,40] S=m) = P(J=y | n [30,50] m) * P(S=m) * P(A=[30;50]) * P(F=n) / P(y [30,50] m) = 0.004 * 0.5 * 0.40 * 0.9 / 0.054 = 0.01333

fraud