Published using Google Docs
KDDM Exam
Updated automatically every 5 minutes

Decision Tree

Complete data set

Count = how many candidates with this profile

Quality = the target we want to predict

Cross-Validation

a) Roles of training, validation and testing subsets

Impurity

b)  Using  misclassification  error  as  impurity  measure,  what  is  the  impurity  of  the training subset?

Impurity: I(S) = 1 − maxk pk 

Good: 50+5+15+1 = 71, Bad: 5+2+2+20=29

Impurity: 29/100 (since have only two classes, just selecting the smaller fraction)

Information Gain

c)  Find  the  best  splitting  criterion  to  partition  the  training  set. You  will  precise  the

information gain for every candidate splitting criterion.

Information gain: ΔI=I(S)−pLI(SL)−pRI(SR)

Using T1:

S_L : with T1 == P, P_L = (50+5+5+2)/100 = 62/100

I(S_L): Good = 55/62, Bad = 7/62

S_R: with T1 == F, P_R = 38/100

I(S_R): Good = 16/38, Bad = 22/38

Delta I = 29/100 - 62/100 * 7/62 - 38/100 * 16/38 = (29 - 7 - 16) / 100 = 6/100

Using T2:

S_L: P, P_L = (50+5+15+2)/100 = 72/100

I(S_L): Good = (50+15) = 65/72, Bad = (5+2)/72 = 7/72

S_R: F, P_R = (5+2+1+20)/100 = 28/100

I(S_R): Good = (5+1)/28 = 6/28, Bad = (2+20)/28 = 22/28

Delta I = 29/100 - 72/100 * 7/72 - 28/100 * 6/28 = (29 - 7 - 6)/100 = 16/100

Using T3:

S_L: P, P_L = (50+2+15+1+20)/100 = 88/100

I(S_L): Good = (50+15+1)/88 = 66/88, Bad = (2+20)/88 = 22/88

S_R: F, P_R = (5+5+2)/100 = 12/100

I(S_R): Good = 5/12, Bad = (5+2)/12

Delta I = 29/100 - 88/100 * 22/88 - 12/100 * 5/12  = (29 - 22 - 5)/100 = 2/100

T2 maximizes the information gain

 

d) Finish the construction of the decision tree using the following stopping criterion: all leaf nodes have a misclassification error (measure using the training set) less or equal than 10 %.

Now have two nodes:

if T2 = P then (Node1) else (Node2)

Node1

misclassification error is 7/72 < 10%

can stop here

Predict “Good” with confidence 65/72

Node2

misclassification error is 6/28 > 10%

continue splitting

Try T1, Condition T1 = P

S_L: P_L = 7/28

I(S_L): Good = 5/7, Bad = 2/7

S_R: P_R = 21/28

I(S_R): Good: 1/21, Bad = 20/21

Delta I = 6/28 - 7/28 * 2/7  - 21/28 * 1/21 = (6 - 2 - 1)/28 = 3/28

Try T3, Condition T3 = P

S_L: P_L = 23/28

I(S_L): Good = 1/23, Bad = 22/23-

S_R: P_R = 5/28 (100% pure)

I(S_R) Good = 5/5, Bad = 0/5

Delta I = 6/28 - 23/28 * 1/23 - 0 = 5/28

⇒ Use T3 as a splitting criterion

Tree so far:

if T2 = P then

predict “Good” with confidence 65/72

else

if (T3 = P) then

        (Node2.1)

else  

        (Node2.2)

Node2.1

misclassification error is 1/23 < 10%

can stop here

Predict “Bad” with confidence 22/23

Node2.2

misclassification error is 0

Predict “Good” with confidence 1

Tree:

if (T2 = P) then

predict “Good” with confidence 65/72

else

if (T3 = P) then

        Predict “Bad” with confidence 22/23

else  

        Predict “Good” with confidence 1

Pruning

e) Considering. possible subtree, compute the validation errors (using the validation subset). Which subtree has to be selected?

We can prune the tree in two places:

So we consider 3 trees:

Tree 1 - our tree before pruning

Tree 2 - pruning node with T3 = p

Tree 3 - pruning the entire tree

So Tree 1 is the best w.r.t the validation set

(For complexity cost based pruning see here)

Testing

f)  Compute finally the test error of the decision tree obtained at the end of question e).

The test error in 0.16

Pattern Mining

Dataset D is divided into 2 subsets D1 and D2

Items: {A B C D E F}

Frequent Itemsets

freq(X,D) - # of transactions where itemset X occurs

freq(AB,D) = 3 and freq(AB,D1) = 2

supp(X,D) = freq(X,D)/|D| -  the support is the proportion of transactions containing X

a) Enumerate all frequent itemsets of dataset D whose frequency is at least 2 and represent them by means of a lattice. Specify the frequency for each itemset.

use an inverse list to represent the data set (it’s simpler)

A → {T1, T2, T4}

B → {T1, T2, T4, T5}

C → {T2, T5, T6}

D → {T2, T3, T4, T5}

E → {T1}

F → {T6}

Use Apriori Algo for that

Start with {}: empty itemset is always frequent

iteration 1

generate candidates: {{A} → 3, {B} → 4 , {C} → 3, {D} → 4, {E} → 1, {F} → 1}

remove not frequent: {{A} → 3, {B} → 4, {C} → 3, {D} → 4}

iteration 2

generate candidates: {{AB} → 3, {AC} → 1, {AD} → 2, {BC} → 2, {BD} → 3,  {CD} → 2}

remove not frequent: {{AB} → 3, {AD} → 2, {BC} → 2, {BD} → 3,  {CD} → 2}

iteration 3

generate candidates: {{ABD} → 2, {BCD} → 2}

remove not frequent: {{ABD} → 2, {BCD} → 2}

stop

can’t generate {ABCD} because apart from {ABD} and {BCD} also need {ABC}, but it’s not frequent

Obtained lattice:

Emerging Patterns

GR1(X) = supp(X,D1)/supp(X,D2) - growth rate of itemset X

X is an emerging pattern iff GR1(X) > 1

b)  Describe in one sentence the semantic of emerging patterns.

Emerging patterns are patterns whose support changes significantly from one dataset to another. [more]

I.e. the fraction of patterns that support X in D1 is higher that in D2

c)  Rewrite the growth rate of an itemset X using freq(X,D1) and freq(X,D).

supp(X,D) = freq(X,D)/|D|

freq(X, D) = freq(X, D1) + freq(X, D2); freq(X, D2) = freq(X, D) - freq(X, D1)

GR1(X) = ( | D2 | * freq(X, D1) ) / ( | D1 | * freq(X, D2) ) = (| D2 | / | D1 |) * freq(X, D1) / (freq(X, D) - freq(X, D1))

We want to find all emerging X \in D s.t.

let EP = { X \in D | GR1(X) >= 2 & freq(X, D) >= 2 }

d)  Demonstrate that any pattern of EP has a frequency in D1 greater than or equal to 2.

GR1(X) >= 2 ⇒ supp(X, D1) / supp(X, D2) >= 2

In general it doesn’t hold:

if | D2 | >= 2 * | D1 | then it can be true that freq(X, D1) = freq(X, D2) = 1

Or it’s just on the given set of data? If only on the given data, then

Suppose freq(X, D1) < 2

know that freq(X, D1) > 0 (otherwise GR1(X) would be 0)

so freq(X, D1) = 1

then supp(X, D1) = 1 / | D1 |

for GR1(X) >= 2 to hold must have  1 / | D1 | >= 2 * freq(X, D2) / | D2 |

since | D1 | = | D2 |, 1 >= 2 * freq(X, D2)

it can be possible only if freq(X, D2) is 0,

but freq(X, D) = freq(X, D1) + freq(X, D2) >= 2,

and 1 + 0 is not >= 2

⇒ contraction ⇒ freq(X, D2) >= 2

(Maybe can do it simpler?)

e)  Enumerate all the patterns of EP.

Frequent patterns of D1:  {{} → 3, {A} → 2, {B} → 2, {D} → 2, {AB} → 2}

Frequency in D2               {{} → 3, {A} → 1, {B} → 2, {D} → 2, {AB} → 1}

So only {A} and {AB} have GR = 2

Clustering

Given A(2,3), B(6,5), C(7,1), D(1,1), B2(8,1), E(2,2), F(5,3), G(7,4)

Generate 3 clusters from this data set by using agglomerative method with Manhattan distance Use the mean distance (i.e., the distance between cluster centers) while regrouping clusters.

Manhattan distance: d(X, Y) = Sum from i=1 to N | X_i - Y_i |

A hierarchical method creates a hierarchical decomposition of the given set of data objects

The agglomerative - the bottom-up approach = AGNES

The divisive - the top-down approach = DIANA

a)  Give the clusters generated after the third iteration.

At first, all data points form their own clusters

Iteration 1

Pairs of clusters with distance 1 are:

({A}, {E}), ({B2}, {C}) - join them

Calculate the centers for the new clusters:

{A, E} → (2, 2.5)

{B2, C} → (7.5, 1)

Iteration 2

Pairs of clusters with distance 2 are: ({B}, {G})

Iteration 3

dist({A,E}, {D}) = 2.5 since the center of {A, E} is (2, 2.5)

Recalculate the centers: {A, D, E} → (5/3, 2) {B, G} → (6.5, 4.5)

So clusters after the 3rd iteration are:

{A,D,E}, {F}, {B,G}, {C, B2}

b) Give the final 3 clusters.

Iteration 4

d({A,D,E}, {F}) = ⅓ + 4 = 4 ⅓

d({B,G}, {F}) = ½ + ½ + 2 = 3

d({B2, C}, {F}) = ½ + 4 = 4 ½

Closest are {B,G} and {F}

So the final 3 clusters are

Bayesian Network

Attributes:

D-Separation

Z d-separates X and Y iff every undirected path P from X to Y is blocked by Z,

where P is blocked iff one or more of the following conditions is true:

-  P contains a chain, x ←m ←y, such that the middle node m is in Z, or

-  P contains a fork, x ←m →y, such that the middle node m is in Z, or

-  P contains an inverted fork, x →m ←y, such that the middle node m is not in Z and no descendant of m is in Z.

a)  Using d-separation, show if the following assertions are true or not:

-  The variables Fraud and Age independent given the variable Jewelry,

-  The variables Sex and Gas are independent,

-  The variables Sex and Gas are independent given the variable Jewelry.

Hung Chang

10:49 PM Today

AJF is not blocked, (F not ╨ A | J)

P contains an inverted fork, x → m ← y,

but the middle node m(J) is in Z

SJFG is blocked (S ╨ G| ∅ )

P contains an inverted fork, x → m ← y,

such that the middle node m(J) is not in Z and

no descendant of m is in Z

SJFG is not blocked (S not ╨ G| J)

P contains an inverted fork, x → m ← y,

but the middle node m(J) is in Z

Joint Probabilities

Express the joint probabilities of the five variables F, A, S, G and B.

The joint probability is

P(FASGB) = P(F) * P(A) * P(S) * P(J | FAS) * P(G | F)

Classification

A 40 years old male has bought a jewel during the last 24 hours. What is the probability that this purchase is fraudulent?

Our instance:

J = Y, A = [30;50], S = M, G - ?

Need to predict F

In order to answer this question, you will use the marginal law and the definition of conditional probability.

-  Conditional probability: P(X / Y) = P(X,Y) / P(Y)

-  Marginal law: P(x) = ∑ P(x, y_i) if the domain D_Y of Y is { y_1, …, y_p }

-  If a directed acyclic graph is a Bayesian network of a probability distribution, then we have:

P(X1, …, XN) = ∏_i P(Xi / Parents(Xi)).

let’s fix the variables

P(F=f | J=j S=s A=a) = P(JSAF) / P(JSA) = P(J | FSA) P(S) P(A) P(F) / P(JSA)

P(JSA) = P(J) P(S) P(J | FSA)

P(J | FSA) - here J,S,A are fixed, but F is not (it appeared because J depends on F) - so need to use the marginal law

P(J=j | F S=s A=a) = P(J=j | F=y S=s A=a) + P(J=j | F=n S=s A=a)

P(F=y | J=y A=[30,40]  S=m) = P(J=y | y [30,50] m) * P(S=m) * P(A=[30;50]) * P(F=y) / P(y [30,50] m)

P(y [30,50] m) = P(y | y  [30,50] m) + P(y | n [30,50] m) = 0.05 + 0.004 = 0.054

P(F=y | J=y S=m A=[30,40])  = 0.05 * 0.5 * 0.40 * 0.1 / 0.054 = 0.01851

P(F=n | J=y A=[30,40]  S=m) = P(J=y | n [30,50] m) * P(S=m) * P(A=[30;50]) * P(F=n) / P(y [30,50] m) = 0.004 * 0.5 * 0.40 * 0.9 / 0.054 = 0.01333

fraud