Complete data set
Count = how many candidates with this profile
Quality = the target we want to predict
a) Roles of training, validation and testing subsets
b) Using misclassification error as impurity measure, what is the impurity of the training subset?
Impurity: I(S) = 1 − maxk pk
Good: 50+5+15+1 = 71, Bad: 5+2+2+20=29
Impurity: 29/100 (since have only two classes, just selecting the smaller fraction)
c) Find the best splitting criterion to partition the training set. You will precise the
information gain for every candidate splitting criterion.
Information gain: ΔI=I(S)−pL⋅I(SL)−pR⋅I(SR)
Using T1:
S_L : with T1 == P, P_L = (50+5+5+2)/100 = 62/100
I(S_L): Good = 55/62, Bad = 7/62
S_R: with T1 == F, P_R = 38/100
I(S_R): Good = 16/38, Bad = 22/38
Delta I = 29/100 - 62/100 * 7/62 - 38/100 * 16/38 = (29 - 7 - 16) / 100 = 6/100
Using T2:
S_L: P, P_L = (50+5+15+2)/100 = 72/100
I(S_L): Good = (50+15) = 65/72, Bad = (5+2)/72 = 7/72
S_R: F, P_R = (5+2+1+20)/100 = 28/100
I(S_R): Good = (5+1)/28 = 6/28, Bad = (2+20)/28 = 22/28
Delta I = 29/100 - 72/100 * 7/72 - 28/100 * 6/28 = (29 - 7 - 6)/100 = 16/100
Using T3:
S_L: P, P_L = (50+2+15+1+20)/100 = 88/100
I(S_L): Good = (50+15+1)/88 = 66/88, Bad = (2+20)/88 = 22/88
S_R: F, P_R = (5+5+2)/100 = 12/100
I(S_R): Good = 5/12, Bad = (5+2)/12
Delta I = 29/100 - 88/100 * 22/88 - 12/100 * 5/12 = (29 - 22 - 5)/100 = 2/100
⇒ T2 maximizes the information gain
d) Finish the construction of the decision tree using the following stopping criterion: all leaf nodes have a misclassification error (measure using the training set) less or equal than 10 %.
Now have two nodes:
if T2 = P then (Node1) else (Node2)
Node1
misclassification error is 7/72 < 10%
can stop here
Predict “Good” with confidence 65/72
Node2
misclassification error is 6/28 > 10%
continue splitting
Try T1, Condition T1 = P
S_L: P_L = 7/28
I(S_L): Good = 5/7, Bad = 2/7
S_R: P_R = 21/28
I(S_R): Good: 1/21, Bad = 20/21
Delta I = 6/28 - 7/28 * 2/7 - 21/28 * 1/21 = (6 - 2 - 1)/28 = 3/28
Try T3, Condition T3 = P
S_L: P_L = 23/28
I(S_L): Good = 1/23, Bad = 22/23-
S_R: P_R = 5/28 (100% pure)
I(S_R) Good = 5/5, Bad = 0/5
Delta I = 6/28 - 23/28 * 1/23 - 0 = 5/28
⇒ Use T3 as a splitting criterion
Tree so far:
if T2 = P then
predict “Good” with confidence 65/72
else
if (T3 = P) then
(Node2.1)
else
(Node2.2)
Node2.1
misclassification error is 1/23 < 10%
can stop here
Predict “Bad” with confidence 22/23
Node2.2
misclassification error is 0
Predict “Good” with confidence 1
Tree:
if (T2 = P) then
predict “Good” with confidence 65/72
else
if (T3 = P) then
Predict “Bad” with confidence 22/23
else
Predict “Good” with confidence 1
e) Considering. possible subtree, compute the validation errors (using the validation subset). Which subtree has to be selected?
We can prune the tree in two places:
So we consider 3 trees:
Tree 1 - our tree before pruning
Tree 2 - pruning node with T3 = p
Tree 3 - pruning the entire tree
So Tree 1 is the best w.r.t the validation set
(For complexity cost based pruning see here)
f) Compute finally the test error of the decision tree obtained at the end of question e).
The test error in 0.16
Dataset D is divided into 2 subsets D1 and D2
Items: {A B C D E F}
freq(X,D) - # of transactions where itemset X occurs
freq(AB,D) = 3 and freq(AB,D1) = 2
supp(X,D) = freq(X,D)/|D| - the support is the proportion of transactions containing X
a) Enumerate all frequent itemsets of dataset D whose frequency is at least 2 and represent them by means of a lattice. Specify the frequency for each itemset.
use an inverse list to represent the data set (it’s simpler)
A → {T1, T2, T4}
B → {T1, T2, T4, T5}
C → {T2, T5, T6}
D → {T2, T3, T4, T5}
E → {T1}
F → {T6}
Use Apriori Algo for that
Start with {}: empty itemset is always frequent
iteration 1
generate candidates: {{A} → 3, {B} → 4 , {C} → 3, {D} → 4, {E} → 1, {F} → 1}
remove not frequent: {{A} → 3, {B} → 4, {C} → 3, {D} → 4}
iteration 2
generate candidates: {{AB} → 3, {AC} → 1, {AD} → 2, {BC} → 2, {BD} → 3, {CD} → 2}
remove not frequent: {{AB} → 3, {AD} → 2, {BC} → 2, {BD} → 3, {CD} → 2}
iteration 3
generate candidates: {{ABD} → 2, {BCD} → 2}
remove not frequent: {{ABD} → 2, {BCD} → 2}
stop
can’t generate {ABCD} because apart from {ABD} and {BCD} also need {ABC}, but it’s not frequent
Obtained lattice:
GR1(X) = supp(X,D1)/supp(X,D2) - growth rate of itemset X
X is an emerging pattern iff GR1(X) > 1
b) Describe in one sentence the semantic of emerging patterns.
Emerging patterns are patterns whose support changes significantly from one dataset to another. [more]
I.e. the fraction of patterns that support X in D1 is higher that in D2
c) Rewrite the growth rate of an itemset X using freq(X,D1) and freq(X,D).
supp(X,D) = freq(X,D)/|D|
freq(X, D) = freq(X, D1) + freq(X, D2); freq(X, D2) = freq(X, D) - freq(X, D1)
GR1(X) = ( | D2 | * freq(X, D1) ) / ( | D1 | * freq(X, D2) ) = (| D2 | / | D1 |) * freq(X, D1) / (freq(X, D) - freq(X, D1))
We want to find all emerging X \in D s.t.
let EP = { X \in D | GR1(X) >= 2 & freq(X, D) >= 2 }
d) Demonstrate that any pattern of EP has a frequency in D1 greater than or equal to 2.
GR1(X) >= 2 ⇒ supp(X, D1) / supp(X, D2) >= 2
In general it doesn’t hold:
if | D2 | >= 2 * | D1 | then it can be true that freq(X, D1) = freq(X, D2) = 1
Or it’s just on the given set of data? If only on the given data, then
Suppose freq(X, D1) < 2
know that freq(X, D1) > 0 (otherwise GR1(X) would be 0)
so freq(X, D1) = 1
then supp(X, D1) = 1 / | D1 |
for GR1(X) >= 2 to hold must have 1 / | D1 | >= 2 * freq(X, D2) / | D2 |
since | D1 | = | D2 |, 1 >= 2 * freq(X, D2)
it can be possible only if freq(X, D2) is 0,
but freq(X, D) = freq(X, D1) + freq(X, D2) >= 2,
and 1 + 0 is not >= 2
⇒ contraction ⇒ freq(X, D2) >= 2
(Maybe can do it simpler?)
e) Enumerate all the patterns of EP.
Frequent patterns of D1: {{} → 3, {A} → 2, {B} → 2, {D} → 2, {AB} → 2}
Frequency in D2 {{} → 3, {A} → 1, {B} → 2, {D} → 2, {AB} → 1}
So only {A} and {AB} have GR = 2
Given A(2,3), B(6,5), C(7,1), D(1,1), B2(8,1), E(2,2), F(5,3), G(7,4)
Generate 3 clusters from this data set by using agglomerative method with Manhattan distance Use the mean distance (i.e., the distance between cluster centers) while regrouping clusters.
Manhattan distance: d(X, Y) = Sum from i=1 to N | X_i - Y_i |
A hierarchical method creates a hierarchical decomposition of the given set of data objects
The agglomerative - the bottom-up approach = AGNES
The divisive - the top-down approach = DIANA
a) Give the clusters generated after the third iteration.
At first, all data points form their own clusters
Iteration 1
Pairs of clusters with distance 1 are:
({A}, {E}), ({B2}, {C}) - join them
Calculate the centers for the new clusters:
{A, E} → (2, 2.5)
{B2, C} → (7.5, 1)
Iteration 2
Pairs of clusters with distance 2 are: ({B}, {G})
Iteration 3
dist({A,E}, {D}) = 2.5 since the center of {A, E} is (2, 2.5)
Recalculate the centers: {A, D, E} → (5/3, 2) {B, G} → (6.5, 4.5)
So clusters after the 3rd iteration are:
{A,D,E}, {F}, {B,G}, {C, B2}
b) Give the final 3 clusters.
Iteration 4
d({A,D,E}, {F}) = ⅓ + 4 = 4 ⅓
d({B,G}, {F}) = ½ + ½ + 2 = 3
d({B2, C}, {F}) = ½ + 4 = 4 ½
Closest are {B,G} and {F}
So the final 3 clusters are
Attributes:
Z d-separates X and Y iff every undirected path P from X to Y is blocked by Z, where P is blocked iff one or more of the following conditions is true: - P contains a chain, x ←m ←y, such that the middle node m is in Z, or - P contains a fork, x ←m →y, such that the middle node m is in Z, or - P contains an inverted fork, x →m ←y, such that the middle node m is not in Z and no descendant of m is in Z. |
a) Using d-separation, show if the following assertions are true or not:
- The variables Fraud and Age independent given the variable Jewelry,
- The variables Sex and Gas are independent,
- The variables Sex and Gas are independent given the variable Jewelry.
Hung Chang
10:49 PM Today
AJF is not blocked, (F not ╨ A | J)
P contains an inverted fork, x → m ← y,
but the middle node m(J) is in Z
SJFG is blocked (S ╨ G| ∅ )
P contains an inverted fork, x → m ← y,
such that the middle node m(J) is not in Z and
no descendant of m is in Z
SJFG is not blocked (S not ╨ G| J)
P contains an inverted fork, x → m ← y,
but the middle node m(J) is in Z
Express the joint probabilities of the five variables F, A, S, G and B.
The joint probability is
P(FASGB) = P(F) * P(A) * P(S) * P(J | FAS) * P(G | F)
A 40 years old male has bought a jewel during the last 24 hours. What is the probability that this purchase is fraudulent?
Our instance:
J = Y, A = [30;50], S = M, G - ?
Need to predict F
In order to answer this question, you will use the marginal law and the definition of conditional probability.
- Conditional probability: P(X / Y) = P(X,Y) / P(Y) - Marginal law: P(x) = ∑ P(x, y_i) if the domain D_Y of Y is { y_1, …, y_p } - If a directed acyclic graph is a Bayesian network of a probability distribution, then we have: P(X1, …, XN) = ∏_i P(Xi / Parents(Xi)). |
let’s fix the variables
P(F=f | J=j S=s A=a) = P(JSAF) / P(JSA) = P(J | FSA) P(S) P(A) P(F) / P(JSA)
P(JSA) = P(J) P(S) P(J | FSA)
P(J | FSA) - here J,S,A are fixed, but F is not (it appeared because J depends on F) - so need to use the marginal law
P(J=j | F S=s A=a) = P(J=j | F=y S=s A=a) + P(J=j | F=n S=s A=a)
P(F=y | J=y A=[30,40] S=m) = P(J=y | y [30,50] m) * P(S=m) * P(A=[30;50]) * P(F=y) / P(y [30,50] m)
P(y [30,50] m) = P(y | y [30,50] m) + P(y | n [30,50] m) = 0.05 + 0.004 = 0.054
P(F=y | J=y S=m A=[30,40]) = 0.05 * 0.5 * 0.40 * 0.1 / 0.054 = 0.01851
P(F=n | J=y A=[30,40] S=m) = P(J=y | n [30,50] m) * P(S=m) * P(A=[30;50]) * P(F=n) / P(y [30,50] m) = 0.004 * 0.5 * 0.40 * 0.9 / 0.054 = 0.01333
fraud