1 of 17

Partitioning

Clustering

Technique

2 of 17

INTRODUCTION

  • This clustering method classifies the information into multiple groups based on the characteristics and similarity of the data.
  • It's the data analysts to specify the number of clusters that has to be generated for the clustering methods.
  • In the partitioning method when database(D) that contains multiple(N) objects then the partitioning method constructs user-specified(K) partitions of the data in which each partition represents a cluster and a particular region.

Two Algorithms are

  1. K-Means
  2. K- Medoids(Partitioning Around Medoid)

3 of 17

K Mean

  • The K means algorithm takes the input parameter K from the user and partitions the dataset containing N objects into K clusters so that resulting similarity among the data objects inside the group (intracluster) is high but the similarity of data objects with the data objects from outside the cluster is low (intercluster).

  • The similarity of the cluster is determined with respect to the mean value of the cluster.

4 of 17

K- Mean Algorithm

Input:

D: A dataset containing N number of objects

K: The number of clusters in which the dataset has to be divided

Output:

A dataset of K clusters

Method:

  1. Randomly assign K objects from the dataset(D) as cluster centres(C)
  2. (Re) Assign each object to which object is most similar based upon mean values.
  3. Update Cluster means, i.e., Recalculate the mean of each cluster with the updated values.
  4. Repeat Step 4 until no change occurs.

5 of 17

Suppose we want to group the visitors to a website using just their age as follows:

16, 16, 17, 20, 20, 21, 21, 22, 23, 29, 36, 41, 42, 43, 44, 45, 61, 62, 66

Initial Cluster:

K=2

Centroid(C1) = 16

Centroid(C2) = 22

[16, 16, 17]

C1 = 16.33

[20, 20, 21, 21, 22, 23, 29, 36, 41, 42, 43, 44, 45, 61, 62, 66]

C2 = 37.25

Iteration-1:

6 of 17

Iteration-2:

[16, 16, 17, 20, 20, 21, 21, 22, 23]

C1 = 19.55

[29, 36, 41, 42, 43, 44, 45, 61, 62, 66]

C2 = 46.90

7 of 17

Iteration-3:

[16, 16, 17, 20, 20, 21, 21, 22, 23, 29]

C1 = 20.50

[36, 41, 42, 43, 44, 45, 61, 62, 66]

C2 = 48.89

Iteration-4:

[16, 16, 17, 20, 20, 21, 21, 22, 23, 29]

C1 = 20.50

[36, 41, 42, 43, 44, 45, 61, 62, 66]

C2 = 48.89

  • No change Between Iteration 3 and 4, so we stop.
  • Therefore we get the clusters (16-29) and (36-66) as 2 clusters we get using K Mean Algorithm.

8 of 17

K-Medoids (also called Partitioning Around Medoid)

  • Algorithm was proposed in 1987 by Kaufman and Rousseeuw.
  • A medoid can be defined as a point in the cluster, whose dissimilarities with all the other points in the cluster are minimum.
  • The dissimilarity of the medoid(Ci) and object(Pi) is calculated by using E = |Pi – Ci|.

9 of 17

Algorithm

  1. Initialize: select k random points out of the n data points as the medoids.
  2. Associate each data point to the closest medoid by using any common distance metric methods.
  3. While the cost decreases: For each medoid m, for each data o point which is not a medoid:
    • Swap m and o, associate each data point to the closest medoid, and recompute the cost.
    • If the total cost is more than that in the previous step, undo the swap.

10 of 17

Step 1: Let the randomly selected 2 medoids, so select k = 2, and let C1 -(4, 5) and C2 -(8, 5) are the two medoids.

11 of 17

12 of 17

Step 2: Calculating cost. The dissimilarity of each non-medoid point with the medoids is calculated and tabulated. Where C1 -(4, 5) and C2 -(8, 5)

Distance = |X1-X2| + |Y1-Y2|

13 of 17

  • Each point is assigned to the cluster of that medoid whose dissimilarity is less.
  • Points 1, 2, and 5 go to cluster C1 and 0, 3, 6, 7, 8 go to cluster C2.

The Cost = (3 + 4 + 4) + (3 + 1 + 1 + 2 + 2) = 20

Step 3:

  • Randomly select one non-medoid point and recalculate the cost.

  • Let the randomly selected point be (8, 4).

  • The dissimilarity of each non-medoid point with the medoids – C1 (4, 5) and C2 (8, 4) is calculated and tabulated.

14 of 17

points 1, 2, and 5 go to cluster C1 and 0, 3, 6, 7, 8 go to cluster C2.

The New cost = (3 + 4 + 4) + (2 + 2 + 1 + 3 + 3) = 22

15 of 17

  • Swap Cost = New Cost – Previous Cost = 22 – 20 and 2 >0
  • As the swap cost is not less than zero, we undo the swap.
  • Hence (4, 5) and (8, 5) are the final medoids.
  • The clustering would be in the following way The time complexity is

16 of 17

Comparing k-Means with k-Medoids :

  • Both algorithms needs predefined value for k the number of cluster, prior to the training of algorithm.
  • Both algorithm arbitrarily choose the initial cluster centroids
  • The k-Medoids method is more robust than k-Means in the presence of outliers, because a medoid is less influenced by outlier than a mean
  • Time complexity of k-means: O(k*n*D)
  • Time complexity of k-Medoids: O(k*(n-k)2)

17 of 17

Applicability of PAM:

  •  PAM does not scale well to large database because of its computation complexity
  • There are some variants of PAM that are designed mainly for large datasets are:
    • CLARA(Clustering LARge Applictions) and
    • CLARANS(Clustering Large Applications based on RANdomized Search)
    • CLARANS is an improvement of CLARA