1 of 13

��PCA18502 – DATA MINING AND DATA WAREHOUSING�class 8 – 02.09.2020��

Mr Rajkumar D

Assistant Professor (S.G)

MCA DEPARTMENT

SRMIST, Ramapuram

2 of 13

CLUSTERING

What is Clustering?

Clustering can be considered the most important unsupervised learning problem; so, as every other problem of this kind, deals with finding a structure in a collection of unlabeled data.
Definition of clustering could be “the process of organizing objects into groups whose members are similar in some way”.
A cluster is therefore a collection of objects which are “similar” between them and are “dissimilar” to the objects belonging to other clusters.

3 of 13

CLUSTERING

In the example, we easily identify the 4 clusters into which the data can be divided; the similarity criterion is distance: two or more objects belong to the same cluster if they are “close” according to a given distance (in this case geometrical distance). This is called distance-based clustering.
Another kind of clustering is conceptual clustering: two or more objects belong to the same cluster if this one defines a concept common to all that objects. In other words, objects are grouped according to their fit to descriptive concepts, not according to simple similarity measures.

4 of 13

Distance Functions

Given two p-dimensional data objects i = (xi1, xi2, ..., xip) and j = (xj1, xj2, ..., xjp), the following common distance functions can be defined:

When using the Euclidean distance function to compare distances, it is not necessary to calculate the square root because distances are always positive numbers and as such, for two distances, d1 and d2, Öd1 > Öd2 Û d1 > d2.
If some of an object’s attributes are measured along different scales, so when using the Euclidean distance function, attributes with larger scales of measurement may overwhelm attributes measured on a smaller scale. To prevent this problem, the attribute values are often normalized to lie between 0 and 1.

5 of 13

Goals of Clustering

The goal of clustering is to determine the intrinsic grouping in a set of unlabeled data
But how to decide what constitutes a good clustering? It can be shown that there is no absolute “best” criterion which would be independent of the final aim of the clustering.
Consequently, it is the user which must supply this criterion, in such a way that the result of the clustering will suit their needs.
For instance, we could be interested in finding representatives for homogeneous groups (data reduction), in finding “natural clusters” and describe their unknown properties (“natural” data types), in finding useful and suitable groupings (“useful” data classes) or in finding unusual data objects (outlier detection).

6 of 13

Applications

Clustering algorithms can be applied in many fields, for instance:

Marketing: Finding groups of customers with similar behavior given a large database of customer data containing their properties and past buying records;
Biology: classification of plants and animals given their features;
Libraries: book ordering;
Insurance: Identifying groups of motor insurance policy holders with a high average claim cost; identifying frauds;
City-planning: Identifying groups of houses according to their house type, value and geographical location;
Earthquake studies: Clustering observed earthquake epicenters to identify dangerous zones;
WWW: Document classification; clustering weblog data to discover groups of similar access patterns.

7 of 13

Types of Clustering Algorithms

There are two types of clustering algorithms:

Nonhierarchical
Hierarchical
In nonhierarchical clustering, such as the k-means algorithm, the relationship between clusters is undetermined.
Hierarchical clustering repeatedly links pairs of clusters until every data object is included in the hierarchy. With both of these approaches, an important issue is how to determine the similarity between two objects, so that clusters can be formed from objects with a high similarity to each other.

8 of 13

Nonhierarchical clustering Algorithms

K-means Algorithm

The k-means algorithm is one of a group of algorithms called partitioning methods.

The problem of partitioned clustering can be formally stated as follows:
Given n objects in a d-dimensional metric space, determine a partition of the objects into k groups, or clusters, such that the objects in a cluster are more similar to each other than to objects in different clusters.
Recall that a partition divides a set into disjoint parts that together include all members of the set. The value of k may or may not be specified and a clustering criterion, typically the squared-error criterion, must be adopted.
The solution to this problem is straightforward. Select a clustering criterion, then for each data object select the cluster that optimizes the criterion.
The k-means algorithm initializes k clusters by arbitrarily selecting one object to represent each cluster. Each of the remaining objects is assigned to a cluster and the clustering criterion is used to calculate the cluster mean.
These means are used as the new cluster points and each object is reassigned to the cluster that it is most similar to. This continues until there is no longer a change when the clusters are recalculated

9 of 13

K-means Algorithm

1. Select k clusters arbitrarily.

2. Initialize cluster centers with those k clusters.

3. Do loop

(a) Partition by assigning or reassigning all data objects to their closest cluster center.

(b) Compute new cluster centers as mean value of the objects in each cluster

until no change in cluster center calculation.

10 of 13

K-means Algorithm

Suppose we are given the data in as input and we choose k = 2 and the Manhattan distance function dis = |x2–x1|+|y2–y1|. The detailed computation is as follows
Step 1: Initialize k partitions. An initial partition can be formed by first specifying a set of k seed points. The seed points could be the first k objects or k objects chosen randomly from the input objects. We choose the first two objects as seed points and initialize the clusters as C1 = {(0,0)} and C2 = {(0,1)}.
Step 2: Since there is only one point in each cluster, that point is the cluster center.
Step 3a: Calculate the distance between each object and each cluster center, assigning the object to the closest cluster.

For example, for the third object:

dis(1,3) = |1–0| + |1–0| = 2 and dis(2,3) = |1–0| + |1–1| = 1,

so this object is assigned to C2. The fifth object is equidistant from both clusters, so we arbitrarily assign it to C1.

After calculating the distance for all points, the clusters contain the following objects:

C1 = {(0,0),(1,0),(0.5,0.5)} and

C2 = {(0,1),(1,1),(5,5),(5,6),(6,6),(6,5),(5.5,5.5)}.

11 of 13

K-means Algorithm

Step 3b: Compute new cluster center for each cluster.

New center for C1 = (0.5,0.16)

(0+1+0.5)/3 = 0.5, (0+0+0.5)/3 = 0.16

New center for C2 = (4.1,4.2)

(0+1+5+5+6+6+5.5)/7 = 4.1, (1+1+5+5+6+6+5.5)/7 = 4.2.

Step 3a’: New centers C1 = (0.5,0.16) and C2 = (4.1,4.2) differ from old centers C1={(0,0)} and C2={(0,1)}, so the loop is repeated.

Reassign the ten objects to the closest cluster center, resulting in:

C1 = {(0,0),(0,1),(1,1),(1,0),(0.5,0.5)}

C2 = {(5,5),(5,6),(6,6),(6,5),(5.5,5.5)}.

Step 3b’: Compute new cluster center for each cluster

New center for C1 = (0.5,0.5)

New center for C2 = (5.5,5.5).

12 of 13

K-means Algorithm

Step 3a’’: New centers C1 = (0.5,0.5) and C2 = (5.5,5.5) differ from old centers C1 =

(0.5,0.16) and C2 = (4.1,4.2), so the loop is repeated.

Step 3b’’: Compute new cluster centers.
Algorithm is done: Centers are the same as in Step 3b´, so algorithm is finished. Result is same as in 3b´.

1 of 13

2 of 13

3 of 13

4 of 13

5 of 13

6 of 13

7 of 13

8 of 13

9 of 13

10 of 13

11 of 13

12 of 13

13 of 13