1 of 25

�

Data Mining_Anoop Chaturvedi

Swayam Prabha

Course Title

Multivariate Data Mining- Methods and Applications

Lecture 29

Centroid and Non-hierarchical Clustering Methods

Anoop Chaturvedi

Department of Statistics, University of Allahabad

Prayagraj (India)

Slides can be downloaded from https://sites.google.com/view/anoopchaturvedi/swayam-prabha

2 of 25

Data Mining_Anoop Chaturvedi

Agglomerative and divisive clustering are the two primary approaches, with agglomerative being more commonly employed due to its simplicity and efficiency.

3 of 25

Centroid Clustering:

Groups formed are represented by mean values for each variable.
Inter-group distance is defined as distance between two such mean vectors.

Example:

Data Mining_Anoop Chaturvedi

Individual	1	2	3	4	5
Variable 1	1	1	6	8	8
Variable 2	1	2	3	2	0

4 of 25

Data Mining_Anoop Chaturvedi

5 of 25

Data Mining_Anoop Chaturvedi

Individual	(12)	3	4	5
Variable 1	1	6	8	8
Variable 2	1.5	3	2	0

6 of 25

Data Mining_Anoop Chaturvedi

Individual	(12)	3	(45)
Variable 1	1	6	8
Variable 2	1.5	3	1

7 of 25

Fuse (45) and 3.

Fuse (12345) at a distance 6.04.

Data Mining_Anoop Chaturvedi

Individual	(12)	(345)
Variable 1	1	7
Variable 2	1.5	2

8 of 25

Median Clustering:

Disadvantages of Centroid clustering

If sizes of groups to be fused are very different, centroid of new group will be very close to the larger group and may remain within that group.
The characteristic properties of the smaller groups are virtually lost.
It is sensitive to outliers.

Data Mining_Anoop Chaturvedi

9 of 25

Data Mining_Anoop Chaturvedi

10 of 25

Data Mining_Anoop Chaturvedi

11 of 25

Data Mining_Anoop Chaturvedi

12 of 25

Some Comments on Hierarchical Procedures:

Clustering method is sensitive to ‘outlier’.
There is no provision for reallocation of objects that have been incorrectly grouped at an early stage. Final configuration of clusters should be carefully examined.
‘Stability’ of clustering should be checked by applying clustering algorithm before and after adding small errors in data. If groups are well distinguished, clustering should agree in both cases. (Robustness)

Data Mining_Anoop Chaturvedi

13 of 25

One should attempt several clustering methods and distance measures and observe if results are consistent.
Multiple situations may occur because of ties. Such groupings should be properly interpreted and different dendrograms can be compared to asses their overlaps.
An inversion occurs when an object joins existing cluster at a smaller distance than that of a previous cluster. Sometimes inversion occurs with centroid method or median clustering.

Data Mining_Anoop Chaturvedi

14 of 25

Dendrogram with crossover

Data Mining_Anoop Chaturvedi

15 of 25

Nonhierarchical or Partitioning Clustering Methods:

Number of clusters K is fixed in advance.
We don’t define distance matrix. This method can be applied to large data sets.
Initially select K seeds point randomly.
Partition items in K clusters.
Then add/leave points according to some criterion.

Data Mining_Anoop Chaturvedi

16 of 25

Data Mining_Anoop Chaturvedi

17 of 25

Data Mining_Anoop Chaturvedi

18 of 25

To check stability, rerun algorithm with new set of initial groups.

Example:

Data Mining_Anoop Chaturvedi

Item	Observations
Item
A	5	3
B	-1	1
C	1	-2
D	-3	-2
Mean

19 of 25

Data Mining_Anoop Chaturvedi


(AB)		2
(CD)	-1	-2

20 of 25

Data Mining_Anoop Chaturvedi

	A	B	C	D
A	0	40	41	89
(BCD)	52	4	5	5

21 of 25

Data Mining_Anoop Chaturvedi

Petal Length

Petal Width

Example: k-mean clustering for Iris dataset.

22 of 25

Data Mining_Anoop Chaturvedi

23 of 25

Data Mining_Anoop Chaturvedi

Comparison of results be forming a table with the species column of the original data

1 of 25

2 of 25

3 of 25

4 of 25

5 of 25

6 of 25

7 of 25

8 of 25

9 of 25

10 of 25

11 of 25

12 of 25

13 of 25

14 of 25

15 of 25

16 of 25

17 of 25

18 of 25

19 of 25

20 of 25

21 of 25

22 of 25

23 of 25

24 of 25

25 of 25