1 of 25

Data Mining_Anoop Chaturvedi

1

Swayam Prabha

Course Title

Multivariate Data Mining- Methods and Applications

Lecture 29

Centroid and Non-hierarchical Clustering Methods

By

Anoop Chaturvedi

Department of Statistics, University of Allahabad

Prayagraj (India)

Slides can be downloaded from https://sites.google.com/view/anoopchaturvedi/swayam-prabha

2 of 25

Data Mining_Anoop Chaturvedi

2

Agglomerative and divisive clustering are the two primary approaches, with agglomerative being more commonly employed due to its simplicity and efficiency.

3 of 25

Centroid Clustering:

  • Groups formed are represented by mean values for each variable.
  • Inter-group distance is defined as distance between two such mean vectors.

Example:

Data Mining_Anoop Chaturvedi

3

Individual

1

2

3

4

5

Variable 1

1

1

6

8

8

Variable 2

1

2

3

2

0

4 of 25

  •  

Data Mining_Anoop Chaturvedi

4

5 of 25

  •  

Data Mining_Anoop Chaturvedi

5

Individual

(12)

3

4

5

Variable 1

1

6

8

8

Variable 2

1.5

3

2

0

6 of 25

  •  

Data Mining_Anoop Chaturvedi

6

Individual

(12)

3

(45)

Variable 1

1

6

8

Variable 2

1.5

3

1

7 of 25

Fuse (45) and 3.

Fuse (12345) at a distance 6.04.

Data Mining_Anoop Chaturvedi

7

Individual

(12)

(345)

Variable 1

1

7

Variable 2

1.5

2

 

1

2

3

4

5

8 of 25

Median Clustering:

Disadvantages of Centroid clustering

  • If sizes of groups to be fused are very different, centroid of new group will be very close to the larger group and may remain within that group.
  • The characteristic properties of the smaller groups are virtually lost.
  • It is sensitive to outliers.

Data Mining_Anoop Chaturvedi

8

9 of 25

  •  

Data Mining_Anoop Chaturvedi

9

10 of 25

  •  

Data Mining_Anoop Chaturvedi

10

11 of 25

  •  

Data Mining_Anoop Chaturvedi

11

12 of 25

Some Comments on Hierarchical Procedures:

  1. Clustering method is sensitive to ‘outlier’.
  2. There is no provision for reallocation of objects that have been incorrectly grouped at an early stage. Final configuration of clusters should be carefully examined.
  3. ‘Stability’ of clustering should be checked by applying clustering algorithm before and after adding small errors in data. If groups are well distinguished, clustering should agree in both cases. (Robustness)

Data Mining_Anoop Chaturvedi

12

13 of 25

  1. One should attempt several clustering methods and distance measures and observe if results are consistent.
  2. Multiple situations may occur because of ties. Such groupings should be properly interpreted and different dendrograms can be compared to asses their overlaps.
  3. An inversion occurs when an object joins existing cluster at a smaller distance than that of a previous cluster. Sometimes inversion occurs with centroid method or median clustering.

Data Mining_Anoop Chaturvedi

13

14 of 25

Dendrogram with crossover

Data Mining_Anoop Chaturvedi

14

15 of 25

Nonhierarchical or Partitioning Clustering Methods:

  • Number of clusters K is fixed in advance.
  • We don’t define distance matrix. This method can be applied to large data sets.
  • Initially select K seeds point randomly.
  • Partition items in K clusters.
  • Then add/leave points according to some criterion.

Data Mining_Anoop Chaturvedi

15

16 of 25

  •  

Data Mining_Anoop Chaturvedi

16

17 of 25

  •  

Data Mining_Anoop Chaturvedi

17

18 of 25

To check stability, rerun algorithm with new set of initial groups.

Example:

Data Mining_Anoop Chaturvedi

18

Item

Observations

A

5

3

B

-1

1

C

1

-2

D

-3

-2

Mean

19 of 25

  •  

Data Mining_Anoop Chaturvedi

19

(AB)

2

(CD)

-1

-2

20 of 25

  •  

Data Mining_Anoop Chaturvedi

20

 

A

B

C

D

A

0

40

41

89

(BCD)

52

4

5

5

21 of 25

Data Mining_Anoop Chaturvedi

21

Petal Length

Petal Width

Example: k-mean clustering for Iris dataset.

22 of 25

Data Mining_Anoop Chaturvedi

22

23 of 25

Data Mining_Anoop Chaturvedi

23

Comparison of results be forming a table with the species column of the original data

24 of 25

  •  

Data Mining_Anoop Chaturvedi

24

25 of 25

Cluster plot:

Data Mining_Anoop Chaturvedi

25