1 of 22

Determining the Number of Clusters

  • Course: Databases and Data Mining
  • Cluster Validation Techniques
  • Instructor: Jamolbek Mattiev

2 of 22

Learning Objectives

  • • Understand importance of K selection
  • • Learn Elbow and Silhouette methods
  • • Compare validation techniques
  • • Apply methods in WEKA

3 of 22

Why Choosing K Matters

  • • Too few clusters → underfitting
  • • Too many clusters → overfitting
  • • Affects interpretability
  • • Impacts decision-making

4 of 22

Cluster Validation Categories

  • • Internal validation
  • • External validation
  • • Relative validation

5 of 22

Internal Validation Metrics

  • • WCSS (compactness)
  • • Silhouette score
  • • Davies–Bouldin index

6 of 22

Elbow Method

  • • Compute WCSS for different K
  • • Plot K vs WCSS
  • • Look for elbow point
  • • Choose K at inflection

7 of 22

WCSS Formula

  • WCSS = Σ ||x - μ||²
  • Measures within-cluster variance
  • Lower is better

8 of 22

Limitations of Elbow

  • • Elbow not always clear
  • • Subjective interpretation
  • • Depends on dataset structure

9 of 22

Silhouette Method

  • • Measures cluster separation
  • • Range: -1 to 1
  • • Higher value indicates better clustering

10 of 22

Silhouette Interpretation

  • • Close to 1 → good clustering
  • • Around 0 → overlapping clusters
  • • Negative → incorrect assignment

11 of 22

Davies–Bouldin Index

  • • Ratio of intra/inter-cluster distance
  • • Lower value → better clustering

12 of 22

Gap Statistic

  • • Compares clustering with random reference
  • • Choose K with maximum gap

13 of 22

Practical Steps in WEKA

  • • Run clustering for K=2 to 10
  • • Record WCSS
  • • Compare cluster sizes
  • • Use visualization tool

14 of 22

Experimental Workflow

  • 1. Select clustering algorithm
  • 2. Test multiple K values
  • 3. Record evaluation metrics
  • 4. Select optimal K

15 of 22

Comparing Methods

  • Elbow → simple
  • Silhouette → informative
  • DB Index → separation-based
  • Use multiple metrics together

16 of 22

Common Mistakes

  • • Choosing arbitrary K
  • • Ignoring data scaling
  • • Overinterpreting weak clusters

17 of 22

Real-World Considerations

  • • Domain knowledge matters
  • • Business interpretability
  • • Computational cost

18 of 22

Advanced Considerations

  • • Stability analysis
  • • Resampling validation
  • • Consensus clustering

19 of 22

Discussion Questions

  • • Can methods suggest different K?
  • • Which method is most reliable?
  • • How to validate results?

20 of 22

Summary

  • • K selection is critical
  • • Use multiple validation approaches
  • • Combine metrics and domain insight
  • • Always validate clustering results

21 of 22

Elbow Method Visualization

22 of 22

Silhouette Score Visualization