1 of 22

Determining the Number of Clusters

Course: Databases and Data Mining
Cluster Validation Techniques
Instructor: Jamolbek Mattiev

2 of 22

Learning Objectives

• Understand importance of K selection
• Learn Elbow and Silhouette methods
• Compare validation techniques
• Apply methods in WEKA

3 of 22

Why Choosing K Matters

• Too few clusters → underfitting
• Too many clusters → overfitting
• Affects interpretability
• Impacts decision-making

4 of 22

Cluster Validation Categories

• Internal validation
• External validation
• Relative validation

5 of 22

Internal Validation Metrics

• WCSS (compactness)
• Silhouette score
• Davies–Bouldin index

6 of 22

Elbow Method

• Compute WCSS for different K
• Plot K vs WCSS
• Look for elbow point
• Choose K at inflection

7 of 22

WCSS Formula

WCSS = Σ ||x - μ||²
Measures within-cluster variance
Lower is better

8 of 22

Limitations of Elbow

• Elbow not always clear
• Subjective interpretation
• Depends on dataset structure

9 of 22

Silhouette Method

• Measures cluster separation
• Range: -1 to 1
• Higher value indicates better clustering

10 of 22

Silhouette Interpretation

• Close to 1 → good clustering
• Around 0 → overlapping clusters
• Negative → incorrect assignment

11 of 22

Davies–Bouldin Index

• Ratio of intra/inter-cluster distance
• Lower value → better clustering

12 of 22

Gap Statistic

• Compares clustering with random reference
• Choose K with maximum gap

13 of 22

Practical Steps in WEKA

• Run clustering for K=2 to 10
• Record WCSS
• Compare cluster sizes
• Use visualization tool

14 of 22

Experimental Workflow

1. Select clustering algorithm
2. Test multiple K values
3. Record evaluation metrics
4. Select optimal K

15 of 22

Comparing Methods

Elbow → simple
Silhouette → informative
DB Index → separation-based
Use multiple metrics together

16 of 22

Common Mistakes

• Choosing arbitrary K
• Ignoring data scaling
• Overinterpreting weak clusters

17 of 22

Real-World Considerations

• Domain knowledge matters
• Business interpretability
• Computational cost

18 of 22

Advanced Considerations

• Stability analysis
• Resampling validation
• Consensus clustering

19 of 22

Discussion Questions

• Can methods suggest different K?
• Which method is most reliable?
• How to validate results?

20 of 22

Summary

• K selection is critical
• Use multiple validation approaches
• Combine metrics and domain insight
• Always validate clustering results

21 of 22

Elbow Method Visualization

22 of 22

Silhouette Score Visualization