1 of 36

Clustering: K-Means

2 of 36

Supervised vs. Unsupervised Learning

2

Supervised Learning

Unsupervised Learning

Building a model from labeled data

Clustering from unlabeled data

3 of 36

Data Clustering

  •  

3

4 of 36

Data Clustering: Similarity

  • The only information clustering uses is the mutual similarity between samples

  • A good clustering is one that achieves:
    • high within-cluster similarity
    • low inter-cluster similarity

4

5 of 36

Unlabeled Data

  • Similarity ?
    • Distance

5

6 of 36

 

6

7 of 36

Assigning Points

7

8 of 36

Assigning Points

8

9 of 36

Recomputing the Cluster Centers

9

10 of 36

Recomputing the Cluster Centers

10

11 of 36

Assigning Points

11

12 of 36

Assigning Points

12

13 of 36

Recomputing the Cluster Centers

13

14 of 36

Recomputing the Cluster Centers

14

15 of 36

Repeat

15

16 of 36

Final Clustering

16

17 of 36

K-Means: (Iterative) Algorithm

  •  

17

18 of 36

K-Means: (Iterative) Algorithm

2) Iteration

  • Repeat until convergence
    • A possible convergence criteria: cluster centers do not change anymore

18

19 of 36

K-Means: (Iterative) Algorithm

  •  

19

20 of 36

Discussion: “Chicken and Egg" Dilemma

  •  

20

21 of 36

The Power of Continuous Improvement: Lessons from K-Means

  • The K-Means clustering algorithm demonstrates a powerful lesson in problem-solving

  • When faced with uncertainty or dilemma, start taking action and refine as you go (even if your initial steps are imperfect)
  • Avoid trying to plan everything in exhaustive detail from the very beginning.

  • This philosophy of “start, iterate, improve” is a powerful concept not only in clustering but also in broader problem-solving strategies.
    • In Decision-Making: When facing uncertainty, take the first step rather than waiting for perfect conditions.
    • In Learning: Start by attempting, then refine your understanding through experience.
    • In Personal Growth: Progress often begins with small, imperfect actions that improve over time.

21

22 of 36

K-Means in Python

22

23 of 36

Python: Data Generation

23

24 of 36

Python: Data Generation and Random Initialization

24

25 of 36

Python: K-Means

25

26 of 36

Python: K-Means in Scikit-learn

26

27 of 36

Some Issues in K-Means

27

28 of 36

Initialization Issues

  • K-Means is extremely sensitive to cluster center initialization

  • Bad initialization can lead to
    • Poor convergence speed
    • Bad overall clustering

  • Safeguarding measures:
    • Choose first center as one of the examples, second which is the farthest from the first, third which is the farthest from both, and so on.
    • Try multiple initialization and choose the best result

28

29 of 36

Choosing the Number of Clusters

  •  

29

30 of 36

Choosing the Number of Clusters

30

31 of 36

Discussion (1/2)

  •  

31

32 of 36

Discussion (2/2)

  •  

32

33 of 36

K-Means: Limitations (1/4)

  • Hard Assignments
    • A point either completely belongs to a cluster or not belongs at all
    • No notion of a soft assignment (i.e., probability of being assigned to each cluster)
    • Gaussian mixture model (we will study later) and Fuzzy K-means allow soft assignments

33

34 of 36

K-Means: Limitations (2/4)

  • Sensitive to outlier examples
    • K-medians algorithm is a more robust alternative for data with outliers�

34

35 of 36

K-Means: Limitations (3/4)

  •  

35

36 of 36

K-Means: Limitations (4/4)

  • Imbalanced Cluster Sizes (or Clusters with Different Densities)

36