1 of 36

Clustering: K-Means

2 of 36

Supervised vs. Unsupervised Learning

2

Supervised Learning	Unsupervised Learning
Building a model from labeled data	Clustering from unlabeled data

3 of 36

Data Clustering

3

4 of 36

Data Clustering: Similarity

The only information clustering uses is the mutual similarity between samples

A good clustering is one that achieves:

high within-cluster similarity
low inter-cluster similarity

4

5 of 36

Unlabeled Data

Similarity ?

Distance

5

6 of 36

6

7 of 36

Assigning Points

7

8 of 36

Assigning Points

8

9 of 36

Recomputing the Cluster Centers

9

10 of 36

Recomputing the Cluster Centers

10

11 of 36

Assigning Points

11

12 of 36

Assigning Points

12

13 of 36

Recomputing the Cluster Centers

13

14 of 36

Recomputing the Cluster Centers

14

15 of 36

Repeat

15

16 of 36

Final Clustering

16

17 of 36

K-Means: (Iterative) Algorithm

17

18 of 36

K-Means: (Iterative) Algorithm

2) Iteration

Repeat until convergence

A possible convergence criteria: cluster centers do not change anymore

18

19 of 36

K-Means: (Iterative) Algorithm

19

20 of 36

Discussion: “Chicken and Egg" Dilemma

20

21 of 36

The Power of Continuous Improvement: Lessons from K-Means

The K-Means clustering algorithm demonstrates a powerful lesson in problem-solving

When faced with uncertainty or dilemma, start taking action and refine as you go (even if your initial steps are imperfect)
Avoid trying to plan everything in exhaustive detail from the very beginning.

This philosophy of “start, iterate, improve” is a powerful concept not only in clustering but also in broader problem-solving strategies.

In Decision-Making: When facing uncertainty, take the first step rather than waiting for perfect conditions.
In Learning: Start by attempting, then refine your understanding through experience.
In Personal Growth: Progress often begins with small, imperfect actions that improve over time.

21

22 of 36

K-Means in Python

22

23 of 36

Python: Data Generation

23

24 of 36

Python: Data Generation and Random Initialization

24

25 of 36

Python: K-Means

25

26 of 36

Python: K-Means in Scikit-learn

26

27 of 36

Some Issues in K-Means

27

28 of 36

Initialization Issues

K-Means is extremely sensitive to cluster center initialization

Bad initialization can lead to

Poor convergence speed
Bad overall clustering

Safeguarding measures:

Choose first center as one of the examples, second which is the farthest from the first, third which is the farthest from both, and so on.
Try multiple initialization and choose the best result

28

29 of 36

Choosing the Number of Clusters

29

30 of 36

Choosing the Number of Clusters

30

31 of 36

Discussion (1/2)

31

32 of 36

Discussion (2/2)

32

33 of 36

K-Means: Limitations (1/4)

Hard Assignments

A point either completely belongs to a cluster or not belongs at all
No notion of a soft assignment (i.e., probability of being assigned to each cluster)
Gaussian mixture model (we will study later) and Fuzzy K-means allow soft assignments

33

34 of 36

K-Means: Limitations (2/4)

Sensitive to outlier examples

K-medians algorithm is a more robust alternative for data with outliers�

34

35 of 36

K-Means: Limitations (3/4)

35

36 of 36

K-Means: Limitations (4/4)

Imbalanced Cluster Sizes (or Clusters with Different Densities)

36