JSC 270 - LAb 9
Clustering
data
40 points
2 features
2 clusters
data
Original data
What we see
A subsample of 11 points
K-means - Algorithm
Decide on the number of clusters K
Pick K centroids (ie. cluster centers) at random
Repeat the following until convergence:
K-Means
A
B
C
D
E
F
G
H
I
J
K
K-Means
A
B
C
D
E
F
G
H
I
J
K
x
x
K-Means
A
B
C
D
E
F
G
H
I
J
K
x
x
K-Means
A
B
C
D
E
F
G
H
I
J
K
x
x
x
x
K-Means
A
B
C
D
E
F
G
H
I
J
K
x
x
x
x
K-Means
A
B
C
D
E
F
G
H
I
J
K
x
x
x
x
x
x
K-Means
A
B
C
D
E
F
G
H
I
J
K
x
x
x
x
x
x
K-Means
A
B
C
D
E
F
G
H
I
J
K
x
x
x
x
x
x
x
K-Means
Depends on initial conditions
How to make K-Means more robust?
How to make K-Means more robust?
How to make K-Means more robust? k-means++
k-MEANS
Advantages
Fast
Makes relatively few assumptions
Stable when robust initialization is used
Disadvantages
Depends on the initialization
Can get stuck in local minimum
A bit about selecting the number of clusters
General approach
VARIOUS STATISTICS for selecting the number of clusters
VARIOUS STATISTICS for selecting the number of clusters
Elbow
Elbow method- heuristic
Elbow in the plot of the sum of squared distances to the nearest centre
Silhouette Score
Kauffman and Rousseeuw (1990)
For each data point i in cluster Ci, let
Silhouette: , Silhouette Coefficient:
average over all points
Silhouette Score
Kauffman and Rousseeuw (1990)
k=9
GAP statistic
Di=𝝨h,k∈Ci dhk- sum of pairwise distances in cluster i
Wk = 𝝨i=1..k1/(2ni)Di - pooled within cluster sum of distances
Main idea: compare log(Wk) to expectation under the null distribution
Tibshirani et al (2000)
GAP statistic
k=5
k=9
k=5 k=9
Alluvial (SANKY) PLOT
k=3
k=9
SUMMARY