Clustering
Outline
2
Classification vs. Clustering
3
Classification: Supervised learning:
Learns a method for predicting the instance class from pre-labeled (classified) instances
Clustering
4
Unsupervised learning:
Finds “natural” grouping of instances given un-labeled data
Clustering Methods
5
Clusters: �exclusive vs. overlapping
6
Simple 2-D representation
Non-overlapping
Venn diagram
Overlapping
a
k
j
i
h
g
f
e
d
c
b
Clustering Evaluation
7
The distance function
8
Simple Clustering: K-means
Works with numeric data only
9
K-means example, step 1
10
k1
k2
k3
X
Y
Pick 3
initial
cluster
centers
(randomly)
K-means example, step 2
11
k1
k2
k3
X
Y
Assign
each point
to the closest
cluster
center
K-means example, step 3
12
X
Y
Move
each cluster center
to the mean
of each cluster
k1
k2
k2
k1
k3
k3
K-means example, step 4
13
X
Y
Reassign
points
closest to a different new cluster center
Q: Which points are reassigned?
k1
k2
k3
K-means example, step 4 …
14
X
Y
A: three points with animation
k1
k3
k2
K-means example, step 4b
15
X
Y
re-compute cluster means
k1
k3
k2
K-means example, step 5
16
X
Y
move cluster centers to cluster means
k2
k1
k3
Discussion, 1
What can be the problems with
K-means clustering?
17
Discussion, 2
18
instances
initial cluster centers
Discussion, 3
A: To increase chance of finding global optimum: restart with different random seeds.
19
K-means clustering summary
Advantages
Disadvantages
20
K-means clustering - outliers ?
What can be done about outliers?
21
K-means variations
22
5
205
5
*Hierarchical clustering
23
Agglomerative Hierarchical Clustering
… you will find the description and an example here:
https://onlinecourses.science.psu.edu/stat555/node/85/
24
*Incremental clustering
25
*Clustering weather data
1
26
ID | Outlook | Temp. | Humidity | Windy |
A | Sunny | Hot | High | False |
B | Sunny | Hot | High | True |
C | Overcast | Hot | High | False |
D | Rainy | Mild | High | False |
E | Rainy | Cool | Normal | False |
F | Rainy | Cool | Normal | True |
G | Overcast | Cool | Normal | True |
H | Sunny | Mild | High | False |
I | Sunny | Cool | Normal | False |
J | Rainy | Mild | Normal | False |
K | Sunny | Mild | Normal | True |
L | Overcast | Mild | High | True |
M | Overcast | Hot | Normal | False |
N | Rainy | Mild | High | True |
2
3
*Clustering weather data
4
27
ID | Outlook | Temp. | Humidity | Windy |
A | Sunny | Hot | High | False |
B | Sunny | Hot | High | True |
C | Overcast | Hot | High | False |
D | Rainy | Mild | High | False |
E | Rainy | Cool | Normal | False |
F | Rainy | Cool | Normal | True |
G | Overcast | Cool | Normal | True |
H | Sunny | Mild | High | False |
I | Sunny | Cool | Normal | False |
J | Rainy | Mild | Normal | False |
K | Sunny | Mild | Normal | True |
L | Overcast | Mild | High | True |
M | Overcast | Hot | Normal | False |
N | Rainy | Mild | High | True |
3
Merge best host and runner-up
5
Consider splitting the best host if merging doesn’t help
*Final hierarchy
28
ID | Outlook | Temp. | Humidity | Windy |
A | Sunny | Hot | High | False |
B | Sunny | Hot | High | True |
C | Overcast | Hot | High | False |
D | Rainy | Mild | High | False |
Oops! a and b are actually very similar
*Example: the iris data (subset)�
29
*Clustering with cutoff
30
*Category utility
31
maximum
number of attributes
*Overfitting-avoidance heuristic
Where n is number of all possible attribute values.
32
Maximum value of CU
Other Clustering Approaches
33
Discussion
34
Examples of Clustering Applications
35
Clustering Summary
36
Principal Component Analysis (PCA)
… you will find (detailed) description here:
http://setosa.io/ev/principal-component-analysis/
http://www.lauradhamilton.com/introduction-to-principal-component-analysis-pca
http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf
37