JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.

1 of 37

Clustering

2 of 37

Outline

Introduction
K-means clustering
Hierarchical clustering

3 of 37

Classification vs. Clustering

Classification: Supervised learning:

Learns a method for predicting the instance class from pre-labeled (classified) instances

4 of 37

Clustering

Unsupervised learning:

Finds “natural” grouping of instances given un-labeled data

5 of 37

Clustering Methods

Many different method and algorithms:

For numeric and/or symbolic data
Deterministic vs. probabilistic
Exclusive vs. overlapping
Hierarchical vs. flat
Top-down vs. bottom-up

6 of 37

Clusters: �exclusive vs. overlapping

Simple 2-D representation

Non-overlapping

Venn diagram

Overlapping

7 of 37

Clustering Evaluation

Manual inspection
Benchmarking on existing labels
Cluster quality measures

distance measures
high similarity within a cluster, low across clusters

8 of 37

The distance function

Simplest case: one numeric attribute A

Distance(X,Y) = A(X) – A(Y)

Several numeric attributes:

Distance(X,Y) = Euclidean distance between X,Y

Nominal attributes: distance is set to 1 if values are different, 0 if they are equal
Are all attributes equally important?

Weighting the attributes might be necessary

9 of 37

Simple Clustering: K-means

Works with numeric data only

Pick a number (K) of cluster centers (at random)
Assign every item to its nearest cluster center (e.g. using Euclidean distance)
Move each cluster center to the mean of its assigned items
Repeat steps 2,3 until convergence (change in cluster assignments less than a threshold)

10 of 37

K-means example, step 1

k₁

k₂

k₃

Pick 3

initial

cluster

centers

(randomly)

11 of 37

K-means example, step 2

k₁

k₂

k₃

Assign

each point

to the closest

cluster

center

12 of 37

K-means example, step 3

Move

each cluster center

to the mean

of each cluster

k₁

k₂

k₁

k₃

13 of 37

K-means example, step 4

Reassign

points

closest to a different new cluster center

Q: Which points are reassigned?

k₁

k₂

k₃

14 of 37

K-means example, step 4 …

A: three points with animation

k₁

k₃

k₂

15 of 37

K-means example, step 4b

re-compute cluster means

k₁

k₃

k₂

16 of 37

K-means example, step 5

move cluster centers to cluster means

k₂

k₁

k₃

17 of 37

Discussion, 1

What can be the problems with

K-means clustering?

18 of 37

Discussion, 2

Result can vary significantly depending on initial choice of seeds (number and position)
Can get trapped in local minimum

Example:

Q: What can be done?

instances

initial cluster centers

19 of 37

Discussion, 3

A: To increase chance of finding global optimum: restart with different random seeds.

20 of 37

K-means clustering summary

Advantages

Simple, understandable
items automatically assigned to clusters

Disadvantages

Must pick number of clusters before hand
All items forced into a cluster
Too sensitive to outliers

21 of 37

K-means clustering - outliers ?

What can be done about outliers?

22 of 37

K-means variations

K-medoids – instead of mean, use medians of each cluster

Mean of 1, 3, 5, 7, 9 is
Mean of 1, 3, 5, 7, 1009 is
Median of 1, 3, 5, 7, 1009 is
Median advantage: not affected by extreme values

For large databases, use sampling

205

23 of 37

*Hierarchical clustering

Bottom up

Start with single-instance clusters
At each step, join the two closest clusters
Design decision: distance between clusters

E.g. two closest instances in clusters� vs. distance between means

Top down

Start with one universal cluster
Find two clusters
Proceed recursively on each subset
Can be very fast

Both methods produce a�dendrogram

24 of 37

Agglomerative Hierarchical Clustering

… you will find the description and an example here:

https://onlinecourses.science.psu.edu/stat555/node/85/

https://onlinecourses.science.psu.edu/stat555/node/86/

25 of 37

*Incremental clustering

Heuristic approach (COBWEB/CLASSIT)
Form a hierarchy of clusters incrementally
Start:

tree consists of empty root node

Then:

add instances one by one
update tree appropriately at each stage
to update, find the right leaf for an instance
May involve restructuring the tree

Base update decisions on category utility

26 of 37

*Clustering weather data

ID	Outlook	Temp.	Humidity	Windy
A	Sunny	Hot	High	False
B	Sunny	Hot	High	True
C	Overcast	Hot	High	False
D	Rainy	Mild	High	False
E	Rainy	Cool	Normal	False
F	Rainy	Cool	Normal	True
G	Overcast	Cool	Normal	True
H	Sunny	Mild	High	False
I	Sunny	Cool	Normal	False
J	Rainy	Mild	Normal	False
K	Sunny	Mild	Normal	True
L	Overcast	Mild	High	True
M	Overcast	Hot	Normal	False
N	Rainy	Mild	High	True

27 of 37

*Clustering weather data

ID	Outlook	Temp.	Humidity	Windy
A	Sunny	Hot	High	False
B	Sunny	Hot	High	True
C	Overcast	Hot	High	False
D	Rainy	Mild	High	False
E	Rainy	Cool	Normal	False
F	Rainy	Cool	Normal	True
G	Overcast	Cool	Normal	True
H	Sunny	Mild	High	False
I	Sunny	Cool	Normal	False
J	Rainy	Mild	Normal	False
K	Sunny	Mild	Normal	True
L	Overcast	Mild	High	True
M	Overcast	Hot	Normal	False
N	Rainy	Mild	High	True

Merge best host and runner-up

Consider splitting the best host if merging doesn’t help

28 of 37

*Final hierarchy

ID	Outlook	Temp.	Humidity	Windy
A	Sunny	Hot	High	False
B	Sunny	Hot	High	True
C	Overcast	Hot	High	False
D	Rainy	Mild	High	False

Oops! a and b are actually very similar

29 of 37

*Example: the iris data (subset)�

30 of 37

*Clustering with cutoff

31 of 37

*Category utility

Category utility: quadratic loss function�defined on conditional probabilities:

Every instance in different category ⇒ �numerator becomes��

maximum

number of attributes

32 of 37

*Overfitting-avoidance heuristic

If every instance gets put into a different category the numerator becomes (maximal):

Where n is number of all possible attribute values.

So without k in the denominator of the CU-formula, every cluster would consist of one instance!

Maximum value of CU

33 of 37

Other Clustering Approaches

EM – probability based clustering
Bayesian clustering
SOM – self-organizing maps
…

34 of 37

Discussion

Can interpret clusters by using supervised learning

learn a classifier based on clusters

Decrease dependence between attributes?

pre-processing step
E.g. use principal component analysis

Can be used to fill in missing values
Key advantage of probabilistic clustering:

Can estimate likelihood of data
Use it to compare different models objectively

35 of 37

Examples of Clustering Applications

Marketing: discover customer groups and use them for targeted marketing and re-organization
Astronomy: find groups of similar stars and galaxies
Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults
Genomics: finding groups of gene with similar expressions
…

36 of 37

Clustering Summary

Unsupervised
Many approaches

K-means – simple, sometimes useful

K-medoids is less sensitive to outliers

Hierarchical clustering – works for symbolic attributes

Evaluation is a problem

37 of 37

Principal Component Analysis (PCA)

… you will find (detailed) description here:

http://setosa.io/ev/principal-component-analysis/

http://www.lauradhamilton.com/introduction-to-principal-component-analysis-pca

http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf