1
Applied Data Analysis (CS401)
Robert West
Lecture 9
Learning from data: Unsupervised learning
15 Nov 2023
Announcements
2
3
Give us feedback on this lecture here: https://go.epfl.ch/ada2023-lec9-feedback
Machine learning
The clustering problem
5
Characteristics of clustering methods
Quantitative: scalability (many samples), dimensionality (many features)
Qualitative: types of features (numerical, categorical, etc.), type of shapes (polyhedra, hyperplanes, manifolds, etc.)
6
Characteristics of clustering methods
Robustness: sensitivity to noise and outliers, sensitivity to the processing order
User interaction: incorporation of user constraints (e.g., number of clusters, max size of clusters), interpretability and usability
7
Example: clusters & outliers
8
x x
x x x x
x x x x
x x x
x x
x x
x x x x
x x x
x
x
x
x x
x x x x
x x x x
x x x
x x
x
xx x
x x
x x x
x
x x x
x
x x
x x x x
x x x
x
Outlier
Cluster
A typical clustering example
Note: Above is 2D; real scenarios often much more high-dimensional, e.g., 10,000-dimensional for 100x100 images.
9
Input
Output
Some use cases for clustering
10
Clustering for condensation/compression
11
Here we don’t require that clusters extract meaningful structure, but that they give a coarse-grained version of the data.
Beware of “cluster bias”!
12
Cluster bias
13
Cluster bias
14
Netflix
15
Terminology
16
Clustering is a hard problem!
17
Why is it hard?
18
Clustering problem: galaxies
19
Clustering problem: movies
20
Clustering problem: movies
Space of all movies:
21
Clustering problem: documents
Finding topics:
22
Cosine, Jaccard, Euclidean distances
23
Overview: Methods of clustering
24
Agglomerative hierarchical clustering
25
Agglomerative hierarchical clustering
26
Example: Hierarchical clustering
27
28
ada.edu.az
Data science from A to Z
Commercial break
Non-Euclidean case: clustroids
29
Centroid is the avg. of all data points in the cluster. This means centroid is an “artificial” point.
Clustroid is an existing data point that is “closest” to all other points in the cluster.
X
Cluster of�3 data points
Centroid
Clustroid
Data point
Non-Euclidean case: cluster “nearness”
30
Cohesion
31
How many branching points are there in a dendrogram for a dataset with N data points?
32
Implementation
33
Overview: Methods of clustering
34
NEXT
K-means
The gorilla among the point-assignment clustering algorithms
35
K-means clustering
36
K-means clustering
37
K-means clustering
How long to iterate?
38
K-means initialization
We need to pick some points for the first round of the algorithm:
Note: Finding an optimal k-means clustering is NP-hard. The above help avoid bad configurations.
39
K-means++ [link]
Start: Choose first cluster center at random from the data points
Iterate:
Intuitively, this finds a sample of widely-spaced points from dense regions of the data space, avoiding “collapsing” of the clustering into a few internal centers.
40
K-means properties
41
K-means drawbacks
Often terminates at a local but non-global optimum (mitigated by smart initialization such k-means++, or by re-running multiple times with different initializations)
Requires the notion of a mean
Requires specifying k (number of clusters) in advance
Doesn’t handle noisy data and outliers well
Clusters only have convex shapes
42
How to choose k?
For k = 1, 2, 3, …
Run k-means with k clusters
For each data point i, compute “silhouette width”
S = average of s(i) over all i
Plot S against k
Pick k for which S is greatest
43
a(i): avg. distance to points in own cluster
b(i): avg. distance to points in closest
other cluster
S
k
DBSCAN
44
DBSCAN
45
DBSCAN
46
DBSCAN clusters
47
DBSCAN algorithm
48
DBSCAN performance
49
curse of dimensionality, ahhhhh!
50
Give us feedback on this lecture here: https://go.epfl.ch/ada2023-lec9-feedback
51