1 of 51

1

Applied Data Analysis (CS401)

Robert West

Lecture 9

Learning from data: Unsupervised learning

15 Nov 2023

2 of 51

Announcements

  • Project milestone P2 due on Fri 17 Nov 23:59 (see Ed)
    • Reminder: we won’t answer questions asked in the final�24 hours before the deadline
  • Homework H2 to be released on Fri 17 Nov
    • Due two weeks later, on Fri 1 Dec
  • Friday’s lab session:
    • Quiz 8
    • Exercises on unsupervised learning

2

3 of 51

3

Give us feedback on this lecture here: https://go.epfl.ch/ada2023-lec9-feedback

  • What did you (not) like about this lecture?
  • What was (not) well explained?
  • On what would you like more (fewer) details?

4 of 51

Machine learning

  • Supervised: We are given input/output pairs (X, y) (a.k.a. “samples”) that are related via a function y = f(X). We would like to “learn” f, and evaluate it on new data.
    • Discrete y (class labels): classification
    • Continuous y: “regression(e.g., linear regression)
  • Unsupervised: Given only samples X of the data, we compute a function f such that y = f(X) is a “simpler” representation.
    • Discrete y (cluster labels): “clustering”
    • Continuous y: “dimensionality reduction” (e.g., matrix factorization, unsupervised neural networks)

5 of 51

The clustering problem

  • Given a set of points, with a notion of distance between points, group the points into some number of clusters, such that
    • members of a cluster are close (i.e., similar) to each other
    • members of different clusters are far apart from each other
  • Usually:
    • Points live in a high-dimensional space
    • Similarity is defined via a distance measure
      • Euclidean, cosine, Jaccard, edit distance, …�(cf. lecture 7)

5

6 of 51

Characteristics of clustering methods

Quantitative: scalability (many samples), dimensionality (many features)

Qualitative: types of features (numerical, categorical, etc.), type of shapes (polyhedra, hyperplanes, manifolds, etc.)

6

7 of 51

Characteristics of clustering methods

Robustness: sensitivity to noise and outliers, sensitivity to the processing order

User interaction: incorporation of user constraints (e.g., number of clusters, max size of clusters), interpretability and usability

7

8 of 51

Example: clusters & outliers

8

x x

x x x x

x x x x

x x x

x x

x x

x x x x

x x x

x

x

x

x x

x x x x

x x x x

x x x

x x

x

xx x

x x

x x x

x

x x x

x

x x

x x x x

x x x

x

Outlier

Cluster

9 of 51

A typical clustering example

Note: Above is 2D; real scenarios often much more high-dimensional, e.g., 10,000-dimensional for 100x100 images.

9

Input

Output

10 of 51

Some use cases for clustering

  • Data exploration (especially for high-dimensional data, where visualization fails)
  • Partitioning of data for more fine-grained subsequent analysis
  • Marketing: building personas
  • Supporting data labeling for supervised learning
  • Supporting feature discretization for supervised learning (cf. lecture 8)
  • Data compression (next slide)

10

11 of 51

Clustering for condensation/compression

11

Here we don’t require that clusters extract meaningful structure, but that they give a coarse-grained version of the data.

12 of 51

Beware of “cluster bias”!

  • Human beings conceptualize the world through categories represented as exemplars or prototypes

  • We tend to see cluster structure whether it is there or not.
  • Works well for dogs, but…

12

13 of 51

Cluster bias

13

14 of 51

Cluster bias

  • Clustering is used more than it should be, because people assume an underlying domain has discrete classes in it
    • Especially true for characteristics of people, e.g., Myers-Briggs personality types like “ENTP”.
  • In reality the underlying data is often continuous.
  • In such cases, continuous models (e.g., matrix factorization, “soft” clustering) tend to do better (cf. next slide)

14

15 of 51

Netflix

15

  • Central portion: more of a continuum than discrete clusters
  • Other methods (e.g., dimensionality reduction) may be more appropriate than discrete clustering models

16 of 51

Terminology

  • Hierarchical clustering: clusters form a tree-shaped hierarchy. Can be computed bottom-up or top-down.
  • Flat clustering: no inter-cluster structure

  • Hard clustering: items assigned to a unique cluster
  • Soft clustering: cluster membership is a probability distribution over all clusters

16

17 of 51

Clustering is a hard problem!

17

18 of 51

Why is it hard?

  • Clustering in 2 dimensions looks easy
  • Clustering small amounts of data looks easy
  • And in these special cases, it actually is often easy, but...

  • … many applications involve not 2, but hundreds or thousands of dimensions (and large amounts of data)
  • High-dimensional spaces are different (“curse of dimensionality”): volume grows exponentially w/ #dims; space is sparsely populated; clusters become less tight

18

19 of 51

Clustering problem: galaxies

  • A catalog of 2 billion “sky objects” represents objects by their radiation in 7 dimensions (frequency bands)
  • Problem: Cluster into similar objects, e.g., galaxies, nearby stars, quasars, etc.
  • Sloan Digital Sky Survey [link]

19

20 of 51

Clustering problem: movies

  • Intuitively: Movies divide into categories/genres, and customers prefer a few categories
    • But what are categories really?
    • → take a data-driven approach!

  • Represent a movie by a set of customers who watched it (“collaborative filtering”)
  • Idea: Similar movies have similar sets of customers, and vice-versa

20

21 of 51

Clustering problem: movies

Space of all movies:

  • Think of a space with one dimension for each customer
    • Values in a dimension may be 0 or 1 only
    • A movie is a point in this space (x1, x2,…, xn), �where xi = 1 iff the i-th customer watched the movie
  • For Amazon/Netflix, the dimensionality is in the millions
  • Task: Find clusters of similar movies

21

22 of 51

Clustering problem: documents

Finding topics:

  • Represent a document by a vector (x1, x2,…, xn), where xi = 1 iff the i-th word appears in the document (in any position)

  • Idea: Documents with similar sets of words are about same topic

22

23 of 51

Cosine, Jaccard, Euclidean distances

  • In both examples (movies, documents) we have a choice when we thinking of data points as sets of features (users, words):
    • Sets as vectors:
      • Measure similarity via Euclidean distance
      • Measure similarity via cosine distance
    • Sets as sets:
      • Measure similarity via Jaccard index

23

24 of 51

Overview: Methods of clustering

  • Hierarchical methods:
    • Agglomerative (bottom-up):
      • Initially, each point is a cluster
      • Repeatedly combine the two �“nearest” clusters into one
    • Divisive (top-down):
      • Start with one cluster and recursively split it

  • Flat methods (a.k.a. point-assignment methods):
    • Maintain a set of clusters
    • Assign points to “nearest” cluster; recompute clusters; repeat

24

25 of 51

Agglomerative hierarchical clustering

  • Key operation: �Repeatedly combine two nearest�clusters

  • Three important questions:
    • (1) How to represent a cluster of more than one point?
    • (2) How to determine the “nearness” of clusters?
    • (3) When to stop combining clusters?

25

26 of 51

Agglomerative hierarchical clustering

  • Key operation: Repeatedly combine two nearest clusters
  • (1) How to represent a cluster of many points?
    • Euclidean case: represent a cluster via its centroid = average of points in cluster
    • What about non-Euclidean case?
  • (2) How to determine “nearness” of clusters?
    • Euclidean case: distance between clusters = distance between centroids
    • What about non-Euclidean case?

26

27 of 51

Example: Hierarchical clustering

27

28 of 51

28

ada.edu.az

Data science from A to Z

Commercial break

29 of 51

Non-Euclidean case: clustroids

  • (1) How to represent a cluster of many points?�clustroid = actual data point that is “closest” to the other points
  • Possible meanings of “closest”:
    • Smallest average distance to other points (a.k.a. medoid)
    • Smallest sum of squares of distances to other points
    • Smallest maximum distance to other points

29

Centroid is the avg. of all data points in the cluster. This means centroid is an “artificial” point.

Clustroid is an existing data point that is “closest” to all other points in the cluster.

X

Cluster of�3 data points

Centroid

Clustroid

Data point

30 of 51

Non-Euclidean case: cluster “nearness”

  • (2) How do you determine the “nearness” of clusters?
    • Approach 1:Intercluster distance = minimum of the distances between any two points, one from each cluster; or average of distances; or distance between clustroids; etc.
    • Approach 2:�Pick a notion of “cohesion” (“tightness”) of a cluster�Then: nearness of clusters = cohesion of their union

30

31 of 51

Cohesion

  • Approach 2.1: Use the diameter of the merged cluster = maximum distance between points in the merged cluster
  • Approach 2.2: Use the average distance between points in the merged cluster

31

32 of 51

How many branching points are there in a dendrogram for a dataset with N data points?

32

POLLING TIME

33 of 51

Implementation

  • Naïve implementation of hierarchical clustering:
    • At each step, compute pairwise distances between all pairs of clusters, then merge
    • O(N3), where N is the number of data points
  • Careful implementation using priority queue can reduce time to O(N2 log N)
    • Still too expensive for really big datasets

33

34 of 51

Overview: Methods of clustering

  • Hierarchical methods:
    • Agglomerative (bottom-up):
      • Initially, each point is a cluster
      • Repeatedly combine the two �“nearest” clusters into one
    • Divisive (top-down):
      • Start with one cluster and recursively split it

  • Flat methods (a.k.a. point-assignment methods):
    • Maintain a set of clusters
    • Assign points to “nearest” cluster; recompute clusters; repeat

34

NEXT

35 of 51

K-means

The gorilla among the point-assignment clustering algorithms

35

36 of 51

K-means clustering

  • Goal: assign each data point to one of k clusters such that the total distance of points to their centroids is minimized�
  • Solved by a simple greedy algorithm (Lloyd’s algorithm):�Locally minimize the “distance” (usually squared Euclidean distance) from data points to their respective centroids:
  • Find the closest cluster centroid for each data point, and assign the point to that cluster.
  • Recompute the cluster centroid (the mean of data points in the cluster) for each cluster.

36

37 of 51

K-means clustering

37

38 of 51

K-means clustering

How long to iterate?

  • For fixed number of iterations
  • or until no change in assignments (guaranteed to happen)
  • or until only small change in cluster “tightness” (sum of [squared] distances from points to centroids)

38

39 of 51

K-means initialization

We need to pick some points for the first round of the algorithm:

  • Random sample: Pick a random subset of k points from the dataset.
  • K-Means++: Iteratively construct a random sample with good spacing across the dataset.

Note: Finding an optimal k-means clustering is NP-hard. The above help avoid bad configurations.

39

40 of 51

K-means++ [link]

Start: Choose first cluster center at random from the data points

Iterate:

  • For every remaining data point x, compute the distance D(x) from x to the closest previously selected cluster center.
  • Choose a remaining point x randomly with probability proportional to D(x)2, and make it a new cluster center.

Intuitively, this finds a sample of widely-spaced points from dense regions of the data space, avoiding “collapsing” of the clustering into a few internal centers.

40

41 of 51

K-means properties

  • Greedy algorithm with random initialization solution may be suboptimal & vary significantly with different initial points
  • Very simple convergence proofs
  • Performance is O(nk) per iteration not bad, and can be heuristically improved�n = number of points in the dataset, k = number clusters
  • Many variants, e.g.
    • Fixed-size clusters
    • Soft clustering
  • Works well for data�condensation/compression

41

42 of 51

K-means drawbacks

Often terminates at a local but non-global optimum (mitigated by smart initialization such k-means++, or by re-running multiple times with different initializations)

Requires the notion of a mean

Requires specifying k (number of clusters) in advance

Doesn’t handle noisy data and outliers well

Clusters only have convex shapes

42

43 of 51

How to choose k?

For k = 1, 2, 3, …

Run k-means with k clusters

For each data point i, compute “silhouette width”

S = average of s(i) over all i

Plot S against k

Pick k for which S is greatest

43

a(i): avg. distance to points in own cluster

b(i): avg. distance to points in closest

other cluster

S

k

44 of 51

DBSCAN

  • Density-based spatial clustering of applications with noise”
  • Motivation: centroid-based clustering methods like k-means favor clusters that are spherical, and have great difficulty with anything else

  • But with real data we often have:

44

45 of 51

DBSCAN

  • DBSCAN performs density-based clustering, and follows the shape of dense neighborhoods of points.

  • Def.: core points have at least minPts neighbors in a sphere of diameter ε around them.
  • The red points here are core points with at least minPts = 3 neighbors in an ε-sphere around them.

45

46 of 51

DBSCAN

  • More definitions (!):
    • Core points can directly reach neighbors in their ε-sphere
    • From non-core points, no other points can be reached
    • Point q is density-reachable from p if there is a series of points p = p1, …, pn = q such that pi+1 is directly reachable from pi
    • All points not density-reachable from any other points are outliers/noise

46

47 of 51

DBSCAN clusters

  • Even more definitions:
    • Points p, q are density-connected if there is a point o such that both p and q are density-reachable from o.
    • A cluster is a set of points that are mutually density-connected.
    • That is, if a point is density-reachable from a cluster point, it is part of the cluster as well.
    • In the above figure, red points are mutually density-reachable; B and C are density-connected; N is an outlier.

47

48 of 51

DBSCAN algorithm

48

49 of 51

DBSCAN performance

  • DBSCAN uses all-pairs point distances, but using an efficient indexing structure, each RangeQuery (for finding neighbors within ε-sphere) takes only O(log n) time
  • The algorithm overall can be made to run in O(n log n)
  • Fast neighbor search becomes progressively harder (higher constants) in higher dimensions, due to the

49

curse of dimensionality, ahhhhh!

50 of 51

50

Give us feedback on this lecture here: https://go.epfl.ch/ada2023-lec9-feedback

  • What did you (not) like about this lecture?
  • What was (not) well explained?
  • On what would you like more (fewer) details?

51 of 51

51