1 of 68

FoCS:

Clustering

Niema Moshiri

UC San Diego SPIS 2022

5 of 68

Clustering

Genes

6 of 68

Clustering

Genes

What genes have similar function?

7 of 68

Clustering

Patients

8 of 68

Clustering

Patients

What patients have similar prognoses?

9 of 68

Cancer Research at UCSD (at a glance)

Jill

Mesirov

Trey

Ideker

Hannah Carter

10 of 68

Cancer Research at UCSD (at a glance)

Jill

Mesirov

Trey

Ideker

Hannah Carter

Cancer Genomics Cross-Lab Meeting

11 of 68

What is clustering?

12 of 68

Clustering

Given data points, we want to “maximally separate” them

13 of 68

Clustering

Given data points, we want to “maximally separate” them

Low intra-cluster (i.e., within cluster) pairwise distances

14 of 68

Clustering

Given data points, we want to “maximally separate” them

Low intra-cluster (i.e., within cluster) pairwise distances
High inter-cluster (i.e., across clusters) pairwise distances

15 of 68

Clustering

Given data points, we want to “maximally separate” them

Low intra-cluster (i.e., within cluster) pairwise distances
High inter-cluster (i.e., across clusters) pairwise distances

Generally, we want to try to minimize the number of clusters

16 of 68

Supervised vs. Unsupervised Learning

Previously, we looked at the classification problem

17 of 68

Supervised vs. Unsupervised Learning

Previously, we looked at the classification problem

Given labeled points, train a computer to classify new data

18 of 68

Supervised vs. Unsupervised Learning

Previously, we looked at the classification problem

Given labeled points, train a computer to classify new data
That was an example of supervised learning (we know labels)

19 of 68

Supervised vs. Unsupervised Learning

Previously, we looked at the classification problem

Given labeled points, train a computer to classify new data
That was an example of supervised learning (we know labels)

Clustering is an example of unsupervised learning

20 of 68

Supervised vs. Unsupervised Learning

Previously, we looked at the classification problem

Given labeled points, train a computer to classify new data
That was an example of supervised learning (we know labels)

Clustering is an example of unsupervised learning

We don’t know labels or anything like that in advance

21 of 68

Supervised vs. Unsupervised Learning

Previously, we looked at the classification problem

Given labeled points, train a computer to classify new data
That was an example of supervised learning (we know labels)

Clustering is an example of unsupervised learning

We don’t know labels or anything like that in advance
We want the computer to learn the underlying structure of the data

22 of 68

Cluster These Tolkien Characters

Legolas

Frodo

Aragorn

Galadriel

Bard

23 of 68

Cluster These Tolkien Characters

Legolas

Frodo

Aragorn

Galadriel

Bard

Male

Female

24 of 68

Cluster These Tolkien Characters

Legolas

Frodo

Aragorn

Galadriel

Bard

Part of the Fellowship of the Ring

25 of 68

Cluster These Tolkien Characters

Legolas

Frodo

Aragorn

Galadriel

Bard

Royalty

Not Royalty

26 of 68

Cluster These Tolkien Characters

Legolas

Frodo

Aragorn

Galadriel

Bard

Starred in a Fast & Furious Movie

27 of 68

Cluster These Tolkien Characters

Legolas

Frodo

Aragorn

Galadriel

Bard

Starred in a Fast & Furious Movie

28 of 68

Cluster These Tolkien Characters

Legolas

Frodo

Aragorn

Galadriel

Bard

Starred in a Fast & Furious Movie

Clustering is subjective!

29 of 68

Cluster These Tolkien Characters

Legolas

Frodo

Aragorn

Galadriel

Bard

Starred in a Fast & Furious Movie

Clustering is subjective!

We need to define a

pairwise distance function!

30 of 68

Defining a Pairwise Distance Function

We need to define a pairwise distance function

31 of 68

Defining a Pairwise Distance Function

We need to define a pairwise distance function

Let u and v denote two objects from the universe of possibilities

32 of 68

Defining a Pairwise Distance Function

We need to define a pairwise distance function

Let u and v denote two objects from the universe of possibilities
D(u,v) denotes the distance between u and v

33 of 68

Defining a Pairwise Distance Function

We need to define a pairwise distance function

Let u and v denote two objects from the universe of possibilities
D(u,v) denotes the distance between u and v
E.g. Euclidean distance

34 of 68

Defining a Pairwise Distance Function

We need to define a pairwise distance function

Let u and v denote two objects from the universe of possibilities
D(u,v) denotes the distance between u and v
E.g. Euclidean distance

We can alternatively define a similarity function (and negate it)

35 of 68

Desirable Properties of a Clustering Algorithm

36 of 68

Desirable Properties of a Clustering Algorithm

Scalability (in terms of both time and space)

37 of 68

Desirable Properties of a Clustering Algorithm

Scalability (in terms of both time and space)
Ability to deal with different data types

38 of 68

Desirable Properties of a Clustering Algorithm

Scalability (in terms of both time and space)
Ability to deal with different data types
Minimal requirements for domain knowledge to determine inputs

39 of 68

Desirable Properties of a Clustering Algorithm

Scalability (in terms of both time and space)
Ability to deal with different data types
Minimal requirements for domain knowledge to determine inputs
Interpretability and usability

40 of 68

Desirable Properties of a Clustering Algorithm

Scalability (in terms of both time and space)
Ability to deal with different data types
Minimal requirements for domain knowledge to determine inputs
Interpretability and usability
Optional: Incorporation of user-specified constraints

41 of 68

Partitional vs. Hierarchical Clustering

42 of 68

Partitional vs. Hierarchical Clustering

Partitional: Construct partitions and evaluate them by some criterion

43 of 68

Partitional vs. Hierarchical Clustering

Partitional: Construct partitions and evaluate them by some criterion
Hierarchical: Create a hierarchy using some criterion

44 of 68

Partitional vs. Hierarchical Clustering

Partitional: Construct partitions and evaluate them by some criterion
Hierarchical: Create a hierarchy using some criterion

A hierarchical clustering can be “cut” into a partitional clustering

45 of 68

Partitional vs. Hierarchical Clustering

Partitional: Construct partitions and evaluate them by some criterion
Hierarchical: Create a hierarchy using some criterion

A hierarchical clustering can be “cut” into a partitional clustering

46 of 68

Partitional vs. Hierarchical Clustering

Partitional: Construct partitions and evaluate them by some criterion
Hierarchical: Create a hierarchy using some criterion

A hierarchical clustering can be “cut” into a partitional clustering

47 of 68

Bottom-Up Hierarchical Clustering

48 of 68

Bottom-Up Hierarchical Clustering

49 of 68

Bottom-Up Hierarchical Clustering

50 of 68

Bottom-Up Hierarchical Clustering

51 of 68

Bottom-Up Hierarchical Clustering

52 of 68

Bottom-Up Hierarchical Clustering

How do we know which clusters are closest??

53 of 68

Computing Distances Between Clusters

Given our pairwise distance function D, we know how to compute the distance between two individual objects u and v

54 of 68

Computing Distances Between Clusters

Given our pairwise distance function D, we know how to compute the distance between two individual objects u and v

It’s just D(u,v)

55 of 68

Computing Distances Between Clusters

Given our pairwise distance function D, we know how to compute the distance between two individual objects u and v

It’s just D(u,v)

How do we compute the distance between two clusters A and B?

56 of 68

Computing Distances Between Clusters

Given our pairwise distance function D, we know how to compute the distance between two individual objects u and v

It’s just D(u,v)

How do we compute the distance between two clusters A and B?

Minimum distance between any element u in A and v in B

57 of 68

Computing Distances Between Clusters

Given our pairwise distance function D, we know how to compute the distance between two individual objects u and v

It’s just D(u,v)

How do we compute the distance between two clusters A and B?

Maximum distance between any element u in A and v in B

58 of 68

Computing Distances Between Clusters

Given our pairwise distance function D, we know how to compute the distance between two individual objects u and v

It’s just D(u,v)

How do we compute the distance between two clusters A and B?

Average distance between any element u in A and v in B

59 of 68

Computing Distances Between Clusters

Given our pairwise distance function D, we know how to compute the distance between two individual objects u and v

It’s just D(u,v)

How do we compute the distance between two clusters A and B?

Average distance between any element u in A and v in B

Something else?

60 of 68

Soft vs. Hard Partitional Clustering

61 of 68

Soft vs. Hard Partitional Clustering

In a hard clustering, each point is assigned to a single cluster

62 of 68

Soft vs. Hard Partitional Clustering

In a hard clustering, each point is assigned to a single cluster

What about if a point looks like a mix of clusters?

63 of 68

Soft vs. Hard Partitional Clustering

In a hard clustering, each point is assigned to a single cluster

What about if a point looks like a mix of clusters?

In a soft clustering, each point is partially assigned to multiple clusters

64 of 68

Some Common Clustering Algorithms

65 of 68

Some Common Clustering Algorithms

K-means Clustering: Cluster points into k clusters

66 of 68

Some Common Clustering Algorithms

K-means Clustering: Cluster points into k clusters

Need to specify k (or can try multiple different values)

67 of 68

Some Common Clustering Algorithms

K-means Clustering: Cluster points into k clusters

Need to specify k (or can try multiple different values)

UPGMA: A common bottom-up hierarchical clustering algorithm

68 of 68

Some Common Clustering Algorithms

K-means Clustering: Cluster points into k clusters

Need to specify k (or can try multiple different values)

UPGMA: A common bottom-up hierarchical clustering algorithm
Gaussian Mixture Models: Fit a mixture of distributions, and assign each point to its closest distribution