Modeling: Clustering
BIOS 611: Introduction to Data Science
Instructor Matt Biggs
Overview of today
BIOS 611
Feedback Test Run
BIOS 611
Review
BIOS 611
Objectives
BIOS 611
Discovering categories
Motivating example:
Suppose you are provided with gene expression data for tumor explants from 100 patients. How many types of cancer are in this culture collection? Which types have the most representatives? What are the characteristics of each type?
How would you answer these questions?
BIOS 611
Discovering categories
Motivating example:
Suppose you are provided with gene expression data for tumor explants from 100 patients. How many types of cancer are in this culture collection? Which types have the most representatives? What are the characteristics of each type?
Clustering your data
Grouping, categorizing, finding patterns of similarity between observations.
Clustering is an “unsupervised” machine learning technique.
Example clustering algorithms (there are many):
k-means, hierarchical clustering, expectation maximization (EM)
BIOS 611
What makes a good cluster?
Ideally:
BIOS 611
K-Means Algorithm
BIOS 611
Distance—can mean many things
Metrics available in the dist() package:
Euclidean - sqrt(sumN[(xi-yi)2])
Maximum - max({|x1-y1|,...,|xn-yn|})
Manhattan - sumN(|xi-yi|)
Canberra - sumN(|xi-yi|) / (|xi| + |yi|)
Binary (aka Jaccard Distance) - 1 - |X ∩ Y| / |X ∪ Y|
Minkowski - (sumN[(xi-yi)P])1/P
BIOS 611
Distance—can mean many things
Euclidean - sqrt(sumN[(xi-yi)2])
BIOS 611
Distance—can mean many things
Manhattan - sumN(|xi-yi|)
BIOS 611
Distance—can mean many things
Binary (aka Jaccard Distance) - (|X ∪ Y| - |X ∩ Y|) / |X ∪ Y|
BIOS 611
How many clusters?
There are many ways to estimate how many clusters to group your data into.
Today you’ll use an “elbow plot” of the total within-group sum of squares
BIOS 611
Hierarchical Agglomerative Clustering Algorithm
BIOS 611
Hierarchical Agglomerative Clustering Algorithm
Agglomeration rules can change the results of the clustering:
BIOS 611
Expectation Maximization algorithm
(Probabilistic version of K-means)
BIOS 611
Clustering in R
kmeans(c=3)
dist(method = "euclidean")
hclust(dist_mat, method="average")
cutree(hfit, k=3)
library(mclust)
Mclust(G=3)
BIOS 611
QUIZ
Evaluating the feedback
BIOS 611
Clustering in-class practice
BIOS 611
Homework 3 (due in one week)
BIOS 611