1 of 22

Modeling: Clustering

BIOS 611: Introduction to Data Science

Instructor Matt Biggs

2 of 22

Overview of today

Logistics:

Homework 2 due today!
Draft of Project 1 due next class

Feedback test run
Clustering

BIOS 611

3 of 22

Feedback Test Run

BIOS 611

4 of 22

Review

Wrangling
Agile data science

BIOS 611

5 of 22

Objectives

We we’ve tested the feedback anonymization system
Students will be able to articulate the purpose of clustering
Students will be familiar with common clustering algorithms
Students will perform clustering operations in R

BIOS 611

6 of 22

Discovering categories

Motivating example:

Suppose you are provided with gene expression data for tumor explants from 100 patients. How many types of cancer are in this culture collection? Which types have the most representatives? What are the characteristics of each type?

How would you answer these questions?

BIOS 611

7 of 22

Discovering categories

Motivating example:

Clustering your data

Grouping, categorizing, finding patterns of similarity between observations.

Clustering is an “unsupervised” machine learning technique.

Example clustering algorithms (there are many):

k-means, hierarchical clustering, expectation maximization (EM)

BIOS 611

8 of 22

What makes a good cluster?

Ideally:

Things within a cluster are very similar to each other
Clusters are not very similar to each other.

BIOS 611

9 of 22

K-Means Algorithm

Randomly assign K “centroids”
Assign all data points to nearest centroid
Calculate new centroid position for each “cluster”
Repeat 1-3 until centroids are stable
Repeat 1-4 many times with random centroid start positions and assign points together to a cluster based on the average clustering result

BIOS 611

10 of 22

Distance—can mean many things

Metrics available in the dist() package:

Euclidean - sqrt(sum_N[(x_i-y_i)²])

Maximum - max({|x₁-y₁|,...,|x_n-y_n|})

Manhattan - sum_N(|x_i-y_i|)

Canberra - sum_N(|x_i-y_i|) / (|x_i| + |y_i|)

Binary (aka Jaccard Distance) - 1 - |X ∩ Y| / |X ∪ Y|

Minkowski - (sum_N[(x_i-y_i)^P])^1/P

BIOS 611

11 of 22

Distance—can mean many things

Euclidean - sqrt(sum_N[(x_i-y_i)²])

BIOS 611

12 of 22

Distance—can mean many things

Manhattan - sum_N(|x_i-y_i|)

BIOS 611

13 of 22

Distance—can mean many things

Binary (aka Jaccard Distance) - (|X ∪ Y| - |X ∩ Y|) / |X ∪ Y|

BIOS 611

14 of 22

How many clusters?

There are many ways to estimate how many clusters to group your data into.

Today you’ll use an “elbow plot” of the total within-group sum of squares

BIOS 611

15 of 22

Hierarchical Agglomerative Clustering Algorithm

Measure distances between all data points
Join closest points together into a cluster, and identify the cluster centroid (i.e. average position)
Repeat with next closest set of points, on up until all are joined into a single tree.
To assign clusters, cut the tree at the the appropriate level

BIOS 611

16 of 22

Hierarchical Agglomerative Clustering Algorithm

Agglomeration rules can change the results of the clustering:

Define distance between clusters as the distance between centroids (average linkage)
Define distance as distance between two closest data points (single linkage)
Define distance as distance between the two farthest data points (complete linkage)

BIOS 611

17 of 22

Expectation Maximization algorithm

(Probabilistic version of K-means)

Randomly generate K probability distributions
Calculate probability of each data point having been generated by each distribution
Assign data points to highest likelihood generating distribution
Re-calculate distribution parameters based on cluster membership
Repeat 1-4 until distribution parameters and cluster membership are stable

BIOS 611

18 of 22

Clustering in R

scripts/cluster_example.R

kmeans(c=3)

dist(method = "euclidean")

hclust(dist_mat, method="average")

cutree(hfit, k=3)

library(mclust)

Mclust(G=3)

BIOS 611

20 of 22

Evaluating the feedback

If everything worked as expected, you should have anonymous feedback pushed to the feedback folder in your homework repository
Take a couple of minutes to score the quality of the feedback and create a feedback evaluation file.

BIOS 611

21 of 22

Clustering in-class practice

Download the police residence data set from FiveThirtyEight

Reminder: include “na = '**'” in read_csv()

Cluster the cities based on the variables “all”, “white”, and “non-white”
Use the within groups sum of squares to estimate a reasonable number of clusters for this data set
Cluster the cities using K-means, HAC, and EM algorithm
Plot the cities in 2D using the variables “white” and “non-white”, and color city points by their cluster assignments
Save the resulting script in the “class-examples” folder in the homework repo on GitHub

BIOS 611

22 of 22

Homework 3 (due in one week)

Instructions:

Load the “nycflights13” package
Join flights and airports into a single dataframe by joining flights.origin to airports.faa
Keep only the columns: flight, carrier, arr_delay, dep_delay, origin_lat, and origin_lon columns
Cluster airline carriers based on dep_delay, arr_delay, latitude and longitude using k-means and place into 3 groups
Plot results as a scatter plot by dep_delay and arr_delay colored by cluster assignment
Plot results as a box plot with arr_delay as the y-axis and cluster assignment as the x-axis. Describe the three clusters in terms of the arr_delay variable.
Which airlines contribute the most flights to each cluster? Are the results surprising? How would you change the analysis based on this result? Write your responses as comments in your script.

Push the resulting script titled “HW3_your_github_id.R” to Github classroom under the directory called “homework”
Submit by 11:59 pm on the due date

BIOS 611

1 of 22

2 of 22

3 of 22

4 of 22

5 of 22

6 of 22

7 of 22

8 of 22

9 of 22

10 of 22

11 of 22

12 of 22

13 of 22

14 of 22

15 of 22

16 of 22

17 of 22

18 of 22

19 of 22

20 of 22

21 of 22

22 of 22