1 of 22

Modeling: Clustering

BIOS 611: Introduction to Data Science

Instructor Matt Biggs

2 of 22

Overview of today

  • Logistics:
    • Homework 2 due today!
    • Draft of Project 1 due next class
  • Feedback test run
  • Clustering

BIOS 611

3 of 22

Feedback Test Run

BIOS 611

4 of 22

Review

  • Wrangling
  • Agile data science

BIOS 611

5 of 22

Objectives

  • We we’ve tested the feedback anonymization system
  • Students will be able to articulate the purpose of clustering
  • Students will be familiar with common clustering algorithms
  • Students will perform clustering operations in R

BIOS 611

6 of 22

Discovering categories

Motivating example:

Suppose you are provided with gene expression data for tumor explants from 100 patients. How many types of cancer are in this culture collection? Which types have the most representatives? What are the characteristics of each type?

How would you answer these questions?

BIOS 611

7 of 22

Discovering categories

Motivating example:

Suppose you are provided with gene expression data for tumor explants from 100 patients. How many types of cancer are in this culture collection? Which types have the most representatives? What are the characteristics of each type?

Clustering your data

Grouping, categorizing, finding patterns of similarity between observations.

Clustering is an “unsupervised” machine learning technique.

Example clustering algorithms (there are many):

k-means, hierarchical clustering, expectation maximization (EM)

BIOS 611

8 of 22

What makes a good cluster?

Ideally:

  • Things within a cluster are very similar to each other
  • Clusters are not very similar to each other.

BIOS 611

9 of 22

K-Means Algorithm

  1. Randomly assign K “centroids”
  2. Assign all data points to nearest centroid
  3. Calculate new centroid position for each “cluster”
  4. Repeat 1-3 until centroids are stable
  5. Repeat 1-4 many times with random centroid start positions and assign points together to a cluster based on the average clustering result

BIOS 611

10 of 22

Metrics available in the dist() package:

Euclidean - sqrt(sumN[(xi-yi)2])

Maximum - max({|x1-y1|,...,|xn-yn|})

Manhattan - sumN(|xi-yi|)

Canberra - sumN(|xi-yi|) / (|xi| + |yi|)

Binary (aka Jaccard Distance) - 1 - |X ∩ Y| / |X ∪ Y|

Minkowski - (sumN[(xi-yi)P])1/P

BIOS 611

11 of 22

Euclidean - sqrt(sumN[(xi-yi)2])

BIOS 611

12 of 22

Manhattan - sumN(|xi-yi|)

BIOS 611

13 of 22

Binary (aka Jaccard Distance) - (|X ∪ Y| - |X ∩ Y|) / |X ∪ Y|

BIOS 611

14 of 22

How many clusters?

There are many ways to estimate how many clusters to group your data into.

Today you’ll use an “elbow plot” of the total within-group sum of squares

BIOS 611

15 of 22

Hierarchical Agglomerative Clustering Algorithm

  1. Measure distances between all data points
  2. Join closest points together into a cluster, and identify the cluster centroid (i.e. average position)
  3. Repeat with next closest set of points, on up until all are joined into a single tree.
  4. To assign clusters, cut the tree at the the appropriate level

BIOS 611

16 of 22

Hierarchical Agglomerative Clustering Algorithm

Agglomeration rules can change the results of the clustering:

  • Define distance between clusters as the distance between centroids (average linkage)
  • Define distance as distance between two closest data points (single linkage)
  • Define distance as distance between the two farthest data points (complete linkage)

BIOS 611

17 of 22

Expectation Maximization algorithm

(Probabilistic version of K-means)

  1. Randomly generate K probability distributions
  2. Calculate probability of each data point having been generated by each distribution
  3. Assign data points to highest likelihood generating distribution
  4. Re-calculate distribution parameters based on cluster membership
  5. Repeat 1-4 until distribution parameters and cluster membership are stable

BIOS 611

18 of 22

Clustering in R

scripts/cluster_example.R

kmeans(c=3)

dist(method = "euclidean")

hclust(dist_mat, method="average")

cutree(hfit, k=3)

library(mclust)

Mclust(G=3)

BIOS 611

19 of 22

QUIZ

20 of 22

Evaluating the feedback

  • If everything worked as expected, you should have anonymous feedback pushed to the feedback folder in your homework repository
  • Take a couple of minutes to score the quality of the feedback and create a feedback evaluation file.

BIOS 611

21 of 22

Clustering in-class practice

  • Download the police residence data set from FiveThirtyEight
    • Reminder: include “na = '**'” in read_csv()

  • Cluster the cities based on the variables “all”, “white”, and “non-white”
  • Use the within groups sum of squares to estimate a reasonable number of clusters for this data set
  • Cluster the cities using K-means, HAC, and EM algorithm
  • Plot the cities in 2D using the variables “white” and “non-white”, and color city points by their cluster assignments
  • Save the resulting script in the “class-examples” folder in the homework repo on GitHub

BIOS 611

22 of 22

Homework 3 (due in one week)

  1. Instructions:
    1. Load the “nycflights13” package
    2. Join flights and airports into a single dataframe by joining flights.origin to airports.faa
    3. Keep only the columns: flight, carrier, arr_delay, dep_delay, origin_lat, and origin_lon columns
    4. Cluster airline carriers based on dep_delay, arr_delay, latitude and longitude using k-means and place into 3 groups
    5. Plot results as a scatter plot by dep_delay and arr_delay colored by cluster assignment
    6. Plot results as a box plot with arr_delay as the y-axis and cluster assignment as the x-axis. Describe the three clusters in terms of the arr_delay variable.
    7. Which airlines contribute the most flights to each cluster? Are the results surprising? How would you change the analysis based on this result? Write your responses as comments in your script.
  2. Push the resulting script titled “HW3_your_github_id.R” to Github classroom under the directory called “homework
  3. Submit by 11:59 pm on the due date

BIOS 611