Meme of the Day (Katherine Comito)
CSE 30124�Unsupervised Learning and Clustering
Margin Classifiers (review)
Support Vector Machines (review)
Supervised Learning
Supervised learning is when we have ___________ for all of our sample vectors.
So we learn by comparing our __________ output to our __________ output and try to bring the two closer together, typically by minimizing our cost function (in parametric models).
Unsupervised Learning
With unsupervised learning, we don’t have ground-truth labels.
Instead we ______________ by __________________________________
The Goal of Unsupervised Learning
With unsupervised learning we want to discover _______________ in our data to help answer questions such as:
Clustering
The most common unsupervised learning task is clustering.
clustering
Using clustering
k-means clustering
k-means clustering
k-means is the “most common” clustering algorithm, but it relies on some assumptions…
But because it makes so many assumptions, it always comes up with an answer!
Visualizing k-means
The k-means algorithm
k-means clustering (exercise)
Draw clusters for the following sample points
k = 2
k = 3
input samples
Evaluating unsupervised learning
Because k-means is unsupervised, we don’t have __________________
This makes it tricky to judge the “goodness” of our discovered solution
Always sanity check unsupervised solutions yourself
Elbow Plots (reading)
Number of clusters (k)
Sum of Squared Distances
“There is no hard and fast rule here, as it’s often up to the discretion of the data scientist”
k-means
Pros:
Cons:
k-means in code
hierarchical clustering
Hierarchical Clustering
In hierarchical clustering we build a ____________, a tree like representation of our samples
Getting clusters from a dendrogram
Any points joined below the dividing line _________________________
Methods for Building Dendrograms
Agglomerative:
“Start small, end big”
“bottom up”
Divisive:
“Start big, end small”
“top-down”
Agglomerative Hierarchical Clustering
Agglomerative Dendrogram (exercise)
A
B
C
D
E
A B C D E
A B C D E
A B C D E
A B C D E
A
B
C
D
E
A
B
C
D
E
A
B
C
D
E
Agglomerative Dendrogram (exam question)
Closest Clusters
Single Linkage is the minimum distance between the closest elements in clusters
Complete Linkage is the maximum distance between the elements in the clusters
Closest Clusters
Average Linkage is the average of the distances of all pairs of nodes in the clusters
Centroid Method computes the minimum distance between the centroids of clusters
sklearn hierarchical clustering
Closest Clusters - Ward
The Ward Method is the average of the squared distances of all pairs of nodes in the clusters
Hierarchical Clustering
Pros:
Cons:
Various Clustering Algorithms
Clustering for Image Segmentation (thought exercise)
How could you use clustering to segment this image?
Image segmentation is the process of separating the pixels in an image into different objects or regions
Evaluation of Image Segmentation
To measure how well our segmentation method worked, we need to compare it to some ground-truth.
We call the ground-truth for image segmentation a __________
Segmentation Evaluation Metric
A common metric for evaluating segmentation tasks is called the IoU - _______________ ______, or if you’re coming from stats, the Jaccard Index
Looking ahead
After fall break we will look at how you can use unsupervised learning to visualize samples that have more than two features