K-Means and Probability
Lecture 4
An introduction to unsupervised learning �and a review of core concepts in probability
EECS 189/289, Fall 2025 @ UC Berkeley
Joseph E. Gonzalez and Narges Norouzi
EECS 189/289, Fall 2025 @ UC Berkeley
Joseph E. Gonzalez and Narges Norouzi
Join at slido.com�#1041260
The Slido app must be installed on every computer you’re presenting from
Do not edit�How to change the design
1041260
Roadmap
Questions
1041260
K-means Clustering
Questions
1041260
Prof. Gonzalez’s Messy Biking Records
Professor Gonzalez has too many bikes
He has been recording the speed and length of his �bike rides but not which bike he used. ��He would like to answer questions like:
1041260
Learning Problem
We have unlabeled data and we would like to divide the records into 4 groups (clusters) corresponding to the four bikes.
Let’s try �k-means clustering
Supervised Learning
Labeled Data
Unsupervised
Learning
Unlabeled Data
Quantitative�Label
Dimensionality�Reduction
Clustering
Categorical�Label
Reinforcement
Learning
Alpha Go
Reward
Classification
Regression
Stock
Prediction
1041260
Demo
SkLearn K-means Clustering
Using sklearn k-means clustering
1041260
Clustering With K-Means in Scikit-Learn
In the demo we wrote a few lines of code and obtained a reasonable clustering of the ride-times.
Today we will learn how this algorithm works.
We will review concepts in probability and explore more general density estimation techniques.
from sklearn.cluster import KMeans
# Create a KMeans model with 4 clusters
kmeans = KMeans(n_clusters=4, random_state=42)
# Fit the model to the data
kmeans.fit(bikes[['Speed', 'Length']])
# Predict the cluster assignments
bikes['c'] = kmeans.predict(bikes[['Speed', 'Length']])
1041260
Lloyd’s Algorithm
Questions
1041260
K-Means Clustering
1041260
K-Means Cluster (Lloyd’s) Algorithm
Why is this the mean of each cluster?
1041260
Updating the Cluster Centers
Sum over clusters
Sum over points in cluster k
1041260
Minimizing the Transformed Objective
Cluster Mean
(Average Point)
1041260
Is Lloyd's algorithm guaranteed to converge? Does it always produce an optimal clustering?
The Slido app must be installed on every computer you’re presenting from
Do not edit�How to change the design
1041260
Convergence of K-Means
1041260
Demo
See live animation in the demo
Animated K-Means Clustering
&
K-means for pixels
1041260
Illustration of Steps (for PDF version)
Step 1
Step 2
Step 3
Step 4
Step 5
Step 6
Step 7
Step 8
Step 9
Step 0
1041260
Choosing the Number of Clusters
The “Elbow”
1041260
Interpreting the Clusters
We used k-means to compute a �cluster assignment for each ride.
Which bike is each cluster?
Does each cluster represent a bike?
Use caution when interpreting �clusters!
1041260
Pixel K-Means
Questions
1041260
Pixel K-Means
The pixels in an image can be treated of as vectors in an �RGB vector space.
Pixels plotted in �RGB vector space.
Color
Channel
Width
Height
Flatten
…
3
width * height
1041260
Demo
Visualize Pixel Clustering
K-means on Pixels
1041260
Hard Cluster Assignments
K-means assigns each data point to�exactly one cluster.
Do all the points belong in exactly�one cluster?
Should these points be red or blue?
1041260
Uncertainty
ML is ultimately about inference – �making predictions.
Source of Uncertainty:
Need a framework for uncertainty – Probability!
1041260
Probability (Review)
Questions
1041260
Probability
Probability provides a framework for quantifying and manipulating uncertainty.
The probability of an event is:
Which one is correct?
1041260
How do you interpret probabilities?
The Slido app must be installed on every computer you’re presenting from
Do not edit�How to change the design
1041260
Basics of Probability
A brief review of the
The Joint Probability Distribution
| | |
| 0.2 | 0.1 |
| 0.1 | 0.1 |
| 0.15 | 0.35 |
1
The joint probability satisfies the �following two properties:
0.45 | 0.55 |
0.3 |
0.2 |
0.5 |
1041260
The Joint Probability Distribution
| | |
| 0.2 | 0.1 |
| 0.1 | 0.1 |
| 0.15 | 0.35 |
0.45 | 0.55 |
0.3 |
0.2 |
0.5 |
1
The Sum Rule (Marginalization): defines the distribution over a subset of the random variables.
1041260
Conditional Probability
| | |
| 0.2 | 0.1 |
| 0.1 | 0.1 |
| 0.15 | 0.35 |
0.45 | 0.55 |
0.3 |
0.2 |
0.5 |
1
| 0.15 | 0.35 |
0.5 |
0.3 | 0.7 |
=
1041260
Empirical Probability Distributions
1041260
The Product Rule of Probability
1041260
Independent Random Variables
1041260
Wake Word Example
Questions
1041260
Example: Wake Words
A wake word is a verbal cue that triggers �voice assistants to start actively listening.
Example: “Alexa, set an alarm for 6:00AM”
Most voice assistants continuously run a wake word detector model on every sound they hear.
1 day
Rare wake word events.
1041260
Example: Wake Words
A wake word is a verbal cue that triggers �voice assistants to start actively listening.
Example: “Alexa, set an alarm for 6:00AM”
Most voice assistants continuously run a wake word detector model on every sound they hear.
Streamed to the cloud for processing.
1041260
Example: Wake Words
Streamed to the cloud for processing.
1041260
Analyzing the Wake Word Detector
Bayes’ Theorem
1041260
Bayes’ Theorem
1041260
Analyzing the Wake Word Detector
1041260
How could we improve the Wake Word Detector?
The Slido app must be installed on every computer you’re presenting from
Do not edit�How to change the design
1041260
Demo
Analysis of Recall and
False Positive Rates
1041260
Bayesian Updates: Wake Word Detector
1041260
K-Means and Probability
Lecture 4
Credit: Joseph E. Gonzalez and Narges Norouzi
Reference Book Chapters: