1 of 41

K-Means Clustering, Probability & MLE

Discussion Mini Lecture 3

K-Means Clustering, Probability Theory & Maximum Likelihood Estimation

CS 189/289A, Fall 2025 @ UC Berkeley

Sara Pohland

2 of 41

K-Means Clustering
Probability Theory

Probability Fundamentals
Common Distributions
Multiple Random Variables

Maximum Likelihood Estimation (MLE)
Convex Optimization

Concepts Covered

3 of 41

K-Means Clustering

K-Means Clustering
Probability Theory

Probability Fundamentals
Common Distributions
Multiple Random Variables

Maximum Likelihood Estimation (MLE)
Convex Optimization

4 of 41

Goal of K-Means Clustering

Unlabeled Data

Clustered Data

5 of 41

K-Means Clustering Problem

where

sum of points

total points

6 of 41

Lloyd’s Algorithm (“naïve k-means”)

1. Update Assignments: Assign each data point to the cluster with the nearest mean:

2. Update Means: Recalculate means based on data points assigned to each cluster:

You’ll show this algorithm converges in Problem 3 of Disc. 3.

7 of 41

Random Variables & Statistical Measures

K-Means Clustering
Probability Theory

Probability Fundamentals
Common Distributions
Multiple Random Variables

Maximum Likelihood Estimation (MLE)
Convex Optimization

8 of 41

Probability Mass Function (PMF)

9 of 41

Probability Density Function (PDF)

10 of 41

Expectation of a Random Variable

Linearity of Expectation:

Expected Value:

Discrete:

Continuous:

Law of the Unconscious Statistician:

11 of 41

Variance & Standard Deviation

Variance – measures the dispersion around the expected value

A Useful Property:

Standard Deviation – another measure of dispersion/spread

12 of 41

Common Probability Distributions

K-Means Clustering
Probability Theory

Probability Fundamentals
Common Distributions
Multiple Random Variables

Maximum Likelihood Estimation (MLE)
Convex Optimization

13 of 41

Bernoulli Random Variable

Bernoulli random variables are used to represent the outcome a single binary experiment.

PMF:

Expected Value:

Variance:

14 of 41

Binomial Random Variable

Binomial random variables are used to represent the outcome of multiple binary (Bernoulli) experiments.

PMF:

Expected Value:

Variance:

You’ll work with Binomial RVs in Problem 2 of Disc. 3.

15 of 41

Exponential Random Variable

Exponential random variables are used to model the amount of time that passes before an event occurs.

PDF:

Expected Value:

Variance:

16 of 41

Gaussian Random Variable

Gaussian random variables are used to represent many continuous real-world processes.

PDF:

Expected Value:

Variance:

You’ll work with Gaussian RVs in Problem 1 of Disc. 3.

17 of 41

Standard Normal Random Variable

The standard normal random variable is the Gaussian random variable with zero mean and unit variance.

PDF:

Expected Value:

Variance:

18 of 41

Functions of Multiple Random Variables

K-Means Clustering
Probability Theory

Probability Fundamentals
Common Distributions
Multiple Random Variables

Maximum Likelihood Estimation (MLE)
Convex Optimization

19 of 41

Joint Probability Mass Function (PMF)

We can find the marginal PMFs from the joint PMFs as such:

20 of 41

Joint Probability Density Function (PDF)

We can find the marginal PDFs from the joint PDFs as such:

21 of 41

Expected Value for Multiple RVs

Discrete:

Continuous:

22 of 41

Variance & Covariance for Multiple RVs

23 of 41

Independent Random Variables

Two random variables are independent if knowing information about one does not tell us information about the other.

You’ll work with independent RVs in Problems 1-2 of Disc. 3.

24 of 41

Independent and Identically Distributed

25 of 41

Conditional Probability Mass Function (PMF)

The conditional PMF tells us the probability of one discrete random variable given our observation of another and is defined as:

26 of 41

Conditional Probability Density Function (PDF)

The conditional PDF tells us the probability of one continuous random variable given our observation of another and is defined as:

You’ll get practice with conditional PDFs in Problem 1 of Disc. 3.

27 of 41

Maximum Likelihood Estimation (MLE)

K-Means Clustering
Probability Theory

Probability Fundamentals
Common Distributions
Multiple Random Variables

Maximum Likelihood Estimation (MLE)
Convex Optimization

28 of 41

Maximum Likelihood Estimation (MLE)

29 of 41

What is the Assumed Distribution?

Usually, one of the common distributions listed in the probability theory section of slides:

Distribution	Parameters
Bernoulli
Binomial
Exponential
Gaussian

30 of 41

How do we Estimate these Parameters?

Approach: Maximize the likelihood of our data under the assumed

probability distribution.

Likelihood:

Optimization problem:

* This assumes that our data is independent and identically distributed (IID)— a very common through rarely true assumption.

Problems 1 & 2 on Disc. 3 cover examples of MLE.

31 of 41

Introduction to Convex Optimization

K-Means Clustering
Probability Theory

Probability Fundamentals
Common Distributions
Multiple Random Variables

Maximum Likelihood Estimation (MLE)
Convex Optimization

32 of 41

Two Types of Problems

Minimization

convex function

Maximization

concave function

33 of 41

Convex and Concave Functions

Convex Functions

Concave Functions

34 of 41

Common Convex/Concave Functions

Name	Form	Property
Affine
Quadratic
Power
Exponential
Logarithmic
Entropy

35 of 41

(Unconstrained) Convex Optimization Problems

Minimization

convex function

Maximization

concave function

36 of 41

Solving Convex Optimization Problems

Take the derivative of the objective w.r.t. optimization variable:

Set the derivative equal to zero and solve for the optimal solution:

(Optionally) Compute the optimal value by plugging the optimal solution into the objective function:

You’ll practice this in Problems 1 & 2 of Disc. 3.

37 of 41

Equivalent Optimization Problems

These problems have the same optimal solution! (And the optimal value of one is the negation of the other.)

Minimization

convex function

Maximization

concave function

38 of 41

Equivalent Optimization Problems

These problems have the same optimal solution!

Monotonically increasing function:

Equivalent problems via monotone objective:

Remember this for Problem 1 of Disc. 3!

39 of 41

Equivalent MLE Problems

Because the natural log is a monotonically increasing function,

We very often will solve this second optimization problem!

Remember this for Problem 2 of Disc. 3!

40 of 41

K-Means Clustering, Probability & MLE

Discussion Mini Lecture 3

Contributors: Sara Pohland

41 of 41

Additional Resources

K-Means Clustering

Deep Learning Foundations and Concepts – Chapter 15.1
Sara’s notes on Machine Learning – Section 15.2

Probability Theory

Deep Learning Foundations and Concepts – Chapter 2.1, 2.2, 3.1
Sara’s notes on Probability & Random Processes – Sections 3-7

Maximum Likelihood Estimation

Deep Learning Foundations and Concepts – Chapter 2.3
Sara’s notes on Machine Learning – Section 7

Convex Optimization

Sara’s notes on Convex Optimization – Sections 2-3