1 of 41

K-Means Clustering, Probability & MLE

Discussion Mini Lecture 3

K-Means Clustering, Probability Theory & Maximum Likelihood Estimation

CS 189/289A, Fall 2025 @ UC Berkeley

Sara Pohland

2 of 41

  1. K-Means Clustering
  2. Probability Theory
    1. Probability Fundamentals
    2. Common Distributions
    3. Multiple Random Variables
  3. Maximum Likelihood Estimation (MLE)
  4. Convex Optimization

Concepts Covered

3 of 41

K-Means Clustering

  1. K-Means Clustering
  2. Probability Theory
    1. Probability Fundamentals
    2. Common Distributions
    3. Multiple Random Variables
  3. Maximum Likelihood Estimation (MLE)
  4. Convex Optimization

4 of 41

Goal of K-Means Clustering

  •  

Unlabeled Data

Clustered Data

5 of 41

K-Means Clustering Problem

 

 

 

where

 

 

 

sum of points

total points

6 of 41

Lloyd’s Algorithm (“naïve k-means”)

 

1. Update Assignments: Assign each data point to the cluster with the nearest mean:

2. Update Means: Recalculate means based on data points assigned to each cluster:

 

 

You’ll show this algorithm converges in Problem 3 of Disc. 3.

7 of 41

Random Variables & Statistical Measures

  1. K-Means Clustering
  2. Probability Theory
    1. Probability Fundamentals
    2. Common Distributions
    3. Multiple Random Variables
  3. Maximum Likelihood Estimation (MLE)
  4. Convex Optimization

8 of 41

Probability Mass Function (PMF)

  •  

 

 

9 of 41

Probability Density Function (PDF)

  •  

 

 

 

10 of 41

Expectation of a Random Variable

Linearity of Expectation:

 

Expected Value:

Discrete:

Continuous:

 

 

Law of the Unconscious Statistician:

 

 

11 of 41

Variance & Standard Deviation

Variance – measures the dispersion around the expected value

 

 

A Useful Property:

 

Standard Deviation – another measure of dispersion/spread

 

12 of 41

Common Probability Distributions

  1. K-Means Clustering
  2. Probability Theory
    1. Probability Fundamentals
    2. Common Distributions
    3. Multiple Random Variables
  3. Maximum Likelihood Estimation (MLE)
  4. Convex Optimization

13 of 41

Bernoulli Random Variable

Bernoulli random variables are used to represent the outcome a single binary experiment.

 

 

 

 

PMF:

Expected Value:

Variance:

14 of 41

Binomial Random Variable

Binomial random variables are used to represent the outcome of multiple binary (Bernoulli) experiments.

 

 

 

 

PMF:

Expected Value:

Variance:

You’ll work with Binomial RVs in Problem 2 of Disc. 3.

15 of 41

Exponential Random Variable

Exponential random variables are used to model the amount of time that passes before an event occurs.

 

 

 

 

PDF:

Expected Value:

Variance:

16 of 41

Gaussian Random Variable

Gaussian random variables are used to represent many continuous real-world processes.

 

 

 

 

PDF:

Expected Value:

Variance:

You’ll work with Gaussian RVs in Problem 1 of Disc. 3.

17 of 41

Standard Normal Random Variable

The standard normal random variable is the Gaussian random variable with zero mean and unit variance.

 

 

 

 

PDF:

Expected Value:

Variance:

18 of 41

Functions of Multiple Random Variables

  1. K-Means Clustering
  2. Probability Theory
    1. Probability Fundamentals
    2. Common Distributions
    3. Multiple Random Variables
  3. Maximum Likelihood Estimation (MLE)
  4. Convex Optimization

19 of 41

Joint Probability Mass Function (PMF)

  •  

 

We can find the marginal PMFs from the joint PMFs as such:

 

 

20 of 41

Joint Probability Density Function (PDF)

  •  

 

We can find the marginal PDFs from the joint PDFs as such:

 

 

21 of 41

Expected Value for Multiple RVs

  •  

Discrete:

Continuous:

 

 

 

22 of 41

Variance & Covariance for Multiple RVs

  •  

 

 

 

 

23 of 41

Independent Random Variables

Two random variables are independent if knowing information about one does not tell us information about the other.

 

 

 

 

You’ll work with independent RVs in Problems 1-2 of Disc. 3.

24 of 41

Independent and Identically Distributed

  •  

 

 

 

 

25 of 41

Conditional Probability Mass Function (PMF)

The conditional PMF tells us the probability of one discrete random variable given our observation of another and is defined as:

 

 

 

 

26 of 41

Conditional Probability Density Function (PDF)

The conditional PDF tells us the probability of one continuous random variable given our observation of another and is defined as:

 

 

 

 

You’ll get practice with conditional PDFs in Problem 1 of Disc. 3.

27 of 41

Maximum Likelihood Estimation (MLE)

  1. K-Means Clustering
  2. Probability Theory
    1. Probability Fundamentals
    2. Common Distributions
    3. Multiple Random Variables
  3. Maximum Likelihood Estimation (MLE)
  4. Convex Optimization

28 of 41

Maximum Likelihood Estimation (MLE)

  •  

 

 

 

*

29 of 41

What is the Assumed Distribution?

Usually, one of the common distributions listed in the probability theory section of slides:

Distribution

Parameters

Bernoulli

Binomial

Exponential

Gaussian

30 of 41

How do we Estimate these Parameters?

Approach: Maximize the likelihood of our data under the assumed

probability distribution.

 

Likelihood:

Optimization problem:

 

*

* This assumes that our data is independent and identically distributed (IID)— a very common through rarely true assumption.

Problems 1 & 2 on Disc. 3 cover examples of MLE.

31 of 41

Introduction to Convex Optimization

  1. K-Means Clustering
  2. Probability Theory
    1. Probability Fundamentals
    2. Common Distributions
    3. Multiple Random Variables
  3. Maximum Likelihood Estimation (MLE)
  4. Convex Optimization

32 of 41

Two Types of Problems

 

Minimization

convex function

 

Maximization

concave function

33 of 41

Convex and Concave Functions

Convex Functions

Concave Functions

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

34 of 41

Common Convex/Concave Functions

Name

Form

Property

Affine

Quadratic

Power

Exponential

Logarithmic

Entropy

35 of 41

(Unconstrained) Convex Optimization Problems

 

Minimization

convex function

 

Maximization

concave function

36 of 41

Solving Convex Optimization Problems

  1. Take the derivative of the objective w.r.t. optimization variable:
  1. Set the derivative equal to zero and solve for the optimal solution:
  1. (Optionally) Compute the optimal value by plugging the optimal solution into the objective function:

 

 

 

You’ll practice this in Problems 1 & 2 of Disc. 3.

37 of 41

Equivalent Optimization Problems

These problems have the same optimal solution! (And the optimal value of one is the negation of the other.)

 

Minimization

convex function

 

Maximization

concave function

 

38 of 41

Equivalent Optimization Problems

These problems have the same optimal solution!

Monotonically increasing function:

 

 

 

 

 

 

 

 

Equivalent problems via monotone objective:

Remember this for Problem 1 of Disc. 3!

39 of 41

Equivalent MLE Problems

Because the natural log is a monotonically increasing function,

 

 

 

We very often will solve this second optimization problem!

Remember this for Problem 2 of Disc. 3!

40 of 41

K-Means Clustering, Probability & MLE

Discussion Mini Lecture 3

Contributors: Sara Pohland

41 of 41

Additional Resources

  1. K-Means Clustering
    1. Deep Learning Foundations and Concepts – Chapter 15.1
    2. Sara’s notes on Machine Learning – Section 15.2
  2. Probability Theory
  3. Maximum Likelihood Estimation
  4. Convex Optimization
    • Sara’s notes on Convex Optimization – Sections 2-3