1 of 19

CS365�Foundations of Data Science

Lecture 9 (2/17)

Charalampos E. Tsourakakis �ctsourak@bu.edu

2 of 19

MLE vs MAP

MLE �Goal: Find θ maximizing the (log)-likelihood Pr(x;θ) (also denoted as Pr(x|θ)�
MAP�Goal: Find θ maximizing the posterior Pr(θ|x)

In some cases solving analytically for θ is hard. ��Today’s agenda: EM algorithm, an iterative approach to solving hard parameter learning problems.

3 of 19

Two coin problem

Process

Suppose we choose a coin (A or B) uniformly at random (uar)
We toss the coin n times, and record the total number of heads

We repeat the process k times

4 of 19

Two coin problem

We have two unknown parameters θ_Α, θ_Β�
Suppose k=5, n=10�
The data are two vectors

x=X_H = (x₁, x₂, x₃, x₄, x₅) #heads per round
z=Z_c = (z₁, z₂, z₃, z₄, z₅) #coin id per round�

E.g., x=(5,9,8,4,7), z=(B, A, A, B, A)

5 of 19

Two coin problem

Comprehension question: What are the MLE for θ_Α, θ_Βin this case?

6 of 19

Two coin problem

Maximum likelihood estimates for the unknown parameters θ_Α, θ_Β

7 of 19

Two coin problem

Suppose we only see the number of heads per round.

In other words we do not have access to z �

We refer to z as hidden variables or latent factors �
Remarks �1. Not uncommon/common setting in data science applications ⇒ missing data! �2. Clear that maximizing Pr(x|θ) is much harder than Pr(x,z|θ) in the presence of missing data z.

8 of 19

The Expectation-Maximization algorithm

Also known as the EM algorithm

In reality, it is an algorithm design methodology rather than a given algorithm �
�

Let’s see the key idea in the context of the two-coins problem before full description

9 of 19

Two coin problem

We will proceed iteratively by updating
Each iteration starts with a guess of the unknown parameters ��
E-step: a probability distribution over possible completions is computed using the current parameters �
M-step: the new parameters are determined using the current completions.

10 of 19

Two coin problem

Suppose our initial guess for the unknown variables are 0.6, and 0.5. ��E-step: what is the probability that round i comes from coin A/coin B?

Why?

11 of 19

Two coin problem

M-step: In order to learn we first need to estimate ��the number of heads/tails from coins A/B given our estimates of the ��latent variables. �

Notice that instead of being 100% certain whether a round was due to coin A or B, we have a probability distribution.

12 of 19

Two coin problem

	Coin A	Coin B
Heads	0.45x5	0.55x5
Tails	0.45x5	0.55x5

Round 1: 5H, 5T

We repeat this for all five rounds

13 of 19

One more round

	Coin A	Coin B
Heads	0.8x9	0.2x9
Tails	0.8x1	0.2x1

Round 2: 9H, 1T �From the E-round we have

14 of 19

Two coin problem

Having done this for all 5 rounds we obtain the following��
The M-step now simply becomes the MLEs according to this data

16 of 19

EM algorithm

Suppose maximizing Pr(x|θ) has no closed form solution/intractable. �
Key idea: latent variables z that make likelihood computations tractable �
Intuition: Pr(x,z|θ), Pr(z|x,θ) should be easy to compute after introducing the “right” latent variables. �
EM guaranteed to converge to a local maximum!

17 of 19

EM algorithm

Define the “expected log” Q(θ|θ᾽) where θ’ is the current estimate of θ: ��The EM algorithm is an iterative method consisting of two steps.��1. E-step: Find Q(θ|θ᾽) in terms of the latent variables z

2. M-step: Find θ* maximizing Q(θ|θ᾽)

18 of 19

EM-algorithm

Some remarks��1. The quantity Q typically is not computed, but is useful for proving the convergence of the EM algorithm. ��2. What we really need to know is the distribution of z given x, our guess θ’, and frequently knowing the expectation of z over its distribution suffices as we saw in our example.

3. The M-step maximizes Q over all possible θ values. Ideally, knowing x and z allows an easy maximization over θ.

1 of 19

2 of 19

3 of 19

4 of 19

5 of 19

6 of 19

7 of 19

8 of 19

9 of 19

10 of 19

11 of 19

12 of 19

13 of 19

14 of 19

15 of 19

16 of 19

17 of 19

18 of 19

19 of 19