1 of 19

CS365�Foundations of Data Science

Lecture 9 (2/17)

Charalampos E. Tsourakakis �ctsourak@bu.edu

2 of 19

MLE vs MAP

  • MLE Goal: Find θ maximizing the (log)-likelihood Pr(x;θ) (also denoted as Pr(x|θ)
  • MAP�Goal: Find θ maximizing the posterior Pr(θ|x)

In some cases solving analytically for θ is hard. ��Today’s agenda: EM algorithm, an iterative approach to solving hard parameter learning problems.

3 of 19

Two coin problem

Process

  • Suppose we choose a coin (A or B) uniformly at random (uar)
  • We toss the coin n times, and record the total number of heads

We repeat the process k times

4 of 19

Two coin problem

  • We have two unknown parameters θΑ, θΒ�
  • Suppose k=5, n=10�
  • The data are two vectors
    • x=XH = (x1, x2, x3, x4, x5) #heads per round
    • z=Zc = (z1, z2, z3, z4, z5) #coin id per round�
  • E.g., x=(5,9,8,4,7), z=(B, A, A, B, A)

5 of 19

Two coin problem

Comprehension question: What are the MLE for θΑ, θΒ in this case?

6 of 19

Two coin problem

Maximum likelihood estimates for the unknown parameters θΑ, θΒ

7 of 19

Two coin problem

  • Suppose we only see the number of heads per round.
    • In other words we do not have access to z �
  • We refer to z as hidden variables or latent factors �
  • Remarks �1. Not uncommon/common setting in data science applications ⇒ missing data! �2. Clear that maximizing Pr(x|θ) is much harder than Pr(x,z|θ) in the presence of missing data z.

8 of 19

The Expectation-Maximization algorithm

  • Also known as the EM algorithm
    • In reality, it is an algorithm design methodology rather than a given algorithm �

Let’s see the key idea in the context of the two-coins problem before full description

9 of 19

Two coin problem

  • We will proceed iteratively by updating
  • Each iteration starts with a guess of the unknown parameters ��
  • E-step: a probability distribution over possible completions is computed using the current parameters �
  • M-step: the new parameters are determined using the current completions.

10 of 19

Two coin problem

Suppose our initial guess for the unknown variables are 0.6, and 0.5. ��E-step: what is the probability that round i comes from coin A/coin B?

Why?

11 of 19

Two coin problem

M-step: In order to learn we first need to estimate ��the number of heads/tails from coins A/B given our estimates of the ��latent variables. �

  • Notice that instead of being 100% certain whether a round was due to coin A or B, we have a probability distribution.

12 of 19

Two coin problem

Coin A

Coin B

Heads

0.45x5

0.55x5

Tails

0.45x5

0.55x5

Round 1: 5H, 5T

We repeat this for all five rounds

13 of 19

One more round

Coin A

Coin B

Heads

0.8x9

0.2x9

Tails

0.8x1

0.2x1

Round 2: 9H, 1T �From the E-round we have

14 of 19

Two coin problem

  • Having done this for all 5 rounds we obtain the following������
  • The M-step now simply becomes the MLEs according to this data

15 of 19

16 of 19

EM algorithm

  • Suppose maximizing Pr(x|θ) has no closed form solution/intractable. �
  • Key idea: latent variables z that make likelihood computations tractable �
  • Intuition: Pr(x,z|θ), Pr(z|x,θ) should be easy to compute after introducing the “right” latent variables. �
  • EM guaranteed to converge to a local maximum!

17 of 19

EM algorithm

Define the “expected log” Q(θ|θ᾽) where θ’ is the current estimate of θ: ����The EM algorithm is an iterative method consisting of two steps.��1. E-step: Find Q(θ|θ᾽) in terms of the latent variables z

2. M-step: Find θ* maximizing Q(θ|θ᾽)

18 of 19

EM-algorithm

Some remarks��1. The quantity Q typically is not computed, but is useful for proving the convergence of the EM algorithm. ��2. What we really need to know is the distribution of z given x, our guess θ’, and frequently knowing the expectation of z over its distribution suffices as we saw in our example.

3. The M-step maximizes Q over all possible θ values. Ideally, knowing x and z allows an easy maximization over θ.

19 of 19

Readings

  1. What is the EM algorithm?
  2. Wikipedia article
  3. Original EM paper
  4. Section 24.4 from the “Understanding ML” book