1 of 24

CS365�Foundations of Data Science

Lecture 8 (2/15)

Charalampos E. Tsourakakis �ctsourak@bu.edu

2 of 24

Reminder

Midterm: Next Thursday 2/24.

Material up to 2/17 (included)�

Time: Please come by 4.50pm, so we can start precisely at 5. �
No cheat sheets, you will be provided with all the formulas needed. �
Place: same auditorium as usual.

3 of 24

The Thumbtack problem

A billionaire from the suburbs of Boston asks you a question:�� – He says: I have thumbtack, if I flip it, what’s the probability it will fall with the nail up? ��–  You say: Please flip it a few times:

Heads

You say: Pr(Heads)=4/7.
He says: Why?

4 of 24

The Thumbtack problem

Assumptions

Pr(Heads)=p, Pr(Tails)=1-p �
Flips are iid�
Consider the sequence HTHHTHT? ��Pr(HTHHTHT)=��Comprehension question: Where is the “n choose k” factor in binomial?

5 of 24

The Thumbtack problem

We shall denote Pr(HTHHTHT) as Pr(HTHHTHT;p)

p is not a random variable (at least for now..) �

D data�θ parameter(s) �We refer to Pr(D|θ) as the likelihood of the data under the model. ��
In our problem

6 of 24

Maximum Likelihood Estimate (MLE)

Data: thumbtack tosses�
Hypotheses: A flip is a Bernoulli distributed variable. Independence of flips. �
Learning: Find p* that maximizes the data likelihood

7 of 24

Maximum Likelihood Estimate (MLE)

p* value

8 of 24

Maximum Likelihood Estimate (MLE)

Choose p that maximizes the likelihood ��

9 of 24

Optimize to find p*

Almost done: ��- We need to verify that we get a maximum for p_MLE��(whiteboard)�

10 of 24

Maximum Likelihood Estimate (MLE)

11 of 24

Confidence intervals (reminder)

Boston billionaire says: I want to know the true p within 0.01 accuracy, with confidence at least 95%. �

Sampling theorem: Given n independent 0-1 RVs X_i such that Pr(X_i=1)=p (i=1…n), ��where then the following holds: ��

12 of 24

Two important properties of the MLE

Consistent � ��
Equivariant��

13 of 24

Billionaire with prior beliefs

He says: Wait! I know that the thumbtack should be close to 50-50. �
You say: let’s be Bayesian! ��
Rather than learn a single value for p, we �learn a probability distribution

p now becomes a random variable

14 of 24

Inference using Bayes’ rule

��

Does not depend on p.

Notice the notation Pr(D|p) instead of Pr(D;p)

15 of 24

Bayesian inference - Summary

We choose the prior distribution f(θ). This distribution expresses our prior beliefs on the parameter θ.�
We choose the statistical model for the likelihood function f(D|θ).�
After observing the data D=X₁,..,X_n, we update our beliefs and calculate the posterior distribution f(θ|D).�

Maximum a posteriori (MAP) estimate is the mode of the posterior distribution (as we did in this lecture).

16 of 24

Important observation

If we impose a uniform prior on p, then ��
Image denoising lecture

Had we imposed a uniform prior on images x, then our MAP inference would be the same as the MLE
Choosing a good prior is important in applications of Bayesian inference

17 of 24

Conjugate priors

Definition ��“ If the posterior distribution p(θ | x) is in the same probability distribution family as the prior probability distribution p(θ), the prior and posterior are then called conjugate distributions, and the prior is called a conjugate prior for the likelihood function p(x | θ).”��- For Binomial, conjugate prior is Beta distribution.

18 of 24

Bayesian inference for the thumbtack problem

The probability density for beta distribution is ��

Where is the gamma function ��

19 of 24

Bayesian inference for the thumbtack problem

��

Posterior has same form

as prior!

20 of 24

Method of moments

Suppose our model has parameters θ=(θ₁,..,θ_k).

Recall that the j-th moment of a RV X is E[X^j]. To denote that this is a function of the model, we write E_θ[X^j].

Compute analytically �

Consider the j-th sample moment for data x₁,...,x_n is ��
Equate the analytical moment expressions with the sample moments, and solve a system of k equations with k unknowns to learn θ=(θ₁,..,θ_k).

21 of 24

Method of moments: example 1

Let X1,...,Xn~Bernoulli(p).

We have one parameter, so k=1.�
The first moment is the mean α₁=E_p[X]=p. �
The first sample moment is the sample mean. �
Thus we directly get�

Remark: Here, MoM is same as MLE, but this is not always the case!

22 of 24

Method of moments: example 2

23 of 24

MLE for Gaussian

24 of 24

Readings

The thumbtack problem �
Method of moments (worked out examples)�
MLE examples (worked out examples)�
Miscellaneous: Julia package on conjugate priors