1 of 24

CS365�Foundations of Data Science

Lecture 8 (2/15)

Charalampos E. Tsourakakis �ctsourak@bu.edu

2 of 24

Reminder

  • Midterm: Next Thursday 2/24.
    • Material up to 2/17 (included)�
  • Time: Please come by 4.50pm, so we can start precisely at 5. �
  • No cheat sheets, you will be provided with all the formulas needed. �
  • Place: same auditorium as usual.

3 of 24

The Thumbtack problem

  • A billionaire from the suburbs of Boston asks you a question:�� – He says: I have thumbtack, if I flip it, what’s the probability it will fall with the nail up? ��–  You say: Please flip it a few times:

Heads

  • You say: Pr(Heads)=4/7.
  • He says: Why?

4 of 24

The Thumbtack problem

Assumptions

  1. Pr(Heads)=p, Pr(Tails)=1-p �
  2. Flips are iid�
  3. Consider the sequence HTHHTHT? ��Pr(HTHHTHT)=��Comprehension question: Where is the “n choose k” factor in binomial?

5 of 24

The Thumbtack problem

  • We shall denote Pr(HTHHTHT) as Pr(HTHHTHT;p)
    • p is not a random variable (at least for now..) �
  • D data�θ parameter(s) �We refer to Pr(D|θ) as the likelihood of the data under the model. ��
  • In our problem

6 of 24

Maximum Likelihood Estimate (MLE)

  • Data: thumbtack tosses�
  • Hypotheses: A flip is a Bernoulli distributed variable. Independence of flips. �
  • Learning: Find p* that maximizes the data likelihood

7 of 24

Maximum Likelihood Estimate (MLE)

p* value

8 of 24

Maximum Likelihood Estimate (MLE)

  • Choose p that maximizes the likelihood ��

9 of 24

Optimize to find p*

Almost done: ��- We need to verify that we get a maximum for pMLE��(whiteboard)�

10 of 24

Maximum Likelihood Estimate (MLE)

11 of 24

Confidence intervals (reminder)

  • Boston billionaire says: I want to know the true p within 0.01 accuracy, with confidence at least 95%. �

Sampling theorem: Given n independent 0-1 RVs Xi such that Pr(Xi=1)=p (i=1…n), ��where then the following holds: ��

12 of 24

Two important properties of the MLE

  • Consistent � ��
  • Equivariant��

13 of 24

Billionaire with prior beliefs

  • He says: Wait! I know that the thumbtack should be close to 50-50. �
  • You say: let’s be Bayesian! ��
  • Rather than learn a single value for p, we �learn a probability distribution

p now becomes a random variable

14 of 24

Inference using Bayes’ rule

���������

Does not depend on p.

Notice the notation Pr(D|p) instead of Pr(D;p)

15 of 24

Bayesian inference - Summary

  1. We choose the prior distribution f(θ). This distribution expresses our prior beliefs on the parameter θ.�
  2. We choose the statistical model for the likelihood function f(D|θ).�
  3. After observing the data D=X1,..,Xn, we update our beliefs and calculate the posterior distribution f(θ|D).�

Maximum a posteriori (MAP) estimate is the mode of the posterior distribution (as we did in this lecture).

16 of 24

Important observation

  • If we impose a uniform prior on p, then ��
  • Image denoising lecture
    • Had we imposed a uniform prior on images x, then our MAP inference would be the same as the MLE
    • Choosing a good prior is important in applications of Bayesian inference

17 of 24

Conjugate priors

Definition ��“ If the posterior distribution p(θ | x) is in the same probability distribution family as the prior probability distribution p(θ), the prior and posterior are then called conjugate distributions, and the prior is called a conjugate prior for the likelihood function p(x | θ).”��- For Binomial, conjugate prior is Beta distribution.

18 of 24

Bayesian inference for the thumbtack problem

  • The probability density for beta distribution is ��

Where is the gamma function �� �

19 of 24

Bayesian inference for the thumbtack problem

��

Posterior has same form

as prior!

20 of 24

Method of moments

Suppose our model has parameters θ=(θ1,..,θk).

  • Recall that the j-th moment of a RV X is E[Xj]. To denote that this is a function of the model, we write Eθ[Xj].
    • Compute analytically �
  • Consider the j-th sample moment for data x1,...,xn is ��
  • Equate the analytical moment expressions with the sample moments, and solve a system of k equations with k unknowns to learn θ=(θ1,..,θk).

21 of 24

Method of moments: example 1

Let X1,...,Xn~Bernoulli(p).

  • We have one parameter, so k=1.�
  • The first moment is the mean α1=Ep[X]=p. �
  • The first sample moment is the sample mean. �
  • Thus we directly get�

Remark: Here, MoM is same as MLE, but this is not always the case!

22 of 24

Method of moments: example 2

23 of 24

MLE for Gaussian

24 of 24

Readings

  1. The thumbtack problem
  2. Method of moments (worked out examples)
  3. MLE examples (worked out examples)
  4. Miscellaneous: Julia package on conjugate priors