1 of 56

PGMs - 1

Overview

@AishFenton @NetflixResearch

2 of 56

Why PGMs?

Probabilistic models provide an elegant framework for modeling

Assumptions are made explicit (and can be played with)
Elegant way of incorporating “prior” knowledge
Full machinery of prob lets you do more (explore/exploit, factor in uncertainty, etc)
More whitebox than deep-learning; results are often interpretable

Graphical models provide tools for producing complex hierarchical (deep?) probabilistic models

3 of 56

Example / Motivation

4 of 56

Example / Motivation

5 of 56

Example / Motivation

6 of 56

Layout of tutorial

Intro

Probability reminders
Probabilistic modeling intro

Mixture models

Overstanding the model
First look at Expectation Maximization (EM)

Inference

EM revisited
Variational bayes
Samplers

Topic models

7 of 56

Probability Reminders

8 of 56

Basic Identities

Sum rule

Product rule

Or the otherway around

9 of 56

	Tues	Wed	Thur
🍏	0.2	0.1	0.1	0.4
🍗	0.1	0.1	0.1	0.3
🧀	0.1	0.1	0.1	0.3
	0.4	0.3	0.3

10 of 56

	Tues	Wed	Thur
🍏	0.50	0.33	0.33	0.4
🍗	0.25	0.33	0.33	0.3
🧀	0.25	0.33	0.33	0.3
	0.4	0.3	0.3

11 of 56

Bayes Rule

Want this

But we have this

12 of 56

	Tues	Wed	Thur
🍏	0.50	0.33	0.33	0.4
🍗	0.25	0.33	0.33	0.3
🧀	0.25	0.33	0.33	0.3
	0.4	0.3	0.3

Useful for swapping from: p(obs|event) -> p(event|obs)

13 of 56

Entropy

Information content of event X

More uncertainty = more information

14 of 56

Kullback–Leibler Divergence

NB: Def isn’t symmetric

“Divergence” of Q from P

15 of 56

Kullback–Leibler Divergence

16 of 56

Distributions we’ll need

17 of 56

Bernoulli

x ∈ {0,1}

18 of 56

Multinomial (1 draw)

1 of K encoding

A

B

C

p([0,1,0])=

0.7

0.3

0.2

19 of 56

Multinomial (n draws)

N! Ways to draw

Account for repeats

Smooth factorial

20 of 56

Beta / Dirichlet Distributions

Beta

Dirichlet

failures

successes

k > 2 states

21 of 56

Bernoulli-Beta Conjugacy

Bernoulli

Beta

= Beta(α+1, β)

22 of 56

Multinomial-Dirichlet Conjugacy

Muli

Dir

= Dir([α₁+x₁,...,α_i+x_i])

23 of 56

Bayes Nets

Directed Graphical Models

24 of 56

25 of 56

A

B

C

D

26 of 56

A

B

C

D

27 of 56

A

B

C

D

28 of 56

29 of 56

Directed graphical models

In general, factorizes to:

Where each “CPD” has an associated distribution
Cycles disallowed (i.e. DAG)

30 of 56

Plate Notation

C

P

2

A

α

Random variable

Observed R.V.

Repeat K times

D

B

C

β

31 of 56

Plate Notation

(Non-standard, but useful)

C

P

2

A

α

Fixed param

D

B

Output of this is a “switch”

C

β

32 of 56

Plate Notation

C

P

2

A

α

D

B

Special case: Used to index into D

C

Draw of this R.V becomes param of next R.V

What’s missing?

β

33 of 56

Generative Story

for k = 1 to 2

D_k ~ Dir(β)

foreach ML conference, i:

A_i ~ Beta(α)

foreach paper, j:

B_ij ~ Bern(A_i)

C_ij ~ Mult(D_B)

// Ak & WK

// topic list

// bias!

// index D

// draw topic

34 of 56

What we observe...

C

P

2

A

α

D

B

C

β

Observe:

List of papers per conference
Authors blinded out

35 of 56

What we observe...

C

P

2

A

α

D

B

C

β

Can infer:

Conference bias
Where the paper likely came from
AK vs. WK topic preferences

36 of 56

Conditional Independence

37 of 56

A ⫫ C | ∅

🌧

☂️

📺

A

C

B

38 of 56

A ⫫ C | B

🌧

☂️

📺

A

C

B

39 of 56

A ⫫ C | B

🌬

⛄

❄️

A

C

B

40 of 56

A ⫫ C | ∅

💦

🌧

🚿

A

C

B

41 of 56

A ⫫ C | B

💦

🌧

🚿

Explained away

A

C

B

42 of 56

Markov Blanket

x_i

Because

explained

problem

43 of 56

Bayesian Modeling

44 of 56

Maximum Likelihood (MLE)

Optimize params to best fit data

45 of 56

Example: MLE of Multinomial

Must sum to 1

Lagrange multiplier

46 of 56

But…

Point estimate of parameters
Can overfit (why?)
Still question on how to maximize p(X|θ)

Closed form for many well known distributions
Also can use standard gradient based optimizers
But sometimes parameters are coupled. What to do then?

47 of 56

Going Bayesian…

Apply bayes rule

Maybe a complex distribution now

Now a full distribution over θ

48 of 56

Anatomy of a model

Posterior

Likelihood

Prior

Ouch !

49 of 56

Conjugacy revisited

Multi

Beta

We know this. Why?

For example:

50 of 56

Pseudo counts

= Dir([α₁+x₁,...,α_i+x_i])

51 of 56

Why go Bayesian?

Factors in uncertainty of parameters

Marginalization gives a more robust estimate
Can be used for explore/exploit

Priors provide an elegant way to encode our prior assumptions (connection to regularization).

52 of 56

Posterior Predictive Distribution

We have this one now

New data

Marginalize over θ

53 of 56

Monte Carlo Estimate

Average over samples from Posterior

54 of 56

Modeling tips

Latent variables are normally first, flow top to bottom, and bottom are observations.
Discrete latent variables per observation, can act as “switches” between model components
Common to add latent variable per observation, to model mixtures of well-known (i.e. easier to work with) distributions.
Nice if parent & child are conjugate (more later)

55 of 56

Markov Random Fields

(Undirected Graphical Models)

1 of 56

2 of 56

3 of 56

4 of 56

5 of 56

6 of 56

7 of 56

8 of 56

9 of 56

10 of 56

11 of 56

12 of 56

13 of 56

14 of 56

15 of 56

16 of 56

17 of 56

18 of 56

19 of 56

20 of 56

21 of 56

22 of 56

23 of 56

24 of 56

25 of 56

26 of 56

27 of 56

28 of 56

29 of 56

30 of 56

31 of 56

32 of 56

33 of 56

34 of 56

35 of 56

36 of 56

37 of 56

38 of 56

39 of 56

40 of 56

41 of 56

42 of 56

43 of 56

44 of 56

45 of 56

46 of 56

47 of 56

48 of 56

49 of 56

50 of 56

51 of 56

52 of 56

53 of 56

54 of 56

55 of 56

56 of 56