1 of 56

PGMs - 1

Overview

@AishFenton @NetflixResearch

2 of 56

Why PGMs?

  • Probabilistic models provide an elegant framework for modeling
    • Assumptions are made explicit (and can be played with)
    • Elegant way of incorporating “prior” knowledge
    • Full machinery of prob lets you do more (explore/exploit, factor in uncertainty, etc)
    • More whitebox than deep-learning; results are often interpretable
  • Graphical models provide tools for producing complex hierarchical (deep?) probabilistic models

3 of 56

Example / Motivation

4 of 56

Example / Motivation

5 of 56

Example / Motivation

6 of 56

Layout of tutorial

  • Intro
    • Probability reminders
    • Probabilistic modeling intro
  • Mixture models
    • Overstanding the model
    • First look at Expectation Maximization (EM)
  • Inference
    • EM revisited
    • Variational bayes
    • Samplers
  • Topic models

7 of 56

Probability Reminders

8 of 56

Basic Identities

Sum rule

Product rule

Or the otherway around

9 of 56

Tues

Wed

Thur

🍏

0.2

0.1

0.1

0.4

🍗

0.1

0.1

0.1

0.3

🧀

0.1

0.1

0.1

0.3

0.4

0.3

0.3

10 of 56

Tues

Wed

Thur

🍏

0.50

0.33

0.33

0.4

🍗

0.25

0.33

0.33

0.3

🧀

0.25

0.33

0.33

0.3

0.4

0.3

0.3

11 of 56

Bayes Rule

Want this

But we have this

12 of 56

Tues

Wed

Thur

🍏

0.50

0.33

0.33

0.4

🍗

0.25

0.33

0.33

0.3

🧀

0.25

0.33

0.33

0.3

0.4

0.3

0.3

Useful for swapping from: p(obs|event) -> p(event|obs)

13 of 56

Entropy

Information content of event X

More uncertainty = more information

14 of 56

Kullback–Leibler Divergence

NB: Def isn’t symmetric

“Divergence” of Q from P

15 of 56

Kullback–Leibler Divergence

16 of 56

Distributions we’ll need

17 of 56

Bernoulli

x ∈ {0,1}

18 of 56

Multinomial (1 draw)

1 of K encoding

A

B

C

p([0,1,0])=

0.7

0.3

0.2

19 of 56

Multinomial (n draws)

N! Ways to draw

Account for repeats

Smooth factorial

20 of 56

Beta / Dirichlet Distributions

Beta

Dirichlet

failures

successes

k > 2 states

21 of 56

Bernoulli-Beta Conjugacy

Bernoulli

Beta

= Beta(α+1, β)

22 of 56

Multinomial-Dirichlet Conjugacy

Muli

Dir

= Dir([α1+x1,...,αi+xi])

23 of 56

Bayes Nets

Directed Graphical Models

24 of 56

25 of 56

A

B

C

D

26 of 56

A

B

C

D

27 of 56

A

B

C

D

28 of 56

29 of 56

Directed graphical models

  • In general, factorizes to:

  • Where each “CPD” has an associated distribution
  • Cycles disallowed (i.e. DAG)

30 of 56

Plate Notation

C

P

2

A

α

Random variable

Observed R.V.

Repeat K times

D

B

C

β

31 of 56

Plate Notation

(Non-standard, but useful)

C

P

2

A

α

Fixed param

D

B

Output of this is a “switch”

C

β

32 of 56

Plate Notation

C

P

2

A

α

D

B

Special case: Used to index into D

C

Draw of this R.V becomes param of next R.V

What’s missing?

β

33 of 56

Generative Story

for k = 1 to 2

Dk ~ Dir(β)

foreach ML conference, i:

Ai ~ Beta(α)

foreach paper, j:

Bij ~ Bern(Ai)

Cij ~ Mult(DB)

// Ak & WK

// topic list

// bias!

// index D

// draw topic

34 of 56

What we observe...

C

P

2

A

α

D

B

C

β

Observe:

  • List of papers per conference
  • Authors blinded out

35 of 56

What we observe...

C

P

2

A

α

D

B

C

β

Can infer:

  • Conference bias
  • Where the paper likely came from
  • AK vs. WK topic preferences

36 of 56

Conditional Independence

37 of 56

A ⫫ C | ∅

🌧

☂️

📺

A

C

B

38 of 56

A ⫫ C | B

🌧

☂️

📺

A

C

B

39 of 56

A ⫫ C | B

🌬

❄️

A

C

B

40 of 56

A ⫫ C | ∅

💦

🌧

🚿

A

C

B

41 of 56

A ⫫ C | B

💦

🌧

🚿

Explained away

A

C

B

42 of 56

Markov Blanket

xi

Because

explained

problem

43 of 56

Bayesian Modeling

44 of 56

Maximum Likelihood (MLE)

Optimize params to best fit data

45 of 56

Example: MLE of Multinomial

Must sum to 1

Lagrange multiplier

46 of 56

But…

  • Point estimate of parameters
  • Can overfit (why?)
  • Still question on how to maximize p(X|θ)
    • Closed form for many well known distributions
    • Also can use standard gradient based optimizers
    • But sometimes parameters are coupled. What to do then?

47 of 56

Going Bayesian…

Apply bayes rule

Maybe a complex distribution now

Now a full distribution over θ

48 of 56

Anatomy of a model

Posterior

Likelihood

Prior

Ouch !

49 of 56

Conjugacy revisited

Multi

Beta

We know this. Why?

For example:

50 of 56

Pseudo counts

= Dir([α1+x1,...,αi+xi])

51 of 56

Why go Bayesian?

  • Factors in uncertainty of parameters
    • Marginalization gives a more robust estimate
    • Can be used for explore/exploit
  • Priors provide an elegant way to encode our prior assumptions (connection to regularization).

52 of 56

Posterior Predictive Distribution

We have this one now

New data

Marginalize over θ

53 of 56

Monte Carlo Estimate

Average over samples from Posterior

54 of 56

Modeling tips

  • Latent variables are normally first, flow top to bottom, and bottom are observations.
  • Discrete latent variables per observation, can act as “switches” between model components
  • Common to add latent variable per observation, to model mixtures of well-known (i.e. easier to work with) distributions.
  • Nice if parent & child are conjugate (more later)

55 of 56

Markov Random Fields

(Undirected Graphical Models)

56 of 56