1 of 28

Parallel tempering with a variational reference

Saifuddin Syed

Nikola Surjanovic

Alexandre Bouchard-Côté

Trevor Campbell

2 of 28

Parallel tempering

  • Class of MCMC algorithms suited for complex target distributions

2

  • Constructs a path of distributions that interpolate between a reference, aa , and intractable target,
  • States along the path are swapped to improve mixing in the target

3 of 28

Parallel tempering

Annealing path: is a path of distributions between and

3

Run chains targeting

Alternate local exploration and communication

4 of 28

Parallel tempering

Local exploration: Update each chain according to an MCMC algorithm

4

Reference

Target

Communication: Swap between chains and with probability

5 of 28

Parallel tempering

Problem:

  • Past work on PT has only used paths with a fixed reference (usually the prior)
  • Reference is often very different from target
  • PT swapping is inefficient

5

Fixed reference

Target

6 of 28

Parallel tempering

6

Contribution:

  • Extend to annealing paths with an optimized variational reference
  • Improve PT swapping significantly (e.g., up to 40x more efficient in real problems)
  • No tuning required from the user
  • Theoretical performance guarantees

Variational reference

Target

7 of 28

Parallel tempering

Annealing path: is a path of distributions between and

7

Usually the posterior

Fixed reference: usually the prior�(Can we do better?)

8 of 28

Quantifying PT performance

Goal: Assess whether a choice of will perform well

8

Communication swap

We use the restart rate to assess PT communication performance

9 of 28

Quantifying PT performance

Goal: Assess whether a choice of will perform well

We use the restart rate to assess PT communication performance

9

10 of 28

Quantifying PT performance

Goal: Assess whether a choice of will perform well

We use the restart rate to assess PT communication performance

10

11 of 28

Quantifying PT performance

Goal: Assess whether a choice of will perform well

We use the restart rate to assess PT communication performance

11

12 of 28

Quantifying PT performance

Goal: Assess whether a choice of will perform well

We use the restart rate to assess PT communication performance

12

Restart!

13 of 28

Suboptimality of a fixed reference

If reference and target do not overlap much → PT swapping is inefficient

13

Prior

Prior is far away from posterior in data limit

(low restart rate)

Posterior

Concentrates to a point mass

Proposition (problem with standard PT): For an exchangeable* Bayesian model with conditionally i.i.d. data points, PT with a fixed reference satisfies

*I.e., satisfies usual technical assumptions for a Bernstein-von Mises result, etc.

14 of 28

Annealing paths with a variational reference

We introduce a variational reference family, and the new path

14

Consider exponential family references

Variational parameters

(e.g. mean and variance of normal distribution)

How do we choose to be close to the target and have a high restart rate?

15 of 28

Variational reference tuning

We minimize the forward (inclusive) KL divergence,

with a gradient-free procedure (i.e., no user tuning effort!)

15

Use samples to tune the reference with moment matching

Choose so that

Samples from target

Repeat with an exponentially increasing number of samples (new tuning round)

16 of 28

Variational reference tuning

Theorem (sketch): The variational parameter estimates converge to the forward KL minimizer almost surely.

16

The restart rate at the forward KL minimum is bounded in terms of the flexibility of the variational family. (More details in paper.)

17 of 28

Optimality of PT with a variational reference

Proposition: For an exchangeable Bayesian model with conditionally i.i.d. data points, PT with a normal variational reference that minimizes the forward KL satisfies

17

Prior

Posterior

Variational reference is close to posterior

(high restart rate)

18 of 28

Experiments

PT with a variational reference empirically outperforms PT with a fixed reference. Consider variational PT with a normal reference (mean-field approximation and full covariance) vs. NRPT.

Tested variational PT on 14 models (phylogenetic inference, sparse CAR models, logistic regression, ODEs)

Up to 40x increase in the number of restarts compared to the current state of the art

18

Simple mixture model

19 of 28

Experiments

PT with a variational reference empirically outperforms PT with a fixed reference. Consider variational PT with a normal reference (mean-field approximation and full covariance) vs. NRPT.

19

Simple mixture model

mRNA transfection model*

Posterior cut

*Ballnus et al. (2017)

20 of 28

Experiments

PT with a variational reference empirically outperforms PT with a fixed reference. Consider variational PT with a normal reference (mean-field approximation and full covariance) vs. NRPT.

20

Simple mixture model

mRNA transfection model*

Higher = more efficient communication

More compute time

21 of 28

Experiments

PT with a variational reference empirically outperforms PT with a fixed reference. Consider variational PT with a normal reference (mean-field approximation and full covariance) vs. NRPT.

21

Simple mixture model

Simple mixture model

Posterior cut

22 of 28

Experiments

PT with a variational reference empirically outperforms PT with a fixed reference. Consider variational PT with a normal reference (mean-field approximation and full covariance) vs. NRPT.

22

Simple mixture model

Simple mixture model

Higher = more efficient communication

More compute time

23 of 28

Summary

Problem:

  • Past work on PT has only used paths with a fixed reference
  • PT swapping is inefficient

Contribution:

  • Extend to annealing paths with an optimized variational reference
  • Improve PT swapping significantly (e.g., up to 40x more efficient in real problems)
  • No tuning required from the user
  • Theoretical performance guarantees

23

24 of 28

Thank you

24

25 of 28

Questions?

25

26 of 28

Parallel tempering

Objective: Minimize the global communication barrier (GCB)

26

The GCB characterizes the efficiency of PT [Syed et al. 2019]

27 of 28

Restart rate

Minimize the global communication barrier → Maximize the restart rate

Rate at which samples from the reference travel to the target

27

28 of 28

Variational reference tuning

Algorithm:

1) Run the non-reversible PT algorithm (NRPT) [Syed et al. 2019]

2) Use obtained samples from the target chain to update the annealing schedule using the procedure in [Syed et al. 2019]

3) Update so that

4) Repeat 1-3 until the computational budget is depleted

28