Parallel tempering with a variational reference
Saifuddin Syed
Nikola Surjanovic
Alexandre Bouchard-Côté
Trevor Campbell
Parallel tempering
2
Parallel tempering
Annealing path: is a path of distributions between and
3
Run chains targeting
Alternate local exploration and communication
Parallel tempering
Local exploration: Update each chain according to an MCMC algorithm
4
Reference
Target
Communication: Swap between chains and with probability
Parallel tempering
Problem:
5
Fixed reference
Target
Parallel tempering
6
Contribution:
Variational reference
Target
Parallel tempering
Annealing path: is a path of distributions between and
7
Usually the posterior
Fixed reference: usually the prior�(Can we do better?)
Quantifying PT performance
Goal: Assess whether a choice of will perform well
8
Communication swap
We use the restart rate to assess PT communication performance
Quantifying PT performance
Goal: Assess whether a choice of will perform well
We use the restart rate to assess PT communication performance
9
Quantifying PT performance
Goal: Assess whether a choice of will perform well
We use the restart rate to assess PT communication performance
10
Quantifying PT performance
Goal: Assess whether a choice of will perform well
We use the restart rate to assess PT communication performance
11
Quantifying PT performance
Goal: Assess whether a choice of will perform well
We use the restart rate to assess PT communication performance
12
Restart!
Suboptimality of a fixed reference
If reference and target do not overlap much → PT swapping is inefficient
13
Prior
Prior is far away from posterior in data limit
(low restart rate)
Posterior
Concentrates to a point mass
Proposition (problem with standard PT): For an exchangeable* Bayesian model with conditionally i.i.d. data points, PT with a fixed reference satisfies
*I.e., satisfies usual technical assumptions for a Bernstein-von Mises result, etc.
Annealing paths with a variational reference
We introduce a variational reference family, and the new path
14
Consider exponential family references
Variational parameters
(e.g. mean and variance of normal distribution)
How do we choose to be close to the target and have a high restart rate?
Variational reference tuning
We minimize the forward (inclusive) KL divergence,
with a gradient-free procedure (i.e., no user tuning effort!)
15
Use samples to tune the reference with moment matching
Choose so that
Samples from target
Repeat with an exponentially increasing number of samples (new tuning round)
Variational reference tuning
Theorem (sketch): The variational parameter estimates converge to the forward KL minimizer almost surely.
16
The restart rate at the forward KL minimum is bounded in terms of the flexibility of the variational family. (More details in paper.)
Optimality of PT with a variational reference
Proposition: For an exchangeable Bayesian model with conditionally i.i.d. data points, PT with a normal variational reference that minimizes the forward KL satisfies
17
Prior
Posterior
Variational reference is close to posterior
(high restart rate)
Experiments
PT with a variational reference empirically outperforms PT with a fixed reference. Consider variational PT with a normal reference (mean-field approximation and full covariance) vs. NRPT.
Tested variational PT on 14 models (phylogenetic inference, sparse CAR models, logistic regression, ODEs)
Up to 40x increase in the number of restarts compared to the current state of the art
18
Simple mixture model
Experiments
PT with a variational reference empirically outperforms PT with a fixed reference. Consider variational PT with a normal reference (mean-field approximation and full covariance) vs. NRPT.
19
Simple mixture model
mRNA transfection model*
Posterior cut
*Ballnus et al. (2017)
Experiments
PT with a variational reference empirically outperforms PT with a fixed reference. Consider variational PT with a normal reference (mean-field approximation and full covariance) vs. NRPT.
20
Simple mixture model
mRNA transfection model*
Higher = more efficient communication
More compute time
Experiments
PT with a variational reference empirically outperforms PT with a fixed reference. Consider variational PT with a normal reference (mean-field approximation and full covariance) vs. NRPT.
21
Simple mixture model
Simple mixture model
Posterior cut
Experiments
PT with a variational reference empirically outperforms PT with a fixed reference. Consider variational PT with a normal reference (mean-field approximation and full covariance) vs. NRPT.
22
Simple mixture model
Simple mixture model
Higher = more efficient communication
More compute time
Summary
Problem:
Contribution:
23
Thank you
24
Questions?
25
Parallel tempering
Objective: Minimize the global communication barrier (GCB)
26
The GCB characterizes the efficiency of PT [Syed et al. 2019]
Restart rate
Minimize the global communication barrier → Maximize the restart rate
Rate at which samples from the reference travel to the target
27
Variational reference tuning
Algorithm:
1) Run the non-reversible PT algorithm (NRPT) [Syed et al. 2019]
2) Use obtained samples from the target chain to update the annealing schedule using the procedure in [Syed et al. 2019]
3) Update so that
4) Repeat 1-3 until the computational budget is depleted
28