Locally-Adaptive Boosting Variational Inference (LBVI)
Gian Carlo Diluvi
Trevor Campbell
Ben Bloem-Reddy
UBC Statistics
Locally-Adaptive Boosting Variational Inference (LBVI)
UBC Statistics
Gian Carlo Diluvi
Background: Bayesian inference
Bayesian inference: use data to update beliefs
data
posterior
+
=
prior
+
likelihood
sampling (MCMC)
local characterization of posterior
does not discover/properly weight modes
optimization (VI)
properly discovers and weights modes
unable to locally refine approximation
this work:
VI with mixture of components that locally adapt to the posterior
adapt to posterior density
Cataloging the visible universe
4
Sloan Digital Sky Survey Consortium, R. Lupton
given an image of (part of) the visible universe
which light points are galaxies and which are stars?
which are background noise?
if they are galaxies, are they...
or
elliptical
ESA, Hubble, NASA, J. Schmidt
even then, we have to account for...
spiral?
ESA, Hubble, NASA, J. Schmidt
recent work on this: [Regier 18, 19a, 19b]
Background: VI
5
variational inference:
minimize Kullback–Leibler (KL) to over family of distributions l
problem: simple family produces limited-fidelity approximation
solution: use nonparametric family of mixtures of simple components
problem: optimizing infinite number of parameters not feasible...
Background: Boosting VI (BVI)
6
step 1: choose new component
solution: build mixture by iteratively adding components
[Guo 17, Miller 17, Locatello 18a,b, Campbell 19, Giaquinto 20, Dresdner 21]
step 0: choose first component
step 2: reweight
repeat!
problem: many simple components are needed to refine approximation
problem: optimization gets more difficult as more components are added
insight: simple components do not capture the local behaviour of target
Our work: boosting VI with locally-adaptive components
7
solution: use components that locally adapt to the target density
initialized at a simple distribution
bonus: can use finite-component family and still achieve high accuracy
properly captures heavy tail as increases
no need to add many light-tailed Gaussians
Locally-adaptive boosting variational inference (LBVI)
greedily add components (discover modes) or increase adaptivity (local refinement)
step 0: choose first component
step 1: either
a. increase adaptivity
b. add new component
step 2: reweight
repeat!
fewer components: approximation is refined by increasing adaptivity
more accurate: yields smaller divergence due to better local approximation
problem: how to define components, ?
MCMC components naturally adapt to local regions of target
9
natural solution: use distribution of MCMC chain run for steps
initialized at a simple (e.g., Gaussian) distribution
recovers shape of target
low probability of mass “leaking”
problem: does not have a tractable density (i.e., cannot use with KL)
The kernelized Stein discrepancy (KSD)
10
solution: the KSD only requires samples and the score function
[Gorham 15&17, Chwialkowski 16, Liu 16]
no need to evaluate , only need access to i.i.d. samples
only depends on through its score,
theorem [Chwialkowski T2.2]: if and only if l
theorem [Liu T3.8]: KSD is an integral probability metric:
KSD is inappropriate for boosting
problem: KSD will not add modes
theorem
11
let
bad approximation
good approximation
then
KSD is inappropriate for boosting
12
problem: KSD will not add modes
but it would decrease the KL
bad approximation
adding a new mode would increase the KSD...
KSD does not penalize the bad approximation, and
good approximation
KSD-based boosting algorithms would prefer not adding missing modes
SMC provides flexible components with tractable densities
13
solution: use sequential Monte Carlo (SMC) components
[Gordon 93, Chopin 02, Del Moral 06]
setup: temper a reference towards the target
reference:
target:
tempering:
Locally-adaptive BVI with SMC components
14
setup: finite family of SMC components,
step 1a: increase adaptivity,
= maximize subject to
effective sample size,
can calculate in SMC
sample quality degradation
threshold
step 1b: modify weight,
1d weight update,
evenly down-weight other components
desideratum: take large adaptivity step without compromising SMC sample quality
minimize KL over weight of component n
minimize 2-order approximation to KL
(closed-form solution)
Synthetic experiment
15
mixture of two bananas and four Gaussians
LBVI (ours) Universal BVI (minimize sq. Hellinger instead of KL) BBBVI (regularized BVI)
(BVI runs into degeneracy issues and is not shown)
LBVI consistently produces better approximations with fewer components
Conclusion
16
boosting VI enables Bayesian inference with difficult, multimodal posterior distributions
...but many simple Gaussian components are needed to refine the approximation
and an example instantiation with SMC components
bonus: the KSD fails to detect missing mass and cannot be used in boosting (e.g., MCMC components)
this work: boosting VI with components that adapt locally to the target density
coming soon!
slides
The score function provides insufficient information
17
intuition: score function has no info about the mixing weights of multimodal targets
problem: KSD fails to detect missing modes
SMC provides adaptive components with tractable densities
18
solution: use sequential Monte Carlo (SMC) components
[Gordon 93, Chopin 02, Del Moral 06]
setup: temper a reference towards the target
reference:
target:
tempering:
local approximation
until
idea: move particles through
step 1: reweight
step 2: resample
step 3: rejuvenate