1 of 18

Locally-Adaptive Boosting Variational Inference (LBVI)

Gian Carlo Diluvi

Trevor Campbell

Ben Bloem-Reddy

UBC Statistics

2 of 18

Locally-Adaptive Boosting Variational Inference (LBVI)

UBC Statistics

Gian Carlo Diluvi

3 of 18

Background: Bayesian inference

Bayesian inference: use data to update beliefs

  • quantifies uncertainty in model parameters
  • hard to simulate from for most models

data

posterior

+

=

prior

+

likelihood

sampling (MCMC)

local characterization of posterior

does not discover/properly weight modes

optimization (VI)

properly discovers and weights modes

unable to locally refine approximation

this work:

VI with mixture of components that locally adapt to the posterior

adapt to posterior density

4 of 18

Cataloging the visible universe

4

Sloan Digital Sky Survey Consortium, R. Lupton

given an image of (part of) the visible universe

which light points are galaxies and which are stars?

which are background noise?

if they are galaxies, are they...

or

elliptical

ESA, Hubble, NASA, J. Schmidt

even then, we have to account for...

  • multiple “shots” of the same area
  • some lights being very dim
  • pixelation, radiation, finite exposure

spiral?

ESA, Hubble, NASA, J. Schmidt

recent work on this: [Regier 18, 19a, 19b]

5 of 18

Background: VI

5

variational inference:

minimize Kullback–Leibler (KL) to over family of distributions l

problem: simple family produces limited-fidelity approximation

solution: use nonparametric family of mixtures of simple components

problem: optimizing infinite number of parameters not feasible...

6 of 18

Background: Boosting VI (BVI)

6

step 1: choose new component

solution: build mixture by iteratively adding components

[Guo 17, Miller 17, Locatello 18a,b, Campbell 19, Giaquinto 20, Dresdner 21]

step 0: choose first component

step 2: reweight

repeat!

problem: many simple components are needed to refine approximation

problem: optimization gets more difficult as more components are added

insight: simple components do not capture the local behaviour of target

7 of 18

Our work: boosting VI with locally-adaptive components

7

solution: use components that locally adapt to the target density

initialized at a simple distribution

bonus: can use finite-component family and still achieve high accuracy

properly captures heavy tail as increases

no need to add many light-tailed Gaussians

8 of 18

Locally-adaptive boosting variational inference (LBVI)

greedily add components (discover modes) or increase adaptivity (local refinement)

step 0: choose first component

step 1: either

a. increase adaptivity

b. add new component

step 2: reweight

repeat!

fewer components: approximation is refined by increasing adaptivity

more accurate: yields smaller divergence due to better local approximation

problem: how to define components, ?

9 of 18

MCMC components naturally adapt to local regions of target

9

natural solution: use distribution of MCMC chain run for steps

initialized at a simple (e.g., Gaussian) distribution

recovers shape of target

low probability of mass “leaking”

problem: does not have a tractable density (i.e., cannot use with KL)

10 of 18

The kernelized Stein discrepancy (KSD)

10

solution: the KSD only requires samples and the score function

[Gorham 15&17, Chwialkowski 16, Liu 16]

no need to evaluate , only need access to i.i.d. samples

only depends on through its score,

theorem [Chwialkowski T2.2]: if and only if l

theorem [Liu T3.8]: KSD is an integral probability metric:

11 of 18

KSD is inappropriate for boosting

problem: KSD will not add modes

theorem

11

let

bad approximation

good approximation

then

12 of 18

KSD is inappropriate for boosting

12

problem: KSD will not add modes

but it would decrease the KL

bad approximation

adding a new mode would increase the KSD...

KSD does not penalize the bad approximation, and

good approximation

KSD-based boosting algorithms would prefer not adding missing modes

13 of 18

SMC provides flexible components with tractable densities

13

solution: use sequential Monte Carlo (SMC) components

[Gordon 93, Chopin 02, Del Moral 06]

setup: temper a reference towards the target

reference:

target:

tempering:

  • easy to sample from
  • sampling naturally provides density estimates

14 of 18

Locally-adaptive BVI with SMC components

14

setup: finite family of SMC components,

step 1a: increase adaptivity,

= maximize subject to

effective sample size,

can calculate in SMC

sample quality degradation

threshold

  • can solve using bisection-like algorithm
  • also automatically adapts SMC discretization

step 1b: modify weight,

1d weight update,

evenly down-weight other components

desideratum: take large adaptivity step without compromising SMC sample quality

minimize KL over weight of component n

minimize 2-order approximation to KL

(closed-form solution)

  • use approximation to choose component
  • then optimize weight via Newton’s method

15 of 18

Synthetic experiment

15

mixture of two bananas and four Gaussians

LBVI (ours) Universal BVI (minimize sq. Hellinger instead of KL) BBBVI (regularized BVI)

(BVI runs into degeneracy issues and is not shown)

LBVI consistently produces better approximations with fewer components

16 of 18

Conclusion

16

boosting VI enables Bayesian inference with difficult, multimodal posterior distributions

...but many simple Gaussian components are needed to refine the approximation

and an example instantiation with SMC components

bonus: the KSD fails to detect missing mass and cannot be used in boosting (e.g., MCMC components)

this work: boosting VI with components that adapt locally to the target density

coming soon!

slides

17 of 18

The score function provides insufficient information

17

intuition: score function has no info about the mixing weights of multimodal targets

problem: KSD fails to detect missing modes

18 of 18

SMC provides adaptive components with tractable densities

18

solution: use sequential Monte Carlo (SMC) components

[Gordon 93, Chopin 02, Del Moral 06]

setup: temper a reference towards the target

reference:

target:

tempering:

local approximation

until

idea: move particles through

step 1: reweight

step 2: resample

step 3: rejuvenate

  • local adaptivity
  • tractable density estimates: