Universal Boosting Variational Inference

Trevor Campbell Xinglong Li
UBC Department of Statistics

1

Bayesian Data Analysis

2

- simple black-box greedy algorithm

- works with any mixture family, any target

- quality guarantees via the Hellinger distance

- properties of Hellinger as a variational objective

This work: VI with a computation/quality tradeoff

Data

Model

Belief (distribution)

Statisticians need reliable uncertainty quantification: tails, moments, modes, etc

+

=

sampling (MCMC)

computation / quality tradeoff

slow, hard to design correctly

optimization (parametric VI)

fast, easy to design

fundamentally limited, unreliable

boosting (nonparametric mixture VI)

fast, easy to design
computation / quality tradeoff

Background: VI

3

Problem: optimizing infinitely many parameters is intractable...

Variational Inference:

minimize Kullback-Leibler (KL) to over family of distributions .

Problem: for any parametric , fundamentally limited

Solution: nonparametric mixture family

Background: Boosting VI

4

Problem: optimization over nonparametric mixture family

Solution: iteratively add one component & reweight (variational boosting)

[Miller 16, Guo 16, Wang 16, Locatello 18a,b]

Iterate:

Step 1) optimize component

Step 2) reweight

Black-Box KL Boosting VI (BBVI)

5

Boosting with KL doesn’t work great...

Problem:

entropy regularization hard to tune

even with it, still don’t converge to p

needs ad-hoc entropy regularization
to prevent degeneracy

[Miller 16, Guo 16, Wang 16, Locatello 18a,b]

Theorem: There are settings where is in the mixture family and BBVI ( ) either

  • immediately returns a degenerate distribution
  • never improves the initialization


Hellinger for VI: Moments, Probabilities, Estimation

6

Theorem [e.g. Pollard p.51]: Hellinger controls error in probabilities

Theorem: Hellinger controls forward & reverse KL (importance sampling error)

Theorem: Hellinger controls error in moments

this work: Hellinger distance

Universal Boosting VI

7

Hellinger is a natural choice for boosting

Step 1: component selection

sqrt densities have norm 1 in

i.e. they are all on the unit Hilbert sphere!

greedily minimize Hellinger

space of (sqrt) densities

Universal Boosting VI

8

Hellinger is a natural choice for boosting

Step 2: optimal reweighting

minimize Hellinger distance over weights
=
minimize squared L2 distance

=

nonnegative least squares (very easy)

Convergence Guarantee

9

Theorem: After greedy steps, UBVI has squared Hellinger error to the best possible approximation,


for any mixture family and any target

(hence universal boosting VI, UBVI)


MCMC-like computation/quality tradeoff!

Synthetic Tests

10

Banana (30 components)

BBVI regularization either forces high variance (high) or fails entirely (low)

UBVI just works (no tuning)

Hellinger (UBVI) KL (BBVI, low reg) KL (BBVI, high reg) Target

Cauchy (30 components)

Universal Boosting VI (UBVI)

boosting variational inference via the Hellinger metric

black-box, no tuning, statistically sound

11

code will (soon) be at www.github.com/trevorcampbell/ubvi

Empirical Results

12

Logistic regression, 3 datasets, diff # components

Empirical Results

13

lower is better

Banana distribution, 30 components

when BBVI regularization doesn’t fail, it forces high variance

Cauchy distribution, 30 components

BBVI struggles with heavy tailed distributions (degeneracy)

lower is better

________ Boosting VI?

14

practitioners need:

  • control of error in posterior moments, probabilities, importance sampling
  • easy to estimate and optimize
  • black-box, no tuning

Total variation: controls probability error but hard to estimate / optimize

Wasserstein: controls moment error but hard to estimate / optimize

Stein: controls RKHS error, can estimate / optimize but requires choosing a kernel

Kullback-Leibler: not “smooth” enough, degenerate optimization

this work: Hellinger distance

Hellinger for VI: Moments, Probabilities, Estimation

15

Theorem: Hellinger can be estimated well with self-normalizing importance sampling

Theorem [e.g. Pollard p.51]: Hellinger controls error in probabilities

Theorem: Hellinger controls forward & reverse KL (importance sampling error)

Theorem: Hellinger controls error in moments

this work: Hellinger distance

Hellinger for VI: Importance Sampling

16

Target (p)

KL(p || q)

KL(q || p)

Hellinger

e.g. correlated Gaussian, isotropic Gaussian

Empirical Results

17

Logistic regression, 3 datasets, diff # components

UBVI ISBA-EAC / JSM 2019 - Google Slides