Universal Boosting Variational Inference

Trevor Campbell Xinglong Li
UBC Department of Statistics

1

Bayesian Data Analysis

2

This work: VI with a computation/quality tradeoff

Data

Model

Belief (distribution)

Statisticians need reliable uncertainty quantification: tails, moments, modes, etc

+

=

sampling (MCMC)

slow, hard to design correctly

optimization (parametric VI)

fast, easy to design

fundamentally limited, unreliable

boosting (nonparametric VI)

fast, easy to design

- simple black-box greedy algorithm

- quality guarantees via the Hellinger distance

- properties of Hellinger as a variational objective

Motivating Problem: KL Divergence Boosting VI

3

Problem:
entropy regularization hard to tune

when it works, poor approximations

sometimes doesnâ€™t, still have degeneracy...

to prevent degeneracy

Synthetic 2-component mixture example

Past work: greedily minimize KL

Theorem: There are settings where is in the mixture family and BBVI ( ) either

• immediately returns a degenerate distribution
• never improves the initialization

Solution: Boosting, but not KL

4

We develop boosting VI based on the Hellinger distance

• better behaved than KL: no regularization tuning required (black-box)
• error bounds for tail probabilities, moments, importance sampling
• simple theoretical analysis:

Theorem: After greedy steps, UBVI has squared Hellinger error to the best possible approximation,

for any mixture family and any target

(hence universal boosting VI, UBVI)

Synthetic Tests (see paper for quantitative results, real data/models)

5

Banana (30 components)

BBVI regularization either forces high variance (high) or fails entirely (low)

UBVI just works (no tuning)

Hellinger (UBVI) KL (BBVI, low reg) KL (BBVI, high reg) Target

Cauchy (30 components)

Universal Boosting VI (UBVI)

boosting variational inference via the Hellinger metric

black-box, no tuning, statistically sound

6

code will (soon) be at www.github.com/trevorcampbell/ubvi

7

Hellinger boosting

8

Empirical Results

9

Logistic regression, 3 datasets, diff # components

Empirical Results

10

Logistic regression, 3 datasets, diff # components

Solution: Boosting, but not KL

11

Target

KL(p || q)

KL(q || p)

Hellinger

We develop boosting VI based on the Hellinger distance

• better behaved than KL in general, no regularization required
• better control of forward KL, TV, Wasserstein, & more
• simple boosting error analysis

Theorem: After greedy steps, UBVI has squared Hellinger error to the best possible approximation,

for any mixture family and any target
(hence universal boosting VI, UBVI)

Empirical Results

12

lower is better

Banana distribution, 30 components

when BBVI regularization doesnâ€™t fail, it forces high variance

Cauchy distribution, 30 components

BBVI struggles with heavy tailed distributions (degeneracy)

lower is better

UBVI CAIDA 2019 - Google Slides