1 of 13

The Computational Asymptotics of Gaussian Variational Inference

Zuheng (David) Xu

Trevor Campbell

2 of 13

Variational Inference (VI), pre-2013

2

Target (unnormalized) posterior density

  1. Pick a parametrized family
  • Find

Mean Field Exponential Family

Usually can’t compute expectations...

But:

  • underestimates variance
  • limited applicability
  • hard to implement

expectations in closed-form!

fast optimization!

(sometimes…)

3 of 13

Variational Inference (VI), post-2013

Two big innovations opened the floodgates

  1. Stochastic optimization (Hoffman et al 13, Ranganath et al 13, others)
  • Automatic differentiation (Kucukelbir et al 16, others)

Enables more complex families (less human input/effort/errors)

Makes VI “black-box” (applicable to wide variety of families, models)

Solution: use unbiased gradient estimates

Problem: involves computing and

Solution: just code log densities; software tracks gradients for you via chain rule

Problem: exact KL gradients involve intractable expectations

Many families based on:

Dinh et al 16

diffeomorphisms

Salimans et al 14

Markov chains

Guo et al 16, Miller et al 16

nonparametric mixtures

Hoffman et al 14

hierarchies

Kingma et al 13

amortization

4 of 13

Variational Inference (VI), today

Given a model/family:

This work: VI with a multivariate Gaussian family�

  • (non)asymptotic conditions for (local) strong convexity, smoothness
  • Consistent VI: a new method that asymptotically finds the optimal approximation

A key challenge in statistical applications is reliability

  1. How good is the optimal approximation?
  2. asymptotics (Wang et al 19, Yang et al 20, Alquier et al 20, etc)
  3. nonparametric families (Guo et al 16, Miller et al 16, etc)
  1. Can it be found tractably?
  2. nothing general yet

5 of 13

Setup: Gaussian variational inference

5

objective

prefers higher target log density

generally nonconvex

regularization

stops variance collapse

convex

problem:

optimize the mean and covariance Cholesky factor

variational family:

model: n i.i.d. data, joint density

6 of 13

Variational Bernstein von Mises

  1. How good is the optimal approximation?

true likelihood param , inverse Fisher info

Corollary: given

i.e., the optimal variational parameters converge too

Asymptotics (Wang & Blei, 2019): under�Bernstein von Mises conditions,

the total variation error of converges to 0.

Gaussians

7 of 13

Nonasymptotic convexity

  1. Can it be found tractably?

Theorem: If

  1. is strongly convex in an ball around
  1. is globally Lipschitz smooth

then is

  • globally Lipschitz smooth
  • locally strongly convex within the region

“small”

and

When does this hold?

8 of 13

Asymptotic convexity

  • Can it be found tractably?

Theorem: If

“Just” need to:

  1. Initialize close to , �
  2. Stay in that local region during optimization

then there exists s.t.

}

then standard theoretical tools guarantee convergence

  1. Bernstein von Mises conditions hold

9 of 13

Initialization via smoothed MAP

9

density

log-smoothed-density

smoothed density

Theorem: If and , then

Furthermore, the smoothed optimum converges to the true parameter with rate

10 of 13

Consistent VI

10

2. Initialize Gaussian VI there with

To get consistency, need to show local confinement of iterates to that mode

3. Run stochastic gradient descent (with careful gradient scaling - see the paper)� to minimize KL divergence

  1. Solve the (eventually convex)�smoothed max a posteriori problem �to find the optimal mode:

Theorem: As , the sequence of iterates produced by consistent VI satisfies

11 of 13

Results

11

Key Takeaway �careful gradient scaling + smoothed MAP initialization �leads to more reliable variational inference in practice.

(random L init)

(standard method)

(SVI + smoothed MAP)

(proposed)

higher is better

(100 trials)

synthetic 1D mixture

20 trials

(see the paper for more results!)

12 of 13

Conclusion

12

https://trevorcampbell.me/

https://arxiv.org/abs/2104.05886

Recent asymptotic theory:

Error of optimal variational approximation → 0 with more data

...but can we find it in practice?

Zuheng (David) Xu

Tonnes of work left...

  • use smoothed posterior directly in VI?
  • non-Gaussian families + smoothed MAP initialization?
  • finding robust optima / nonasymptotic smoothing?

This work:

  • (non)asymptotic convexity properties of Gaussian variational inference�
  • Consistent VI: asymptotically finds the optimal variational approximation

13 of 13

Effect of smoothing

smoothing