The Computational Asymptotics of Gaussian Variational Inference
Zuheng (David) Xu
Trevor Campbell
Variational Inference (VI), pre-2013
2
Target (unnormalized) posterior density
Mean Field Exponential Family
Usually can’t compute expectations...
But:
expectations in closed-form!
fast optimization!
(sometimes…)
Variational Inference (VI), post-2013
Two big innovations opened the floodgates
Enables more complex families (less human input/effort/errors)
Makes VI “black-box” (applicable to wide variety of families, models)
Solution: use unbiased gradient estimates
Problem: involves computing and
Solution: just code log densities; software tracks gradients for you via chain rule
Problem: exact KL gradients involve intractable expectations
Many families based on:
Dinh et al 16
diffeomorphisms
Salimans et al 14
Markov chains
Guo et al 16, Miller et al 16
nonparametric mixtures
Hoffman et al 14
hierarchies
Kingma et al 13
amortization
Variational Inference (VI), today
Given a model/family:
This work: VI with a multivariate Gaussian family�
A key challenge in statistical applications is reliability
Setup: Gaussian variational inference
5
objective
prefers higher target log density
generally nonconvex
regularization
stops variance collapse
convex
problem:
optimize the mean and covariance Cholesky factor
variational family:
model: n i.i.d. data, joint density
Variational Bernstein von Mises
true likelihood param , inverse Fisher info
Corollary: given
i.e., the optimal variational parameters converge too
Asymptotics (Wang & Blei, 2019): under�Bernstein von Mises conditions,
the total variation error of converges to 0.
Gaussians
Nonasymptotic convexity
Theorem: If
then is
“small”
and
When does this hold?
Asymptotic convexity
Theorem: If
“Just” need to:
then there exists s.t.
}
then standard theoretical tools guarantee convergence
Initialization via smoothed MAP
9
density
log-smoothed-density
smoothed density
Theorem: If and , then
Furthermore, the smoothed optimum converges to the true parameter with rate
Consistent VI
10
2. Initialize Gaussian VI there with
To get consistency, need to show local confinement of iterates to that mode
3. Run stochastic gradient descent (with careful gradient scaling - see the paper)� to minimize KL divergence
Theorem: As , the sequence of iterates produced by consistent VI satisfies
Results
11
Key Takeaway �careful gradient scaling + smoothed MAP initialization ��leads to more reliable variational inference in practice.
(random L init)
(standard method)
(SVI + smoothed MAP)
(proposed)
higher is better
(100 trials)
synthetic 1D mixture
20 trials
(see the paper for more results!)
Conclusion
12
https://trevorcampbell.me/
https://arxiv.org/abs/2104.05886
Recent asymptotic theory:
Error of optimal variational approximation → 0 with more data
...but can we find it in practice?
Zuheng (David) Xu
Tonnes of work left...
This work:
Effect of smoothing
smoothing