Automated, Scalable Bayesian Inference via Coresets
Trevor Campbell Tamara Broderick
1
Large Scale Inference
The challenges of science have shifted from data generation to data analysis as new technologies have enabled generating/recording massive amounts of data
2
physics
computer security
finance
Desiderata: good point estimates/uncertainties, scalable, easy to use
Challenges: existing methods often slow, unreliable, difficult/tedious to implement
This talk: automated, scalable inference with guarantees via data summarization
biology
Bayesian Inference
Why be Bayesian?
3
Given: data
model
Infer: posterior
but typically don’t have access to exact posterior (intractable)
Inference Algorithm Desiderata
4
wide class of applicable models & no tuning
enables:
automated
scalable
theoretically sound
complexity O(# data) with a small constant
enables:
bounded discrepancy between the true posterior and the output
enables:
Past Work (nonexhaustive)
5
automation, scalability, guarantees — pick 2
e.g.
Black-Box VB (Ranganath et al 2014)
Auto-Diff VB (Kucukelbir et al 2016)
Consensus MC (Scott et al 2016)�SDA Bayes (Broderick et al 2013)
scalability
guarantees
automation
e.g.
Hamiltonian MC (Neal 2011)
NUTS (Hoffman & Gelman 2014)
Stan (Carpenter et al 2017)
Approx Diffusions
(Huggins & Zou 2017)
MCMC
VB
This Work
6
run MCMC on a small representative subset of the data (coreset)
{
{
MCMC is slow
MCMC is fast
Bayesian Coresets (Huggins, Campbell, Broderick 2016; Campbell & Broderick, 2017, 2018)
Outline
7
Bayesian Coresets
Coresets via Subsampling
Coresets via Optimization
Experimental Results
The “core” of a dataset
8
redundancies exist in large datasets
e.g. K-clustering
coresets�summarize the data with a weighted subset, then run standard MCMC
similar ideas�coresets in optimization (Feldman & Langberg 2011)�data squashing (DuMouchel et al 1999)�GP inducing points (Snelson & Ghahramani 2005)
Bayesian Coresets (Huggins, Campbell, Broderick ‘16)
9
A coreset is a weighted subset sum of log-likelihoods
Goal: posterior distribution
Coreset posterior approximation
Not just any coreset...
10
we need weights for which �
� ��
Quality / Cost Tradeoff
11
tradeoff between low error, small coreset
N is too many (costly)
e.g. K-clustering
1 is too few (poor quality)
K is about right
How do we perform this tradeoff automatically for an arbitrary model / dataset?
Outline
12
Bayesian Coresets
Coresets via Subsampling
Coresets via Optimization
Experimental Results
Subsampling
Idea: subsample dataset with probabilities
13
Intuitive Reasoning:
coreset has bounded size
as
approximation becomes exact as coreset grows
Uniform Subsampling
14
high variance
misses critical data
Importance Sampling (HCB16, CB17)
15
Intuition: sample unique/important data more frequently, but then downweight
reweight by �inverse norms
original dataset
compute data norms
i.i.d. subsample�
Theorem [CB17]: With high probability,
Importance Sampling (HCB16, CB17)
16
finds critical data, but still high variance
no notion of residual error
Questions Remain
17
need an inner product to get a residual error direction, and �need to consider residual when adding points to coreset
Outline
18
Bayesian Coresets
Coresets via Subsampling
Coresets via Optimization
Experimental Results
A Proposed Norm
19
solution: use a norm with an inner product
weighting distribution (ideally )
problem: need a notion of residual error direction
more problems:
Posterior Distance
20
solution: upper bound on posterior 1-Wasserstein
problem: what is ?
Theorem [Huggins (Personal Communication)]: If is -strongly log-concave,
divergence:
Random Features
21
solution: random feature projection
Theorem [CB17]: With high probability, a good coreset for is a good coreset for
problem: not tractable / automated
we construct the coreset using the finite-dimensional vectors
Coresets via Optimization (Campbell, Broderick ‘17)
22
find the best posterior approximation
given a particular MCMC computational budget
and make sure the approximation is a valid likelihood
problem: how to perform cost/quality tradeoff?
solution: formulate as optimization
Frank-Wolfe
23
convex optimization:
Frank-Wolfe iteration:
sparsity
solution is a convex combination of M vertices after M-1 iterations
useful for coreset construction!
Convex Coreset Construction
24
not a polytope
convex
Recall:
solution: replace with a simplex
Convex Coreset Construction
25
optimal solution:
constraint vertices:
modified optimization error will approach 0
after iterations,
Frank-Wolfe for Coreset Construction
26
using a Hilbert norm makes FW tractable
Frank-Wolfe for Coreset Construction
27
coreset construction via FW is iterative greedy selection & optimal reweighting
Exponential Convergence
28
problem: past coreset constructions have
solution: we show FW converges exponentially fast
Theorem [CB’17]: After iterations,
subsampling
posterior 1-Wasserstein
exponential dominates�for large M
linear dominates�for small M
Summary & Scalability
29
random feature projection
Frank-Wolfe
MCMC
dataset size
coreset size
projection dim
MCMC steps
Outline
30
Bayesian Coresets
Hilbert Coresets
Algorithms & Guarantees
Experimental Results
Experiments
31
Gaussian Inference
Logistic Regression
Poisson Regression
Gaussian Inference
M = 5
M = 50
M = 500
Uniformly Random Subsampling
Importance Sampling
Frank-Wolfe
32
Gaussian Inference
33
FW Hilbert coresets: orders of magnitude reduction in KL-Divergence
coreset construction iterations
lower error
Logistic Regression
M = 10
M = 100
M = 1000
Uniformly Random Subsampling
Importance Sampling
Frank-Wolfe
34
Poisson Regression
M = 10
M = 100
M = 1000
Uniformly Random Subsampling
Importance Sampling
Frank-Wolfe
35
36
Logistic Regression
Poisson Regression
FW Hilbert coresets: orders of magnitude reduction in 1-Wasserstein distance
Frank-Wolfe
uniform subsampling
Conclusion
Lots of work currently in progress:
Bayesian Coresets�automated, scalable Bayesian inference with finite-sample guarantees
37
Campbell, Broderick �Automated Scalable Bayesian Inference via Hilbert Coresets �https://arxiv.org/abs/1710.05053
trevorcampbell/bayesian-coresets
coming soon!
New Work
38
Campbell, Broderick �Bayesian Coreset Construction via Greedy Iterative Geodesic Ascent
ICML 2018 �https://arxiv.org/abs/1802.01737
We chose the simplex to work well with FW
Q: Is there a better optimization manifold / algorithm?
A: Yes
Streaming/Distributed Coresets
39
split
dataset
coreset construction
...
...
merge coresets
Theorem [CB17]: Using K machines gives the same error bound, but coreset size KM
Experimental Results Summary
40
FW Hilbert coresets: lower posterior error across numerous models/datasets; � often also has lower computational cost for same M
coreset construction iterations
coreset construction iterations
lower
error
faster
Logistic/Poisson Regression Test Log-Likelihood
41
The Logistic Map
42
Lemma [CB’17]:
If is a sequence satisfying
then
dominates�for large t
dominates�for small t
previously could only �get one or the other
Gaussian Projection Error
43
Related Recent Work
44
Unified analysis of common Bayesian nonparametric models for streaming combinatorial data
3
2
1
1
1
2
3
2
3
V1
V3
V2
Exchangeable Trait Allocations (2016)
T. Campbell, D. Cai, T. Broderick
https://arxiv.org/abs/1609.09147
Automated approximations of Bayesian nonparametric priors with theoretical quality guarantees
1
2
3
4
K
Truncated Random Measures (2016)
T. Campbell*, J. Huggins*, J. How, T. Broderick.
https://arxiv.org/abs/1603.00861