1 of 44

Automated, Scalable Bayesian Inference via Coresets

Trevor Campbell Tamara Broderick

1

2 of 44

Large Scale Inference

The challenges of science have shifted from data generation to data analysis as new technologies have enabled generating/recording massive amounts of data

2

physics

computer security

finance

Desiderata: good point estimates/uncertainties, scalable, easy to use

Challenges: existing methods often slow, unreliable, difficult/tedious to implement

This talk: automated, scalable inference with guarantees via data summarization

biology

3 of 44

Bayesian Inference

Why be Bayesian?

posterior provides point estimation, uncertainty quantification
prior enables incorporation of domain expertise

3

Given: data

model

Infer: posterior

but typically don’t have access to exact posterior (intractable)

4 of 44

Inference Algorithm Desiderata

4

wide class of applicable models & no tuning

enables:

ease of use
experimental reproducibility
sophisticated modeling
fast exploratory analysis

automated

scalable

theoretically sound

complexity O(# data) with a small constant

enables:

application to real massive data problems

bounded discrepancy between the true posterior and the output

enables:

confidence in results
control of computation/quality tradeoff

5 of 44

Past Work (nonexhaustive)

5

automation, scalability, guarantees — pick 2

e.g.

Black-Box VB (Ranganath et al 2014)

Auto-Diff VB (Kucukelbir et al 2016)

Consensus MC (Scott et al 2016)�SDA Bayes (Broderick et al 2013)

scalability

guarantees

automation

e.g.

Hamiltonian MC (Neal 2011)

NUTS (Hoffman & Gelman 2014)

Stan (Carpenter et al 2017)

Approx Diffusions

(Huggins & Zou 2017)

MCMC

VB

6 of 44

This Work

6

run MCMC on a small representative subset of the data (coreset)

{

MCMC is slow

MCMC is fast

Bayesian Coresets (Huggins, Campbell, Broderick 2016; Campbell & Broderick, 2017, 2018)

Automated: model-agnostic, no tuning
Scalable: complexity is
Sound: posterior discrepancy decays exponentially in

7 of 44

Outline

7

Bayesian Coresets

Coresets via Subsampling

Coresets via Optimization

Experimental Results

8 of 44

The “core” of a dataset

8

redundancies exist in large datasets

e.g. K-clustering

coresets�summarize the data with a weighted subset, then run standard MCMC

similar ideas�coresets in optimization (Feldman & Langberg 2011)�data squashing (DuMouchel et al 1999)�GP inducing points (Snelson & Ghahramani 2005)

9 of 44

Bayesian Coresets (Huggins, Campbell, Broderick ‘16)

9

A coreset is a weighted subset sum of log-likelihoods

Goal: posterior distribution

Coreset posterior approximation

10 of 44

Not just any coreset...

10

we need weights for which �

� ��

for some norm, hopefully implying

so that MCMC for is fast; each step takes time

11 of 44

Quality / Cost Tradeoff

11

tradeoff between low error, small coreset

N is too many (costly)

e.g. K-clustering

1 is too few (poor quality)

K is about right

How do we perform this tradeoff automatically for an arbitrary model / dataset?

12 of 44

Outline

12

Bayesian Coresets

Coresets via Subsampling

Coresets via Optimization

Experimental Results

13 of 44

Subsampling

Idea: subsample dataset with probabilities

13

Intuitive Reasoning:

coreset has bounded size

as

approximation becomes exact as coreset grows

14 of 44

Uniform Subsampling

14

high variance

misses critical data

15 of 44

Importance Sampling (HCB16, CB17)

15

Intuition: sample unique/important data more frequently, but then downweight

reweight by �inverse norms

original dataset

compute data norms

i.i.d. subsample�

Theorem [CB17]: With high probability,

16 of 44

Importance Sampling (HCB16, CB17)

16

finds critical data, but still high variance

no notion of residual error

17 of 44

Questions Remain

17

need an inner product to get a residual error direction, and �need to consider residual when adding points to coreset

Which norm to choose? e.g. (HCB16)

How does the norm relate to familiar statistical notions of posterior discrepancy?

How do we improve upon subsampling?

18 of 44

Outline

18

Bayesian Coresets

Coresets via Subsampling

Coresets via Optimization

Experimental Results

19 of 44

A Proposed Norm

19

solution: use a norm with an inner product

weighting distribution (ideally )

problem: need a notion of residual error direction

20 of 44

Posterior Distance

20

solution: upper bound on posterior 1-Wasserstein

problem: what is ?

Theorem [Huggins (Personal Communication)]: If is -strongly log-concave,

divergence:

21 of 44

Random Features

21

solution: random feature projection

Theorem [CB17]: With high probability, a good coreset for is a good coreset for

problem: not tractable / automated

we construct the coreset using the finite-dimensional vectors

22 of 44

Coresets via Optimization (Campbell, Broderick ‘17)

22

find the best posterior approximation

given a particular MCMC computational budget

and make sure the approximation is a valid likelihood

problem: how to perform cost/quality tradeoff?

solution: formulate as optimization

23 of 44

Frank-Wolfe

23

convex optimization:

Frank-Wolfe iteration:

linear optimization

line search

update

sparsity

solution is a convex combination of M vertices after M-1 iterations

useful for coreset construction!

24 of 44

Convex Coreset Construction

24

not a polytope

convex

Recall:

solution: replace with a simplex

25 of 44

Convex Coreset Construction

25

optimal solution:

constraint vertices:

modified optimization error will approach 0

after iterations,

26 of 44

Frank-Wolfe for Coreset Construction

26

linear optimization

greedy selection

line search

closed-form

update

using a Hilbert norm makes FW tractable

27 of 44

Frank-Wolfe for Coreset Construction

27

coreset construction via FW is iterative greedy selection & optimal reweighting

28 of 44

Exponential Convergence

28

problem: past coreset constructions have

solution: we show FW converges exponentially fast

Theorem [CB’17]: After iterations,

subsampling

posterior 1-Wasserstein

exponential dominates�for large M

linear dominates�for small M

29 of 44

Summary & Scalability

29

random feature projection

Frank-Wolfe

MCMC

dataset size

coreset size

projection dim

MCMC steps

30 of 44

Outline

30

Bayesian Coresets

Hilbert Coresets

Algorithms & Guarantees

Experimental Results

31 of 44

Experiments

31

Gaussian Inference

Logistic Regression

Poisson Regression

Data: 2D synthetic
Inference: closed-form
Comparison: KL-divergence
Norms: closed-form,

Data: 1 synthetic, 2 real datasets for each model
Inference: Random-Walk Metropolis Hastings
Comparison: test log-likelihood, 1-Wasserstein
Norms: 500-dimensional random projection, multivariate normal via Laplace approximation

32 of 44

Gaussian Inference

M = 5

M = 50

M = 500

Uniformly Random Subsampling

Importance Sampling

Frank-Wolfe

32

33 of 44

Gaussian Inference

33

FW Hilbert coresets: orders of magnitude reduction in KL-Divergence

coreset construction iterations

lower error

34 of 44

Logistic Regression

M = 10

M = 100

M = 1000

Uniformly Random Subsampling

Importance Sampling

Frank-Wolfe

34

35 of 44

Poisson Regression

M = 10

M = 100

M = 1000

Uniformly Random Subsampling

Importance Sampling

Frank-Wolfe

35

36 of 44

36

Logistic Regression

Poisson Regression

FW Hilbert coresets: orders of magnitude reduction in 1-Wasserstein distance

Frank-Wolfe

uniform subsampling

37 of 44

Conclusion

Lots of work currently in progress:

reducing projection error
different constraint sets
non-conditionally i.i.d. models
investigating other norms
streaming/distributed

Bayesian Coresets�automated, scalable Bayesian inference with finite-sample guarantees

37

Campbell, Broderick �Automated Scalable Bayesian Inference via Hilbert Coresets �https://arxiv.org/abs/1710.05053

trevorcampbell/bayesian-coresets

coming soon!

38 of 44

New Work

38

Campbell, Broderick �Bayesian Coreset Construction via Greedy Iterative Geodesic Ascent

ICML 2018 �https://arxiv.org/abs/1802.01737

We chose the simplex to work well with FW

Q: Is there a better optimization manifold / algorithm?

A: Yes

39 of 44

Streaming/Distributed Coresets

39

split

dataset

coreset construction

...

merge coresets

Theorem [CB17]: Using K machines gives the same error bound, but coreset size KM

40 of 44

Experimental Results Summary

40

FW Hilbert coresets: lower posterior error across numerous models/datasets; � often also has lower computational cost for same M

coreset construction iterations

lower

error

faster

41 of 44

Logistic/Poisson Regression Test Log-Likelihood

41

42 of 44

The Logistic Map

42

Lemma [CB’17]:

If is a sequence satisfying

then

dominates�for large t

dominates�for small t

previously could only �get one or the other

43 of 44

Gaussian Projection Error

43

44 of 44

Related Recent Work

44

Unified analysis of common Bayesian nonparametric models for streaming combinatorial data

3

2

1

2

3

2

3

V₁

V₃

V₂

Exchangeable Trait Allocations (2016)

T. Campbell, D. Cai, T. Broderick

https://arxiv.org/abs/1609.09147

Automated approximations of Bayesian nonparametric priors with theoretical quality guarantees

1

2

3

4

K

Truncated Random Measures (2016)

T. Campbell*, J. Huggins*, J. How, T. Broderick.

https://arxiv.org/abs/1603.00861