1 of 53

Robert Bamler

Cluster Colloquium "Machine Learning" @ Tübingen • 3 February 2021

Scalable Bayesian Inference:

New Tools for New Challenges

Department of Computer Science,

Cluster of Excellence “Machine Learning for Science”,

Tübingen AI Center

2 of 53

My Journey

PhD in �Theoretical Physics

Cologne, Germany

2016

Postdoc in�Machine Learning

Disney Research, USA

2018

Postdoc in�Machine Learning

UC Irvine, CA, USA

2020

Assistant Professor

Uni Tübingen, Germany

ML Cluster of Excellence

Robert Bamler • University of Tübingen

Slide 2

3 of 53

My Research Plan

Common Umbrella:

scalable Bayesian inference

guide research & evaluation metrics

allow us to address

New Tools: models & algorithms

New Challenges: Resource Constrained Applications

Robert Bamler • University of Tübingen

Slide 3

4 of 53

Teaser: Evolution of Language

[RB & Mandt,

ICML 2017]

“Computer” today

© 20th century FOX

“Computer”

in 1961

Robert Bamler • University of Tübingen

Slide 4

5 of 53

Teaser: Evolution of Language

[RB & Mandt,

ICML 2017]

Süddeutsche Zeitung, last Saturday

“20 Years of Gibberish”

Robert Bamler • University of Tübingen

Slide 5

6 of 53

Top 10 Most Mobile Words

(Training data: Google Books corpus [Michel et al., 2011])

[RB & Mandt,

ICML 2017]

similarity to “peer

Robert Bamler • University of Tübingen

Slide 6

7 of 53

Robert Bamler • University of Tübingen

Slide 7

8 of 53

Peeking�Behind the Curtains:

Probabilistic Models &�Scalable Bayesian Inference

9 of 53

Previous Approaches

[e.g., Kim et al., 2014,

Hamilton et al., 2016]

word embeddings for year 1851

align

books written in 1850

books written in 1851

books written in 2008

word embeddings for year 2008

word embeddings for year 1850

word2vec

align

Robert Bamler • University of Tübingen

Slide 9

10 of 53

Problem: Subtle Signals in the Noise

1998

1999

1999

Probabilistic time series model

Our Solution:

[RB & Mandt, ICML 2017]

Scalable approximate Bayesian inference

+

Symmetry-aware optimization algorithm

+

Robert Bamler • University of Tübingen

Slide 10

11 of 53

Dynamic Word Embeddings

[RB & Mandt,

ICML 2017]

word embeddings�for year 1851

books written�in 1850

books written�in 1851

books written�in 2008

Probabilistic time series model�(Ornstein-Uhlenbeck process)

word embeddings�for year 2008

word embeddings�for year 1850

probabilistic generative model

inference

Robert Bamler • University of Tübingen

Slide 11

12 of 53

Probabilistic Models And Bayesian Inference

observed data

“latent variables”

Inference: find the posterior p(z | x)

p(x, z)

p(x, z) dz

=

prior knowledge�(e.g., scientific model)

likelihood�(“observation model”)

= p(z) p(x | z)

What they won’t tell you (usually):

Exact posterior often has thousands or millions of terms.

What they will tell you:

intractable

Generative probabilistic model:

p(x, z)

Robert Bamler • University of Tübingen

Slide 12

13 of 53

Approximate Bayesian Inference

Claim: You don’t want the exact posterior.

a compact approximate posterior.

that provides

For example:

Sampling methods (HMC, Langevin, …)

Finite set of samples from posterior

Mean field variational inference

Fully factorized variational distribution

Structured variational inference

Distribution with some correlations

Laplace approximation

Posterior mode + curvature @ mode

You want

a scalable method

Robert Bamler • University of Tübingen

Slide 13

14 of 53

Approximate Bayesian Inference

Claim: You don’t want the exact posterior.

a compact approximate posterior.

that provides

For example:

Sampling methods (HMC, Langevin, …)

Finite set of samples from posterior

Mean field variational inference

Fully factorized variational distribution

Structured variational inference

Distribution with some correlations

Laplace approximation

Posterior mode + curvature @ mode

You want

a scalable method

Which approximation works best?

→ Answer:

Best on what task?

Robert Bamler • University of Tübingen

Slide 14

15 of 53

Claim:

Whether a certain approximation (of the exact posterior) is good or bad depends on the constraints of the application.

Foundational research on scalable Bayesian inference methods should be guided by resource constrained applications.

Corollary:

16 of 53

My Research Interests

Scalable Bayesian Inference Methods

Expensive Data:

Bayesian ML for Natural Sciences

Low Bandwidth:

Neural Compression

Low Compute & High Latency:

Decentralized ML

Robert Bamler • University of Tübingen

Slide 16

17 of 53

Focus of Today’s Talk

Low Compute & High Latency:

Decentralized ML

Scalable Bayesian Inference Methods

Expensive Data:

Bayesian ML for Natural Sciences

Low Bandwidth:

Neural Compression

1

2

3

Cheng Zhang�(Microsoft Research,

Cambridge, UK)

Manfred Opper�(TU Berlin)

Stephan Mandt�(UC Irvine)

Collaborators:

Robert Bamler • University of Tübingen

Slide 17

18 of 53

Variational Inference: The Basic Idea

Faster convergence in presence of continuous symmetries:�[RB, Mandt, ICML 2018]

Extend variational family:

[RB, Mandt, ICML 2017]

[recommended reviews: Blei et al., 2017; Zhang et al., 2019]

Better distance measure:[RB, Zhang, Opper, Mandt, NeurIPS 2017]

all probability distributions over latent variables z

variational

family

distance measure

true posterior

best approximation

initial candidate

Robert Bamler • University of Tübingen

Slide 18

19 of 53

Our Proposal in 100 Seconds

[RB, Zhang, Opper, Mandt, NeurIPS 2017]

▷ Our new lower bound:

▷ Standard Variational Inference:

▷ Reminder: we want to evaluate p(x) = p(x, z) dz

maximize this lower bound

▷ Observation: similar to “perturbation theory” in theoretical physics:

The Big Bang Theory, Warner Bros. Television

Robert Bamler • University of Tübingen

Slide 19

20 of 53

VI = Biased Importance Sampling

importance sampling

▷ Variational inference reduces variance at the cost of introducing a bias:

▷ Can we make the bound tighter?

high variance

exact equality

Robert Bamler • University of Tübingen

Slide 20

21 of 53

Generalized Lower Bounds

▷ Conventional ELBO uses Jensen’s inequality:

▷ Holds not only for the logarithm but for any concave function:

Robert Bamler • University of Tübingen

Slide 21

22 of 53

Variance-Bias Trade-Off

[Bamler, Zhang, Opper, and Mandt, NIPS 2017; Bamler, Zhang, Opper, and Mandt, JSTAT 2019]

Generalized lower bound:

importance sampling

biased importance sampling�(with lower variance)

small bias,

high variance

large bias,

low variance

f = id (importance sampling)

f(ξ) = ξ1-α (alpha-VI)

[e.g., Li & Turner, 2016]

f = log (standard VI)

Idea: set f(𝜉) = Polynomial(log 𝜉)

Idea: such that f ≈ id

Robert Bamler • University of Tübingen

Slide 22

23 of 53

Perturbative Black Box VI

[Bamler, Zhang, Opper, and Mandt, NIPS 2017; Bamler, Zhang, Opper, and Mandt, JSTAT 2019]

▷ Taylor series in β:

→ ELBO + perturbative corrections, i.e., as in cumulant expansion, but:

lower bound on p(x) for any V0 and any odd order K.

Robert Bamler • University of Tübingen

Slide 23

24 of 53

Properties of the Perturbative Bound

▷ Tighter than ELBO due to correction terms.

small bias,

high variance

large bias,

low variance

f = id (importance sampling)

f = log (standard VI)

Robert Bamler • University of Tübingen

Slide 24

25 of 53

Properties of the Perturbative Bound

▷ Tighter than ELBO due to correction terms.

▷ Lower gradient variance than α-bound.

(because only polynomial in V)

Robert Bamler • University of Tübingen

Slide 25

26 of 53

Results for Variational Inference

▷ Less prone to underestimation of posterior variances� than standard VI with KL divergence.

Robert Bamler • University of Tübingen

Slide 26

27 of 53

Results for Variational EM

Also: Better estimates of p(x).

Variational Autoencoder:

better predictive performance

with small training sets.

(here: subsets of MNIST)

Robert Bamler • University of Tübingen

Slide 27

28 of 53

Enough Equations ...

Nymphenburg Park, Munich #nofilter

… back to pretty pictures.

Robert Bamler • University of Tübingen

Slide 28

29 of 53

Outline

Scalable Bayesian Inference Methods

Expensive Data:

Bayesian ML for Natural Sciences

Low Bandwidth:

Neural Compression

Low Compute & High Latency:

Decentralized ML

2

1

3

collaborators:

Stephan Mandt�(UC Irvine)

Yibo Yang�(UC Irvine)

Robert Bamler • University of Tübingen

Slide 29

30 of 53

A New Perspective for Compression

[home.cern]

[Ahrens et al., Nature, 2013]

Robert Bamler • University of Tübingen

Slide 30

31 of 53

Example: Lossy Image Compression

[Yang, RB, Mandt, NeurIPS 2020]

ours

ours (with lossy bits-back coding)

reference: best non-neural codec

(BPG 4:4:4) [Bellard, 2014]

previously best neural method�[Minnen et al., 2018]

our starting point [Ballé et al., 2018]

Robert Bamler • University of Tübingen

Slide 31

32 of 53

Compression & Latent Variable Models

a) Data Compression

infer

z*

g

x

compressed�bitstring

x’ = g(z*)

entropy coding

b) Model Compression

Example: Bayesian word embeddings�Example: [Barkan, 2017]

N

z

x

[pioneered by Ballé et al., 2017; Theis et al., 2017]

Robert Bamler • University of Tübingen

Slide 32

33 of 53

Probabilities Matter in Compression

Classical Example: Arithmetic Coding

Don’t transmit

what you can predict.

E[bitrate] ⩾ CrossEntropy(data || model)

Need a good�generative probabilistic model:

Robert Bamler • University of Tübingen

Slide 33

34 of 53

Probabilities Matter in Compression

Need good�posterior uncertainty estimates.

(→ Bayesian inference)

Don’t transmit what you’re not sure about.

Our New Aspect:

[Yang, RB & Mandt, ICML 2020]

Don’t transmit

what you can predict.

E[bitrate] ⩾ CrossEntropy(data || model)

Need a good�generative probabilistic model:

Robert Bamler • University of Tübingen

Slide 34

35 of 53

What’s the Population of Rome?

© David Iliff, CC BY-SA 2.5

100,000

  • In the year 500 AD:

2,879,728

  • On 30 April 2018:

Robert Bamler • University of Tübingen

Slide 35

36 of 53

Variational Bayesian Quantization

2 bits

3 bits

2 bits

3 bits

[Yang, RB & Mandt, ICML 2020]

Robert Bamler • University of Tübingen

Slide 36

37 of 53

Example: JPEG

original

Robert Bamler • University of Tübingen

Slide 37

38 of 53

Example: JPEG

JPEG @ 0.24 bits per pixel

Robert Bamler • University of Tübingen

Slide 38

39 of 53

Example: VBQ

Ours @ 0.24 bits per pixel

Robert Bamler • University of Tübingen

Slide 39

40 of 53

Scaling it Up

VBQ (proposed)

Example: Bayesian word embedding model (107 parameters) [Barkan, 2017]

[Yang, RB, Mandt, ICML 2020 ]

semantic reasoning benchmark

Robert Bamler • University of Tübingen

Slide 40

41 of 53

Reminder: Dynamic Word Embeddings

[RB & Mandt, ICML 2017]

Robert Bamler • University of Tübingen

Slide 41

42 of 53

Dynamic Word Embeddings Model

2.5 GB

34 MB (compressed)

➜ About 600 Million model parameters

“Don’t transmit what you’re not sure about”

➜ But: we know the posterior

Robert Bamler • University of Tübingen

Slide 42

43 of 53

The Linguistic Flux Capacitor

Robert Bamler • University of Tübingen

Slide 43

44 of 53

Outline

Scalable Bayesian Inference Methods

Expensive Data:

Bayesian ML for Natural Sciences

Low Bandwidth:

Neural Compression

2

1

Low Compute & High Latency:

Decentralized ML

3

Robert Bamler • University of Tübingen

Slide 44

45 of 53

Super Powers That Never Wanted to Be

Robert Bamler • University of Tübingen

Slide 45

46 of 53

ML Solidifies The New Super Powers

data

predictions

data

predictions

data

predictions

Consolidation of Power:

more users

more�training data

better ML models

Robert Bamler • University of Tübingen

Slide 46

47 of 53

Unrelated: Federated Learning

updates

model

updates

model

typically controlled by 1 institution

central authority for model consensus.

Robert Bamler • University of Tübingen

Slide 47

48 of 53

Claim:

Current ML research is mainly oriented around the prevalent business model of centralized services.

To overcome centralization, we should explore decentralized learning algorithms in lockstep with decentralized business models.

Working Thesis:

49 of 53

Vision: Decentralized Machine Learning

  • Model is a shared data structure.

updates

updates

updates

updates

  • Each user keeps track of a partial view �of the model.
  • The partial views are strongly correlated.�→ “lazy consensus”
  • Both training and consensus are incentivized economically.

→ Blockchain

Robert Bamler • University of Tübingen

Slide 49

50 of 53

Computer Science is About Abstractions

models &

learning algorithms

research questions

experimentation platform for decentralized ML

(MVP: https://github.com/robamler/

blockchain-machine-learning-demo)

Robert Bamler • University of Tübingen

Slide 50

51 of 53

Recap

An emerging applied field:

Neural compression

A new inference algorithm:

Perturbative Black Box Variational Inference

An outlook:

decentralization of power in machine learning

An Example Model:

Dynamic Word Embeddings

Robert Bamler • University of Tübingen

Slide 51

52 of 53

Proof of Concept: Custom ML Blockchain

Robert Bamler • University of Tübingen

Slide 52

53 of 53

Color Palette

90%

80%

70%

60%

50%

40%

30%

20%

10%

50.65.75

165.30.55

180.160.105

0.150.170

65.90.140

80.170.200

125.165.75

130.185.160

50.110.30

175.110.150

200.80.60

180.160.150

210.150.0

215.180.105

145.105.70

Robert Bamler • University of Tübingen

Slide 53