Robert Bamler
Cluster Colloquium "Machine Learning" @ Tübingen • 3 February 2021
Scalable Bayesian Inference:
New Tools for New Challenges
Department of Computer Science,
Cluster of Excellence “Machine Learning for Science”,
Tübingen AI Center
My Journey
PhD in �Theoretical Physics
Cologne, Germany
2016
Postdoc in�Machine Learning
Disney Research, USA
2018
Postdoc in�Machine Learning
UC Irvine, CA, USA
2020
Assistant Professor
Uni Tübingen, Germany
ML Cluster of Excellence
Robert Bamler • University of Tübingen
Slide 2
My Research Plan
Common Umbrella:
scalable Bayesian inference
guide research & evaluation metrics
allow us to address
New Tools: models & algorithms
New Challenges: Resource Constrained Applications
Robert Bamler • University of Tübingen
Slide 3
Teaser: Evolution of Language
[RB & Mandt,
ICML 2017]
“Computer” today
© 20th century FOX
“Computer”
in 1961
Robert Bamler • University of Tübingen
Slide 4
Teaser: Evolution of Language
[RB & Mandt,
ICML 2017]
Süddeutsche Zeitung, last Saturday
“20 Years of Gibberish”
Robert Bamler • University of Tübingen
Slide 5
Top 10 Most Mobile Words
(Training data: Google Books corpus [Michel et al., 2011])
[RB & Mandt,
ICML 2017]
similarity to “peer”
Robert Bamler • University of Tübingen
Slide 6
Robert Bamler • University of Tübingen
Slide 7
Peeking�Behind the Curtains:
�Probabilistic Models &�Scalable Bayesian Inference
Previous Approaches
[e.g., Kim et al., 2014,
Hamilton et al., 2016]
word embeddings for year 1851
align
books written in 1850
books written in 1851
books written in 2008
word embeddings for year 2008
word embeddings for year 1850
word2vec
align
Robert Bamler • University of Tübingen
Slide 9
Problem: Subtle Signals in the Noise
1998
1999
1999
Probabilistic time series model
Our Solution:
[RB & Mandt, ICML 2017]
Scalable approximate Bayesian inference
+
Symmetry-aware optimization algorithm
+
Robert Bamler • University of Tübingen
Slide 10
Dynamic Word Embeddings
[RB & Mandt,
ICML 2017]
word embeddings�for year 1851
books written�in 1850
books written�in 1851
books written�in 2008
Probabilistic time series model�(Ornstein-Uhlenbeck process)
word embeddings�for year 2008
word embeddings�for year 1850
probabilistic generative model
inference
Robert Bamler • University of Tübingen
Slide 11
Probabilistic Models And Bayesian Inference
observed data
“latent variables”
Inference: find the posterior p(z | x)
p(x, z)
∫p(x, z) dz
=
prior knowledge�(e.g., scientific model)
likelihood�(“observation model”)
= p(z) p(x | z)
What they won’t tell you (usually):
Exact posterior often has thousands or millions of terms.
What they will tell you:
intractable
Generative probabilistic model:
p(x, z)
Robert Bamler • University of Tübingen
Slide 12
Approximate Bayesian Inference
Claim: You don’t want the exact posterior.
a compact approximate posterior.
that provides
For example:
Sampling methods (HMC, Langevin, …)
Finite set of samples from posterior
Mean field variational inference
Fully factorized variational distribution
Structured variational inference
Distribution with some correlations
Laplace approximation
Posterior mode + curvature @ mode
You want
a scalable method
Robert Bamler • University of Tübingen
Slide 13
Approximate Bayesian Inference
Claim: You don’t want the exact posterior.
a compact approximate posterior.
that provides
For example:
Sampling methods (HMC, Langevin, …)
Finite set of samples from posterior
Mean field variational inference
Fully factorized variational distribution
Structured variational inference
Distribution with some correlations
Laplace approximation
Posterior mode + curvature @ mode
You want
a scalable method
Which approximation works best?
→ Answer:
Best on what task?
Robert Bamler • University of Tübingen
Slide 14
Claim:
Whether a certain approximation (of the exact posterior) is good or bad depends on the constraints of the application.
Foundational research on scalable Bayesian inference methods should be guided by resource constrained applications.
Corollary:
My Research Interests
Scalable Bayesian Inference Methods
Expensive Data:
Bayesian ML for Natural Sciences
Low Bandwidth:
Neural Compression
Low Compute & High Latency:
Decentralized ML
Robert Bamler • University of Tübingen
Slide 16
Focus of Today’s Talk
Low Compute & High Latency:
Decentralized ML
Scalable Bayesian Inference Methods
Expensive Data:
Bayesian ML for Natural Sciences
Low Bandwidth:
Neural Compression
1
2
3
Cheng Zhang�(Microsoft Research,
Cambridge, UK)
Manfred Opper�(TU Berlin)
Stephan Mandt�(UC Irvine)
Collaborators:
Robert Bamler • University of Tübingen
Slide 17
Variational Inference: The Basic Idea
Faster convergence in presence of continuous symmetries:�[RB, Mandt, ICML 2018]
Extend variational family:
[RB, Mandt, ICML 2017]
[recommended reviews: Blei et al., 2017; Zhang et al., 2019]
Better distance measure:�[RB, Zhang, Opper, Mandt, NeurIPS 2017]
all probability distributions over latent variables z
variational
family
distance measure
true posterior
best approximation
initial candidate
Robert Bamler • University of Tübingen
Slide 18
Our Proposal in 100 Seconds
[RB, Zhang, Opper, Mandt, NeurIPS 2017]
▷ Our new lower bound:
▷ Standard Variational Inference:
▷ Reminder: we want to evaluate p(x) = ∫p(x, z) dz
maximize this lower bound
▷ Observation: similar to “perturbation theory” in theoretical physics:
The Big Bang Theory, Warner Bros. Television
Robert Bamler • University of Tübingen
Slide 19
VI = Biased Importance Sampling
importance sampling
▷ Variational inference reduces variance at the cost of introducing a bias:
▷ Can we make the bound tighter?
high variance
exact equality
Robert Bamler • University of Tübingen
Slide 20
Generalized Lower Bounds
▷ Conventional ELBO uses Jensen’s inequality:
▷ Holds not only for the logarithm but for any concave function:
Robert Bamler • University of Tübingen
Slide 21
Variance-Bias Trade-Off
[Bamler, Zhang, Opper, and Mandt, NIPS 2017; Bamler, Zhang, Opper, and Mandt, JSTAT 2019]
Generalized lower bound:
importance sampling
biased importance sampling�(with lower variance)
small bias,
high variance
large bias,
low variance
f = id (importance sampling)
f(ξ) = ξ1-α (alpha-VI)
[e.g., Li & Turner, 2016]
f = log (standard VI)
▷ Idea: set f(𝜉) = Polynomial(log 𝜉)
▷ Idea: such that f ≈ id
Robert Bamler • University of Tübingen
Slide 22
Perturbative Black Box VI
[Bamler, Zhang, Opper, and Mandt, NIPS 2017; Bamler, Zhang, Opper, and Mandt, JSTAT 2019]
▷ Taylor series in β:
→ ELBO + perturbative corrections, i.e., as in cumulant expansion, but:
→ lower bound on p(x) for any V0 and any odd order K.
Robert Bamler • University of Tübingen
Slide 23
Properties of the Perturbative Bound
▷ Tighter than ELBO due to correction terms.
small bias,
high variance
large bias,
low variance
f = id (importance sampling)
f = log (standard VI)
Robert Bamler • University of Tübingen
Slide 24
Properties of the Perturbative Bound
▷ Tighter than ELBO due to correction terms.
▷ Lower gradient variance than α-bound.
▷ (because only polynomial in V)
Robert Bamler • University of Tübingen
Slide 25
Results for Variational Inference
▷ Less prone to underestimation of posterior variances�▷ than standard VI with KL divergence.
Robert Bamler • University of Tübingen
Slide 26
Results for Variational EM
Also: Better estimates of p(x).
▷ Variational Autoencoder:
▷ better predictive performance
▷ with small training sets.
(here: subsets of MNIST)
Robert Bamler • University of Tübingen
Slide 27
Enough Equations ...
Nymphenburg Park, Munich #nofilter
… back to pretty pictures.
Robert Bamler • University of Tübingen
Slide 28
Outline
Scalable Bayesian Inference Methods
Expensive Data:
Bayesian ML for Natural Sciences
Low Bandwidth:
Neural Compression
Low Compute & High Latency:
Decentralized ML
2
1
3
collaborators:
Stephan Mandt�(UC Irvine)
Yibo Yang�(UC Irvine)
Robert Bamler • University of Tübingen
Slide 29
A New Perspective for Compression
[home.cern]
[Ahrens et al., Nature, 2013]
Robert Bamler • University of Tübingen
Slide 30
Example: Lossy Image Compression
[Yang, RB, Mandt, NeurIPS 2020]
ours
ours (with lossy bits-back coding)
reference: best non-neural codec
(BPG 4:4:4) [Bellard, 2014]
previously best neural method�[Minnen et al., 2018]
our starting point [Ballé et al., 2018]
Robert Bamler • University of Tübingen
Slide 31
Compression & Latent Variable Models
a) Data Compression
infer
z*
g
x
compressed�bitstring
x’ = g(z*)
entropy coding
b) Model Compression
Example: Bayesian word embeddings�Example: [Barkan, 2017]
N
z
x
[pioneered by Ballé et al., 2017; Theis et al., 2017]
Robert Bamler • University of Tübingen
Slide 32
Probabilities Matter in Compression
Classical Example: Arithmetic Coding
Don’t transmit
what you can predict.
E[bitrate] ⩾ CrossEntropy(data || model)
Need a good�generative probabilistic model:
Robert Bamler • University of Tübingen
Slide 33
Probabilities Matter in Compression
Need good�posterior uncertainty estimates.
(→ Bayesian inference)
Don’t transmit what you’re not sure about.
Our New Aspect:
[Yang, RB & Mandt, ICML 2020]
Don’t transmit
what you can predict.
E[bitrate] ⩾ CrossEntropy(data || model)
Need a good�generative probabilistic model:
Robert Bamler • University of Tübingen
Slide 34
What’s the Population of Rome?
© David Iliff, CC BY-SA 2.5
100,000
2,879,728
Robert Bamler • University of Tübingen
Slide 35
Variational Bayesian Quantization
2 bits
3 bits
2 bits
3 bits
[Yang, RB & Mandt, ICML 2020]
Robert Bamler • University of Tübingen
Slide 36
Example: JPEG
original
Robert Bamler • University of Tübingen
Slide 37
Example: JPEG
JPEG @ 0.24 bits per pixel
Robert Bamler • University of Tübingen
Slide 38
Example: VBQ
Ours @ 0.24 bits per pixel
Robert Bamler • University of Tübingen
Slide 39
Scaling it Up
VBQ (proposed)
Example: Bayesian word embedding model (107 parameters) [Barkan, 2017]
[Yang, RB, Mandt, ICML 2020 ]
semantic reasoning benchmark
Robert Bamler • University of Tübingen
Slide 40
Reminder: Dynamic Word Embeddings
[RB & Mandt, ICML 2017]
Robert Bamler • University of Tübingen
Slide 41
Dynamic Word Embeddings Model
2.5 GB
34 MB (compressed)
➜ About 600 Million model parameters
“Don’t transmit what you’re not sure about”
➜ But: we know the posterior
Robert Bamler • University of Tübingen
Slide 42
The Linguistic Flux Capacitor
➜ Try it out: https://robamler.github.io/linguistic-flux-capacitor
Robert Bamler • University of Tübingen
Slide 43
Outline
Scalable Bayesian Inference Methods
Expensive Data:
Bayesian ML for Natural Sciences
Low Bandwidth:
Neural Compression
2
1
Low Compute & High Latency:
Decentralized ML
3
Robert Bamler • University of Tübingen
Slide 44
Super Powers That Never Wanted to Be
Robert Bamler • University of Tübingen
Slide 45
ML Solidifies The New Super Powers
data
predictions
data
predictions
data
predictions
Consolidation of Power:
more users
more�training data
better ML models
Robert Bamler • University of Tübingen
Slide 46
Unrelated: Federated Learning
updates
model
updates
model
typically controlled by 1 institution
central authority for model consensus.
Robert Bamler • University of Tübingen
Slide 47
Claim:
Current ML research is mainly oriented around the prevalent business model of centralized services.
To overcome centralization, we should explore decentralized learning algorithms in lockstep with decentralized business models.
Working Thesis:
Vision: Decentralized Machine Learning
updates
updates
updates
updates
→ Blockchain
Robert Bamler • University of Tübingen
Slide 49
Computer Science is About Abstractions
models &
learning algorithms
research questions
experimentation platform for decentralized ML
Robert Bamler • University of Tübingen
Slide 50
Recap
An emerging applied field:
Neural compression
A new inference algorithm:
Perturbative Black Box Variational Inference
An outlook:
decentralization of power in machine learning
An Example Model:
Dynamic Word Embeddings
Robert Bamler • University of Tübingen
Slide 51
Proof of Concept: Custom ML Blockchain
Robert Bamler • University of Tübingen
Slide 52
Color Palette
90%
80%
70%
60%
50%
40%
30%
20%
10%
50.65.75
165.30.55
180.160.105
0.150.170
65.90.140
80.170.200
125.165.75
130.185.160
50.110.30
175.110.150
200.80.60
180.160.150
210.150.0
215.180.105
145.105.70
Robert Bamler • University of Tübingen
Slide 53