1 of 15

Variational Bayesian Quantization

Yibo Yang*, Robert Bamler*, Stephan Mandt

(*equal contribution)

University of California, Irvine

International Conference on Machine Learning • June 13, 2020

Our paper was previously titled Variable-bitrate Neural Compression via Bayesian Arithmetic Coding. We changed the title to Variational Bayesian Quantization based on reviewer feedback.

2 of 15

Data is abundant

Latent variable models

g

z ~ p(z)

x ~ p(x|z)

0.123
-0.987
0.151 ...

Yang, Bamler, Mandt • Variational Bayesian Quantization, ICML 2020

2

3 of 15

Latent variable models + compression

Data Compression

Model Compression

infer

x

z*

g

z

x

N

counts	queen	woman	girl	boy	man	...
queen	0	3	2	0	1
woman	0	0	7	2	5
...

Example: Bayesian word embedding model [Barkan 2017].

x’ = g(z*)

Yang, Bamler, Mandt • Variational Bayesian Quantization, ICML 2020

3

4 of 15

Quantizing continuous latent variables

0.13
-0.98
0.13 ...

Quantize

0.123
-0.987
0.151 ...

z* =

Current neural compression methods:

Optimize a rate-distortion objective end-to-end;
Embed quantization into training, approximately:

Straight-through estimator [Bengio et al., 2013]
Stochastic binarization [Toderici et al., 2016]
Soft-to-hard VQ [Agustsson et al., 2017]
Adding uniform noise [Ballé et al., 2017]

Require catering the training procedure, or even the generative model itself, to quantization;
Most require retraining a new model for a different rate-distortion trade-off.

Yang, Bamler, Mandt • Variational Bayesian Quantization, ICML 2020

4

5 of 15

Contributions

A new algorithm, Variational Bayesian Quantization (VBQ), for compressing latent representations in a large class of generative models, that:

Operates completely after training:

plug-and-play compression for pre-trained models;
works for both data compression and model compression;
separates the data modeling (and model training) task from quantization and compression;

Performs variable-bitrate compression with a single model, outperforming JPEG with a single standard variational autoencoder (VAE);
Exploits posterior uncertainty for compression -- only other known work is bits-back coding for lossless compression [Wallace, 1990, Hinton and Van Camp, 1993].

What’s the best way to quantize latent variables in a given generative model?

Yang, Bamler, Mandt • Variational Bayesian Quantization, ICML 2020

5

6 of 15

Data Compression: When Probabilities Matter

Don’t transmit

what you can predict.

Better generative probabilistic models �lead to better compression rate:

minimal bitrate = –log₂ p_model(message)

E[bitrate] ⩾ Crossentropy(data || model)

Classical Example: Arithmetic Coding

Yang, Bamler, Mandt • Variational Bayesian Quantization, ICML 2020

6

Review of a general connection between lossless compression and data modeling, that can be summed up by the slogan "don't transmit what you can already predict". i.e., we can build a probability model of our data, and use short bit-strings to encode data outcome that is likely under the model. Better model -> smaller expected bit-rate (number of bits) for compressing a data sample.

e.g., arithmetic coding, illustrates this principle (our work can be seen as an extension of it).

Given a discrete data message m, and a probability model p of the data, AC uses the CDF to map the message to an interval between [0, 1) with width p(m), that I'm calling the uncertainty region. To transmit, or store the message m, AC always encodes a number in this uncertainty region, using -log p(m) bits, so that a data observations with higher probability, in other words, wider uncertainty region, is encoded with a smaller number of bits.

7 of 15

Lossy Data Compression: When Uncertainty Matters

Don’t transmit

what you can predict.

Better generative probabilistic models �lead to better compression rate:

Don’t transmit what you’re not sure about.

Better estimates of posterior uncertainties�allow for more efficient use of bandwidth.

minimal bitrate = -log₂ p_model(message)

E[bitrate] ⩾ Crossentropy(data || model)

[Yang, Bamler & Mandt, Variational Bayesian Quantization, ICML 2020]

Yang, Bamler, Mandt • Variational Bayesian Quantization, ICML 2020

7

8 of 15

What’s the Population of Rome?

100,000

In the year 500 AD:

2,879,728

On April 30, 2018:

Yang, Bamler, Mandt • Variational Bayesian Quantization, ICML 2020

8

9 of 15

Variational Bayesian Quantization

Now: continuous observation

(Reminder: Classical Arithmetic Coding)

[Yang, Bamler & Mandt, Variational Bayesian Quantization, ICML 2020]

Yang, Bamler, Mandt • Variational Bayesian Quantization, ICML 2020

9

Recall how AC uses the CDF to map a discrete observation to a float, then encode it with a number of bits depending on the width of the uncertainty region. We consider the basic problem of quantizing a scalar latent variable, z_i, that follows a probability density p(z_i), with a corresponding CDF.

Given an observation zi*, (colored in red), the prior CDF maps it from Z space again to a floating point number <NEXT>. It turns out this float is uniformly distributed between 0 and 1, and its binary expansion is an infinite sequence of independently distributed 0s and 1s with equal probability. In other words, this float has an incompressible binary representation, and our problem now is to decide on how to best truncate it to a finite number of bits, in order to transmit it digitally.

Now recall from our compression setup earlier, zi*, or the red dot here, is the result of inference conditioned on some data x. Specifically, we assume that Bayesian inference <NEXT> is performed to obtain a variational posterior q(zi|x), so zi*, also labeled mu_i, is the MAP estimate.

The idea of our algorithm then, is to use the posterior probability to guide the truncation process, by essentially copying the values of the posterior probability from z space to the interval [0, 1), using the invertible prior CDF. <NEXT> In this case, we see that truncating to the closest 3-bit binary float, 0.111, gets us a higher posterior probability, and closer to the MAP solution, than truncating to 2 bits, 0.11, but of course this comes with the cost of one extra bit to transmit; so we already face a rate-distortion trade-off. <NEXT>The problem becomes more interesting when quantizing multiple latent variables, which we assume have factorized prior and posterior distributions. The objective of our algorithm then, is to search over binary floats representable by a finite number of bits across all latent dimensions, and maximize their associated posterior probabilities, subject to a constraint on the total bit-lengths of binary floats that we truncate to.

The fact that different latent dimensions have different posterior variances mean that they will be encoded with different precision, and like before, a latent variable with high posterior variance, or wider uncertainty region, is encoded with a fewer number of bits, than a latent variable which we are more certain about.

Once we found the best finite binary floats through optimization, we can encode their bit-strings into a file. And to reconstruct the MAP solution in Z space, we can simply pass the binary floats through the inverse CDFs.

10 of 15

Real Example: Compressing Neural Word Embeddings

?

–

+

=

“king”

“man”

“woman”

“queen”

[Yang, Bamler & Mandt, Variational Bayesian Quantization, ICML 2020]

VBQ (proposed)

(model with 10⁷ parameters)

Yang, Bamler, Mandt • Variational Bayesian Quantization, ICML 2020

10

11 of 15

Data Compression With Deep Probabilistic Models

Results for Image Compression:

[Yang, Bamler & Mandt, Variational Bayesian Quantization, ICML 2020]

Yang, Bamler, Mandt • Variational Bayesian Quantization, ICML 2020

11

Here we apply VBQ to a pretrained, variational autoencoder, to compress data, in the form color images. Plotting compression performance on the standard Kodak test dataset; X axis gives number of bits per pixel; the Y axis gives the image reconstruction quality, so the higher the better. Two plots for two image quality metrics, PSNR (based on mean-squared-error), and MS-SSIM (better aligns with human perception).

JPEG;

End-to-end optimized neural compression method, which essentially represents the performance upper bound of this VAE architecture, obtained by training a different model for each point on the rate-distortion curve;

Various quantization methods applied to a single pre-trained VAE;

Finally, various quantization methods applied to a single pre-trained VAE. Baseline quantization methods ignore posterior variance, and simply encode the MAP estimates. In particular, the baseline using the generalized Lloyd algorithm represents the state-of-the-art scalar quantization performance on quantizing the MAP solution. As shown, our VBQ algorithm, which exploits posterior uncertainty, significantly outperforms all the baselines, and remarkably, even JPEG on a wide range of bitrates.

12 of 15

original

Yang, Bamler, Mandt • Variational Bayesian Quantization, ICML 2020

12

13 of 15

JPEG @ 0.24 BPP

Yang, Bamler, Mandt • Variational Bayesian Quantization, ICML 2020

13

14 of 15

ours @ 0.24 BPP

Yang, Bamler, Mandt • Variational Bayesian Quantization, ICML 2020

14

15 of 15

Conclusion

Goal: quantizing latent variables in post-processing
Solution: new algorithm, Variational Bayesian Quantization (VBQ), for quantizing latent representations in a wide class of latent variable models:

takes posterior uncertainty into account;
modular, plug-and-play compression of data and model;
separates modeling from compression; variable-rate compression with a single model

Consequences

model compression: improved lossy compression of Bayesian word embeddings, outperforming all baselines that use uniform quantization;
data compression: lossy image compression by quantizing a single Gaussian VAE, outperforming all baselines including JPEG;
potentially offers a new way of evaluating/comparing latent variable models.

Our paper was previously titled Variable-bitrate Neural Compression via Bayesian Arithmetic Coding. We changed the title to Variational Bayesian Quantization based on reviewer feedback.

Yang, Bamler, Mandt • Variational Bayesian Quantization, ICML 2020

15