1 of 17

On Variational Bounds of Mutual Information

Ben Poole, Sherjil Ozair, Aäron van den Oord, Alexander A. Alemi, George Tucker

ICML 2019

1

Alex Wang and Xiaohui Zeng, sta4273 course presentation.

Credit: https://colab.research.google.com/github/google-research/google-research/blob/master/vbmi/vbmi_demo.ipynb#scrollTo=yz0yBU27ED67

2 of 17

Mutual Information

Define:

Observe Y, #infor about X. Higher MI means more.
Used for

Representation Learning I(data, representation)
..

Challenge

Unknown p(x,y), p(x) and p(y)
Prefer (scalable) side bounds for optimization

2

Mutual information is a metric depending on two random variables, Its symmetric and non-negative

The mutual information tells us if we have a random variable x, how much information we get for another random variable Y.

it is defined as the expectation over the joint distribution of x,y of the log of the density ratio, which includes the joint distribution and the product of two marginal distribution p(x), p(y)

When x and y are independent from each other, there mutual information will have value as zero.

Mutual information is widely used in unsupervised learning where people trying to maximize the mutual information between the latent features and the raw input,

The challenge of using mutual information is that in practice we only has access to the samples but the distribution itself is unknown.

There are some classical estimators for the mutual information, however, if we want to further maximize the MI or minimize the MI, we would prefer to work with the side bounds

3 of 17

Contribution

Review and present recent variational estimators of MI.
Propose an interpolated lower bounds for better trade-off between bias and variance

3

4 of 17

Contribution

Review and present recent variational estimators of MI.
Propose an interpolated lower bounds for better trade-off between bias and variance

4

5 of 17

Lower Bound of I(X;Y)

Introduce tractable variational distribution q:

5

Challenge:

Require a tractable q(x|y) when x has high dimension

Barber & Agakov (2003)

(1)

First we start with the original form of mutual information, we introduce a tractable variational distribution q(x|y), which is learnable.

By plug it into the formulation of mutual information, we can separate the density ratio into two parts, the second terms can be written as the expectation over y of the KL divergence between unknown conditional distribution p(x|y) and the learned tractable variational distribution q; since the value of KL divergence is always non-negative, we can get the first lower bound of mutual information: I_BA, as shown in the red box, it consists of the entropy of random variable X and the expectation of the log of the variational distribution, over x, y sampled from the joint distribution.

> this bound can be used to compare the mutual information I(x,y1) and I(x,y2), the gradient of it w.r.t. The variational distribution is tractable, so it’s useful when we want to maximize the lower bound of the mutual information.

6 of 17

Lower Bound of I(X;Y)

Introduce tractable variational distribution q:

Require tractable q(x|y)

Use energy based q:

6

Introduce critic function f(x,y), tractable

Partition function Z(y), intractable

(1)

(2)

(3)

7 of 17

Lower Bound of I(X;Y)

Introduce tractable variational distribution q:

Require tractable q(x|y)

Use energy based family for q:

Learn tractable critic function f(x,y)
Intractable log Z(y)

Introduce lower bound of - log Z(y), with

Tight when a(y) = Z(y)

7

(1)

(3)

(4)

(5)

8 of 17

Closer look at TUBA

Two components learned by neural network

f(x,y), critic function, aims to approximate
a(y), baseline, aims to approximate

NWJ* estimator: a(y) as constant e
NCE estimator: a(y) as averaged e^f(x,y) over different x in the mini-batch

8

*In addition, NWJ use self-normalize f(x,y)

(Nguyen et al., 2010) (Belghazi et al., 2018)

(5)

9 of 17

Recap - Lower Bound of I(X;Y)

BA

Tractable for optimizing
Require tractable q(x|y)

UBA

Learn tractable critic function f(x,y)
Intractable log Z(y)

TUBA
NWJ* estimator: a(y) as constant e
NCE estimator: a(y) as averaged e^f(x,y) over different x in the mini-batch

9

*In addition, NWJ use self-normalize f(x,y)

(Nguyen et al., 2010) (Belghazi et al., 2018) ( van den Oord et al. 2018)

(1)

(3)

(5)

10 of 17

Proposed bound

Existing lower bounds are either high bias or high variance

NWJ* estimator, single sample based

high variance, low bias

NCE estimator, multi-sample based

low variance, high bias
Upper bounded by log(batch-size)

:Interpolate between NWJ estimator and NCE estimator

10

11 of 17

Experiments

Study bias-variance tradeoff of Mutual Information lower bounds
Evaluate the accuracy of gradients of these bounds
Decoder-free representation learning on the dSprites dataset

11

12 of 17

Notebook

https://colab.research.google.com/drive/1Qevp6CydttqRd1bsoVwj97VAX9wLSovi?usp=sharing

12

13 of 17

Experiments with Gaussian data

Study the variance-bias tradeoff of the MI estimators using the optimal critic

13

14 of 17

Experiments with Gaussian data

Study the variance-bias tradeoff of the MI estimators using the optimal critic

14

15 of 17

dSprites Representation Learning

Decoder-free representation learning�to recover spatial variables

Include additional regularization terms

�

Recovers position and scale variables but�not rotation
Uses a structured estimator with a known p(y|x)

15

16 of 17

Key Ideas & Discussion

Surveying upper and lower bounds of MI
Demonstrate the bias-variance trade off

No low-variance low-bias methods for high MI and small batch_size

Present the interpolative bound for MI

Infinite data and i.i.d assumptions are made
Considering estimating MI gradients rather than MI

16

17 of 17

Related Work & follow up

https://arxiv.org/pdf/1807.03748.pdf

https://arxiv.org/pdf/1907.13625.pdf

Learn the gradient directly: https://openreview.net/pdf?id=ByxaUgrFvH

Original VMI paper: https://arxiv.org/abs/1905.06922

17