1 of 17

On Variational Bounds of Mutual Information

Ben Poole, Sherjil Ozair, Aäron van den Oord, Alexander A. Alemi, George Tucker

ICML 2019

1

2 of 17

Mutual Information

Define:

  • Observe Y, #infor about X. Higher MI means more.
  • Used for
    • Representation Learning I(data, representation)
    • ..
  • Challenge
    • Unknown p(x,y), p(x) and p(y)
    • Prefer (scalable) side bounds for optimization

2

3 of 17

Contribution

  • Review and present recent variational estimators of MI.
  • Propose an interpolated lower bounds for better trade-off between bias and variance

3

4 of 17

Contribution

  • Review and present recent variational estimators of MI.
  • Propose an interpolated lower bounds for better trade-off between bias and variance

4

5 of 17

Lower Bound of I(X;Y)

  • Introduce tractable variational distribution q:

5

Challenge:

Require a tractable q(x|y) when x has high dimension

Barber & Agakov (2003)

(1)

6 of 17

Lower Bound of I(X;Y)

  • Introduce tractable variational distribution q:
    • Require tractable q(x|y)
  • Use energy based q:

6

Introduce critic function f(x,y), tractable

Partition function Z(y), intractable

(1)

(2)

(3)

7 of 17

Lower Bound of I(X;Y)

  • Introduce tractable variational distribution q:
    • Require tractable q(x|y)
  • Use energy based family for q:
    • Learn tractable critic function f(x,y)
    • Intractable log Z(y)
  • Introduce lower bound of - log Z(y), with

  • Tight when a(y) = Z(y)

7

(1)

(3)

(4)

(5)

8 of 17

Closer look at TUBA

Two components learned by neural network

  • f(x,y), critic function, aims to approximate
  • a(y), baseline, aims to approximate
    • NWJ* estimator: a(y) as constant e
    • NCE estimator: a(y) as averaged ef(x,y) over different x in the mini-batch

8

*In addition, NWJ use self-normalize f(x,y)

(Nguyen et al., 2010) (Belghazi et al., 2018)

(5)

9 of 17

Recap - Lower Bound of I(X;Y)

  • BA
    • Tractable for optimizing
    • Require tractable q(x|y)
  • UBA
    • Learn tractable critic function f(x,y)
    • Intractable log Z(y)
  • TUBA
  • NWJ* estimator: a(y) as constant e
  • NCE estimator: a(y) as averaged ef(x,y) over different x in the mini-batch

9

*In addition, NWJ use self-normalize f(x,y)

(Nguyen et al., 2010) (Belghazi et al., 2018) ( van den Oord et al. 2018)

(1)

(3)

(5)

10 of 17

Proposed bound

Existing lower bounds are either high bias or high variance

  • NWJ* estimator, single sample based
    • high variance, low bias
  • NCE estimator, multi-sample based
    • low variance, high bias
    • Upper bounded by log(batch-size)

:Interpolate between NWJ estimator and NCE estimator

10

11 of 17

Experiments

  • Study bias-variance tradeoff of Mutual Information lower bounds
  • Evaluate the accuracy of gradients of these bounds
  • Decoder-free representation learning on the dSprites dataset

11

12 of 17

Notebook

12

13 of 17

Experiments with Gaussian data

  • Study the variance-bias tradeoff of the MI estimators using the optimal critic

13

14 of 17

Experiments with Gaussian data

  • Study the variance-bias tradeoff of the MI estimators using the optimal critic

14

15 of 17

dSprites Representation Learning

  • Decoder-free representation learning�to recover spatial variables
    • Include additional regularization terms

    • Recovers position and scale variables but�not rotation
    • Uses a structured estimator with a known p(y|x)

15

16 of 17

Key Ideas & Discussion

  • Surveying upper and lower bounds of MI
  • Demonstrate the bias-variance trade off
    • No low-variance low-bias methods for high MI and small batch_size
  • Present the interpolative bound for MI

  • Infinite data and i.i.d assumptions are made
  • Considering estimating MI gradients rather than MI

16

17 of 17

Related Work & follow up

https://arxiv.org/pdf/1807.03748.pdf

https://arxiv.org/pdf/1907.13625.pdf

Learn the gradient directly: https://openreview.net/pdf?id=ByxaUgrFvH

Original VMI paper: https://arxiv.org/abs/1905.06922

17