1 of 34

Variational Autoencoders pursue PCA directions (by accident)

July 2, 2019

Michal Rolínek

Max-Planck Institute for Intelligent Systems

2 of 34

1.

A LITTLE BIT ABOUT ME

3 of 34

BRIEF TRAJECTORY

  • 2012 - Master’s degree in topology and function spaces in Charles University, Prague.

  • 2017 - PhD. in discrete optimization theory at IST Austria, Vienna

  • 2017 - ? PostDoc at Autonomous Learning group of the Max-Planck Institute for Intelligent Systems, Tübingen

  • Along the way: mathematical education, mathematical olympiads, startup in cryptography

4 of 34

MY WORK BEFORE...

5 of 34

… AND AFTER

  • Efficient Optimization of Rank-Based Loss Functions -- CVPR 2018 (HM Best Paper Award) [Mohapatra* , Rolínek*, Jawahar, Kolmogorov, Kumar]

  • L4: Practical Loss-based Stepsize Adaptation for Deep Learning -- NeurIPS 2018

[Rolínek, Martius]

  • Variational Autoencoders Pursue PCA Directions (by Accident) -- CVPR 2019 (and ongoing)

[Rolínek*, Zietlow*, Martius]

6 of 34

2.

WE NEED TO TALK ABOUT DISENTANGLEMENT

7 of 34

[Higgins et al., Beta-VAE: Learning Visual Concepts with a Constrained Variational Framework, ICLR 2017]

“Learning meaningful and compact representations with structurally disentangled semantics.”

WHAT IS DISENTANGLEMENT?

Also formal approaches to the definition exist

[Higgins et al., Towards a Definition of Disentangled Representations, 2018]

8 of 34

BRIEF HISTORY

  • 2017 [Higgins et al.]: fully unsupervised learning of disentangled representations is somewhat possible.

  • 2017-2018: Disorganized interest - new architectures, metrics, benchmarks, datasets

  • ICML 2019 [Locatello et al.]: madness resolved

  • 2019 - ?: Organized interest - ICML best paper award, disentanglement challenge, fairness, sim2real

We were here (and failing)

9 of 34

From [Locatello et al.] ICML 2019

10 of 34

Rotations matter

11 of 34

3.

MAIN RESULT

12 of 34

FORMALIZED...

13 of 34

USUAL REACTIONS

14 of 34

  • Typical PCA-autoencoder connections do not discuss alignment of the latent space.

  • [Burgess et al., Understanding Disentangling in Beta-VAE, 2018] Only found a high-level intuitive explanation

  • The PCA connection goes away without “diagonal posterior” (and so does disentanglement)

WHY NOT OBVIOUS?

15 of 34

4.

FROM UNDER THE RUG

16 of 34

THE CLASSICAL VAE STORY

  • Choice of prior p(z) ~ N(0, I))
  • Gaussian decoder (unit variance) => MSE
  • Gaussian encoder with diagonal covariance matrix (!!)

CANONICAL IMPLEMENTATION

β

17 of 34

What doesn’t explain choice of alignment...

  • Log likelihood - is invariant to latent space rotations

  • ELBO - is invariant to latent space rotations

  • But… the invariance proof breaks if diagonal posterior is enforced

18 of 34

19 of 34

PROOF STRATEGY

  • Operate explicitly with implemented loss

  • Compare locally linearized decoder with PCA decoder (in terms of singular value decomposition)

20 of 34

PROOF STRATEGY II

  • For a Beta-VAE, isolate incentives on U, Σ, and V.

  • Compare to what PCA would do “locally”.

  • Comparison of U was missing in the conference paper (journal version is in preparation).

  • Keep verifying experimentally.

21 of 34

WHERE THE MATH IS FRAGILE

  • Requires a particular loss term to be negligible (depends on Beta, in agreement with experiments)

  • For U-case in SVD, the local/global interplay isn’t fully faithful. For V-case, it is “approximate”

  • Some degenerate cases => (see later)

22 of 34

5.

THE HAPPY EXPERIMENTS

23 of 34

ORTHOGONALITY

VS.

DISENTANGLEMENT

24 of 34

COMMON DEGENERATE CASE WITH PCA

What are the two

principle components?

Ambiguous!

25 of 34

COMMON DEGENERATE CASE WITH PCA

Four restarts of linear Beta-VAE

In particular, (Beta-)VAE does not optimize for statistical independence.

26 of 34

DISCUSSION

  • Is it “real disentanglement” or just statistical properties of the dataset?

  • Is (Beta)-VAE a “good implementation” of nonlinear PCA?

  • Forget disentanglement! This shows existence of non-linear PCA (in a deep sense)

27 of 34

6.

FINAL WORD ABOUT AN ONGOING PROJECT

28 of 34

BACK TO BASICS - WHAT IS THE POWER OF DEEP LEARNING?

  • Flexibility, composability

  • Automated feature extraction (remember pre-dl vision?)

29 of 34

EXACT ALGORITHMS ARE COOL!

  • Combinatorial algorithms (A*, MAX-CUT, (s,t)-MIN-CUT, MAX-WEIGHT-MATCHING…) are powerful and hard to mimic by NNs.

  • Used to be part of vision pipelines. Now, there is no gradient for end to end training.

30 of 34

31 of 34

32 of 34

PRELIMINARY RESULTS

  • Suitable for combinatorial optimization problems (also usable for others).

  • Runs (blackbox) solver on forward pass AND on backward pass.

  • Math says how to construct the “backward instance” (and compute “gradient” from it)

33 of 34

(FOR NOW) SYNTHETIC EXAMPLES

34 of 34

THANK YOU!