1 of 34

Variational Autoencoders pursue PCA directions (by accident)

July 2, 2019

Michal Rolínek

Max-Planck Institute for Intelligent Systems

2 of 34

A LITTLE BIT ABOUT ME

3 of 34

BRIEF TRAJECTORY

2012 - Master’s degree in topology and function spaces in Charles University, Prague.

2017 - PhD. in discrete optimization theory at IST Austria, Vienna

2017 - ? PostDoc at Autonomous Learning group of the Max-Planck Institute for Intelligent Systems, Tübingen

Along the way: mathematical education, mathematical olympiads, startup in cryptography

4 of 34

MY WORK BEFORE...

5 of 34

… AND AFTER

Efficient Optimization of Rank-Based Loss Functions -- CVPR 2018 (HM Best Paper Award) [Mohapatra* , Rolínek*, Jawahar, Kolmogorov, Kumar]

L4: Practical Loss-based Stepsize Adaptation for Deep Learning -- NeurIPS 2018

[Rolínek, Martius]

Variational Autoencoders Pursue PCA Directions (by Accident) -- CVPR 2019 (and ongoing)

[Rolínek*, Zietlow*, Martius]

6 of 34

WE NEED TO TALK ABOUT DISENTANGLEMENT

7 of 34

[Higgins et al., Beta-VAE: Learning Visual Concepts with a Constrained Variational Framework, ICLR 2017]

“Learning meaningful and compact representations with structurally disentangled semantics.”

WHAT IS DISENTANGLEMENT?

Also formal approaches to the definition exist

[Higgins et al., Towards a Definition of Disentangled Representations, 2018]

8 of 34

BRIEF HISTORY

2017 [Higgins et al.]: fully unsupervised learning of disentangled representations is somewhat possible.

2017-2018: Disorganized interest - new architectures, metrics, benchmarks, datasets

ICML 2019 [Locatello et al.]: madness resolved

2019 - ?: Organized interest - ICML best paper award, disentanglement challenge, fairness, sim2real

We were here (and failing)

9 of 34

From [Locatello et al.] ICML 2019

10 of 34

Rotations matter

11 of 34

MAIN RESULT

12 of 34

FORMALIZED...

13 of 34

USUAL REACTIONS

14 of 34

Typical PCA-autoencoder connections do not discuss alignment of the latent space.

[Burgess et al., Understanding Disentangling in Beta-VAE, 2018] Only found a high-level intuitive explanation

The PCA connection goes away without “diagonal posterior” (and so does disentanglement)

WHY NOT OBVIOUS?

15 of 34

FROM UNDER THE RUG

16 of 34

THE CLASSICAL VAE STORY

Choice of prior p(z) ~ N(0, I))
Gaussian decoder (unit variance) => MSE
Gaussian encoder with diagonal covariance matrix (!!)

CANONICAL IMPLEMENTATION

17 of 34

What doesn’t explain choice of alignment...

Log likelihood - is invariant to latent space rotations

ELBO - is invariant to latent space rotations

But… the invariance proof breaks if “diagonal posterior” is enforced

19 of 34

PROOF STRATEGY

Operate explicitly with implemented loss

Compare locally linearized decoder with PCA decoder (in terms of singular value decomposition)

20 of 34

PROOF STRATEGY II

For a Beta-VAE, isolate incentives on U, Σ, and V.

Compare to what PCA would do “locally”.

Comparison of U was missing in the conference paper (journal version is in preparation).

Keep verifying experimentally.

21 of 34

WHERE THE MATH IS FRAGILE

Requires a particular loss term to be negligible (depends on Beta, in agreement with experiments)

For U-case in SVD, the local/global interplay isn’t fully faithful. For V-case, it is “approximate”

Some degenerate cases => (see later)

22 of 34

THE HAPPY EXPERIMENTS

23 of 34

ORTHOGONALITY

VS.

DISENTANGLEMENT

24 of 34

COMMON DEGENERATE CASE WITH PCA

What are the two

principle components?

Ambiguous!

25 of 34

COMMON DEGENERATE CASE WITH PCA

Four restarts of linear Beta-VAE

In particular, (Beta-)VAE does not optimize for statistical independence.

26 of 34

DISCUSSION

Is it “real disentanglement” or just statistical properties of the dataset?

Is (Beta)-VAE a “good implementation” of nonlinear PCA?

Forget disentanglement! This shows existence of non-linear PCA (in a deep sense)

27 of 34

FINAL WORD ABOUT AN ONGOING PROJECT

28 of 34

BACK TO BASICS - WHAT IS THE POWER OF DEEP LEARNING?

Flexibility, composability

Automated feature extraction (remember pre-dl vision?)

29 of 34

EXACT ALGORITHMS ARE COOL!

Combinatorial algorithms (A*, MAX-CUT, (s,t)-MIN-CUT, MAX-WEIGHT-MATCHING…) are powerful and hard to mimic by NNs.

Used to be part of vision pipelines. Now, there is no gradient for end to end training.

32 of 34

PRELIMINARY RESULTS

Suitable for combinatorial optimization problems (also usable for others).

Runs (blackbox) solver on forward pass AND on backward pass.

Math says how to construct the “backward instance” (and compute “gradient” from it)