Unsupervised Deep
Single-Image Intrinsic Decomposition
using Illumination-Varying Image Sequences

Louis Lettry1 Kenneth Vanhoey1,2 Luc Van Gool1,3

1 Computer Vision Lab @ ETH Zurich 2 Unity Technologies 3 PSI-ESAT @ KU Leuven

This presentation covers the content of a technical paper presented at Pacific Graphics 2018.
The paper and additional material (including code) are available on Unity Labs’ publication website and on Kenneth Vanhoey’s website.

The topic is intrinsic decomposition of a single image. Our solution leverages deep learning, trained in an unsupervised manner.
Note that a concurrent work (see here) proposes a very similar idea.

Introduction: Intrinsic Image Decomposition

Unsupervised Deep Single-Image Intrinsic Decomposition

2

Lettry, Vanhoey, Van Gool, ETH Zürich

Rendering

An image results from a process where photons emitted by light sources interact with the world, and after multiple reflections and transformations arrive on a sensor (eg, a camera, a human eye, …).

Introduction: Intrinsic Image Decomposition

Unsupervised Deep Single-Image Intrinsic Decomposition

3

Lettry, Vanhoey, Van Gool, ETH Zürich

Intrinsic Decomposition

Intrinsic Decomposition, in essence, is the inverse of this process: one tries to separate (hence the word “decompose”) the resultant into the “intrinsic” properties of the world that is at the basis of a captured image.
For example, one can try to guess location and physical properties of light sources and objects in the photographed scene, given an image of it.

Knowing such a decomposition is useful: it allows to edit the scene in a virtual world. Indeed, one can for example edit the lighting, edit materials, soften shadows, etc., while recalculating a consistent and plausible image for visualisation.

Introduction: Intrinsic Image Decomposition

Unsupervised Deep Single-Image Intrinsic Decomposition

4

Lettry, Vanhoey, Van Gool, ETH Zürich

Intrinsic Decomposition

[Bonneel et al., 17]

The most general 2D formulation of this problem is the following:

Given an image I, infer two images - the albedo image A and shading image S - that, when pixelwise multiplied together, form the given image.
The Albedo image should contain the “base color” of what’s visible in the scene, regardless of any reflections, shadows or specularities.
Conversely, the Shading image should contain all lighting-induced effects.

Goal: Single Image Intrinsic Decomposition

Unsupervised Deep Single-Image Intrinsic Decomposition

5

Lettry, Vanhoey, Van Gool, ETH Zürich

CNN

In this work, we use a deep convolutional neural network to decompose an input image into its A and S layers.

Training such a network in a supervised manner requires a lot of training data.

This data is however nearly impossible to obtain in sufficient variety and photorealism: Indeed, Albedo and Shading images cannot be observed directly in the real world.

Goal: Single Image Intrinsic Decomposition

Unsupervised Deep Single-Image Intrinsic Decomposition

6

Lettry, Vanhoey, Van Gool, ETH Zürich

CNN

Our method: train on time-lapses without groundtruth

Constant

Varying

We propose an unsupervised alternative, in which ground truth decomposition are not required for training.

Instead, we use just images for training our network, but of a particular format: that is, static image sequences, more often known as time-lapses or static webcams.

Such sequences contain lighting variations with static content.

This inherently means a fixed Albedo, and a varying Shading.

Our contribution can be summarized as “how to set up an effective learning scheme that exploits these natural relationships between images

The answer consists of a siamese training, with an appropriate set of loss functions, which we’ll all detail in the following slides.

Our way to proceed has three key advantages:

Related Work

Unsupervised Deep Single-Image Intrinsic Decomposition

7

Lettry, Vanhoey, Van Gool, ETH Zürich

[Zhou et al., ‘15]

[Narihira et al., ‘15]

[Lettry et al, ‘18]

[Land et al., 1971]

Retinex theory

  • Sparse albedo
  • Smooth shading

Human devised optimizations

(Supervised)

Learning based approaches

[Shi et al., ‘17]

Let’s review some related work.

Traditionally, optimization approaches are used. They levera the Retinex theory of [Land et al., 1971] which assume albedo is often sparse and piecewise constant, while shading is piecewise smooth, with discontinuities strongly correlated with geometric discontinuities.
These observations are translated to mathematical functions to optimize for. As these constraints still leave some degrees of freedom, other observations are used to regularize the optimization: eg, minimize the amount of albedo’s used (see related work section in our paper).
These approaches work fine on a limited scope of scenes, particularly human-made ones, in which the assumptions are valid. In general however, and particularly in natural environments, the assumptions don’t hold, and the optimization fails.
[Bonneel et al., 2017] presented a state of the art report concluding the problem remains unsolved as no solution is usable in practice.

Related Work

Unsupervised Deep Single-Image Intrinsic Decomposition

8

Lettry, Vanhoey, Van Gool, ETH Zürich

Supervised training:

need for ground truth

Training CNNs

requires

lots of data !

[Zhou et al., ‘15]

[Narihira et al., ‘15]

[Lettry et al, ‘18]

[Land et al., 1971]

Retinex theory

  • Sparse albedo
  • Smooth shading

Human devised optimizations

(Supervised)

Learning based approaches

[Shi et al., ‘17]

Learning based approaches have recently been attempted: they try to replace arbitrary human-devised observations by learning from data.

Most recent methods leverage deep learning in a supervised way: this requires a lot of data.

Intrinsic Decomposition Data

Unsupervised Deep Single-Image Intrinsic Decomposition

9

Lettry, Vanhoey, Van Gool, ETH Zürich

Real dense GT

Real Sparse Human Annotations

Synthetic dense GT

  • Cumbersome
  • Small objects
  • Not scalable
  • Expert modeler required
  • Rendering time
  • Domain transfer

[Grosse et al., 09]

[Bell et al., 14]

[Bonneel et al., 17]

  • Sparse
  • Not scalable

Hence, ground truth data is acquired using a tedious lab-constrained process [Grosse et al., 09], by extrapolating from collected sparse human annotations [Bell et al., 14], or by synthesizing images using physically-based rendering [Bonneel et al., 17].

None of these are fully satisfying because the process is not scalable to date:

While deep learning based solutions numerically improve over classical methods, it is difficult to define the scope of applicability: often results are biased towards the used datasets which may be non photorealistic (synthetic scenes) or human perceptual bias (human annotations).

Static Timelapses

Unsupervised Deep Single-Image Intrinsic Decomposition

10

Lettry, Vanhoey, Van Gool, ETH Zürich

Timelapse = Sequence of images acquired at a low frequency

Static = Same viewpoint & static scene (Albedo constant)

[Lalonde et al., 09]

Our goal is to replace any form of fabricated ground truth decompositions. This sounds appealing as it removes human bias and/or assumptions from the loop.

We therefore leverage timelapse data, like the ones illustrated here.

In each sequence, the albedo is supposed constant in each pixel, and the shading varies.

Disclaimer: We do not tackle sanitization of webcam streams (i.e., detection and removal/correction of non-staticity). Instead, we propose a synthetic dataset (see slide 18)

Static Timelapses

Unsupervised Deep Single-Image Intrinsic Decomposition

11

Lettry, Vanhoey, Van Gool, ETH Zürich

Timelapse = Sequence of images acquired at a low frequency

Static = Same viewpoint & static scene (Albedo constant)

[Lalonde et al., 09]

[Laffont et al., ‘15]

Temporally-varying data has been used before for intrinsic image decomposition, but not in a train-on-many/apply-on-one fashion: all solutions required multiple inputs at inference time.

Decomposed Albedo Uniqueness

Unsupervised Deep Single-Image Intrinsic Decomposition

12

Lettry, Vanhoey, Van Gool, ETH Zürich

Training on timelapse pairs but inference on single image !

Shading

Pairs

Unique

Albedo

Albedo

Conversely, we train a CNN that decomposes a single image, which effectively defines a single-image intrinsic decomposition method.

Our goal now is to define a procedure to train this CNN so that it learns from temporal variations within a timelapse.

That is why we will define losses that compare the responses of this CNN on pairs of images at train time.

The question now becomes: how do we design a principled training procedure?

Decomposed Albedo Uniqueness

Unsupervised Deep Single-Image Intrinsic Decomposition

13

Lettry, Vanhoey, Van Gool, ETH Zürich

Training on timelapse pairs but inference on single image !

Shading

Pairs

Unique

Albedo

Albedo

?

First, let us define the exact setting we are dealing with, as this is important for introducing the constraints (loss functions) that will drive the deep learning optimization.

Input images are color images: they have a limited range that can be normalized to the unit cube (if RGB images).

The albedo inherently is also a color thus lives in the same unit cube.

Defining the space shading pixel values live in is a more complex problem.

Shading Model: Colored & Unbounded

Unsupervised Deep Single-Image Intrinsic Decomposition

14

Lettry, Vanhoey, Van Gool, ETH Zürich

[Lalonde et al., 09]

In many papers, it also lives in the unit cube, or even the unit interval (greyscale).
This would mean shading, when multiplied with the albedo can only make the albedo darker: it cannot alter its color, nor can it brighten is (e.g.., add specularities)
However, in the real world, it can. Slides 15 to 17 illustrate this.

Therefore, we will treat it as a more general quantity: 3 dimensions and no upper bound.

Shading Model: Colored & Unbounded

Unsupervised Deep Single-Image Intrinsic Decomposition

15

Lettry, Vanhoey, Van Gool, ETH Zürich

Colored Shading is needed for real scenes

[Lalonde et al., 09]

Light effects can alter color.

Shading Model: Colored & Unbounded

Unsupervised Deep Single-Image Intrinsic Decomposition

16

Lettry, Vanhoey, Van Gool, ETH Zürich

[Lalonde et al., 09]

Colored Shading is needed for real scenes

Camera effects can alter color.

Shading Model: Colored & Unbounded

Unsupervised Deep Single-Image Intrinsic Decomposition

17

Lettry, Vanhoey, Van Gool, ETH Zürich

No upper bound shading is needed for real scenes

[Lalonde et al., 09]

Light effects can make images brighter than the albedo.

The result is that we consider a more expressive and general model for shading.

Yet, this also makes the optimization much harder, as many degrees of freedom are added. This explains why related work rarely tackle this.

Training Data:

  • Timelapses in the Wild: hard to sanitize + No camera processing control

Own synthetic sandbox framework based on SUNCG:

Unsupervised Deep Single-Image Intrinsic Decomposition

18

18

Lettry, Vanhoey, Van Gool, ETH Zürich

  • Static scenes
  • Fully customizable
  • Random lighting
  • Random acquisition processing (tone-mapping parameters)

To train our network, we introduce a new synthetically rendered dataset.
Why synthetic? Because:
1. this avoids the problems of non-staticity and varying camera parameters in real webcam footage
2. It gives us control over lighting, viewpoints and image processing parameters, like tone-mapping.

As a result, we generate many images, for which we render variants in terms of lighting and tone-mapping, in the hope that this data encompasses real-world variations of these two parameters.
This should benefit generalization of the learned CNN.

To render the images, we employ physically-based rendering using the Mistuba renderer on the dataset of SUNCG consisting of many indoor environments.
In this environment, we placed random-positioned point light sources, and random-positioned cameras (at approx 1.5m height and not facing towards close obstacles). We pruned overly dark images (e.g., when a light source was in a cupboard for example).

Convolutional Neural Networks

Unsupervised Deep Single-Image Intrinsic Decomposition

19

Lettry, Vanhoey, Van Gool, ETH Zürich

U-Net architecture: auto-encoder with skip connections

For the CNN, we use a pretty standard U-Net architecture, which is an auto-encoder with skip connections.
It takes as input the image, outputs the shading image (3 channels of positive numbers per pixel).
The albedo image is obtained by pixelwise division of I over S, and clipping to the valid interval.

Albedo Siamese Constraint

Unsupervised Deep Single-Image Intrinsic Decomposition

20

Lettry, Vanhoey, Van Gool, ETH Zürich

Training happens in a siamese setting: pairs of images taken from the same viewpoint, but with varying lighting and tone-mapping, are fed to the network in batches.

Albedo Siamese Constraint

Unsupervised Deep Single-Image Intrinsic Decomposition

21

Lettry, Vanhoey, Van Gool, ETH Zürich

Albedo

Shading

The CNN processes both images independently, and infers a pair of Albedo and Shading for each of them.

Albedo Siamese Constraint

Unsupervised Deep Single-Image Intrinsic Decomposition

22

Lettry, Vanhoey, Van Gool, ETH Zürich

Siamese Albedo

Albedo

Shading

In this setting, we can now express loss functions comparatively on the outputs of both processed images.
Therefore, our first loss function will be an L2 norm on the albedo difference, as both albedo’s should be similar regardless of the illumination and tone-mapping conditions of the input.
This is a physically valid measure: it is true for any real-world albedo. Hence, it generalized by definition to any physically valid data.

Albedo Siamese Constraint

Unsupervised Deep Single-Image Intrinsic Decomposition

23

Lettry, Vanhoey, Van Gool, ETH Zürich

Siamese Albedo

Shading

Albedo

  • Underconstrained: &
    • S not constrained
    • Huge space of possibilities for A

This is of course not sufficient: the problem is highly ambiguous and the optimisation has lots of freedom.

A typical outcome that is valid in the sense of this metric would be to put the full image in the shading, multiplied by a unit constant (white) albedo.

Shading Regularizations

Unsupervised Deep Single-Image Intrinsic Decomposition

24

Lettry, Vanhoey, Van Gool, ETH Zürich

Based on Retinex theory

Therefore, we need to constrain the optimisation further, in particular because our shading variables are unbounded (on the upper side).
Yet we wish to minimize human-devised regularisations as much as possible: therefore we add low-weighted regularizations to the former albedo term.

We regularize the shading with two terms based on observations of shading images:

Talk only about observing the chromaticity of the shading is smooth as well (avoid luminosity is fixed etc)

Shading is

Shading Regularizations

Unsupervised Deep Single-Image Intrinsic Decomposition

25

Lettry, Vanhoey, Van Gool, ETH Zürich

Based on Retinex theory

[Bonneel et al., 17]

2) it’s chroma varies smoothly, as is illustrated in these chroma images (lower row).

We can clearly see shading is colored, but also that color variation is sparse and smooth.

Albedo Initialization

Unsupervised Deep Single-Image Intrinsic Decomposition

26

Lettry, Vanhoey, Van Gool, ETH Zürich

Initialization with decay weight along the training

Finally, we regularize the albedo too.

More precisely, we initialize the albedo with a soft initialization method: a loss function is strongly-weighted in the beginning of the training procedure, and its weight decreases and nearly vanishes after a while.
This loss sais that the full input image is a good starting point for the albedo, as it has a lot in common.

Albedo Initialization

Unsupervised Deep Single-Image Intrinsic Decomposition

27

Lettry, Vanhoey, Van Gool, ETH Zürich

Initialization with decay weight along the training

Note that there is a subtlety here: we cross-initialize (lower row). That is, the albedo is initialized with the value of the “other” image in the siamese setting.

Since both albedo’s should be identical in the end, this proved beneficial, as the result on this slide illustrates.

Qualitative Results

Unsupervised Deep Single-Image Intrinsic Decomposition

28

Lettry, Vanhoey, Van Gool, ETH Zürich

  • First unsupervised method
    • Never seen any GT decomposition

  • Compete with state of the art supervised learning methods

[Bonneel et al., 17]

[Zhou et al, 15]

Ours

GT

[Zhao et al, 15]

Let’s take a look at some results. Our method is the first (along with concurrent work) to tackle this problem with unsupervised learning: it has never seen any ground truth decomposition at train time.
Our goal is thus to approach the quality of results of the supervised methods, trained with knowledge of ground truth decompositions.

This slide shows a synthetic scene taken from [Bonneel et al., 17] (left column), which comes with known decompositions (we did not use for training, as opposed to the other works). Bonneel et al. state in their report that the problem is unsolved: no result is fully satisfying.

Our work (2nd column) is compared against [Zhou et al., 15] (a state of the art supervised method, 3rd column) and [Zhao et al., 15] (a state of the art human-devised optimization method, right column).

One can observe several things.
First, it is hard to say one method is clearly better than another. They differ in different ways, and none fully solves the problem.
On the plus-side for our method, we can notice that:
- GT shading is colored, as is ours, unlike related work,
- GT albedo is textured, as is ours (slightly), whereas related work relies on an assumption of sparse and piecewise constant albedo,
- The flaws of our method are also present in the other methods’ results: shadows in the albedo, textures spilling in the shading.
On the down-side for any algorithm:
- None fully solve the problem,
- Texture spilling in the shading and vice versa.

Qualitative Results: Colored Shading

Unsupervised Deep Single-Image Intrinsic Decomposition

29

Lettry, Vanhoey, Van Gool, ETH Zürich

[Bell et al., 14]

This slide shows an example of the application of our method on a real image.
One can notice these are very complex scenes with multiple light sources, including in -and outdoors, and many smooth and textured objects.

Overall, it’s hard to assess quality. Decompositions are surely not perfect, and may even not be usable depending on the application.

But on the other hand, there are several positive aspects unseen in related work:
- the colored shading is indispensable in this scene: the lights and reflections are colored. Our method does put these dominant lighting colors in the shading. Unfortunately, the yellow lamps are also still colored in the albedo. But the white outdoor light is well pushed to the shading. Recall: with a bounded 1D monochrome shading, all these effects would be in the albedo image.
- the input image is very textured, and our method puts most of the colors related to texture (e.g., the canvases on the wall) in the albedo image. Some of it spills in the shading, but this is mostly monochrome light attenuation (shadow) or emphasis (specularity), which is a reasonable result.

Qualitative Results

Unsupervised Deep Single-Image Intrinsic Decomposition

30

Lettry, Vanhoey, Van Gool, ETH Zürich

Groundtruth

[Our work]

[Zhou et al, 15]

[Bonneel et al., 17]

Finally, let’s take a look at a very complex synthetic scene taken again from [Bonneel et al., 17]. It is very dark.

Thanks to the training with dark images and varying tone-mappings too, our method is able to recover non-dark albedo from this very dark scene. Of course, most of the albedo is “guessed”, as any multiplication with a black shading gives a black pixel for result.

Quantitative Results: Metrics

Unsupervised Deep Single-Image Intrinsic Decomposition

31

Lettry, Vanhoey, Van Gool, ETH Zürich

WHDR

Weighted Human Disagreement Rate

MRE

Mean Reconstruction Error

MACE

Mean Albedo Consistency Error

LMSE

(scale inv.) Local Mean Squared Error

[>, <, =]

SAW

Shading Annotations in the Wild

2

GT

Estimated

2

2

Existing metrics:

Our new consistency metrics:

Perceptual

Groundtruth Proximity

Consistency

How is it looking on the quantitative side of results. We did an extensive numerical study in which we compare several methods on 5 metrics grouped in 3 categories. Depending on the application a decomposition is used for, these different metrics and categories can be more or less important. By ranking different methods with respect to them, we propose a multi-dimensional view of the numerical results of intrinsic image decomposition methods.

Our metric categories are:

GT proximity is always a good scientific feature to have. Perceptual quality can be very important when humans are involved directly. And finally, consistency can be very important for applications that tread video data or multi-view reconstruction for example.

Quantitative Results

Unsupervised Deep Single-Image Intrinsic Decomposition

32

Lettry, Vanhoey, Van Gool, ETH Zürich

Losslessness (MRE)

Albedo Consistency

(MACE)

Albedo Quality (WHDR)

Shading Quality

(SAW)

Proximity to GT

(LMSE)

Quantitative Results

Unsupervised Deep Single-Image Intrinsic Decomposition

33

Lettry, Vanhoey, Van Gool, ETH Zürich

Losslessness (MRE)

Albedo Consistency

(MACE)

Albedo Quality (WHDR)

Shading Quality

(SAW)

Proximity to GT

(LMSE)

Quantitative Results

Unsupervised Deep Single-Image Intrinsic Decomposition

34

Lettry, Vanhoey, Van Gool, ETH Zürich

Losslessness (MRE)

Albedo Consistency

(MACE)

Albedo Quality (WHDR)

Shading Quality

(SAW)

Proximity to GT

(LMSE)

Quantitative Results

Unsupervised Deep Single-Image Intrinsic Decomposition

35

Lettry, Vanhoey, Van Gool, ETH Zürich

Losslessness (MRE)

Albedo Consistency

(MACE)

Albedo Quality (WHDR)

Shading Quality

(SAW)

Proximity to GT

(LMSE)

Quantitative Results

Unsupervised Deep Single-Image Intrinsic Decomposition

36

Lettry, Vanhoey, Van Gool, ETH Zürich

Losslessness (MRE)

Albedo Consistency

(MACE)

Albedo Quality (WHDR)

Shading Quality

(SAW)

Proximity to GT

(LMSE)

Here one can see a plot of four recent and state of the art algorithms. Ours is in blue.

On the bottom-left is the proximity to GT, in which we are among the three top-ranked ones, in the middle.

On the top-left are the consistency metrics, in which our method leads significantly. This is unsurprising, as consistency is inherently incorporated in our training scheme.

Finally, on the right is the perceptual quality of albedo (top) and shading (bottom). Our method fails on the albedo quality (which disregards textures one of our strong point), and is very successful at shading quality.

Overall, (this is subjective), our method ranks among the state of the art methods, despite it being trained in a completely unsupervised manner.
As it is one of the first attempts (along with concurrently published work), we believe there is great potential for improvement.

If one had to choose a particular algorithm today, it would depend on the application and the guarantees it requires.

Limitations: Albedo Bleeding

Unsupervised Deep Single-Image Intrinsic Decomposition

37

Lettry, Vanhoey, Van Gool, ETH Zürich

Highlighted by trichromatic shading model

To end our results, please let us show some limitations.

The biggest drawback from our method comes from the albedo similarity formulation (slide 23): it possesses many minima that motivates the CNN to put albedo in the shading as can be seen on those images.
This is particularly visible because of our trichromatic (colored) shading model: the shading is too colored as can be seen in the first row on this slide. (Note: this example has been generated by removing the shading constraints, so as to illustrate the limitations of the albedo similarity formulation.)

The second pair of images shows that despite the shading constraints, texture still bleeds into the shading and that a particularly problematic dark albedo bleeding arises: the table should be much brighter in the shading image, as it is enlighted in this scene.

Conclusion

  • Unsupervised approach to Single Image Intrinsic Decomposition
    • No GT needed
    • Close to supervised State-of-the-art
    • First choice for consistency & losslessness
  • General trichromatic unbounded shading model used (and usable !)
    • More difficult optimization
    • Requires new regularizations: our set can probably be improved
  • Opens up new research directions for Intrinsic Decomposition
    • (and closes cumbersome groundtruth gathering !)

Unsupervised Deep Single-Image Intrinsic Decomposition

38

Lettry, Vanhoey, Van Gool, ETH Zürich

Let us conclude with a small recap of what we have done and achieved. Our method is the first (along with concurrent work) to attempt a deep-learning method for single-image intrinsic decomposition that is trained in a completely unsupervised manner. There is no need for ground truth at train time.

Yet, results lean close to that of state of the art methods trained in a supervised manner, in particular when looking for consistency and invertibility of decomposition.

On a more scientific side, this opens a new avenue for work along the unsupervised training trend for intrinsic decomposition, and we have shown that using a trichromatic unbounded shading model is tractable. It brings along more ambiguity to solve, hence requires regularization terms. We believe our first attempts is not optimal, and encourage novelties in this regard.

Unsupervised Deep
Single-Image Intrinsic Decomposition
using Illumination-Varying Image Sequences

Louis Lettry1 Kenneth Vanhoey1,2 Luc Van Gool1,3

1 Computer Vision Lab @ ETH Zurich 2 Unity Technologies 3 PSI-ESAT @ KU Leuven

Code available @ https://git.ee.ethz.ch/lettryl/UnsupervisedIntrinsicDecomposition

Thank you for following through this presentation.

Source code and several other documents are available on the websites of Unity Labs’ publications and of Kenneth Vanhoey.
In case of questions, please contact Louis and/or Kenneth by email.

PG18 Presentation - Google Slides