1 of 37

Context Encoders

Niharika Vadlamudi 2018122008

Sravya Vardhani S 201970008

Dhawal Sirikonda 2019201089

2 of 37

Contents

Abstract

Introduction

Context encoders for image generation

Loss function

Results

Timeline

3 of 37

Abstract

Visual feature learning algorithm based on context based pixel prediction ( surrounding based).

Two main parts

  • The content of the image should be extracted.
  • Plausible hypothesis should be provided for the missing parts.

Better results are obtained when a reconstruction loss plus an adversarial loss is used over standard pixel wise reconstruction loss.

4 of 37

Introduction

  • Exploring if state-of-the-art computer vision algorithms can make a sense of the structure as humans do.
  • In this paper they learn and predict the structure using convolutional neural networks.( shown below)

5 of 37

Model

In autoencoders the feature we get is likely to be just the compressed version of the image without learning any meaningful representation.

Denoising autoencoders corrupt the input image and try to undo it , but there's no need to understand the semantic meaning of the scene to do this as these corruptions occur at localized and low-level.

Here we need to fill large missing parts of image, we require deeper semantic understanding.

Encoder

Context of image in latent feature (compact)

Decoder

6 of 37

This task, however, is inherently multi-modal as there are:

  • multiple ways to fill the missing region while.
  • maintaining coherence with the given context.

This problem is decoupled in the loss function by jointly training our context encoders to minimize both a reconstruction loss and an adversarial loss.

Encoder - Context of the image patch is extracted and nearest neighbour contexts produces patches which are semantically similar to the original patch.

Fine tune -they validated the quality of encoder features on variety of image understanding tasks (classification, object recognition).

Decoder - this method is often able to fill in realistic image content.

7 of 37

Region Masks

The input to a context encoder is an image with one or more of its regions “dropped out”; i.e., set to zero, assuming zero-centered inputs. The removed regions could be of any shape, we present three different strategies here:

8 of 37

Current network

9 of 37

Real

Fake

Encoder

Decoder

Discriminator

10 of 37

Context encoders for image generation

11 of 37

Encoder-decoder pipeline

  • Simple encoder-decoder pipeline.
  • Encoder takes an input image with missing regions and produces a latent feature representation of that image
  • Decoder takes this feature representation and produces the missing image content.
  • Encoder and decoder should be connected with a channel wise fully connected layer, so decoder can reason with the whole image content.

12 of 37

Pool-free encoders - They experimented with replacing all pooling layers with convolutions of the same kernel size and stride.

The overall stride of the network remains the same, but it results in finer inpainting.

Intuitively, there is no reason to use pooling for reconstruction based networks.

Original AlexNet architecture (with pooling) for all feature learning results is used.

13 of 37

Encoder

Encoder is derived from the AlexNet architecture.Image of size 227×227 (32x32) , we use the first five convolutional layers and the following pooling layer to compute an abstract 6 × 6 × 256 (1024) dimensional feature representation.

If only convolutional layers are used then all feature maps will be connected together but no connections within a specific map.

To handle this we use fully connected layer.

Encoder

227x227

Convolutional layers (5)

Feature dim - 6x6x256

14 of 37

Decoder

It can be understood as upsampling followed by convolution, or convolution with fractional stride.

The intuition behind this is straightforward non-linear weighted upsampling of the feature produced by the encoder until we roughly reach the original target size.

Decoder

Convolutional layers (5)

Feature dim - 6x6x256

ReLu activation layers

15 of 37

Loss Function

16 of 37

Loss function

Reconstruction Loss - Normalised L2 distance , it encourages the decoder to produce a rough outline of the predicted object, but does not indicate detail.

Adversarial Loss - Based on GANs To learn a generative model G of a data distribution and learn an adversarial discriminative model D to provide loss gradients to the generative model.

G and D are parametric functions (e.g., deep networks) where G : Z → X maps samples from noise distribution Z to data distribution X .

17 of 37

18 of 37

Understanding Adversarial Loss Function

19 of 37

20 of 37

21 of 37

22 of 37

23 of 37

24 of 37

25 of 37

26 of 37

Adversarial Loss�

The learning procedure is a two-player game where an adversarial discriminator D takes in both the prediction of G and ground truth samples, and tries to distinguish them, while G tries to confuse D by producing samples that appear as “real” as possible. ( G - Generated image, GT - Ground Truth ).

Discriminator

G

GT

Loss

The objective for discriminator is logistic likelihood indicating whether the input is real sample or predicted one:

27 of 37

Joint loss

We define the overall loss function as:

  • NOTE : The weight of L2 reconstruction loss is set to be 0.999 and for Ladv its 0.001 respectively .There is no rule for deciding these values , it was done empirically .

28 of 37

Results

29 of 37

Results on paris dataset

30 of 37

Results shown in paper vs our results

Our Result

Paper Result

Original Image

31 of 37

Result on Cat dataset

First few epochs

20-25k epochs

40-50k epochs

32 of 37

Results over number of epochs

33 of 37

34 of 37

Timeline

35 of 37

April 15 to 31

Testing on different datasets and documentation of results

April 1st-15th

Completion of code implementation and results for part 2 ( Additional )

Feb 15 - March 1

Collect data analyse and implement code for part 1 (Block Cut)

Feb 1-14

First evaluation understanding paper, analysis.

36 of 37

Channel-wise fully-connected layer

This layer is essentially a fully-connected layer with groups, intended to propagate information within activations of each feature map.

However, unlike a fully-connected layer (m2n4), it has no parameters connecting different feature maps and only propagates information within feature maps.

Note: we had mailed the author about this and we received a reply that these layers were optional.

Number of parameters = m*n4.

m feature maps of size nxn

m feature maps of size nxn

37 of 37

THANK YOU