Context Encoders
Niharika Vadlamudi 2018122008
Sravya Vardhani S 201970008
Dhawal Sirikonda 2019201089
Contents
Abstract
Introduction
Context encoders for image generation
Loss function
Results
Timeline
Abstract
Visual feature learning algorithm based on context based pixel prediction ( surrounding based).
Two main parts
Better results are obtained when a reconstruction loss plus an adversarial loss is used over standard pixel wise reconstruction loss.
Introduction
Model
In autoencoders the feature we get is likely to be just the compressed version of the image without learning any meaningful representation.
Denoising autoencoders corrupt the input image and try to undo it , but there's no need to understand the semantic meaning of the scene to do this as these corruptions occur at localized and low-level.
Here we need to fill large missing parts of image, we require deeper semantic understanding.
Encoder
Context of image in latent feature (compact)
Decoder
This task, however, is inherently multi-modal as there are:
This problem is decoupled in the loss function by jointly training our context encoders to minimize both a reconstruction loss and an adversarial loss.
Encoder - Context of the image patch is extracted and nearest neighbour contexts produces patches which are semantically similar to the original patch.
Fine tune -they validated the quality of encoder features on variety of image understanding tasks (classification, object recognition).
Decoder - this method is often able to fill in realistic image content.
Region Masks
The input to a context encoder is an image with one or more of its regions “dropped out”; i.e., set to zero, assuming zero-centered inputs. The removed regions could be of any shape, we present three different strategies here:
Current network
Real
Fake
Encoder
Decoder
Discriminator
Context encoders for image generation
Encoder-decoder pipeline
Pool-free encoders - They experimented with replacing all pooling layers with convolutions of the same kernel size and stride.
The overall stride of the network remains the same, but it results in finer inpainting.
Intuitively, there is no reason to use pooling for reconstruction based networks.
Original AlexNet architecture (with pooling) for all feature learning results is used.
Encoder
Encoder is derived from the AlexNet architecture.Image of size 227×227 (32x32) , we use the first five convolutional layers and the following pooling layer to compute an abstract 6 × 6 × 256 (1024) dimensional feature representation.
If only convolutional layers are used then all feature maps will be connected together but no connections within a specific map.
To handle this we use fully connected layer.
Encoder
227x227
Convolutional layers (5)
Feature dim - 6x6x256
Decoder
It can be understood as upsampling followed by convolution, or convolution with fractional stride.
The intuition behind this is straightforward non-linear weighted upsampling of the feature produced by the encoder until we roughly reach the original target size.
Decoder
Convolutional layers (5)
Feature dim - 6x6x256
ReLu activation layers
Loss Function
Loss function
Reconstruction Loss - Normalised L2 distance , it encourages the decoder to produce a rough outline of the predicted object, but does not indicate detail.
Adversarial Loss - Based on GANs To learn a generative model G of a data distribution and learn an adversarial discriminative model D to provide loss gradients to the generative model.
G and D are parametric functions (e.g., deep networks) where G : Z → X maps samples from noise distribution Z to data distribution X .
Understanding Adversarial Loss Function
Adversarial Loss�
The learning procedure is a two-player game where an adversarial discriminator D takes in both the prediction of G and ground truth samples, and tries to distinguish them, while G tries to confuse D by producing samples that appear as “real” as possible. ( G - Generated image, GT - Ground Truth ).
Discriminator
G
GT
Loss
The objective for discriminator is logistic likelihood indicating whether the input is real sample or predicted one:
Joint loss
We define the overall loss function as:
Results
Results on paris dataset
Results shown in paper vs our results
Our Result
Paper Result
Original Image
Result on Cat dataset
First few epochs
20-25k epochs
40-50k epochs
Results over number of epochs
Timeline
April 15 to 31
Testing on different datasets and documentation of results
April 1st-15th
Completion of code implementation and results for part 2 ( Additional )
Feb 15 - March 1
Collect data analyse and implement code for part 1 (Block Cut)
Feb 1-14
First evaluation understanding paper, analysis.
Channel-wise fully-connected layer
This layer is essentially a fully-connected layer with groups, intended to propagate information within activations of each feature map.
However, unlike a fully-connected layer (m2n4), it has no parameters connecting different feature maps and only propagates information within feature maps.
Note: we had mailed the author about this and we received a reply that these layers were optional.
Number of parameters = m*n4.
m feature maps of size nxn
m feature maps of size nxn
THANK YOU