1 of 23

Semantic 3D-aware portrait synthesis and manipulation using CNeRF

2 of 23

Semantic Manipulation

Ability to selectively edit certain semantic sections of the image is called semantic manipulation.

This gives a better control over generating synthetic images.

Generated input

Eyes

Background

Hair

3 of 23

Problem Statement and Motivation

Problem Statement:

Most of the current 3D aware GAN methods using neural fields model the whole image as an overall neural radiance field which limits the partial semantic editability of the generated results.
In this work, a new method/framework called Compositional Neural Radiance Field was proposed, which allows us to manipulate independent semantic regions in the synthetic (generated) image.

4 of 23

Problem Statement and Motivation

"Semantic 3D-aware Portrait Synthesis and Manipulation Based on Compositional Neural Radiance Field (CNeRF)" by Tianxiang Ma, Bingchuan Li, Qian He, Jing Dong, Tieniu Tan

Link: https://arxiv.org/pdf/2302.01579v1.pdf

5 of 23

Implementation

6 of 23

Stage 1: Mapping Network

Mapping network: This mapping network generates latent code with a noise input sampled from standard gaussian distribution.
The network used here is a 3-layer MLP with leaky ReLU activation and 256 channels.
The output for mapping network is 256*5 (3 for shape network + 2 for texture network).
The input for this network is a sample from standard gaussian distribution (of 100 dimensions).

7 of 23

Stage 1: Local Generator

Responsible for generating different semantic content.
Number of these generators will be equal to the number of semantic regions
They take x (coordinates), v (view angle theta, and phi) and w (latent codes) from mapping network as inputs.
Shape Network and Texture networks are of depth 3 and 2 respectively. And we use modulated networks for it.
What is modulated networks? The activation functions are modulated sine functions like below

8 of 23

Stage 1: Fusion & Volume Aggregation

Here we fuse the outputs from individual generator using the below:

For each pixel, we query the samples on a ray using the following equation:

9 of 23

Stage 1: Global and Semantic Discriminators

Global discriminators (GD) judge the overall rendering result, and the semantic (SD) discriminator judges the semantic parts.
GD takes as input the rendered 2D mask and 2D color image and outputs the predicted true /false label as well as the estimated view direction of the input generated image.

10 of 23

Stage 1: Global and Semantic Discriminators

SD is designed to enhance semantic region disentanglement. In each iteration of training, the model randomly selects one of the k semantic categories and extracts the corresponding region in the 2D color image based on the generated 2D mask. The SD learns to discriminate the reality of the semantic part input and to predict the semantic category of the input image.

11 of 23

Experiments#1

Build of low resolution CNeRF – Stage 1
Training of low resolution CNeRF – Stage 1

Issues with training:

It seems the Stage – 1 model, low resolution CNeRF (64 x 64) is facing non-convergence issues.
The discriminator seems to be dominating. Next slides we can see some plots between various losses

Note: The training is done for nearly 5000 iterations on a training set of 100 images. In this setup I did few experiments by around with weights of some r1 regularization loss. But it didn't affect that much.

12 of 23

Experiments#1 - Losses

This is the plot of global discriminator loss, local discriminator loss and generator loss. Note: Here we are using the total suggested in the paper.

13 of 23

Experiments#1 - Losses

This is the plot global discriminator's gan loss on fake images, global discriminator's gan loss on real images and generators gan loss (for overall image generation)

14 of 23

Experiments#1 – Intermediate results

After 1000 iterations

After 1500 iterations

After 2500 iterations

After 3500 iterations

After 4500 iterations

Image

Seg maps

15 of 23

Experiments#2 – Modified the Original Repo

Original repo modified
40K images as training data – FFHQ dataset
Nearly 100k iterations of training. Recommended was 300K
Took 56 hours of training time for 100K iterations

5. Batch size 4 data samples

16 of 23

Experiments#2 – Results

Iteration 0

Iteration 10000

17 of 23

Experiments#2 – Results

Iteration 60000

Iteration 90000

18 of 23

Results - diversity

19 of 23

Results – view consistency

20 of 23

Results – semantic maps

21 of 23

Results – semantic manipulation

W_new = W_old + lambda * Direction

22 of 23

Results – semantic manipulation

23 of 23

Results – semantic manipulation