1 of 60

Implicit Neural Representation Tutorial

Zhenyu Jiang

ICRA 2021, Xi’an, China

2 of 60

Signal Representations

How do we represent signals?

  1. Image -> Discrete pixels�
  2. 3D shape -> Voxels, point clouds, meshes�
  3. Sound wave -> Discrete samples of intensity

Lose details when representing signals in dis crete manner!!

Sitzmann, Vincent, et al. "Implicit neural representations with periodic activation functions." Neurips 2020

3 of 60

Implicit Neural Representation

Instead of representing signals in a discrete manner, new approach has been studied called implicit neural representation.

Explicit representation

Spatial coordinate (x, y, z) in N

->Occupancy

1. Explicit way tends to lose details

2. Memory expensive (scales with resolution)

Implicit representation

Spatial coordinate (x, y, z) in R

->Occupancy

4 of 60

Applications

Implicit Neural representation is applicable to a variety of scientific fields:

  1. Image/video/audio processing�
  2. 3D data processing
  3. Solving boundary value problems

5 of 60

Overview

1. 3D Shape Representation

���

2. Structured Implicit Functions

���

3. Neural Rendering

6 of 60

Implicit Functions for 3D Reconstruction

7 of 60

3D Representation

Mescheder, Lars, et al. "Occupancy networks: Learning 3d reconstruction in function space." CVPR 2019.

8 of 60

3D Representation

 

9 of 60

3D Representation

Point Cloud

Pros

  • Flexible and memory efficient�

Cons

  • No topology
  • Limited number of points

10 of 60

3D Representation

Mesh

Pros

  • Flexible and memory efficient
  • Well-defined topology�

Cons

  • Need template or primitive
  • Restricted to specific domain
  • Hard to process with neural networks

11 of 60

3D Representation

What we want

  • Can represent meshes of arbitrary topology and at arbitrary resolution,
  • Is not restricted to a Manhattan world
  • Is not limited by excessive memory requirements,
  • Preserves connectivity information,
  • Is not restricted to a specific domain (e.g. object class), and
  • Blends well with deep learning techniques?

12 of 60

Implicit occupancy field

Neural

Network

 

feature

13 of 60

Occupancy Networks: Training objective

Neural

Network

 

feature

Encoder

 

14 of 60

Marching Cube

  1. Divide the input volume into a discrete set of cubes. �
  2. For each cube containing a section of the iso-surface, a triangular mesh that approximates the behavior of the trilinear interpolant in the interior cube is generated.

15 of 60

Occupancy Networks: Mesh Extraction

Multiresolution Iso-surface Extraction�

  1. Building octree by incrementally querying the occupancy network. �
  2. Extract triangular mesh using marching cube.�
  3. Refine via 1st and 2nd order gradient information.

16 of 60

Memory Complexity

17 of 60

Single Image 3D Reconstruction

18 of 60

Testing on Real Data

19 of 60

DeepSDF

Occupancy

Whether a point is occupied by the object

Sign Distance Function

Signed distance to nearest surface

Park, Jeong Joon, et al. "Deepsdf: Learning continuous signed distance functions for shape representation.” CVPR 2019.

20 of 60

Single Shape DeepSDF

Neural

Network

 

SDF

 

21 of 60

Coded Shape DeepSDF

Neural

Network

 

SDF

feature

 

22 of 60

Auto-decoder-based DeepSDF

Auto-decoder

Each latent code z is paired with a training shape X.

Posterior over z

SDF likelihood

Optimization

23 of 60

Auto-decoder-based DeepSDF

Auto-decoder

Training

  • A random initialized latent vector is assigned to each training shape.�
  • Both the latent code and the decoder weights are optimized.

Inference

  • The decoder weights are fixed, and an optimal latent vector is estimated.

24 of 60

Shape Reconstruction

25 of 60

Shape Completion

26 of 60

ONet VS DeepSDF

Similarity�

  • Implicit Representations�
  • Decoder architecture�
  • Mesh extraction

Difference

Training data sampling

  • ONet: uniform sample�
  • DeepSDF: uniform + near-surface sample

Condition on input

  • ONet: Encoder. Can handle any input data but need training for each.�
  • DeepSDF: Auto-decoder. Can only handle input data that provide SDF-related information but no need to re-train the decoder

27 of 60

Structured Implicit Functions

28 of 60

Limitation of ONet

Peng, Songyou, et al. "Convolutional occupancy networks.” CVPR 2019.

29 of 60

ConvONets

ONet

  • Encode the input into a global feature.�
  • Limited expressiveness

ConvONets

  • Encode the input into a structured feature grid.�
  • Exploit convolutions to process feature.�
  • Query local features for occupancy prediction.

30 of 60

Encoders

  1. 3D input: point clouds or voxels�
  2. Convert inputs in to point features�
  3. Project point features onto one or multiple plane grids (Plane Encoder) or into a volume grid (Volume Encoder) using average pooling.�
  4. Average the features that fall into the same pixel/voxel cell and use it as the pixel/voxel feature.

31 of 60

Decoders

  1. Process structured feature grids with 2D/3D U-Nets.�
  2. Query local features for input x at point p (bilinear/trilinear sampling)��
  3. Predict occupancy conditioning on the local feature�

32 of 60

Quantitative Results

  1. Better IoU than ONet with less GPU memory usage.�
  2. Three 2D planes work better than 3D volumes with less GPU memory usage.�

33 of 60

Object Reconstruction

34 of 60

Voxel Super-resolution

35 of 60

Scene Reconstruction

36 of 60

Patch-based Scene Reconstruction

  1. Structure feature grid representation have translation invariance
  2. Crop the large input into different patch and process separately

37 of 60

PIFu: Pixel-Aligned Implicit Function for

High-Resolution Clothed Human Digitization

Saito, Shunsuke, et al. "Pifu: Pixel-aligned implicit function for high-resolution clothed human digitization." ICCV 2019.

38 of 60

Surface reconstruction from image

 

39 of 60

Texture reconstruction from image

 

40 of 60

Inference

41 of 60

Multi-view PIFu

 

42 of 60

Multi-view Results

43 of 60

Implicit Function for Neural Rendering

44 of 60

Neural Radiance Field

Mildenhall, Ben, et al. "Nerf: Representing scenes as neural radiance fields for view synthesis." ECCV 2020.

45 of 60

Problem and Approach

Novel view synthesis: based on multiple/single view image(s) of a scene, synthesize novel view images.

NeRF first reconstruct neural radiance field of the scene then render the reconstructed scene to get novel view images.

46 of 60

Neural radiance field with implicit functions

 

47 of 60

Volume Rendering

  1. For each pixel on the image, there is a corresponding ray.�
  2. Sample points inside the volume along the ray.�
  3. Accumulate the color and density of the sampled points to get color of the pixel.

48 of 60

Volume Rendering in NeRF

  1. Query the color and density of the sampled points with the learned implicit functions.�
  2. Color is view-dependent while density is not.�
  3. Use pixel-level loss between the rendered view and the ground truth to optimize the implicit functions.

49 of 60

Differentiable Volume Rendering - Formulation

 

50 of 60

Differentiable Volume Rendering - Implementation

 

51 of 60

Technique – Positional encoding

  1. Directly pass coordinates and angles into the implicit function(neural network) performs poorly at representing high-frequency variation in color and geometry.�
  2. First map the input into a high dimensional cosine space.

52 of 60

Technique – Hierarchical volume rendering

  1. First sample a set of points along the ray using stratified sampling.�
  2. Given the output, compute the weight of each point:��
  3. Normalize the weights and produce a piecewise-constant PDF along the ray.�
  4. Samples from this distribution are biased towards the relevant parts of the volume.

53 of 60

Optimization

  1. Loss: total squared error between the rendered and true pixel colors for both the coarse and fine renderings.���
  2. Need over 100 images for training�
  3. Convergence: a single scene takes 1-2 days on a single Nvidia V100.

54 of 60

View dependent radiance

55 of 60

Synthetic scenes

56 of 60

Real world scenes

57 of 60

Summary

Implicit Neural Representations:

  1. Effective output representation for shape, appearance, material�
  2. No discretization, model arbitrary topology�
  3. Structured implicit function improves the accuracy�
  4. Can be learned from images via differentiable rendering�
  5. Many applications: 3D reconstruction, view synthesis, etc.

58 of 60

Discussion

  1. Geometry must be extracted in post-processing step, which is slow and produces artifacts. (hybrid representation, Neural Marching Cube…)���

  • What’s the application of implicit neural representations in robotics? (Represent affordance, for local ad-hoc query …)

59 of 60

More on implicit neural representations

  • 3D reconstruction
    • IM-Net https://arxiv.org/abs/1812.02822
  • Structured implicit function
  • Neural rendering
    • Scene Representation Networks https://arxiv.org/pdf/1906.01618
    • Differentiable Volumetric Rendering: Learning Implicit 3D Representations without 3D Supervision �https://arxiv.org/abs/1912.07372

60 of 60

More on implicit neural representations

  • Affordance representation