1 of 60

Implicit Neural Representation Tutorial

Zhenyu Jiang

ICRA 2021, Xi’an, China

2 of 60

Signal Representations

How do we represent signals?

Image -> Discrete pixels�
3D shape -> Voxels, point clouds, meshes�
Sound wave -> Discrete samples of intensity

Lose details when representing signals in dis crete manner!!

Sitzmann, Vincent, et al. "Implicit neural representations with periodic activation functions." Neurips 2020

3 of 60

Implicit Neural Representation

Instead of representing signals in a discrete manner, new approach has been studied called implicit neural representation.

Explicit representation

Spatial coordinate (x, y, z) in N

->Occupancy

1. Explicit way tends to lose details

2. Memory expensive (scales with resolution)

Implicit representation

Spatial coordinate (x, y, z) in R

->Occupancy

4 of 60

Applications

Implicit Neural representation is applicable to a variety of scientific fields:

Image/video/audio processing�
3D data processing�
Solving boundary value problems

5 of 60

Overview

1. 3D Shape Representation

��

2. Structured Implicit Functions

��

3. Neural Rendering

6 of 60

Implicit Functions for 3D Reconstruction

7 of 60

3D Representation

Mescheder, Lars, et al. "Occupancy networks: Learning 3d reconstruction in function space." CVPR 2019.

8 of 60

3D Representation

9 of 60

3D Representation

Point Cloud

Pros

Flexible and memory efficient�

Cons

No topology
Limited number of points

10 of 60

3D Representation

Mesh

Pros

Flexible and memory efficient
Well-defined topology�

Cons

Need template or primitive
Restricted to specific domain
Hard to process with neural networks

11 of 60

3D Representation

What we want

Can represent meshes of arbitrary topology and at arbitrary resolution,
Is not restricted to a Manhattan world
Is not limited by excessive memory requirements,
Preserves connectivity information,
Is not restricted to a specific domain (e.g. object class), and
Blends well with deep learning techniques?

12 of 60

Implicit occupancy field

Neural

Network

feature

13 of 60

Occupancy Networks: Training objective

Neural

Network

feature

Encoder

14 of 60

Marching Cube

Divide the input volume into a discrete set of cubes. �
For each cube containing a section of the iso-surface, a triangular mesh that approximates the behavior of the trilinear interpolant in the interior cube is generated.

15 of 60

Occupancy Networks: Mesh Extraction

Multiresolution Iso-surface Extraction�

Building octree by incrementally querying the occupancy network. �
Extract triangular mesh using marching cube.�
Refine via 1st and 2nd order gradient information.

16 of 60

Memory Complexity

17 of 60

Single Image 3D Reconstruction

18 of 60

Testing on Real Data

19 of 60

DeepSDF

Occupancy

Whether a point is occupied by the object

Sign Distance Function

Signed distance to nearest surface

Park, Jeong Joon, et al. "Deepsdf: Learning continuous signed distance functions for shape representation.” CVPR 2019.

20 of 60

Single Shape DeepSDF

Neural

Network

SDF

21 of 60

Coded Shape DeepSDF

Neural

Network

SDF

feature

22 of 60

Auto-decoder-based DeepSDF

Auto-decoder

Each latent code z is paired with a training shape X.

Posterior over z

SDF likelihood

Optimization

23 of 60

Auto-decoder-based DeepSDF

Auto-decoder

Training

A random initialized latent vector is assigned to each training shape.�
Both the latent code and the decoder weights are optimized.

Inference

The decoder weights are fixed, and an optimal latent vector is estimated.

24 of 60

Shape Reconstruction

25 of 60

Shape Completion

26 of 60

ONet VS DeepSDF

Similarity�

Implicit Representations�
Decoder architecture�
Mesh extraction

Difference

Training data sampling

ONet: uniform sample�
DeepSDF: uniform + near-surface sample

Condition on input

ONet: Encoder. Can handle any input data but need training for each.�
DeepSDF: Auto-decoder. Can only handle input data that provide SDF-related information but no need to re-train the decoder

27 of 60

Structured Implicit Functions

28 of 60

Limitation of ONet

Peng, Songyou, et al. "Convolutional occupancy networks.” CVPR 2019.

29 of 60

ConvONets

ONet

Encode the input into a global feature.�
Limited expressiveness

ConvONets

Encode the input into a structured feature grid.�
Exploit convolutions to process feature.�
Query local features for occupancy prediction.

30 of 60

Encoders

3D input: point clouds or voxels�
Convert inputs in to point features�
Project point features onto one or multiple plane grids (Plane Encoder) or into a volume grid (Volume Encoder) using average pooling.�
Average the features that fall into the same pixel/voxel cell and use it as the pixel/voxel feature.

31 of 60

Decoders

Process structured feature grids with 2D/3D U-Nets.�
Query local features for input x at point p (bilinear/trilinear sampling)��
Predict occupancy conditioning on the local feature�

32 of 60

Quantitative Results

Better IoU than ONet with less GPU memory usage.�
Three 2D planes work better than 3D volumes with less GPU memory usage.�

33 of 60

Object Reconstruction

34 of 60

Voxel Super-resolution

35 of 60

Scene Reconstruction

36 of 60

Patch-based Scene Reconstruction

Structure feature grid representation have translation invariance
Crop the large input into different patch and process separately

37 of 60

PIFu: Pixel-Aligned Implicit Function for

High-Resolution Clothed Human Digitization

Saito, Shunsuke, et al. "Pifu: Pixel-aligned implicit function for high-resolution clothed human digitization." ICCV 2019.

38 of 60

Surface reconstruction from image

39 of 60

Texture reconstruction from image

41 of 60

Multi-view PIFu

42 of 60

Multi-view Results

43 of 60

Implicit Function for Neural Rendering

44 of 60

Neural Radiance Field

Mildenhall, Ben, et al. "Nerf: Representing scenes as neural radiance fields for view synthesis." ECCV 2020.

45 of 60

Problem and Approach

Novel view synthesis: based on multiple/single view image(s) of a scene, synthesize novel view images.

NeRF first reconstruct neural radiance field of the scene then render the reconstructed scene to get novel view images.

46 of 60

Neural radiance field with implicit functions

47 of 60

Volume Rendering

For each pixel on the image, there is a corresponding ray.�
Sample points inside the volume along the ray.�
Accumulate the color and density of the sampled points to get color of the pixel.

48 of 60

Volume Rendering in NeRF

Query the color and density of the sampled points with the learned implicit functions.�
Color is view-dependent while density is not.�
Use pixel-level loss between the rendered view and the ground truth to optimize the implicit functions.

49 of 60

Differentiable Volume Rendering - Formulation

50 of 60

Differentiable Volume Rendering - Implementation

51 of 60

Technique – Positional encoding

Directly pass coordinates and angles into the implicit function(neural network) performs poorly at representing high-frequency variation in color and geometry.�
First map the input into a high dimensional cosine space.

52 of 60

Technique – Hierarchical volume rendering

First sample a set of points along the ray using stratified sampling.�
Given the output, compute the weight of each point:��
Normalize the weights and produce a piecewise-constant PDF along the ray.�
Samples from this distribution are biased towards the relevant parts of the volume.

53 of 60

Optimization

Loss: total squared error between the rendered and true pixel colors for both the coarse and fine renderings.��
Need over 100 images for training�
Convergence: a single scene takes 1-2 days on a single Nvidia V100.

54 of 60

View dependent radiance

55 of 60

Synthetic scenes

56 of 60

Real world scenes

57 of 60

Summary

Implicit Neural Representations:�

Effective output representation for shape, appearance, material�
No discretization, model arbitrary topology�
Structured implicit function improves the accuracy�
Can be learned from images via differentiable rendering�
Many applications: 3D reconstruction, view synthesis, etc.

58 of 60

Discussion

Geometry must be extracted in post-processing step, which is slow and produces artifacts. (hybrid representation, Neural Marching Cube…)��

What’s the application of implicit neural representations in robotics? (Represent affordance, for local ad-hoc query …)

1 of 60

2 of 60

3 of 60

4 of 60

5 of 60

6 of 60

7 of 60

8 of 60

9 of 60

10 of 60

11 of 60

12 of 60

13 of 60

14 of 60

15 of 60

16 of 60

17 of 60

18 of 60

19 of 60

20 of 60

21 of 60

22 of 60

23 of 60

24 of 60

25 of 60

26 of 60

27 of 60

28 of 60

29 of 60

30 of 60

31 of 60

32 of 60

33 of 60

34 of 60

35 of 60

36 of 60

37 of 60

38 of 60

39 of 60

40 of 60

41 of 60

42 of 60

43 of 60

44 of 60

45 of 60

46 of 60

47 of 60

48 of 60

49 of 60

50 of 60

51 of 60

52 of 60

53 of 60

54 of 60

55 of 60

56 of 60

57 of 60

58 of 60

59 of 60

60 of 60