1 of 25

MonoNeRF: Learning a Generalizable Dynamic Radiance Field from Monocular Videos

Published in ICCV 2023

We propose a radiance field that could generalize to multiple dynamic scenes

Fengrui Tian1, Shaoyi Du1, Yueqi Duan2

1College of Artificial Intelligence, Xi’an Jiaotong University

2Department of Electrical Engineering, Tsinghua University

Code: https://github.com/tianfr/MonoNeRF

Arxiv: https://arxiv.org/abs/2212.13056

Code

2 of 25

Introduction – NeRF

NeRF pipeline

Novel view synthesis of orchid

Implicit radiance field

[1] Mildenhall, B., Srinivasan, P. P., Tancik, M., Barron, J. T., Ramamoorthi, R., & Ng, R., NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. In ECCV, 2020.

Vanilla NeRF cannot render dynamic scenes

3 of 25

Introduction – NeRF in Dynamic Scenes

[1] Li, Z., Niklaus, S., Snavely, N., & Wang, O., Neural Scene Flow Fields for Space-Time View Synthesis of Dynamic Scenes. In CVPR, 2021. https://doi.org/10.1109/cvpr46437.2021.00643

[2] Video link: https://www.bilibili.com/video/BV1US4y1G7Qu/?spm_id_from=333.337.search-card.all.click&vd_source=ae30a8dad8d04331531446de3242367a

Novel view synthesis of dynamic scenes from monocular videos

Bullet time in movie production and sports event

Space-time interpolation

4 of 25

Challenge – Ambiguity from Monocular Video

Only one 2D video frame and 2D optical flow at any timestamp

No precise 3D information

Which one ?

Multiview

Monocular

5 of 25

Previous Works – Use Positions

Break the ambiguity by positions, which has no transferable ability across scenes

 

 

 

 

 

 

 

Can we learn a dynamic radiance field that generalizes to multiple scenes?

Two scenes are mixed due to no transferable ability of positions

6 of 25

Our Solution – MonoNeRF

2D video frames and optical flows are a pair of complementary constraints to jointly estimate 3D point information and trajectories.

2D video frames – spatial constraint

  • correct unreasonable trajectories in space

Optical flows – temporal constraint

  • limit the relations of point trajectories in time

7 of 25

Our Solution – MonoNeRF

Monocular Video

NeRF

Rendering

Spatial Constraint

Dynamic Scene

2D Video Frame

Optical Flow

Temporal Constraint

Encoder

Flow-based Feature Aggregation

Break the Ambiguity

  1. Build implicit velocity field based on video features
  2. Sample features from multiple frames to break the ambiguity
  3. Render images with sampled point features
  4. Optimize point features by Spatial-Temporal constraints

Frame-wise Feature

Flow Field

8 of 25

Pipeline – Build Flow Field

Frame-wise Feature

Flow Field

Monocular Video

Encoder

 

 

  • Build implicit velocity field.
  • Point trajectories.

9 of 25

Pipeline – Build Flow Field

Frame-wise Feature

Flow Field

Monocular Video

Encoder

 

 

Velocity field

Point Trajectory

Position

Time

10 of 25

Pipeline – Sample Point Features

Monocular Video

Frame-wise Feature

Flow Field

Encoder

Flow-based Feature Aggregation

Break the Ambiguity

 

 

  • Project points onto the image frames to find local image patches and extract the features
  • Fuse the local features of each frames as the point features

11 of 25

Pipeline – Render Dynamic Scenes

Monocular Video

Frame-wise Feature

Flow Field

Encoder

Flow-based Feature Aggregation

Break the Ambiguity

NeRF

Rendering

Dynamic Scene

  • Rendering images by sample point features with the NeRF pipeline

12 of 25

Pipeline – Spatial-Temporal Constraint

Monocular Video

NeRF

Rendering

Spatial Constraint

Dynamic Scene

2D Video Frame

Optical Flow

Temporal Constraint

Frame-wise Feature

Flow Field

Encoder

Flow-based Feature Aggregation

Break the Ambiguity

  • Optimize the point features and flow field by Spatial-Temporal constraints

13 of 25

Traditional Setting – Single Scene

[1] B., Srinivasan, P. P., Tancik, M., Barron, J. T., Ramamoorthi, R., & Ng, R., NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. In ECCV, 2020.

[2] Qiao, Yi-Ling and Gao, Alexander and Lin, Ming C, NeuPhysics: Editable Neural Geometry and Physics from Monocular Videos, In NeurIPS, 2022

MonoNeRF outperforms other SOTA methods on traditional novel view synthesis task.

14 of 25

Application #1 – Video Stream

Training Frames

 

Unseen Frames

 

Novel view synthesis on unseen frames containing new foreground motions

15 of 25

[1] Gao, C., Saraf, A., Kopf, J. and Huang, J.B., 2021. Dynamic view synthesis from dynamic monocular video. In ICCV, 2021.

Application #1 – Video Stream

Ours

DynNeRF

MonoNeRF could transfer to new motions, whereas DynNeRF only interpolates in the training frames.

Free-viewpoint video rendering

16 of 25

Application #2 – General Radiance Field

Multiple Monocular Videos

MonoNeRF

General Dynamic Radiance Field

A general dynamic radiance field of multiple scenes

17 of 25

Application #2 – General Radiance Field

Ours – Balloon2

Ours – Umbrella

[1] Gao, C., Saraf, A., Kopf, J. and Huang, J.B., 2021. Dynamic view synthesis from dynamic monocular video. In ICCV, 2021.

DynNeRF

MonoNeRF could learn general dynamic radiance field from multiple scenes, whereas other methods fail to distinguish scenes.

18 of 25

Application #3 – Novel Scene Adaptation

Novel Scene Adaption

Training Scene

Novel Scene

Finetuning on novel scenes with 500 steps (10 minutes)

19 of 25

Application #3 – Novel Scene Adaptation

[1] Gao, C., Saraf, A., Kopf, J. and Huang, J.B., 2021. Dynamic view synthesis from dynamic monocular video. In ICCV, 2021.

Ours (10-minute finetuning)

DynNeRF (10-minute finetuning)

MonoNeRF could transfer to a novel scene with 10-minute finetuning, compared to other methods that need 1 day to train from scratch.

Free-viewpoint video rendering

20 of 25

Application #4 – Editing

Change Background

Change Background + Flip Foreground

Change Background + Scale Foreground

Move Foreground

21 of 25

Code

22 of 25

Future Work #1 – Finetuning

Still need 500-step finetuning due to limited dataset and resources

[1] Gao, C., Saraf, A., Kopf, J. and Huang, J.B., 2021. Dynamic view synthesis from dynamic monocular video. In ICCV, 2021.

Ours (500 Steps)

DynNeRF (500 Steps)

23 of 25

Future Work #2 – Flow Generalization Ability

Limited generalization ability of flow estimation

500-step finetuning

24 of 25

In Summary

  • MonoNeRF learns a dynamic radiance field that generalizes to scenes.
  • MonoNeRF exploits generalizable features with joint spatial and temporal optimization.
  • MonoNeRF supports new applications such as novel scene adaptation and editing.

MonoNeRF: Learning a Generalizable Dynamic Radiance Field from Monocular Videos

25 of 25

Published in ICCV 2023

Fengrui Tian, Shaoyi Du, Yueqi Duan

Code: https://github.com/tianfr/MonoNeRF

Arxiv: https://arxiv.org/abs/2212.13056

MonoNeRF: Learning a Generalizable Dynamic Radiance Field from Monocular Videos

Thank You