1 of 25

MonoNeRF: Learning a Generalizable Dynamic Radiance Field from Monocular Videos

Published in ICCV 2023

We propose a radiance field that could generalize to multiple dynamic scenes

Fengrui Tian¹, Shaoyi Du¹, Yueqi Duan²

¹College of Artificial Intelligence, Xi’an Jiaotong University

²Department of Electrical Engineering, Tsinghua University

Code: https://github.com/tianfr/MonoNeRF

Arxiv: https://arxiv.org/abs/2212.13056

Code

2 of 25

Introduction – NeRF

NeRF pipeline

Novel view synthesis of orchid

Implicit radiance field

[1] Mildenhall, B., Srinivasan, P. P., Tancik, M., Barron, J. T., Ramamoorthi, R., & Ng, R., NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. In ECCV, 2020.

Vanilla NeRF cannot render dynamic scenes

3 of 25

Introduction – NeRF in Dynamic Scenes

[1] Li, Z., Niklaus, S., Snavely, N., & Wang, O., Neural Scene Flow Fields for Space-Time View Synthesis of Dynamic Scenes. In CVPR, 2021. https://doi.org/10.1109/cvpr46437.2021.00643

[2] Video link: https://www.bilibili.com/video/BV1US4y1G7Qu/?spm_id_from=333.337.search-card.all.click&vd_source=ae30a8dad8d04331531446de3242367a

Novel view synthesis of dynamic scenes from monocular videos

Bullet time in movie production and sports event

Space-time interpolation

4 of 25

Challenge – Ambiguity from Monocular Video

Only one 2D video frame and 2D optical flow at any timestamp

No precise 3D information

Which one ?

Multiview

Monocular

So what is the biggest challenge in this setting? What is the scientific problem in the setting when you try to render novel views of dynamic scenes from monocular videos? In my opinion, the challenge is the ambiguity problem.

I make a comparison in this slice, for Multiview, it is possible to estimate the precise information of one point in the 3D space, as 2 or more camera rays from different cameras could decide the information of one point. However, when comes to the monocular setting, since we only have one frame at each time stamp, it is difficult to align the pixel information on the video frames to the point information in the 3D space. It an “one-to-many” mapping.

This ambiguous mapping can be analyzed from 2 perspectives. First, as I just said, we can only extract 2D pixel information from images, but fail to estimate precise information of each point in 3D space. Second, we could estimate 2D movements of each pixel on the image plane, but fail to estimate the precise 3D movements of each point.

5 of 25

Previous Works – Use Positions

Break the ambiguity by positions, which has no transferable ability across scenes

Can we learn a dynamic radiance field that generalizes to multiple scenes?

Two scenes are mixed due to no transferable ability of positions

How previous works solve the problem. In NeRF-based methods, they use positions to break the ambiguity. The meaning of ambiguity is that you look at different things as the same thing. In a monocular setting, one camera ray could travel through many points in the 3D space but only one pixel in the video frame image plane, so you have to represent many points in the 3D space as the only pixel in the image plane.

However, if we use position to represent the point information in 3D space since the different point has different positions, you can easily distinguish the points along the camera ray and then you could break the ambiguity by using the positions. So based on positions, previous works devote effort to optimizing a dynamic radiance field.

However, representing point information as a position has many limitations. The position has no transferable ability across scenes. For each scene, previous methods have to train independent models. With the fast increase of monocular videos/phone videos on the internet, maybe there are millions or billions of videos. They suffer lengthy training time and heavy computational costs. Also, the lack of generalization ability limits further applications of scene editing which requires interaction among different scene.

So here, a natural question is raised: can we learn a generalizable dynamic field from monocular videos?

6 of 25

Our Solution – MonoNeRF

2D video frames and optical flows are a pair of complementary constraints to jointly estimate 3D point information and trajectories.

2D video frames – spatial constraint

correct unreasonable trajectories in space

Optical flows – temporal constraint

limit the relations of point trajectories in time

So in this paper, we propose a positive answer to this question. We want to learn point features from videos instead of positions to obtain the generalization ability of our model. We will meet the ambiguity problem as I just said when learning from videos.

While independently using 2D video frames and 2D optical flows may lead to the ambiguity problem, we found that they are a pair of complementary constraints to jointly estimate 3D point information and trajectories.

Specifically, we assume the point traveling in time has the same information which is the rgb color and density. The 2D video frames provide the spatial constraint to correct unreasonable trajectories in space, and the 2D optical flows provide temporal constraints to limit the relations of point trajectories along camera rays in time.

7 of 25

Our Solution – MonoNeRF

Monocular Video

NeRF

Rendering

Spatial Constraint

Dynamic Scene

2D Video Frame

Optical Flow

Temporal Constraint

Encoder

Flow-based Feature Aggregation

Break the Ambiguity

Build implicit velocity field based on video features
Sample features from multiple frames to break the ambiguity
Render images with sampled point features
Optimize point features by Spatial-Temporal constraints

Frame-wise Feature

Flow Field

8 of 25

Pipeline – Build Flow Field

Frame-wise Feature

Flow Field

Monocular Video

Encoder

Build implicit velocity field.

Point trajectories.

9 of 25

Pipeline – Build Flow Field

Frame-wise Feature

Flow Field

Monocular Video

Encoder

Velocity field

Point Trajectory

Position

Time

10 of 25

Pipeline – Sample Point Features

Monocular Video

Frame-wise Feature

Flow Field

Encoder

Flow-based Feature Aggregation

Break the Ambiguity

Project points onto the image frames to find local image patches and extract the features

Fuse the local features of each frames as the point features

11 of 25

Pipeline – Render Dynamic Scenes

Monocular Video

Frame-wise Feature

Flow Field

Encoder

Flow-based Feature Aggregation

Break the Ambiguity

NeRF

Rendering

Dynamic Scene

Rendering images by sample point features with the NeRF pipeline

12 of 25

Pipeline – Spatial-Temporal Constraint

Monocular Video

NeRF

Rendering

Spatial Constraint

Dynamic Scene

2D Video Frame

Optical Flow

Temporal Constraint

Frame-wise Feature

Flow Field

Encoder

Flow-based Feature Aggregation

Break the Ambiguity

Optimize the point features and flow field by Spatial-Temporal constraints

13 of 25

Traditional Setting – Single Scene

[1] B., Srinivasan, P. P., Tancik, M., Barron, J. T., Ramamoorthi, R., & Ng, R., NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. In ECCV, 2020.

[2] Qiao, Yi-Ling and Gao, Alexander and Lin, Ming C, NeuPhysics: Editable Neural Geometry and Physics from Monocular Videos, In NeurIPS, 2022

MonoNeRF outperforms other SOTA methods on traditional novel view synthesis task.

14 of 25

Application #1 – Video Stream

Training Frames

Unseen Frames

Novel view synthesis on unseen frames containing new foreground motions

15 of 25

[1] Gao, C., Saraf, A., Kopf, J. and Huang, J.B., 2021. Dynamic view synthesis from dynamic monocular video. In ICCV, 2021.

Application #1 – Video Stream

Ours

DynNeRF

MonoNeRF could transfer to new motions, whereas DynNeRF only interpolates in the training frames.

Free-viewpoint video rendering

16 of 25

Application #2 – General Radiance Field

Multiple Monocular Videos

MonoNeRF

General Dynamic Radiance Field

A general dynamic radiance field of multiple scenes

17 of 25

Application #2 – General Radiance Field

Ours – Balloon2

Ours – Umbrella

[1] Gao, C., Saraf, A., Kopf, J. and Huang, J.B., 2021. Dynamic view synthesis from dynamic monocular video. In ICCV, 2021.

DynNeRF

MonoNeRF could learn general dynamic radiance field from multiple scenes, whereas other methods fail to distinguish scenes.

18 of 25

Application #3 – Novel Scene Adaptation

Novel Scene Adaption

Training Scene

Novel Scene

Finetuning on novel scenes with 500 steps (10 minutes)

19 of 25

Application #3 – Novel Scene Adaptation

[1] Gao, C., Saraf, A., Kopf, J. and Huang, J.B., 2021. Dynamic view synthesis from dynamic monocular video. In ICCV, 2021.

Ours (10-minute finetuning)

DynNeRF (10-minute finetuning)

MonoNeRF could transfer to a novel scene with 10-minute finetuning, compared to other methods that need 1 day to train from scratch.

Free-viewpoint video rendering

20 of 25

Application #4 – Editing

Change Background

Change Background + Flip Foreground

Change Background + Scale Foreground

Move Foreground

21 of 25

Code: https://github.com/tianfr/MonoNeRF

Code

22 of 25

Future Work #1 – Finetuning

Still need 500-step finetuning due to limited dataset and resources

[1] Gao, C., Saraf, A., Kopf, J. and Huang, J.B., 2021. Dynamic view synthesis from dynamic monocular video. In ICCV, 2021.

Ours (500 Steps)

DynNeRF (500 Steps)

23 of 25

Future Work #2 – Flow Generalization Ability

Limited generalization ability of flow estimation

500-step finetuning

24 of 25

In Summary

MonoNeRF learns a dynamic radiance field that generalizes to scenes.
MonoNeRF exploits generalizable features with joint spatial and temporal optimization.
MonoNeRF supports new applications such as novel scene adaptation and editing.

MonoNeRF: Learning a Generalizable Dynamic Radiance Field from Monocular Videos

25 of 25

Published in ICCV 2023

Fengrui Tian, Shaoyi Du, Yueqi Duan

Code: https://github.com/tianfr/MonoNeRF

Arxiv: https://arxiv.org/abs/2212.13056

MonoNeRF: Learning a Generalizable Dynamic Radiance Field from Monocular Videos

Thank You