1 of 15

DeepSmooth: Efficient and Smooth Depth Completion

VOCVALC 2023

Sriram Krishna

Samsung Research

Bangalore, India

sriram.sk@samsung.com

Basavaraja Shanthappa Vandrotti

Samsung Research

Bangalore, India

b.vandrotti@samsung.com

2 of 15

Introduction

Depth Estimation - A fundamental task in Computer Vision, involves estimating the distance of every pixel in an image.

Useful across domains from Augmented Reality to Robotics. Prerequisite step in Spatial Mapping, Video Portrait effects etc.

Many paradigms exist for acquiring depth:

Monocular Depth Estimation - Estimating depth from a single RGB image
Multi View Stereo - Utilizing a sequence RGB images and known 6DoF pose
Commodity Depth Sensors - Extracting depth directly from a depth sensor
Depth Completion - Using RGB and sparse depth to refine the depth image

3 of 15

The Rationale for Depth Completion

Drawbacks of various depth estimation paradigms:

Monocular Depth Estimation - Geometrically impossible to extract accurate metric depth from a single image.
Stereo Imaging - Tends to fail in textureless areas e.g. plain walls
Depth Sensors - The different types of depth sensors (ToF, LIDAR etc.) have their own drawbacks - sparse depth, error in highly reflective areas, holes in areas of low reflectivity etc.

Depth Completion aims to overcome these issues by combining the best of both worlds - refining a sparse/noisy depth image with the semantic cues from an RGB image.

4 of 15

Contemporary Work

Contemporary works utilize neural networks using an RGB image and sparse depth as input to produce a dense depth map.

Typically designed to work on individual images rather than video streams, resulting in flickering depth over time, especially if the holes in the input are large and varying.

The noisiness in the depth maps propagates errors in downstream tasks, such as 3D Scene Reconstruction.

5 of 15

Our Contributions

We propose a novel architecture that is carefully designed to model RGB-D streams of indoor scenes with semi-dense input depth. Our network makes use of Atrous Gated Convolutions to encode the input depth, and R2+1D convolutions for spatio-temporal fusion

We develop a novel loss function, Temporal Planar Consistency Loss, that propagates planes in predicted depth over time and minimizes the distance between planes in consecutive predictions, enforcing spatial and temporal consistency in depth predictions.

We validate our approach with extensive experiment, showing competitive results while improving upon other models in terms of temporal consistency.

6 of 15

Model Architecture - I

7 of 15

Model Architecture - II

A lightweight dual branch encoder-decoder, enhanced with temporal propagation, consisting of the following components:

Color Encoder - The recent monocular depth estimation model MiDaS is used as the color backbone, to extract features more useful for depth completion.

Depth Encoder - Semi-dense depth contains holes and performance is affected when using naive convolutions to encode the depth. Thus, we make use of atrous gated convolutions:
Gated Convolutions - Variation of a traditional convolution where the output is “gated” using a second convolution. This formulation has shown improvement in image inpainting tasks. [Ref - Free-Form Image Inpainting with Gated Convolution, Yu et al., CVPR 2019]
Gated convolutions effectively doubles the size of the depth encoder, thus atrous/dilated convolutions are used to maintain a moderately sized model.

8 of 15

Model Architecture - III

Temporal Encoder - Consists of a sequence of R2+1D convolutions. R2+1D convolutions are a factorized form of 3D convolutions, comprising a 2D followed by a 1D conv, operating over spatial and temporal states independently. Allows for convolving over a 4D volume with less compute than a 3D conv. The temporal vector of the previous timestep is also provided as input.

[Ref - A Closer Look at Spatiotemporal Convolutions for Action Recognition, Tran et al., CVPR 2018]

Decoder - The decoder takes the RGB-D temporal vector as input, fed to a lightweight variant of the RefineNet architecture. Contains skip connections from both the color and the depth encoders.

9 of 15

Loss Function - I

Desired characteristics of an ideal depth completion model:

The predicted depth must be consistent over time - Temporally Smooth
The predicted depth must be well-fitted by planes in the scene - Spatially Smooth

Solution - Temporal Planar Consistency (TPC) Loss:

The depth prediction at time t_i-1 (D_i-1) is warped/reprojected to t_i( D^warp_i-1) using the available 6DoF pose.
Planes are fitted to the reprojected depth using the RANSAC algorithm.
Depth values of pixels are flattened (z^flat) such that they fit the plane exactly. This value is calculated by inverting the plane equation:

z^flat = −1(Ax + By + D)/C

Where [A,B,C,D] are coefficients of the plane equation calculated by RANSAC.

10 of 15

Loss Function - II

The final Temporal Planar Consistency Loss is given by the L1 norm of the warped+flattened depth and the depth prediction at the current timestep:

L_TPC = || D_i - D^warp+flat_(i-1) ||

L_TPCenforces that the planes present in D_i-1 continue to be present in D_i. This enforces spatial and temporal consistency simultaneously.

NOTE: This loss encourages the prediction of idealized depth i.e. minor irregularities in the real world are flattened into smooth planes. By design, this encourages our model to avoid predicting finer details leading to a slight increase in error. We argue that, in most applications, beyond a point, consistency is more desirable than absolute accuracy.

11 of 15

Results - I

Quantitative Results on the ScanNet dataset:

RMSE and MAE are measured in metres
Temporal Consistency - Evaluates smoothness of prediction over time. It quantifies the percentage of pixels which are stable over time i.e. the ratio of change between consecutive frames does not go beyond a threshold.

[Ref - Enforcing temporal consistency in video depth estimation, Li et al., ICCVW 2021]

	Temporal Consistency ↑	RMSE ↓	MAE ↓
CostDCNet	0.989	0.145	0.039
DM-LRN	0.990	0.137	0.036
inDepth	0.990	0.137	0.035
DeepSmooth (ours)	0.992	0.142	0.043

12 of 15

Results - II

Qualitative Results on the ScanNet dataset:

The output of DeepSmooth contains smoother output, with real world noise being abstracted by planar surfaces. Inevitably, this also leads to a loss of detail in “almost planar” surfaces. e.g. the bumps in the bedsheet.

13 of 15

Results - III

Qualitative Results on the ScanNet dataset:

14 of 15

Conclusion

We propose DeepSmooth, a depth completion method that generates dense depth maps from semi-dense depth, using a novel dual-branch encoder-decoder network.

Our model is designed for video streams by integrating R2+1D convolutions into the network for stable predictions over time. We enforce spatial and temporal consistency by means of a novel loss function, Temporal Planar Consistency Loss.

Crucially, we argue that temporal consistency is a more desirable characteristic in applications where millimeter-level accuracy is not required. Depth sensors are limited in their accuracy due to engineering constraints and cost. Providing high quality depth maps through depth completion unlocks the way towards applications requiring 3D understanding of the world around us.