1 of 1

Baseline 3DGS yields better depth maps than NeRF based methods for transparent objects
The sequential learning of Background Splats followed by Residual Splats as way of capturing scene prior, as shown by Clear-Splatting, yields better performance
We will work towards reducing floaters by generating solid backgrounds for the scenes
More work needs to be done by directly using depth supervision in the optimization process

: Learning Residual Gaussian Splats for Transparent

Object Manipulation

Aviral Agrawal, Ritaban Roy, Bardienus P. Duisterhof, Keerthan Bhat Hekkadka, Hongyi Chen, Jeffrey Ichnowski

Motivation

Proposed Approach

Conclusion and Future Work

Figure 1: DepthAnything [4] (left two) and Intel RealSense^TM (right) camera perform poorly for transparent objects

Results

Robotics Implications

ClearSplatting leverages mostly static scenes to improve depth perception.

It begins by training Background Splats (2) of the entire scene without transparent objects (1).
The transparent object is then added on to the background scene (3).
Then we learn Residual Splats (5) to complement the background Splats from (4).
The rendered depth is shown in (6).

Our two main goals for experimentation are addressed as follows:

Accuracy of depth estimation : This gives an idea how well the methods estimate depth objectively. We use root mean squared error to quantify the error from ground-truth
Completeness of depth maps : This gives how “complete” the depth maps estimated by the methods are. Depth maps with holes pose a significant problem to robot grippers.

Figure 2: Clear-Splatting approach

ClearSplatting outperforms prior works and baseline 3DGS for scene depth estimation for transparent objects
For all views in the test set, it achieves convergence later than other methods
For top view, relevant for robotic implications, our method achieves convergence in real-time and greatly outperforms other methods

References

Grasping and manipulating transparent objects poses a significant challenge for robots since depth sensors and off-the-shelf monocular depth-estimators fail on transparent objects.

Premise: Leverage multi-view 3D reconstruction methods to model depth for transparent objects in a short time span

Problem statement: Neural Radiance Field (NeRF) methods [1,2] have shown ability to estimate depth for transparent objects given multi-view images. However, they still struggle with challenging objects and lighting conditions. With this work, we aim to show that novel scene prior learning compounded with novel depth-based gaussian pruning for 3D Gaussian Splatting can outperform existent works.

[1] J. Ichnowski*, Y. Avigal*, J. Kerr, and K. Goldberg, “Dex-NeRF: Using a neural radiance field to grasp transparent objects,” in Conference on Robot Learning (CoRL), 2020.

[2] B. P. Duisterhof, Y. Mao, S. H. Teng, and J. Ichnowski, “Residual-nerf: Learning residual nerfs for transparent object manipulation,” in ICRA, 2024

[3] B. Kerbl, G. Kopanas, T. Leimku ̈hler, and G. Drettakis, “3d gaussian splatting for real-time radiance field rendering,” ACM Transactions on Graphics, vol. 42, no. 4, July 2023. [Online].

[4] Yang, L., Kang, B., Huang, Z., Xu, X., Feng, J., & Zhao, H. (2024). Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data. ArXiv, abs/2401.10891.

Figure 5: Convergence vs. time for top view

Figure 6: Convergence vs. time for all views

All view vs. top view performance gauging:

All view performance gives an idea of how good the model has learnt the scene and the geometry
Top view performance is specifically important for robot grasping

Figure 3: Depth map objective metric comparison

Figure 4: Depth map qualitative comparison

(avirala, ritabanr, bduister, kbhathek, hongyic, jichnows) @andrew.cmu.edu