1 of 1

  • Baseline 3DGS yields better depth maps than NeRF based methods for transparent objects
  • The sequential learning of Background Splats followed by Residual Splats as way of capturing scene prior, as shown by Clear-Splatting, yields better performance
  • We will work towards reducing floaters by generating solid backgrounds for the scenes
  • More work needs to be done by directly using depth supervision in the optimization process

: Learning Residual Gaussian Splats for Transparent

Object Manipulation

Aviral Agrawal, Ritaban Roy, Bardienus P. Duisterhof, Keerthan Bhat Hekkadka, Hongyi Chen, Jeffrey Ichnowski

Motivation

Proposed Approach

Conclusion and Future Work

Figure 1: DepthAnything [4] (left two) and Intel RealSenseTM (right) camera perform poorly for transparent objects

Results

Robotics Implications

ClearSplatting leverages mostly static scenes to improve depth perception.

  • It begins by training Background Splats (2) of the entire scene without transparent objects (1).
  • The transparent object is then added on to the background scene (3).
  • Then we learn Residual Splats (5) to complement the background Splats from (4).
  • The rendered depth is shown in (6).

Our two main goals for experimentation are addressed as follows:

  • Accuracy of depth estimation : This gives an idea how well the methods estimate depth objectively. We use root mean squared error to quantify the error from ground-truth
  • Completeness of depth maps : This gives how “complete” the depth maps estimated by the methods are. Depth maps with holes pose a significant problem to robot grippers.

Figure 2: Clear-Splatting approach

  • ClearSplatting outperforms prior works and baseline 3DGS for scene depth estimation for transparent objects
  • For all views in the test set, it achieves convergence later than other methods
  • For top view, relevant for robotic implications, our method achieves convergence in real-time and greatly outperforms other methods

References

Grasping and manipulating transparent objects poses a significant challenge for robots since depth sensors and off-the-shelf monocular depth-estimators fail on transparent objects.

Premise: Leverage multi-view 3D reconstruction methods to model depth for transparent objects in a short time span

Problem statement: Neural Radiance Field (NeRF) methods [1,2] have shown ability to estimate depth for transparent objects given multi-view images. However, they still struggle with challenging objects and lighting conditions. With this work, we aim to show that novel scene prior learning compounded with novel depth-based gaussian pruning for 3D Gaussian Splatting can outperform existent works.

[1] J. Ichnowski*, Y. Avigal*, J. Kerr, and K. Goldberg, “Dex-NeRF: Using a neural radiance field to grasp transparent objects,” in Conference on Robot Learning (CoRL), 2020.

[2] B. P. Duisterhof, Y. Mao, S. H. Teng, and J. Ichnowski, “Residual-nerf: Learning residual nerfs for transparent object manipulation,” in ICRA, 2024

[3] B. Kerbl, G. Kopanas, T. Leimku ̈hler, and G. Drettakis, “3d gaussian splatting for real-time radiance field rendering,” ACM Transactions on Graphics, vol. 42, no. 4, July 2023. [Online].

[4] Yang, L., Kang, B., Huang, Z., Xu, X., Feng, J., & Zhao, H. (2024). Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data. ArXiv, abs/2401.10891.

Figure 5: Convergence vs. time for top view

Figure 6: Convergence vs. time for all views

All view vs. top view performance gauging:

  • All view performance gives an idea of how good the model has learnt the scene and the geometry
  • Top view performance is specifically important for robot grasping

Figure 3: Depth map objective metric comparison

Figure 4: Depth map qualitative comparison

(avirala, ritabanr, bduister, kbhathek, hongyic, jichnows) @andrew.cmu.edu