1 of 16

D(R,O) Grasp: A Unified Representation of Robot and Object Interaction for Cross-Embodiment Dexterous Grasping�ICRA 2025 Best Paper Award on Robot Manipulation and Locomotion

Samuel Chua

4 November 2025

2 of 16

Contents

  • Background
  • Related Works
  • Motivation
  • Method
  • Experiments
  • Conclusion

2

3 of 16

Background – Cross-Embodiment Dexterous Grasping

3

  • There is increasing availability and diversity of commercial and open-source dexterous hands
  • A unified framework is required to efficiently equip robot hands with dexterous grasping capabilities

4 of 16

Background – Cross-Embodiment Dexterous Grasping

4

  • Embodiment Gaps
    • Requires efficient computation sharing across embodiments
  • Precise Contact
    • Requires accurate predictions of hand configurations that conform the object geometries

5 of 16

Related Works

  • Robot-Centric Approach
    • Directly infer robot poses or joint angles
    • RL in high dimensional action spaces suffers from sample inefficiency
    • Sim-to-real gap complicates policy transfer
  • Object-Centric Representation
    • Infers grasp poses by solving IK based on predicted contact points/maps
    • Slow optimization speed
    • Partial object point cloud

5

6 of 16

Motivation

6

  • D(R,O)
    • Interaction-centric representation that captures the spatial relationship between robot hand and the object

7 of 16

Method

7

Pretrain

D(R,O) Prediction

D(R,O) Execution

8 of 16

Method

8

Pretrain

D(R,O) Prediction

D(R,O) Execution

9 of 16

Method

9

Pretrain

D(R,O) Prediction

D(R,O) Execution

Two point clouds are passed through the encoder to produce point-wise features

10 of 16

Method

10

Pretrain

D(R,O) Prediction

D(R,O) Execution

  • Model takes in point cloud inputs of robot and object to extract two sets of correlated features
  • CVAE model is used to predict the D(R,O) representation to capture variations across numerous combinations of hand, object, and grasp configurations.
  • Outputs D(R,O) as a result of latent variable z, learned features, kernel function.

11 of 16

Method

11

Pretrain

D(R,O) Prediction

D(R,O) Execution

  • Repeating this process for each row of D(R,O) yields the complete predicted robot point cloud in the grasp pose.

12 of 16

Experiments

12

13 of 16

Experiments

13

14 of 16

Experiments

14

15 of 16

Experiments

15

16 of 16

Conclusion

16

  • D(R,O) captures the essential interaction between robotic hands and objects.
  • Unlike existing methods that rely heavily on either object or robot-specific representations, our approach bridges the gap by using a unified framework that generalizes well across different robots and object geometries.