1 of 16

D(R,O) Grasp: A Unified Representation of Robot and Object Interaction for Cross-Embodiment Dexterous Grasping�ICRA 2025 Best Paper Award on Robot Manipulation and Locomotion

Samuel Chua

4 November 2025

2 of 16

Contents

Background
Related Works
Motivation
Method
Experiments
Conclusion

2

3 of 16

Background – Cross-Embodiment Dexterous Grasping

3

There is increasing availability and diversity of commercial and open-source dexterous hands
A unified framework is required to efficiently equip robot hands with dexterous grasping capabilities

4 of 16

Background – Cross-Embodiment Dexterous Grasping

4

Embodiment Gaps

Requires efficient computation sharing across embodiments

Precise Contact

Requires accurate predictions of hand configurations that conform the object geometries

5 of 16

Related Works

Robot-Centric Approach

Directly infer robot poses or joint angles
RL in high dimensional action spaces suffers from sample inefficiency
Sim-to-real gap complicates policy transfer

Object-Centric Representation

Infers grasp poses by solving IK based on predicted contact points/maps
Slow optimization speed
Partial object point cloud

5

6 of 16

Motivation

6

D(R,O)

Interaction-centric representation that captures the spatial relationship between robot hand and the object

7 of 16

Method

7

Pretrain

D(R,O) Prediction

D(R,O) Execution

8 of 16

Method

8

Pretrain

D(R,O) Prediction

D(R,O) Execution

9 of 16

Method

9

Pretrain

D(R,O) Prediction

D(R,O) Execution

Two point clouds are passed through the encoder to produce point-wise features

10 of 16

Method

10

Pretrain

D(R,O) Prediction

D(R,O) Execution

Model takes in point cloud inputs of robot and object to extract two sets of correlated features
CVAE model is used to predict the D(R,O) representation to capture variations across numerous combinations of hand, object, and grasp configurations.
Outputs D(R,O) as a result of latent variable z, learned features, kernel function.

11 of 16

Method

11

Pretrain

D(R,O) Prediction

D(R,O) Execution

Repeating this process for each row of D(R,O) yields the complete predicted robot point cloud in the grasp pose.

Given the object point cloud, the multilateration method positions the robot point cloud. This positioning technique determines the location of a point p′Ri by solving the least-squares optimization problem based on distances from multiple reference points.

In 3D space, we can determine a point’s position by measuring its relative distances to just 4 other points. D(R,O) representation provides NO(=512) relative distances, enhancing robustness to prediction errors.

SVD to get the 6D pose of each link using rigid body registration techniques

But… the robot still needs joint angles to actually move the hardware.

So the final step is: Optimize the robot’s joint configuration q such that the forward kinematics matches the predicted link poses.

We start from an initial open-hand configuration and iteratively update the joint values so that the hand links align with the predicted grasp pose — while enforcing joint limits.

12 of 16

Experiments

12

Success: determining whether the force closure condition is satisfied. we sequentially apply forces along six orthogonal directions, each for a duration of 1 second. The grasp is considered successful if the object’s resultant displacement remains below2 cm after all six directional forces have been applied.

Diversity: Grasp diversity is quantified by calculating the standard deviation of the joint values (including 6 floating wrist DoF) across all successful grasps.

Efficiency: The computational time required to achieve a grasp is measured, encompassing both network inference and the subsequent optimization steps.

Barrett (3-finger),Allegro (4-finger), and ShadowHand (5-finger)

DFC often results in unnatural poses.GenDexGrasp struggles with objects of complex shapes, frequently encountering significant penetration issues. Although ManiFM produces visually appealing grasps, its point-contact method lacks stability, lowering its success rate in simulation.

FYI:

DFC is an optimization-based approach that searches for feasible rasp configurations through iterative optimization. GenDexGrasp predicts contact heatmaps and uses optimization to determine grasp poses.ManiFM supports cross-embodiment grasping but employs a point-contact approach,which was not suitable for training on our dataset that emphasizes surface-contact methods.

13 of 16

Experiments

13

14 of 16

Experiments

14

15 of 16

Experiments

15

16 of 16

Conclusion

16

D(R,O) captures the essential interaction between robotic hands and objects.
Unlike existing methods that rely heavily on either object or robot-specific representations, our approach bridges the gap by using a unified framework that generalizes well across different robots and object geometries.