Synthesize 3D Data from 2D for Enhanced Robotic Decision-Making
Mingtian Tan, Haolin Liu, Tianle Zhong, Xuhui Kang
(Name Listed by Presentation Order)
Motivation: FINAL Research Goal
Final Goal: Build a Robot with 3D Decision-Making Capabilities for Executing Real-World Instructions
Pour Water from the Bottle into the Cup
Challenges and our Approach
Challenges: Training directly on 3D data is highly complex.
(For instance, a model may fail to recognize a cup simply because its color changes.)
Challenges and our Solution
Indirect Approach: An alternative is to use robotic affordance data as an intermediate process. It enables decision-making based on this intermediate representation.
Affordance
Challenges and our Solution
Indirect Approach: An alternative is to use robotic affordance data as an intermediate process. It enables decision-making based on this intermediate representation.
Output is labeled affordance data.
Generator: 2D text-to-image Stable Diffusion
Input : 3D point clouds of objects and natural language instructions
…
…
512 token
512 token
PointNet++
"Place the straw into the Cup."
Language Projector
Generator
labeled affordance data
Problem Setting: Pipeline Overview
Input : 3D point clouds of objects and natural language instructions
Output is labeled affordance data.
…
…
512 token
512 token
PointNet++
"Place the straw into the Cup."
Language Projector
G
Related Work: Current Affordance Prediction Methods
Affordance Learning from Human Videos [1]:
(2) Models identify contact points (where human hands interact with objects) and post-contact trajectories (how the object moves after contact).
(3) Tools like Gaussian Mixture Models (GMMs) help predict multiple affordances, allowing flexibility in manipulation tasks.
[1] Bahl, Shikhar, et al. "Affordances from human videos as a versatile representation for robotics." CVPR, 2023.
Robo-ABC [2]:
(2) It leverages 2D diffusion models to establish semantic correspondences between objects, enabling robots to apply learned affordances from familiar objects to novel ones.
Related Work: Current Affordance Prediction Methods
[2] Ju, Yuanchen, et al. "Robo-abc: Affordance generalization beyond categories via semantic correspondence for robot manipulation." arXiv preprint arXiv:2401.07487 (2024).
Related Work: 3D Diffusion Methods in Robotics
3D Diffusion Policy [3]:
(2) The method demonstrates significant improvements in learning efficiency, generalization, and safety over 2D-based approaches.
[3] Ze, Yanjie, et al. "3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations." RSS, 2024.
Dataset: Annotating Object Pair Affordance
Task: Place the straw into the cup
Object point clouds are from PartNet-Mobility Dataset.
Dataset: Annotating Object Pair Affordance
Task: Place the book onto the cup
We set the contact area affordance as 1, and decide afforance of other areas by
Euclidean distance.
Finally we annotated 300 object pair affordance as the training dataset.
Model Architecture: VAE + Diffusion Transformer
Diffusion Transformer
…
…
512 token
512 token
CLIP-based
Language embedding
"Place the straw into the Cup."
PointNet++
Each object is represented as 512 points.
Training Process:
loss
steps
Optimizator: AdamW
Batch size: 64
GPU: 1*A100
Evaluation: Successful Model Outputs
Put the pen to the cup
Cover the Cup
Conclusion
Current Progress
Future Plans