1 of 16

Synthesize 3D Data from 2D for Enhanced Robotic Decision-Making

Mingtian Tan, Haolin Liu, Tianle Zhong, Xuhui Kang

(Name Listed by Presentation Order)

2 of 16

Motivation: FINAL Research Goal

Final Goal: Build a Robot with 3D Decision-Making Capabilities for Executing Real-World Instructions

Pour Water from the Bottle into the Cup

3 of 16

Challenges and our Approach

Challenges: Training directly on 3D data is highly complex.

  1. 3D data is not enough, and
  2. real-world 3D data often contains significant noise, which greatly increases the difficulty of model generalization.

(For instance, a model may fail to recognize a cup simply because its color changes.)

4 of 16

Challenges and our Solution

Indirect Approach: An alternative is to use robotic affordance data as an intermediate process. It enables decision-making based on this intermediate representation.

Affordance

5 of 16

Challenges and our Solution

Indirect Approach: An alternative is to use robotic affordance data as an intermediate process. It enables decision-making based on this intermediate representation.

6 of 16

Output is labeled affordance data.

Generator: 2D text-to-image Stable Diffusion

Input : 3D point clouds of objects and natural language instructions

512 token

512 token

PointNet++

"Place the straw into the Cup."

Language Projector

Generator

labeled affordance data

7 of 16

Problem Setting: Pipeline Overview

Input : 3D point clouds of objects and natural language instructions

Output is labeled affordance data.

512 token

512 token

PointNet++

"Place the straw into the Cup."

Language Projector

G

8 of 16

Related Work: Current Affordance Prediction Methods

Affordance Learning from Human Videos [1]:

  • (1) This method uses videos of human interactions to train robots to predict affordances.

(2) Models identify contact points (where human hands interact with objects) and post-contact trajectories (how the object moves after contact).

(3) Tools like Gaussian Mixture Models (GMMs) help predict multiple affordances, allowing flexibility in manipulation tasks.

  • Limitations:
    • Human video-based approaches struggle to generalize affordances to novel or unseen object categories due to the domain shift from human to robot.
    • Limited generalization to novel objects or tasks outside the training set, often requiring additional fine-tuning.

[1] Bahl, Shikhar, et al. "Affordances from human videos as a versatile representation for robotics." CVPR, 2023.

9 of 16

Robo-ABC [2]:

  • (1) Robo-ABC focuses on zero-shot affordance generalization, allowing robots to transfer manipulation skills across different object categories.

(2) It leverages 2D diffusion models to establish semantic correspondences between objects, enabling robots to apply learned affordances from familiar objects to novel ones.

  • Limitations:
    • While it effectively handles generalization within known object categories, challenges arise when dealing with highly diverse or complex environments.
    • It relies on 2D affordance predictions and are limited in comprehensively understanding 3D object structures for manipulation tasks, which can hinder precise robotic interaction.

Related Work: Current Affordance Prediction Methods

[2] Ju, Yuanchen, et al. "Robo-abc: Affordance generalization beyond categories via semantic correspondence for robot manipulation." arXiv preprint arXiv:2401.07487 (2024).

10 of 16

Related Work: 3D Diffusion Methods in Robotics

3D Diffusion Policy [3]:

  • (1) This is an end-to-end method that leverages 3D representations extracted from point clouds combined with diffusion models to output actions for robotic control.

(2) The method demonstrates significant improvements in learning efficiency, generalization, and safety over 2D-based approaches.

  • Limitations:
    • The system still faces challenges with generalization when there are significant changes in camera perspectives.
    • There is still large room for improvement in performance on complex tasks.

[3] Ze, Yanjie, et al. "3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations." RSS, 2024.

11 of 16

Dataset: Annotating Object Pair Affordance

Task: Place the straw into the cup

Object point clouds are from PartNet-Mobility Dataset.

12 of 16

Dataset: Annotating Object Pair Affordance

Task: Place the book onto the cup

We set the contact area affordance as 1, and decide afforance of other areas by

Euclidean distance.

Finally we annotated 300 object pair affordance as the training dataset.

13 of 16

Model Architecture: VAE + Diffusion Transformer

Diffusion Transformer

512 token

512 token

CLIP-based

Language embedding

"Place the straw into the Cup."

PointNet++

Each object is represented as 512 points.

14 of 16

Training Process:

loss

steps

Optimizator: AdamW

Batch size: 64

GPU: 1*A100

15 of 16

Evaluation: Successful Model Outputs

Put the pen to the cup

Cover the Cup

16 of 16

Conclusion

Current Progress

  • Utilized diffusion models to address affordance problems.
  • Successfully validated feasibility.

Future Plans

  • Automate Affordance data annotation using models like ChatGPT.
  • Integrate predicted affordances with robot action policies to guide action generation.