2 of 29

Task

Agent (Robotic Arm with RGB-D Camera)

“Pour out water from the bottle.”

Text input: {task}

3 of 29

Task

Agent (Robotic Arm with RGB-D Camera)

“Pour out water from the bottle.”

Text input: {task}

4 of 29

Task

Agent (Robotic Arm with RGB-D Camera)

“Pour out water from the bottle.”

Text input: {task}

One agent for grasping arbitrary objects.

5 of 29

Pipeline

VLM

(Gemini 1.5)

“Find the {cup}.”

Foundation Model

(SAM 2)

“Find the best grasping part of the {cup}.”

VLM

(Gemini 1.5)

Foundation Model

(GraspNet)

RGB Image Input

Depth Image Input

Text input

Grasping Poses Filtering

RGB-D Camera

Robotic Arm

6 of 29

Vision-Language Model (VLM)

VLM

(Gemini 1.5)

“Find the {cup}.”

Foundation Model

(SAM 2)

“Find the best grasping part of the {cup} for the task: {pour out water}.”

VLM

(Gemini 1.5)

Foundation Model

(GraspNet)

RGB Image Input

Depth Image Input

Text input

Grasping Poses Filtering

RGB-D Camera

Robotic Arm

7 of 29

VLM: Related Work

https://github.com/leggedrobotics/darknet_ros

YOLO: give 2D bounding boxes for specific objects.

Can only detect trained objects.

What about other objects out of the training dataset?

8 of 29

VLM: Related Work

key A

Can YOLO do this?

(Yes, but this would increase the size of the training dataset in an insane way.)

An example using Google Gemini

9 of 29

VLM

key A

Vision-Language Models (VLMs) can detect arbitrary objects / parts with text input.

cup

handle of cup

10 of 29

Segment Anything Model (SAM)

VLM

(Gemini 1.5)

“Find the {cup}.”

Foundation Model

(SAM 2)

“Find the best grasping part of the {cup} for the task: {pour out water}.”

VLM

(Gemini 1.5)

Foundation Model

(GraspNet)

RGB Image Input

Depth Image Input

Text input

Grasping Poses Filtering

RGB-D Camera

Robotic Arm

11 of 29

Segment Anything Model (SAM): Related Work

Mask R-CNN: give 2D bounding boxes and masks for specific objects.

Can only detect trained objects.

What about other objects out of the training dataset?

https://github.com/matterport/Mask_RCNN

12 of 29

Segment Anything Model (SAM): Related Work

Can Mask R-CNN do this?

(Yes, but this would increase the size of the training dataset in an insane way.)

cup handle

13 of 29

Segment Anything Model (SAM)

The Segment Anything Model (SAM) is a zero-shot, promptable image segmentation model designed to produce high-quality object masks for arbitrary objects.

https://arxiv.org/pdf/2304.02643

14 of 29

Segment Anything Model (SAM)

15 of 29

Segment Anything Model (SAM)

VLM

“Find the mustard bottle.”

SAM 2

Bounding Box

Mask

We combined Vision-Language Model with SAM, utilizing VLMs’ language ability for open-vocabulary segmentation with text input.

16 of 29

Segment Anything Model (SAM)

We’re utilizing Vision-Language Models’ language ability for open-vocabulary segmentation.

17 of 29

Segment Anything Model (SAM)

Vision-Language Models can also reason about appropriate grasping parts with its language ability.

18 of 29

GraspNet

VLM

(Gemini 1.5)

“Find the {cup}.”

Foundation Model

(SAM 2)

“Find the best grasping part of the {cup} for the task: {pour out water}.”

VLM

(Gemini 1.5)

Foundation Model

(GraspNet)

RGB Image Input

Depth Image Input

Text input

Grasping Poses Filtering

RGB-D Camera

Robotic Arm

19 of 29

GraspNet

https://arxiv.org/pdf/2103.02184

20 of 29

GraspNet

Grasping Poses

21 of 29

GraspNet

Grasping Poses

Segmentation Masks

Project back in 2D images and filter out the best grasping pose within the masks.

22 of 29

GraspNet

Due to hardware problems with the gripper,

we’re only demonstrating the grasping pose.

23 of 29

GraspNet

Due to hardware problems with the gripper,

we’re only demonstrating the grasping pose.

24 of 29

GraspNet

Due to hardware problems with the gripper,

we’re only demonstrating the grasping pose.

25 of 29

Pipeline

VLM

(Gemini 1.5)

“Find the {cup}.”

Foundation Model

(SAM 2)

“Find the best grasping part of the {cup}.”

VLM

(Gemini 1.5)

Foundation Model

(GraspNet)

RGB Image Input

Depth Image Input

Text input

Grasping Poses Filtering

RGB-D Camera

Robotic Arm

26 of 29

Future Work: BundleSDF

BundleSDF is a method that simultaneously tracks the 6-DoF pose of an unknown object and reconstructs its 3D geometry and appearance using a neural network from a monocular RGB-D video.

https://bundlesdf.github.io

27 of 29

Future Work: Not only focus on Grasping arbitrary objects.

https://berkeleyautomation.github.io/POGS/

28 of 29

Future Work: Not only focus on Grasping arbitrary objects.

https://omnimanip.github.io

29 of 29

Conclusion

Contribution:

Open-vocabulary object segmentation.
Automatically reason about appropriate grasping parts.
6-DoF grasping pose generation and filtering.

Future Work:

6-DoF pose estimation.
Online AIGC.
Not only focus on grasping, but manipulation.