Modern Grasping
Zhengxiao Han & Harrison Bounds
from MSR
Task
Agent (Robotic Arm with RGB-D Camera)
“Pour out water from the bottle.”
Text input: {task}
Task
Agent (Robotic Arm with RGB-D Camera)
“Pour out water from the bottle.”
Text input: {task}
Task
Agent (Robotic Arm with RGB-D Camera)
“Pour out water from the bottle.”
Text input: {task}
One agent for grasping arbitrary objects.
Pipeline
VLM
(Gemini 1.5)
“Find the {cup}.”
Foundation Model
(SAM 2)
“Find the best grasping part of the {cup}.”
VLM
(Gemini 1.5)
Foundation Model
(GraspNet)
RGB Image Input
Depth Image Input
Text input
Grasping Poses Filtering
RGB-D Camera
Robotic Arm
Vision-Language Model (VLM)
VLM
(Gemini 1.5)
“Find the {cup}.”
Foundation Model
(SAM 2)
“Find the best grasping part of the {cup} for the task: {pour out water}.”
VLM
(Gemini 1.5)
Foundation Model
(GraspNet)
RGB Image Input
Depth Image Input
Text input
Grasping Poses Filtering
RGB-D Camera
Robotic Arm
VLM: Related Work
https://github.com/leggedrobotics/darknet_ros
YOLO: give 2D bounding boxes for specific objects.
Can only detect trained objects.
What about other objects out of the training dataset?
VLM: Related Work
key A
Can YOLO do this?
(Yes, but this would increase the size of the training dataset in an insane way.)
An example using Google Gemini
VLM
key A
Vision-Language Models (VLMs) can detect arbitrary objects / parts with text input.
cup
handle of cup
Segment Anything Model (SAM)
VLM
(Gemini 1.5)
“Find the {cup}.”
Foundation Model
(SAM 2)
“Find the best grasping part of the {cup} for the task: {pour out water}.”
VLM
(Gemini 1.5)
Foundation Model
(GraspNet)
RGB Image Input
Depth Image Input
Text input
Grasping Poses Filtering
RGB-D Camera
Robotic Arm
Segment Anything Model (SAM): Related Work
Mask R-CNN: give 2D bounding boxes and masks for specific objects.
Can only detect trained objects.
What about other objects out of the training dataset?
https://github.com/matterport/Mask_RCNN
Segment Anything Model (SAM): Related Work
Can Mask R-CNN do this?
(Yes, but this would increase the size of the training dataset in an insane way.)
cup handle
Segment Anything Model (SAM)
The Segment Anything Model (SAM) is a zero-shot, promptable image segmentation model designed to produce high-quality object masks for arbitrary objects.
https://arxiv.org/pdf/2304.02643
Segment Anything Model (SAM)
Segment Anything Model (SAM)
VLM
“Find the mustard bottle.”
SAM 2
Bounding Box
Mask
We combined Vision-Language Model with SAM, utilizing VLMs’ language ability for open-vocabulary segmentation with text input.
Segment Anything Model (SAM)
We’re utilizing Vision-Language Models’ language ability for open-vocabulary segmentation.
Segment Anything Model (SAM)
Vision-Language Models can also reason about appropriate grasping parts with its language ability.
GraspNet
VLM
(Gemini 1.5)
“Find the {cup}.”
Foundation Model
(SAM 2)
“Find the best grasping part of the {cup} for the task: {pour out water}.”
VLM
(Gemini 1.5)
Foundation Model
(GraspNet)
RGB Image Input
Depth Image Input
Text input
Grasping Poses Filtering
RGB-D Camera
Robotic Arm
GraspNet
https://arxiv.org/pdf/2103.02184
GraspNet
GraspNet
Grasping Poses
GraspNet
Grasping Poses
Segmentation Masks
Project back in 2D images and filter out the best grasping pose within the masks.
GraspNet
Due to hardware problems with the gripper,
we’re only demonstrating the grasping pose.
GraspNet
Due to hardware problems with the gripper,
we’re only demonstrating the grasping pose.
GraspNet
Due to hardware problems with the gripper,
we’re only demonstrating the grasping pose.
Pipeline
VLM
(Gemini 1.5)
“Find the {cup}.”
Foundation Model
(SAM 2)
“Find the best grasping part of the {cup}.”
VLM
(Gemini 1.5)
Foundation Model
(GraspNet)
RGB Image Input
Depth Image Input
Text input
Grasping Poses Filtering
RGB-D Camera
Robotic Arm
Future Work: BundleSDF
BundleSDF is a method that simultaneously tracks the 6-DoF pose of an unknown object and reconstructs its 3D geometry and appearance using a neural network from a monocular RGB-D video.
https://bundlesdf.github.io
Future Work: Not only focus on Grasping arbitrary objects.
https://berkeleyautomation.github.io/POGS/
Future Work: Not only focus on Grasping arbitrary objects.
https://omnimanip.github.io
Conclusion
Contribution:
Future Work: