1 of 26

Language Embedded Radiance Fields for Zero-Shot Task-Oriented Grasping

Presented by Ahmet Firat Gamsiz

Adam Rashid*, Satvik Sharma*, Chung Min Kim, Justin Kerr, Lawrence Yunliang Chen

Angjoo Kanazawa, Ken Goldberg

2 of 26

Introduction

  • Many common objects must be grasped appropriately to avoid damage or facilitate performing a task
  • a knife by its handle
  • a flower by its stem
  • sunglasses by their frame�
  • Ability to grasp an object part based on a desired task and constraints is called task-oriented grasping

3 of 26

Introduction

4 of 26

Motivation

  • Learning-based grasping systems are robust on grasping arbitrary objects.
  • But they typically measure grasp success based on whether the object was lifted.
  • They ignore an object’s semantic properties:
    • Even if a robot could locate your favorite sunglasses, rather than safely grasp at the frame it may shatter the lenses

5 of 26

Motivation

  • Previous methods collect specific object affordance datasets and struggle to scale to a diverse set of objects.

  • Flexibility of natural language has the potential for specifying what and where to grasp.
  • We can use large vision-language models in a zero-shot manner for task-oriented grasping

6 of 26

Related Work - Task-Oriented Grasping

  • Probabilistically modeling human grasps
  • Extracting geometric features from labeled object parts
  • Training on part-affordance datasets in simulation
  • Transferring category-specific part grasps to new instances
  • Train object-part grasp networks by leveraging object part and manipulation affordance datasets
  • Use videos of humans interacting with objects to guide grasps
  • Use pretrained vision features to discover common parts within sets of object

7 of 26

Related Work - Neural Radiance Fields (NeRF)

  • NeRF represents a scene using a fully-connected (non-convolutional) deep network, whose input is a single continuous 5D coordinate (spatial location (x, y, z) and viewing direction (θ, φ)) and whose output is the volume density and view-dependent emitted radiance at that spatial location.

8 of 26

Related Work - Contrastive Language–Image Pre-training (CLIP)

  • CLIP jointly trains an image encoder and a text encoder to predict the correct pairings of a batch of (image, text) training examples.

9 of 26

Related Work - Self-DIstillation with NO Labels (DINO)

  • Self-supervised pre training of Visual Transformers
  • ViT features explicitly contain the scene layout and, in particular, object boundaries

10 of 26

Related Work - Language Embedded Radiance Fields (LERF)

  • A method for grounding language embeddings from off-the-shelf models like CLIP into NeRF, which enable these types of open-ended language queries in 3D

11 of 26

Related Work - Language Embedded Radiance Fields (LERF)

  • To create LERF, augment NeRF’s outputs with a language embedding Flang(~x, s) ∈ Rd, which takes an input position ~x and physical scale s, and outputs a d-dimensional language embedding.

12 of 26

Related Work - Language Embedded Radiance Fields (LERF)

  • Relevancy Score: To assign each rendered language embedding φlang a score, compute the CLIP embedding of the text query φquer, along with a set of canonical phrases φi canon. We compute cosine similarity between the rendered embedding and canonical phrase embeddings, then compute the pairwise softmax between the rendered embedding text prompts. The relevancy score is then

13 of 26

Related Work - Open-vocabulary object detection (OVD)

  • Open-vocabulary object detection attempts to output masks or bounding box detections given text prompts as input.
  • First learn a visual-semantic stage using image-text pairs and then learn object detection using object annotations from a set of base classes

14 of 26

Problem

  • Given a planar surface (table or workbench) containing a set of objects, the objective is for a robot to grasp and lift a target object specified using natural language.

  • Example query: “sunglasses; ear hooks.”

15 of 26

Method - Overview

  • First captures the scene and reconstructs a LERF. Given a text query, LERF can generate a 3D relevancy map that highlights the relevant parts of the scene.

  • Second, a 3D object mask is generated using the LERF relevancy for the object query and DINO-based semantic grouping .

  • Third, a 3D part relevancy map is generated with a conditional LERF query over the object part query and the 3D object mask . The part relevancy map is used to produce a semantic grasp distribution

16 of 26

Method - Overview

17 of 26

Method - 3D Object Extraction

  • An important limitation of LERF is its lack of spatial grouping within objects.
  • LERF-TOGO overcomes this by finding a 3D object mask given a language query, which groups the object part together with the LERF activation.
  • To create the object mask, we leverage the 3D DINO embeddings (self-DIstillation with NO labels) present within LERF during inference, because DINO embeddings have been shown to exhibit strong object awareness and foreground-background distinction

18 of 26

Method - 3D Object Extraction

  • First, obtain a coarse object localization from LERF by rendering a top-down view of the scene and querying the object.
  • Second, produce a foreground mask by thresholding the first principal component of the top-down rendered DINO embeddings, and constrain the relevancy query to this mask to find the most relevant 3D point.
  • Third, iteratively grow the object mask by including neighboring points to the frontier which lie within a threshold DINO similarity. The output of this process is a set of 3D points lying on the target object

19 of 26

Method - Conditional LERF Queries

  • CLIP has a tendency to behave as a bag-of-words. Activation for “mug” behaves very similarly to “mug handle” .

  • To condition a LERF query (second part of it), search only on the points within the 3D object mask. This results in a distribution over the object’s 3D geometry representing the likelihood that a given point is the desired object part.

20 of 26

Method - Grasp Sampling

  • Ensuring complete coverage of grasps on objects is critical to avoid missing specific object parts.
  • GraspNet can generate 6-DOF parallel jaw grasps from a monocular RGBD point cloud.
  • Create a hemisphere of virtual cameras oriented towards the scene’s center. Give LERF generated point cloud to GraspNet.
  • Combine the generated grasps using non-maximum suppression to remove duplicates

21 of 26

Method - Grasp Ranking

  • The semantic score ssem for a given grasp is computed as the median LERF relevancy of points within the grasp volume.
  • The geometric score sgeom is the confidence output from GraspNet, indicating grasp quality based on geometric cues.
  • To balance relevance and success likelihood, we combine the grasp score s = 0.95ssem + 0.05sgeom to ensure that we consider the most relevant grasps while slightly biasing towards confident grasps.

22 of 26

Method - Scene Reconstruction

  • Robot uses a wrist-mounted camera to capture the scene with a hemispherical trajectory centered at the workspace.
  • Capture has a radius of 45 cm and arcs from ±100◦ around the workspace horizontally and an inclination range of 30◦ to 75◦. We capture images while the arm moves at 15 cm/sec at a rate of 3 hz, resulting in around 60 images per capture.
  • While the robot moves, Each image is pre-processed to extract DINO features, multi-scale CLIP, and ZoeDepth, which are used during LERF training.

23 of 26

Experiments

Part-Oriented Grasping: Object, part queries

Task-Oriented Grasping: Asking LLM to create object part queries

Integration with an LLM Planner: Integrate as a module with an LLM planner

24 of 26

Results

  • Overall achieves a 69% success rate for physically grasping and lifting objects.
  • LERF-TOGO able to differentiate between:
    • very fine-grained language queries like color, appearance (“matte” vs “shiny”)
    • semantically similar categories (“popsicle” vs “lolipop”).
    • long-tail object queries like “ethernet dongle”, “cork”, or “martini glass”, owing to its usage of CLIP zero-shot.

25 of 26

Limitations and Future Work

  • One limitation of LERF-TOGO is speed: the entire end-to-end the process takes a few minutes which can be impractical for time-sensitive applications
  • Another key limitation of LERF-TOGO is with groups of connected foreground objects, for example a bouquet of multiple flowers.
  • If there are multiple objects that match the prompt, the system will arbitrarily choose only one of them
  • LERF-TOGO is not designed for referring/comparative expressions (e.g., “mug next to the plate”, “biggest mug”).

26 of 26

References

  • Language Embedded Radiance Fields for Zero-Shot Task-Oriented Grasping
  • NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
  • Learning Transferable Visual Models From Natural Language Supervision
  • Emerging Properties in Self-Supervised Vision Transformers
  • LERF: Language Embedded Radiance Fields
  • Open-Vocabulary Object Detection Using Captions
  • GraspNet-1Billion: A Large-Scale Benchmark for General Object Grasping