1 of 26

Language Embedded Radiance Fields for Zero-Shot Task-Oriented Grasping

Presented by Ahmet Firat Gamsiz

Adam Rashid*, Satvik Sharma*, Chung Min Kim, Justin Kerr, Lawrence Yunliang Chen

Angjoo Kanazawa, Ken Goldberg

2 of 26

Introduction

Many common objects must be grasped appropriately to avoid damage or facilitate performing a task
a knife by its handle
a flower by its stem
sunglasses by their frame�
Ability to grasp an object part based on a desired task and constraints is called task-oriented grasping

3 of 26

Introduction

4 of 26

Motivation

Learning-based grasping systems are robust on grasping arbitrary objects.
But they typically measure grasp success based on whether the object was lifted.
They ignore an object’s semantic properties:

Even if a robot could locate your favorite sunglasses, rather than safely grasp at the frame it may shatter the lenses

5 of 26

Motivation

Previous methods collect specific object affordance datasets and struggle to scale to a diverse set of objects.

Flexibility of natural language has the potential for specifying what and where to grasp.
We can use large vision-language models in a zero-shot manner for task-oriented grasping

6 of 26

Related Work - Task-Oriented Grasping

Probabilistically modeling human grasps
Extracting geometric features from labeled object parts
Training on part-affordance datasets in simulation
Transferring category-specific part grasps to new instances
Train object-part grasp networks by leveraging object part and manipulation affordance datasets
Use videos of humans interacting with objects to guide grasps
Use pretrained vision features to discover common parts within sets of object

7 of 26

Related Work - Neural Radiance Fields (NeRF)

NeRF represents a scene using a fully-connected (non-convolutional) deep network, whose input is a single continuous 5D coordinate (spatial location (x, y, z) and viewing direction (θ, φ)) and whose output is the volume density and view-dependent emitted radiance at that spatial location.

8 of 26

Related Work - Contrastive Language–Image Pre-training (CLIP)

CLIP jointly trains an image encoder and a text encoder to predict the correct pairings of a batch of (image, text) training examples.

9 of 26

Related Work - Self-DIstillation with NO Labels (DINO)

Self-supervised pre training of Visual Transformers
ViT features explicitly contain the scene layout and, in particular, object boundaries

DINO objenin segmentasyon bilgisini iceren image embeddingleri olusturmaya yariyor

DINO, simplifies self-supervised training by directly predicting the output of a teacher network—built with a momentum encoder—by using a standard cross-entropy loss

Model passes two different random transformations of an input image to the student and teacher networks.

Both networks have the same architecture but different parameters.

The output of the teacher network is centered with a mean computed over the batch.

Each networks outputs a K dimensional feature that is normalized with a temperature softmax over the feature dimension.

Their similarity is then measured with a cross-entropy loss.

We apply a stop-gradient (sg) operator on the teacher to propagate gradients only through the student.

The teacher parameters are updated with an exponential moving average (ema) of the student parameters.

Knowledge distillation is a learning paradigm where we train a student network gθs to match the output of a given teacher network gθt , parameterized by θs and θt respectively.

Given an input image x, both networks output probability distributions over K dimensions denoted by Ps and Pt.

All crops are passed through the student while only the global views are passed through the teacher, therefore encouraging “local-to-global” correspondences.

10 of 26

Related Work - Language Embedded Radiance Fields (LERF)

A method for grounding language embeddings from off-the-shelf models like CLIP into NeRF, which enable these types of open-ended language queries in 3D

11 of 26

Related Work - Language Embedded Radiance Fields (LERF)

To create LERF, augment NeRF’s outputs with a language embedding Flang(~x, s) ∈ Rd, which takes an input position ~x and physical scale s, and outputs a d-dimensional language embedding.

LERF konum ve view anglei input olarak aliyor, ve 3D DINO embedding fieldi ve scale-conditioned CLIP field cikariyor. CLIP fieldi Flang olusturuyor. CLIP’e tek pixel veremiyoruz genel olmasi gerekiyor. o yuzden rayler icin belli scalelerden image alip onlarin clip embeddingini cikariyoruz. LERF bunu kullanarak CLIP featuelari supervise ediyor her nokta icin. LERF pixeller degil hacimler uzerinde bir alan oldugu icin scale isini yapiyor bu yuzden de Scale conditioned.

Relevancy scoreu da verilen texti CLIP modelinin text embedderina sokup embedder cikariyor. Her noktanin clip featureyle bunu kiyaslayip relevancy score olusturup renderliyor.

DINO Featurelari da backgroundla frontgroundi ayirt ediyor

We choose this output to be view-independent, since the semantic meaning of a location should be invariant to viewing angle.

the output of this field is the average CLIP embedding across all training views of image crops containing the specified volume.

By reframing the query from points to volumes, we can effectively supervise a dense field from coarse crops of input images,

which can be rendered in a pixel-aligned manner by conditioning on a given volume scale

Dino features are used to clean outlier or background front seperation

LERF inputs RGB images with camera poses and outputs a 3D field of DINO embeddings as well as a scale-conditioned CLIP field. This supports querying points in 3D for CLIP embeddings at different physical scales, capturing different semantics given different amounts of context

Given a text query, a relevancy value (from 0 to 1) can be generated at any 3D point by calculating the cosine similarity between LERF-queried embeddings and the CLIP embedding of query text.

Multi-Scale Supervision -> can’t get clip embeddings of pixels souse different scale crops, use trilinear interpolation between the embeddings

from the 4 nearest crops for the scale above and below to produce the final ground truth embedding φg

ˆφlang = ∫t w(t)Flang (r(t), s(t)) dt

Llang = −λlangφlang · φgtlan

Flang is the language embedding for points

12 of 26

Related Work - Language Embedded Radiance Fields (LERF)

Relevancy Score: To assign each rendered language embedding φlang a score, compute the CLIP embedding of the text query φquer, along with a set of canonical phrases φi canon. We compute cosine similarity between the rendered embedding and canonical phrase embeddings, then compute the pairwise softmax between the rendered embedding text prompts. The relevancy score is then

13 of 26

Related Work - Open-vocabulary object detection (OVD)

Open-vocabulary object detection attempts to output masks or bounding box detections given text prompts as input.
First learn a visual-semantic stage using image-text pairs and then learn object detection using object annotations from a set of base classes

14 of 26

Problem

Given a planar surface (table or workbench) containing a set of objects, the objective is for a robot to grasp and lift a target object specified using natural language.

Example query: “sunglasses; ear hooks.”

15 of 26

Method - Overview

First captures the scene and reconstructs a LERF. Given a text query, LERF can generate a 3D relevancy map that highlights the relevant parts of the scene.

Second, a 3D object mask is generated using the LERF relevancy for the object query and DINO-based semantic grouping .

Third, a 3D part relevancy map is generated with a conditional LERF query over the object part query and the 3D object mask . The part relevancy map is used to produce a semantic grasp distribution

16 of 26

Method - Overview

17 of 26

Method - 3D Object Extraction

An important limitation of LERF is its lack of spatial grouping within objects.
LERF-TOGO overcomes this by finding a 3D object mask given a language query, which groups the object part together with the LERF activation.
To create the object mask, we leverage the 3D DINO embeddings (self-DIstillation with NO labels) present within LERF during inference, because DINO embeddings have been shown to exhibit strong object awareness and foreground-background distinction

18 of 26

Method - 3D Object Extraction

First, obtain a coarse object localization from LERF by rendering a top-down view of the scene and querying the object.
Second, produce a foreground mask by thresholding the first principal component of the top-down rendered DINO embeddings, and constrain the relevancy query to this mask to find the most relevant 3D point.
Third, iteratively grow the object mask by including neighboring points to the frontier which lie within a threshold DINO similarity. The output of this process is a set of 3D points lying on the target object

19 of 26

Method - Conditional LERF Queries

CLIP has a tendency to behave as a bag-of-words. Activation for “mug” behaves very similarly to “mug handle” .

To condition a LERF query (second part of it), search only on the points within the 3D object mask. This results in a distribution over the object’s 3D geometry representing the likelihood that a given point is the desired object part.

20 of 26

Method - Grasp Sampling

Ensuring complete coverage of grasps on objects is critical to avoid missing specific object parts.
GraspNet can generate 6-DOF parallel jaw grasps from a monocular RGBD point cloud.
Create a hemisphere of virtual cameras oriented towards the scene’s center. Give LERF generated point cloud to GraspNet.

Combine the generated grasps using non-maximum suppression to remove duplicates

21 of 26

Method - Grasp Ranking

The semantic score ssem for a given grasp is computed as the median LERF relevancy of points within the grasp volume.
The geometric score sgeom is the confidence output from GraspNet, indicating grasp quality based on geometric cues.
To balance relevance and success likelihood, we combine the grasp score s = 0.95ssem + 0.05sgeom to ensure that we consider the most relevant grasps while slightly biasing towards confident grasps.

22 of 26

Method - Scene Reconstruction

Robot uses a wrist-mounted camera to capture the scene with a hemispherical trajectory centered at the workspace.
Capture has a radius of 45 cm and arcs from ±100◦ around the workspace horizontally and an inclination range of 30◦ to 75◦. We capture images while the arm moves at 15 cm/sec at a rate of 3 hz, resulting in around 60 images per capture.
While the robot moves, Each image is pre-processed to extract DINO features, multi-scale CLIP, and ZoeDepth, which are used during LERF training.

23 of 26

Experiments

Part-Oriented Grasping: Object, part queries

Task-Oriented Grasping: Asking LLM to create object part queries

Integration with an LLM Planner: Integrate as a module with an LLM planner

24 of 26

Results

Overall achieves a 69% success rate for physically grasping and lifting objects.

LERF-TOGO able to differentiate between:

very fine-grained language queries like color, appearance (“matte” vs “shiny”)
semantically similar categories (“popsicle” vs “lolipop”).
long-tail object queries like “ethernet dongle”, “cork”, or “martini glass”, owing to its usage of CLIP zero-shot.

Sonuclarda queryleri direkt insanlar verdiginde yuzde 96 objeyi dogru sekilde buluyor, yuzde 82 dogru kismini tespit ediyor. Tum executionlarsa yuzde 69 basarili.

Aradaki fark grasp executionindan geliyor.

Bu skorlar gosteriyor ki LERF-TOGO graspin dogru parta biasli olmasini saglayabiliyor.

Kiyaslamak icin, geometrik grasptaki en yuksek confidencelar sadece yuzde 18inde dogru kisimda bulunuyor.

LLMler ise objenin dogru partini bulmada daha dusuk performans veriyor. Bu da CLIPin kelime farkliliklarina asiri duyarli olmasindan. Body ve base cok farkli LERF activationlari cikariyor.

Semantic Abstraction [77] takes a single RGBD frame as input and a text query and outputs a

relevancy heat map over the image. Since the method takes a single image, we provide the method

with an input image observing all object parts for a fair comparison

OWL-ViT [69] is an open-vocab detector which takes in an RGB image and text prompts and

outputs segmentation maps. We provide OWL-ViT a single input image that encompasses all object

parts for a fair comparison. To obtain an object mask, we use the object prompt to establish an initial

bounding box. This box serves as a region to identify the highest-scoring part within the region

25 of 26

Limitations and Future Work

One limitation of LERF-TOGO is speed: the entire end-to-end the process takes a few minutes which can be impractical for time-sensitive applications
Another key limitation of LERF-TOGO is with groups of connected foreground objects, for example a bouquet of multiple flowers.
If there are multiple objects that match the prompt, the system will arbitrarily choose only one of them
LERF-TOGO is not designed for referring/comparative expressions (e.g., “mug next to the plate”, “biggest mug”).

26 of 26

References

Language Embedded Radiance Fields for Zero-Shot Task-Oriented Grasping
NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
Learning Transferable Visual Models From Natural Language Supervision
Emerging Properties in Self-Supervised Vision Transformers
LERF: Language Embedded Radiance Fields
Open-Vocabulary Object Detection Using Captions
GraspNet-1Billion: A Large-Scale Benchmark for General Object Grasping