1 of 23

Paper: OpenShape: Scaling Up 3D Shape Representation Towards Open-World Understanding

Date: 10/28/2024

🏆: Aryan Magoon

🧑🏻‍⚖️: Aryan Magoon

👩🏼‍🚀: Aaryan Dhore

🧑🏽‍💻: Dhriti Gampa

2 of 23

🏆: Motivation

Goal: Learn multi-modal joint representation of text, images, and 3D point clouds for open-world shape understanding

Generalization!

Motivation

Problem with Current SOTA

Small Datasets, Difficulty in aligning 3D representations with text/image due to noise

Applications in Autonomous Driving, Robotics etc.

So the goal of this research is to create a multi-modal joint representation of text, images, and 3D point clouds to create a coherent representation that integrates information

From many modalities, into a shared representation space. So essentially, just as with CLIP, they project all of these modalities into a shared representation space where similar datapoints are aligned and can be easily compared using cosine similarity or a similar measure for zero-shot performance. Similar with CLIP, they want to scale this up to lead to greater generalization and zero-shot capabilities that we’ve seen with pure text, pure image, and even image and text with CLIP and other things.

Limitation of Existing Models:

Current 3D shape understanding models are constrained by small-scale training datasets.
They exhibit poor generalization to unseen or long-tail shape categories.

Need for Open-World Understanding:

Real-world applications require models that can recognize a vast and diverse range of 3D shapes.

Applications in Key Industries:

Autonomous Driving: Improves object detection and environment understanding.
Robotics: Enables robots to interact more effectively with their surroundings.

Scaling Up Training Data:

Dataset Ensemble: The authors combine four public 3D shape datasets—ShapeNetCore, 3D-FUTURE, ABO, and Objaverse—to create a large-scale training set consisting of 876,000 3D shapes covering a diverse range of categories.
Data Processing: For each shape, they sample 10,000 points from the mesh surface (including point colors) and render 12 color images from preset camera poses to link 3D shapes with corresponding images and texts.

Text Filtering and Enrichment:

Addressing Noisy Text Descriptions: Recognizing that many text descriptions (especially from Objaverse) are inaccurate or uninformative, they propose strategies to filter and enrich these texts.
Filtering with GPT-4: They employ GPT-4 to automatically filter out irrelevant or uninformative text descriptions, such as random filenames or nonsensical strings.
Captioning: They use image captioning models like BLIP and Azure Cognitive Services to generate descriptive captions for the 3D shapes, enhancing the quality of text data.
Image Retrieval: They perform image retrieval using the LAION-5B dataset to obtain additional relevant text descriptions, further enriching the textual information associated with each shape.

Scaling 3D Backbone Networks:

Exploring Different Backbones: The authors explore and compare various 3D backbone networks suitable for large-scale training, such as PointBERT (Transformer-based) and SparseConvNet (convolution-based).
Model Scaling: They scale up these backbones by increasing the number of layers and parameters to better handle the increased data volume and diversity.

Multi-Modal Contrastive Learning Framework:

Framework Overview: They adopt a multi-modal contrastive learning framework to align representations across text, images, and 3D point clouds.
Frozen Pretrained Encoders: The CLIP text and image encoders are kept frozen to leverage their pretrained knowledge.
Training the 3D Encoder: A dedicated 3D point cloud encoder is trained to align its output with the frozen text and image embeddings using contrastive loss.
Contrastive Loss Function: The loss function encourages matching representations of corresponding modalities while pushing apart non-matching pairs, facilitating the alignment of the 3D embedding space with that of text and images.

Hard Negative Mining:

Addressing Data Imbalance: Given the unbalanced distribution of categories in the dataset, they introduce a hard negative mining strategy to enhance training efficiency and performance.
k-Nearest Neighbors (kNN): After an initial training phase, they compute the k-nearest neighbors for each shape in the learned 3D embedding space.
Batch Construction with Hard Negatives: In subsequent training, batches are constructed by including hard negatives—shapes that are similar in the embedding space but belong to different categories—to improve the model's discriminative ability.
Mitigating False Negatives: They use text and image embeddings to filter out potential false negatives by ensuring that shapes with similar descriptions are not treated as negatives during training.

3 of 23

🏆: OpenShape Methods

The openshape method focuses on four key components which I will explain here. The first one is the dataset creation��As we mentioned in the introduction slides, one of the major issues is the small-scale datasets that other similar methods are train on make it not suitable for large scale generalization. Thus, we ensemble multiple datasets to create the dataset and in this case, we use ShapeNetCore, which has about 50k models across 55 categories, 3d-future, which has 3d models for 16k furniture models with annotations, Amazon-berkeely objects, which includes 8,000 3d models derived from product listings, and objaverse, which we have discussed in this class, and as we know, is a large scale dataset with almost 800k 3d models uploaded by web users and cover a very wide array of categories. But remember, this is not enough by itself, we need to create triplets of (image, text, model) for a single training sample��To prepare the data, for each 3d model 10,000 points are samples from the mesh surface, colors are interpolated based on mesh textures. It renders 12 different color images for each model from preset camera angles to capture different views. The thumbnails in the dataset are also included. Now we have the 3d Model and the images but we needed to do the text extraction to generate the text portion of the triplet, thus to extract text, for objaverse, the user provides a name and used as the initial text description and for all the other datasets, it includes some metadata like category names, styles, themes, materials and such to create raw text descriptions.

As you can see in this, we have been able to generate quality image embeddings from the 3d model, but because we are using the raw text from these datasets, there are many challenges with them. ��Challenges with Raw Text:

Many text descriptions, especially from Objaverse, are noisy, uninformative, or inaccurate due to being user-generated without expert verification.
Examples include random filenames, timestamps, or irrelevant content that do not describe the 3D model meaningfully.

So first, they use GPT4 from OpenAI, to automatically filter out the uninformative of text descriptions, specifically, it would remove texts that are timestamps like image_20241023.png, random filenames like untitled.png, and other incomprehensible text like new image.png. From this, they were able to filter out 30% of the raw texts that we got from the previous step. In the second sample, they use two different image captioning models to generate the descriptive captions from images, specifically they use BLIP (bootstrapping language image pretraining) that generates these captions, and Azure Cognitive services, to provide the captions. FInally, it does image retrieval, where it retrieves descriptions from similar images. So they used a pretrained the large Vision transformer model from teh pretrained CLIP model to retrieve the k nearest neighbors from the images from the LAION-5B dataset, which is a large0scale collections of image-text pairs. THe associated captions of the retrieved neighbors are used as enriched descriptions for corresponding 3d Models, so it passed in the images from the rendered images of the 3D models and finds similar images, to futher introduce diverse linguistic expressions and descriptions. ��Now, as you see we have the raw text, generated captions from BLIP, and retrieved texts from similar images, during each training iteration, for each 3d model, a text description from the 3 are randomly selected which further enriches the training data.

Using this text filtering, we filter out the bad text descriptions and generate quality ones using BLIP and by using text descriptions from a high quality dataset, we get better 3D-text alignment and better enhances the generalization from the model.

Now, we can move to the cross-modal alignment. In this step, the paper aims to align 3D shape representations into the representation space of text and images that already exist from pretrained CLIP model, which enables the model to understand and relate concepts across these modalities.

4 of 23

5 of 23

🏆: OpenShape Methods

Now, we can move onto the cross-modal alignment. In this section, the overall objective is to align the 3d encoder representations with the embedding spaces of text and images from a pre-trained CLIP model to relate concepts across these three modalities. Overall, we can see the network has three main models, the image and text encoder directly from CLIP and a new 3D encoder that processes the point clouds to produce the shape embeddings. For training, the text and image embeddings are frozen to preserve the knowledge they provide and only the 3d encoder and projection heads are trained to align the embedding space with that of CLIP. They train through a contrastive learning framework to bring the representation of matching modalities closer, For this, they use the following loss function:

6 of 23

Penalize Across 4 Different Pairs

3D-Text

Text-3D

3D-Image

Image-3D

7 of 23

🏆: OpenShape Methods

Now, we can discuss the actual training strategy for the model. In the cross modal alignment, they use a two-stage training approach to address the issue of unbalanced data distribution to improve the model’s discriminative ability by providing challenging examples. So in the two stage approach, we first perform initial training. We train the model with completely randomly sampled batches until the the model nearly converges, and then we perform a hard-negative mining to try to create these challenging samples. First, we perform a k-nearest neighbors search to compute the nearest neighbors for each shape in the 3d Embedding space from the initial training, and then for each training batch, we randomly select a set of seed shapes and for each seed shape, include the top x nearest neighbors to form a batch. This ensures that each batch contains shapes that are hard negatives. Now, this might make it seem that this inadvertently includes false negatives by accidentally including the same object in the representations so we can use the text and image embeddings to filter out these false negatives. So for example, let’s say that two text embeddings from a hard negative that we mined are extremely similar, then they are considered to be the same class and is not a negative sample, so we are able to mitigate this well. So after training on these hard-negative samples, we get a great model that is able to perform these zero-shot cross-modal applications as we are about to see.

8 of 23

Experiments: Zero-Shot

Firstly, they evaluated OpenShape's performance on zero-shot 3D shape classification tasks using three benchmarks: ModelNet40, ScanObjectNN, and the newly introduced Objaverse-LVIS dataset. ModelNet40 and ScanObjectNN contain 40 and 15 categories, respectively, while Objaverse-LVIS presents a more challenging scenario with 1,156 categories and a long-tailed distribution. OpenShape was compared against existing zero-shot approaches such as PointCLIP, PointCLIPv2, ReCon, CG3D, CLIP2Point, and ULIP. The results demonstrated that OpenShape significantly outperformed prior methods across all benchmarks. Notably, on the Objaverse-LVIS dataset, OpenShape achieved a top-1 accuracy of 46.8%, a substantial improvement over existing methods that struggled with less than 10% accuracy. On ModelNet40, OpenShape achieved an 85.3% accuracy, surpassing previous zero-shot methods by at least 20% and as we see in the next section, it even outperforms the fully supervised capabilities of other methods, proving the generalizability of this method, which is the primary objective.

9 of 23

Experiments: Few-Shot

the authors then conducted few-shot linear probing experiments. They froze the OpenShape embeddings and trained a linear classifier with a small number of labeled samples on the same benchmarks. OpenShape consistently outperformed other methods in this setting as well, particularly on the challenging Objaverse-LVIS dataset, indicating that the embeddings captured rich and generalizable features suitable for downstream tasks.

An extensive ablation study was performed to understand the contributions of different components in their approach. They examined the impact of scaling up the 3D backbone networks, the effect of ensembling multiple datasets, and the benefits of their text filtering and enrichment strategies. The ablations revealed that both data diversity and model capacity are crucial for achieving high performance. The text enrichment strategies, including filtering noisy text descriptions and augmenting with image captions and retrieved texts, were shown to significantly enhance the alignment between 3D shapes and textual descriptions, leading to better zero-shot classification results.

Lastly, the authors showcased the versatility of OpenShape's embeddings through cross-modal applications. They demonstrated that the learned representations could effectively support fine-grained 3D shape retrieval using text, images, or other 3D shapes as queries. Additionally, since OpenShape's embeddings are aligned with CLIP's embedding space, they could be seamlessly integrated with existing CLIP-based models for tasks like point cloud captioning and point cloud-conditioned image generation without additional training. These experiments highlighted OpenShape's ability to capture a wide range of semantic and visual concepts, underscoring its potential for enabling open-world 3D shape understanding.

From the perspective of a champion, these experiments are enough to showcase the effectiveness of the method.

10 of 23

🏆: OpenShape Strengths

Motivations are very strong, achieved results on zero-shot performance of OpenShape is excellent
Retrieval Visualizations are interesting and clearly illustrate the quality of embeddings generated from the encoder
Utilizes similar strategy from pure text embeddings to improve performance

Hard Negative Mining

11 of 23

🧑🏻‍⚖️: Summary and Importance

Problem Tackled:

Scalable, Generalizable 3D shape representations for zero-shot capabilities for open-world understanding

Importance:

Has the potential to enhance various applications, like 3D Shape retrieval, captioning, and integration

12 of 23

🧑🏻‍⚖️: Paper Critique

Limited improvement from alterations aside from backbone scaling

Majority of performance improvements come from using larger model

Zero-Shot Capabilities are overstated

Results from Linear Probing are extremely similar

Could mean the actual shape representation quality are similar

The evaluation of the zero-shot shape classification on Objaverse-LVIS might not actually be zero-shot. Based on the description in L235, even the "Ensembled (no LVIS)" can include LVIS categories, because seemingly the author only excludes the exact evaluation samples from the training set. Ideally, all shapes from the evaluation categories should be removed from the training set. This makes the real zero-shot generalization performance of the model potentially lower than reported and the overall comparison on this test set less informative. Besides, ModelNet40 and ScanObjectNN can also have overlapping categories with the training set, meaning the zero shot performance might not actually be “zero-shot” or out of distribution which means that the zero shot capabilities are overstated.

The model is trained on synthetic datasets and looses much of its edge when applied to real data, compared to prior models like ULIP that are otherwise suboptimal when tested on synthetic data (e.g., Figure 5, right panel). The authors themselves mention the lack of quality for sim-to-real which limits the real-world applications, meaning this method might not be that great to use for real-world applicability.

13 of 23

14 of 23

🧑🏻‍⚖️: Paper Critique

Limited improvement from alterations aside from backbone scaling

Majority of performance improvements come from using larger model

Zero-Shot Capabilities are overstated

Results from Linear Probing are extremely similar

Could mean the actual shape representation quality are similar

The evaluation of the zero-shot shape classification on Objaverse-LVIS might not actually be zero-shot. Based on the description in L235, even the "Ensembled (no LVIS)" can include LVIS categories, because seemingly the author only excludes the exact evaluation samples from the training set. Ideally, all shapes from the evaluation categories should be removed from the training set. This makes the real zero-shot generalization performance of the model potentially lower than reported and the overall comparison on this test set less informative. Besides, ModelNet40 and ScanObjectNN can also have overlapping categories with the training set, meaning the zero shot performance might not actually be “zero-shot” or out of distribution which means that the zero shot capabilities are overstated.

The model is trained on synthetic datasets and looses much of its edge when applied to real data, compared to prior models like ULIP that are otherwise suboptimal when tested on synthetic data (e.g., Figure 5, right panel). The authors themselves mention the lack of quality for sim-to-real which limits the real-world applications, meaning this method might not be that great to use for real-world applicability.

15 of 23

The linear probe results show the proposed model performs similarly to ULIP-retrained on ModelNet and ScanObjectNN. The authors try to explain it with in-category sample bias (what does this mean?) and domain gap dominance, without providing any evidence for these hypotheses. My concern here is that, given that OpenShape performs much better than ULIP in shape classification (requires both shape and language rep) but not linear probe (only requires shape rep), the actual underlying reason could be that the shape representation is of similar quality when tested on out-of-domain, and the real difference between these models lies in the text representation. I believe it would be quite beneficial to provide more analyses on this result, as this is central to the "shape representation learning" story of the paper. For example, such analyses could be visualizing and comparing the latent structure of the shape representation; or simply performing shape-latent based NN query and evaluate the distance between queried shapes; or measuring the shape latent's alignment with image latent, to name a few.

16 of 23

🧑🏻‍⚖️: Experiments

Although the depth of the Experiments are great, they are somewhat unfair.

Uses a massive backbone: “OpenCLIP ViT-bigG-14” Other methods might not use it

17 of 23

🧑🏻‍⚖️: Limitations

Limitations from authors

875K images is still very limited
Doesn’t include part-level information and only focuses on global features
Model trained purely on synthetic data

Limited Improvement from text enhancement
Use of software like GPT-4 can cause some problems and create outliers in the dataset used to train

18 of 23

👩🏼‍🚀: Paper Summary from Pioneer

You are a pioneer. Your goal is to think how the paper being discussed could be used to accelerate other findings, help in other disciplines (e.g. robotics, science), and be combined with other techniques you have seen to create a novel result worthy of a solid publication.

Think of two or three novel applications of the work and present them
Tell us how you would go about pursuing these ideas to showcase their efficacy

19 of 23

👩🏼‍🚀: Medical Imaging: Cross-Modal Retrieval

Many types of medical imaging are 3D today

fMRI, Computed tomography, Cone beam computerized tomography etc.

Can we adapt the 3D encoder on medical imaging datasets and align with medical language and relevant imaging modalities
Clinicians would input a 3D medical scan and retrieve similar cases based on textual reports or diagnoses or input a textual description of symptoms to find relevant 3D scans.
Use embeddings to highlight areas of interest by correlating textual findings with 3D regions
Assess retrieval accuracy using precision and recall on a held-out test set to measure the system’s ability to correctly suggest diagnoses based on results
Pilot in a clinical setting with radiologists to collect feedback on real patients of the success of this method.

Give medical professionals a lot more power. Given a 3D scan, you can find all previous cases related to the scans and features within it. Or vice versa, given a diagnosis or description of symptoms, find relevant 3D scans.

How to Pursue This Idea:

Specialized Training:

Fine-tune OpenShape's 3D encoder on medical imaging datasets, aligning it with medical language and relevant imaging modalities.
Collaborate with medical professionals to curate datasets with appropriate annotations and textual descriptions.

Cross-Modal Retrieval System:

Develop a system where clinicians can input a 3D medical scan and retrieve similar cases based on textual reports or diagnoses.
Alternatively, input a textual description of symptoms or findings to retrieve relevant 3D scans.

Augmented Diagnostic Tools:

Use the embeddings to highlight areas of interest in scans by correlating textual findings with 3D regions.
Assist in detecting anomalies by comparing patient scans with a database of known conditions.

Layer-wise Relevance Propagation (LRP) to Transformers with co-attention and encoder-decoder structures.

They define relevance matrices that represent the importance of each token (e.g., word or image region) with respect to the model's prediction.

Their method carefully updates these matrices as the data passes through different layers, accounting for how attention mechanisms mix and transform the inputs.

Experimental Showcase:

Evaluation Metrics:

Assess retrieval accuracy using precision and recall on a held-out test set.
Measure the system's ability to correctly suggest diagnoses based on retrieval results.

Clinical Trials:

Pilot the system in a clinical setting with radiologists.
Collect feedback on the usefulness of cross-modal retrieval in aiding diagnostic decisions.

Expected Outcomes:

Enhance diagnostic efficiency by providing quick access to similar cases and relevant information.
Reduce diagnostic errors through comprehensive cross-referencing.
Publish results in medical imaging journals (e.g., IEEE Transactions on Medical Imaging) to demonstrate the benefits.

20 of 23

👩🏼‍🚀: Robotics: Voice-Controlled Manipulation in Unstructured Environments

Utilize 3D Shape representations to enable robot to understand and execute voice commands
Integrate OpenShape to the robot’s SLAM pipeline and use the encoder to process point clouds from LiDAR to build a map of the environment.
Integrate pre-existing speech recognition to convert voice to text and extract intent and object references
Leverage the Alignment with language embeddings to match commands to objects in the area.

Ex: “Go to the table”

Create ROS environment and test with

Demo:

Given an unknown room, tell a robot to ‘pick up a red cup on the table’ → can be done through adding OpenShape to SLAM to build a map of the room while also understanding the spatial relationships between categorized objects to then parse command

Integration with Robotic Systems:

SLAM Integration:

Incorporate OpenShape's 3D shape encoder into the robot's SLAM pipeline.
Use the encoder to process point cloud data from depth sensors (e.g., LiDAR, RGB-D cameras) to build a semantically rich 3D map of the environment.
Align 3D shapes in the environment with OpenShape's embeddings to label objects in the map.

Natural Language Understanding:

Integrate a speech recognition system to convert voice commands into text.
Use NLP models to parse the textual commands and extract intent and object references.
Leverage OpenShape's alignment with language embeddings to match parsed commands to objects identified in the environment.

Voice-Controlled Navigation:

Command Interpretation:

For a command like "go to the table," parse the command to identify the target object ("table") and the action ("go to").
Use OpenShape to match the word "table" with the 3D shapes identified in the SLAM-generated map.

Path Planning:

Plan a navigation path to the identified table using the robot's motion planning algorithms.
Use obstacle avoidance techniques to navigate safely in dynamic environments.

Experimental Showcase:

1. Simulations:

Setup:

Create simulated environments using platforms like Gazebo or ROS-based simulators that include various household objects and furniture.
Populate environments with objects of different shapes, sizes, and colors to test the system's robustness.

Scenarios:

Test navigation commands such as "go to the kitchen table" or "navigate to the sofa."
Test manipulation commands like "pick up the blue book from the shelf" or "bring me the green bottle from the table."

Evaluation Metrics:

Measure success rates of navigation and manipulation tasks.
Assess the accuracy of object identification and the efficiency of path planning.

21 of 23

🧑🏽‍💻: Paper Summary from Entrepreneur

You are an entrepreneur. This means you are constantly on the lookout for cool new ideas and to build new products (which will hopefully make profit!). Your goal is to think how the paper being discussed could be used to build a new product – remember the product does not have to be “novel” but it should have high chances of working well and robustly.

Think of one or two products derived from the work
Tell us how you would go about building a demo to showcase each idea – this is the demo you would show for your seed or round A funding.

22 of 23

🧑🏽‍💻: 3D Asset Search and Recommendation

Platform for 3D artists for game developers, designers and any 3D modelers to allow users to search for 3D models using text and image descriptions.
Demo:

Collect Diverse 3D Model Dataset and then compute OpenShape embeddings and store in a vector database
Implement a vector similarity search using FAISS to efficiently retrieve 3D models based on embeddings, integrate handling of text, image, and even 3D models
Fetch results and show that they are relevant to the user query, showcase recall@k scores to show validity of the method
Since we have the triplets of image, text, and model, we can easily provide recommendations for new models that are similar but not identical to what they were searching for before

Similar to hard-negative mining strategy

23 of 23

Questions?