Paper: OpenShape: Scaling Up 3D Shape Representation Towards Open-World Understanding
Date: 10/28/2024
🏆: Aryan Magoon
🧑🏻⚖️: Aryan Magoon
👩🏼🚀: Aaryan Dhore
🧑🏽💻: Dhriti Gampa
2 of 23
🏆: Motivation
Goal: Learn multi-modal joint representation of text, images, and 3D point clouds for open-world shape understanding
Generalization!
Motivation
Problem with Current SOTA
Small Datasets, Difficulty in aligning 3D representations with text/image due to noise
Applications in Autonomous Driving, Robotics etc.
3 of 23
🏆: OpenShape Methods
4 of 23
5 of 23
🏆: OpenShape Methods
6 of 23
Penalize Across 4 Different Pairs
3D-Text
Text-3D
3D-Image
Image-3D
7 of 23
🏆: OpenShape Methods
8 of 23
Experiments: Zero-Shot
9 of 23
Experiments: Few-Shot
10 of 23
🏆: OpenShape Strengths
Motivations are very strong, achieved results on zero-shot performance of OpenShape is excellent
Retrieval Visualizations are interesting and clearly illustrate the quality of embeddings generated from the encoder
Utilizes similar strategy from pure text embeddings to improve performance
Hard Negative Mining
11 of 23
🧑🏻⚖️: Summary and Importance
Problem Tackled:
Scalable, Generalizable 3D shape representations for zero-shot capabilities for open-world understanding
Importance:
Has the potential to enhance various applications, like 3D Shape retrieval, captioning, and integration
12 of 23
🧑🏻⚖️: Paper Critique
Limited improvement from alterations aside from backbone scaling
Majority of performance improvements come from using larger model
Zero-Shot Capabilities are overstated
Results from Linear Probing are extremely similar
Could mean the actual shape representation quality are similar
13 of 23
14 of 23
🧑🏻⚖️: Paper Critique
Limited improvement from alterations aside from backbone scaling
Majority of performance improvements come from using larger model
Zero-Shot Capabilities are overstated
Results from Linear Probing are extremely similar
Could mean the actual shape representation quality are similar
15 of 23
16 of 23
🧑🏻⚖️: Experiments
Although the depth of the Experiments are great, they are somewhat unfair.
Uses a massive backbone: “OpenCLIP ViT-bigG-14” Other methods might not use it
17 of 23
🧑🏻⚖️: Limitations
Limitations from authors
875K images is still very limited
Doesn’t include part-level information and only focuses on global features
Model trained purely on synthetic data
Limited Improvement from text enhancement
Use of software like GPT-4 can cause some problems and create outliers in the dataset used to train
18 of 23
👩🏼🚀: Paper Summary from Pioneer
You are a pioneer. Your goal is to think how the paper being discussed could be used to accelerate other findings, help in other disciplines (e.g. robotics, science), and be combined with other techniques you have seen to create a novel result worthy of a solid publication.
Think of two or three novel applications of the work and present them
Tell us how you would go about pursuing these ideas to showcase their efficacy
19 of 23
👩🏼🚀: Medical Imaging: Cross-Modal Retrieval
Many types of medical imaging are 3D today
fMRI, Computed tomography, Cone beam computerized tomography etc.
Can we adapt the 3D encoder on medical imaging datasets and align with medical language and relevant imaging modalities
Clinicians would input a 3D medical scan and retrieve similar cases based on textual reports or diagnoses or input a textual description of symptoms to find relevant 3D scans.
Use embeddings to highlight areas of interest by correlating textual findings with 3D regions
Assess retrieval accuracy using precision and recall on a held-out test set to measure the system’s ability to correctly suggest diagnoses based on results
Pilot in a clinical setting with radiologists to collect feedback on real patients of the success of this method.
20 of 23
👩🏼🚀: Robotics: Voice-Controlled Manipulation in Unstructured Environments
Utilize 3D Shape representations to enable robot to understand and execute voice commands
Integrate OpenShape to the robot’s SLAM pipeline and use the encoder to process point clouds from LiDAR to build a map of the environment.
Integrate pre-existing speech recognition to convert voice to text and extract intent and object references
Leverage the Alignment with language embeddings to match commands to objects in the area.
Ex: “Go to the table”
Create ROS environment and test with
21 of 23
🧑🏽💻: Paper Summary from Entrepreneur
You are an entrepreneur. This means you are constantly on the lookout for cool new ideas and to build new products (which will hopefully make profit!). Your goal is to think how the paper being discussed could be used to build a new product – remember the product does not have to be “novel” but it should have high chances of working well and robustly.
Think of one or two products derived from the work
Tell us how you would go about building a demo to showcase each idea – this is the demo you would show for your seed or round A funding.
22 of 23
🧑🏽💻: 3D Asset Search and Recommendation
Platform for 3D artists for game developers, designers and any 3D modelers to allow users to search for 3D models using text and image descriptions.
Demo:
Collect Diverse 3D Model Dataset and then compute OpenShape embeddings and store in a vector database
Implement a vector similarity search using FAISS to efficiently retrieve 3D models based on embeddings, integrate handling of text, image, and even 3D models
Fetch results and show that they are relevant to the user query, showcase recall@k scores to show validity of the method
Since we have the triplets of image, text, and model, we can easily provide recommendations for new models that are similar but not identical to what they were searching for before