1 of 23

Paper: OpenShape: Scaling Up 3D Shape Representation Towards Open-World Understanding

Date: 10/28/2024

🏆: Aryan Magoon

🧑🏻‍⚖️: Aryan Magoon

👩🏼‍🚀: Aaryan Dhore

🧑🏽‍💻: Dhriti Gampa

2 of 23

🏆: Motivation

  • Goal: Learn multi-modal joint representation of text, images, and 3D point clouds for open-world shape understanding
    • Generalization!
  • Motivation
    • Problem with Current SOTA
      • Small Datasets, Difficulty in aligning 3D representations with text/image due to noise
    • Applications in Autonomous Driving, Robotics etc.

3 of 23

🏆: OpenShape Methods

4 of 23

5 of 23

🏆: OpenShape Methods

6 of 23

Penalize Across 4 Different Pairs

3D-Text

Text-3D

3D-Image

Image-3D

7 of 23

🏆: OpenShape Methods

8 of 23

Experiments: Zero-Shot

9 of 23

Experiments: Few-Shot

10 of 23

🏆: OpenShape Strengths

  • Motivations are very strong, achieved results on zero-shot performance of OpenShape is excellent
  • Retrieval Visualizations are interesting and clearly illustrate the quality of embeddings generated from the encoder
  • Utilizes similar strategy from pure text embeddings to improve performance
    • Hard Negative Mining

11 of 23

🧑🏻‍⚖️: Summary and Importance

  • Problem Tackled:
    • Scalable, Generalizable 3D shape representations for zero-shot capabilities for open-world understanding
  • Importance:
    • Has the potential to enhance various applications, like 3D Shape retrieval, captioning, and integration

12 of 23

🧑🏻‍⚖️: Paper Critique

  • Limited improvement from alterations aside from backbone scaling
    • Majority of performance improvements come from using larger model

  • Zero-Shot Capabilities are overstated

  • Results from Linear Probing are extremely similar
    • Could mean the actual shape representation quality are similar

13 of 23

14 of 23

🧑🏻‍⚖️: Paper Critique

  • Limited improvement from alterations aside from backbone scaling
    • Majority of performance improvements come from using larger model

  • Zero-Shot Capabilities are overstated

  • Results from Linear Probing are extremely similar
    • Could mean the actual shape representation quality are similar

15 of 23

16 of 23

🧑🏻‍⚖️: Experiments

  • Although the depth of the Experiments are great, they are somewhat unfair.
    • Uses a massive backbone: “OpenCLIP ViT-bigG-14” Other methods might not use it

17 of 23

🧑🏻‍⚖️: Limitations

  • Limitations from authors
    • 875K images is still very limited
    • Doesn’t include part-level information and only focuses on global features
    • Model trained purely on synthetic data
  • Limited Improvement from text enhancement
  • Use of software like GPT-4 can cause some problems and create outliers in the dataset used to train

18 of 23

👩🏼‍🚀: Paper Summary from Pioneer

You are a pioneer. Your goal is to think how the paper being discussed could be used to accelerate other findings, help in other disciplines (e.g. robotics, science), and be combined with other techniques you have seen to create a novel result worthy of a solid publication.

  • Think of two or three novel applications of the work and present them
  • Tell us how you would go about pursuing these ideas to showcase their efficacy

19 of 23

👩🏼‍🚀: Medical Imaging: Cross-Modal Retrieval

  • Many types of medical imaging are 3D today
    • fMRI, Computed tomography, Cone beam computerized tomography etc.
  • Can we adapt the 3D encoder on medical imaging datasets and align with medical language and relevant imaging modalities
  • Clinicians would input a 3D medical scan and retrieve similar cases based on textual reports or diagnoses or input a textual description of symptoms to find relevant 3D scans.
  • Use embeddings to highlight areas of interest by correlating textual findings with 3D regions
  • Assess retrieval accuracy using precision and recall on a held-out test set to measure the system’s ability to correctly suggest diagnoses based on results
  • Pilot in a clinical setting with radiologists to collect feedback on real patients of the success of this method.

20 of 23

👩🏼‍🚀: Robotics: Voice-Controlled Manipulation in Unstructured Environments

  • Utilize 3D Shape representations to enable robot to understand and execute voice commands
  • Integrate OpenShape to the robot’s SLAM pipeline and use the encoder to process point clouds from LiDAR to build a map of the environment.
  • Integrate pre-existing speech recognition to convert voice to text and extract intent and object references
  • Leverage the Alignment with language embeddings to match commands to objects in the area.
    • Ex: “Go to the table”
  • Create ROS environment and test with

21 of 23

🧑🏽‍💻: Paper Summary from Entrepreneur

You are an entrepreneur. This means you are constantly on the lookout for cool new ideas and to build new products (which will hopefully make profit!). Your goal is to think how the paper being discussed could be used to build a new product – remember the product does not have to be “novel” but it should have high chances of working well and robustly.

  • Think of one or two products derived from the work
  • Tell us how you would go about building a demo to showcase each idea – this is the demo you would show for your seed or round A funding.

22 of 23

🧑🏽‍💻: 3D Asset Search and Recommendation

  • Platform for 3D artists for game developers, designers and any 3D modelers to allow users to search for 3D models using text and image descriptions.
  • Demo:
    • Collect Diverse 3D Model Dataset and then compute OpenShape embeddings and store in a vector database
    • Implement a vector similarity search using FAISS to efficiently retrieve 3D models based on embeddings, integrate handling of text, image, and even 3D models
    • Fetch results and show that they are relevant to the user query, showcase recall@k scores to show validity of the method
    • Since we have the triplets of image, text, and model, we can easily provide recommendations for new models that are similar but not identical to what they were searching for before
      • Similar to hard-negative mining strategy

23 of 23

Questions?