1 of 22

LLaMA-Mesh:

Unifying 3D Mesh Generation with Language Models

Zhengyi Wang*, Jonathan Lorraine*, Yikai Wang,

Hang Su, Jun Zhu, Sanja Fidler*, Xiaohui Zeng*

*NVIDIA

12/14/2024

NVIDIA Confidential

2 of 22

All assets shown are available on our project webpage:

https://research.nvidia.com/labs/toronto-ai/LLaMA-Mesh/

3 of 22

1-min Overview

  • We enable LLMs to understand/generate 3D meshes by representing them as text and fine-tuning. This unifies 3D and text modalities in a model, preserving language abilities, unlocking conversational 3D creation with mesh understanding.

4 of 22

Our Text Representation of Meshes - OBJ Files

  • We represent 3D meshes using the OBJ file format, converting vertex coordinates and face definitions into plain text sequences that LLMs can process directly without modifying tokenizers or vocabularies.

5 of 22

Mesh Generation Ability without Finetuning

  • Pretrained LLMs show promise for generating simple 3D objects in text format.
  • However, mesh quality and complexity are often unsatisfactory.

6 of 22

Our Finetuning Strategy

  • We use a combination of rule-based methods in (a) and (b) and, LLM-augmented methods in (c) and (d) to construct an SFT dataset for mesh generation and understanding.

7 of 22

Finetuning Details: Dataset Mixtures

  • We use a mixture of generation, understanding, and general conversation data when finetuning

  • We show our model preserves language capabilities with standard metrics

8 of 22

Finetuning Details: Quantization for Context Length

  • We quantize vertex coordinates in the OBJ files, reducing the token count, with minimal impact on geometric fidelity.

9 of 22

Results: Sampled Dialogues

  • The model retains language-understanding abilities, demonstrating coherent and appropriate dialogues and being able to describe meshes in natural language.

10 of 22

Results: Comparison to Mesh-Generation Methods

  • LLaMA-Mesh achieves mesh generation quality comparable to specialized models trained from scratch on 3D data, as evidenced by qualitative comparisons with state-of-the-art methods like MeshXL.

11 of 22

Results: Sample Diversity Per-Prompt

  • We generate diverse 3D meshes for each prompt.

12 of 22

13 of 22

Blender Addon powered by LLaMA-Mesh: https://github.com/huggingface/meshgen

14 of 22

Interactive Demo on Hugging Face: https://huggingface.co/spaces/Zhengyi/LLaMA-Mesh

15 of 22

Limitations

  • Data:
    • Scarcity: Only ~25k shapes
      • We filter Objaverse (w/ Cap3D captions) for objects with <500 faces
    • Limited amount of sophisticated multi-step dialogues for 3D objects
      • We use UltraChat as our general conversational dataset
  • Problem Difficulty:
    • 8k context length harder to generate than current 4k
    • More rounds of dialogue harder – ex., chain of thought – harder to do
  • Infrastructure:
    • Using 8B param model - larger could do better
    • Training on more GPU with more data could help

16 of 22

Future

  • Additional modalities, like textures of physical properties
  • Part-based generation, augmenting, labeling
  • More sophisticated dialogues
  • Training with longer context lengths
  • Chain of thought reasoning for the LLM (or VLM)

17 of 22

Jonathan Lorraine

Zhengyi Wang

Xiaohui Zeng

Sanja Fidler

Jun Zhu

Hang Su

Yikai Wang

18 of 22

More Info

19 of 22

Questions

20 of 22

Spare Slides

21 of 22

Training Details

22 of 22

Results Gallery

  • We can generate high-quality and diverse meshes with artist-like created topology.