1 of 22

LLaMA-Mesh:

Unifying 3D Mesh Generation with Language Models

Zhengyi Wang*, Jonathan Lorraine*, Yikai Wang,

Hang Su, Jun Zhu, Sanja Fidler*, Xiaohui Zeng*

*NVIDIA

12/14/2024

NVIDIA Confidential

2 of 22

All assets shown are available on our project webpage:

https://research.nvidia.com/labs/toronto-ai/LLaMA-Mesh/

3 of 22

1-min Overview

We enable LLMs to understand/generate 3D meshes by representing them as text and fine-tuning. This unifies 3D and text modalities in a model, preserving language abilities, unlocking conversational 3D creation with mesh understanding.

4 of 22

Our Text Representation of Meshes - OBJ Files

We represent 3D meshes using the OBJ file format, converting vertex coordinates and face definitions into plain text sequences that LLMs can process directly without modifying tokenizers or vocabularies.

5 of 22

Mesh Generation Ability without Finetuning

Pretrained LLMs show promise for generating simple 3D objects in text format.
However, mesh quality and complexity are often unsatisfactory.

6 of 22

Our Finetuning Strategy

We use a combination of rule-based methods in (a) and (b) and, LLM-augmented methods in (c) and (d) to construct an SFT dataset for mesh generation and understanding.

7 of 22

Finetuning Details: Dataset Mixtures

We use a mixture of generation, understanding, and general conversation data when finetuning

We show our model preserves language capabilities with standard metrics

8 of 22

Finetuning Details: Quantization for Context Length

We quantize vertex coordinates in the OBJ files, reducing the token count, with minimal impact on geometric fidelity.

9 of 22

Results: Sampled Dialogues

The model retains language-understanding abilities, demonstrating coherent and appropriate dialogues and being able to describe meshes in natural language.

10 of 22

Results: Comparison to Mesh-Generation Methods

LLaMA-Mesh achieves mesh generation quality comparable to specialized models trained from scratch on 3D data, as evidenced by qualitative comparisons with state-of-the-art methods like MeshXL.

11 of 22

Results: Sample Diversity Per-Prompt

We generate diverse 3D meshes for each prompt.

12 of 22

Inference Code and Model Checkpoint Available: https://github.com/nv-tlabs/LLaMa-Mesh,

https://huggingface.co/Zhengyi/LLaMA-Mesh

13 of 22

Blender Addon powered by LLaMA-Mesh: https://github.com/huggingface/meshgen

14 of 22

Interactive Demo on Hugging Face: https://huggingface.co/spaces/Zhengyi/LLaMA-Mesh

15 of 22

Limitations

Data:

Scarcity: Only ~25k shapes

We filter Objaverse (w/ Cap3D captions) for objects with <500 faces

Limited amount of sophisticated multi-step dialogues for 3D objects

We use UltraChat as our general conversational dataset

Problem Difficulty:

8k context length harder to generate than current 4k
More rounds of dialogue harder – ex., chain of thought – harder to do

Infrastructure:

Using 8B param model - larger could do better
Training on more GPU with more data could help

16 of 22

Future

Additional modalities, like textures of physical properties
Part-based generation, augmenting, labeling
More sophisticated dialogues
Training with longer context lengths
Chain of thought reasoning for the LLM (or VLM)

17 of 22

Jonathan Lorraine

Zhengyi Wang

Xiaohui Zeng

Sanja Fidler

Jun Zhu

Hang Su

Yikai Wang

Project Website: research.nvidia.com/labs/toronto-ai/LLaMA-Mesh/

18 of 22

More Info

Project Webpage: https://research.nvidia.com/labs/toronto-ai/LLaMA-Mesh/

Online Demo: https://huggingface.co/spaces/Zhengyi/LLaMA-Mesh

Paper: https://arxiv.org/abs/2411.09595

Code: https://github.com/nv-tlabs/LLaMa-Mesh

Model Weights: https://huggingface.co/Zhengyi/LLaMA-Mesh

Blender Addon (courtesy of Dylan Ebert): https://github.com/huggingface/meshgen

Contact: xzeng@nvidia.com, jlorraine@nvidia.com, wang-zy21@mails.tsinghua.edu.cn

Slack: #fdl-llama-mesh

19 of 22

Questions

20 of 22

Spare Slides

21 of 22

Training Details

22 of 22

Results Gallery

We can generate high-quality and diverse meshes with artist-like created topology.