An overview of the Landscape of 3D Generation with PyTorch
Suvaditya Mukherjee, University of Southern California & Magnopus
suvadity@usc.edu
Evolving Landscape of 3D and 4D Generation
PyTorch 3D Ecosystem - Tools & Frameworks
Balancing Quality, Memory, and Inference Latency
Integrating PyTorch into Creative Workflows
Relevant PyTorch Optimizations for 3D-native stacks
- Current 3D Generation techniques depend on rendering into one of three standards, a NeRF (Neural Radiance Fields), a Mesh/Point Cloud representation, or a 3D Gaussian Splat
- Works like TRELLIS and SF3D are bringing the Text-to-3D paradigm to the forefront now
- The field has simultaneously started making the jump into Text-to-4D generation (3D Generation with a time component) and world models to mimic environments
- PyTorch is a great fit due to well-implemented tensor operations that scale well to GPUs and other accelerators
- These models are generally per-unit more compute-intensive than LLMs due to their sparse representations
- PyTorch3D is a FAIR-maintained library for Meta, offering one of the best implementations of PyTorch-centric components
- First-class support for torch.Tensor
- The NeRFStudio Project has been one of the most important contributors to the 3D PyTorch ecosystem
- Libraries like nerfstudio, gsplat, and nerfacc are built on PyTorch and are essential for the community as building blocks to build on
- Useful for interop between PyTorch and NumPy
- Essential for file I/O and numerical operations on meshes and point clouds
- Maintains Kaolin, an important tool for 3D research with features like full-fledged Physics simulations, visualizers that support Jupyter, and differentiable renderers
- Kaolin Wisp is another PyTorch-native tool that makes it easy to work with Neural Fields
- Text-to-3D models benefits from quantization to a large extent due to inherently sparse representations, allowing for high-quality generations even after extreme quantization
- The experiment (below) compares a standard TRELLIS pipeline against a pipeline with Int4 quantization (through torchao). We generate 5 samples and average statistics over it to get final results
- We see upto ~20% savings in memory at the cost of a higher inference latency due to the dequantization overhead
- Using torch.compile on the quantized model brought down inference time, but came with a higher VRAM consumption due to kernel caching on the GPU
- Cheap & free optimization through use of torch.compile
- Caching latents from the text encoder and VAE can also help repeated calls
- Quick drop in precision with torch.amp (Automatic Mixed-Precision) can also help in reducing compute requirements
- Finding a balance between quality and speed by increasing/decreasing number of rays sampled and/or image resolution can be vital
- Can also make use of Quantization through torchao and other quantization libraries out there such as bitsandbytes and quanto
- Production-scale inference can be optimized with ExecuTorch and AOT-compilation
- Advanced optimization strategies would include the use of custom Triton kernels for inference, CUDA-compatible rasterizers (like nvdiffrast) for differentiable rendering in case of Gaussians or polygon renders, or using Occupancy Grids to speed up NeRF-style renders
- Profiling with tlparse or torch.Profiler is beneficial in finding gaps and bottlenecks
- Recent announcement of torchax unlocks the ease-of-use for TPUs in 3D Generation, with higher compute availability and a larger ecosystem through JAX
- OpenUSD standards need more adoption for 3D Generation to help move 3D generation artifacts across tools more easily
- Rise of World Models with persistent memory for generating worlds on-the-fly will lead to stronger interest in this space
- Latency of models need to come down with better algorithms and stronger software stack for 3D and Graphics researchers to build on
- Better support within existing VFX and 3D tools like Blender, Unreal Engine, Unity etc. through plugins
- Need for better benchmarks of performance with 3D Generation models
- Biggest win for injecting PyTorch into creative workflows is for 3D Asset Generation in VFX pipelines
- Material/Geometry matching with differentiable rendering for 3D objects against reference plates is also an important application
- Monocular Depth Estimation for hard-to-solve shots is useful for Nuke compositor pipelines to simulate relighting, camera vignettes/mattes
- Scene Reconstruction with partial data is useful for recreating scenes in Unity/Unreal for CGI workflows
- Mix-and-match textures and generate your own with Blender plugins using PyTorch under the hood to create designs on the fly
Experiments (Colab Notebook)