1 of 18

MeshFormer �High-Quality Mesh Generation with a 3D-Guided Reconstruction Model

Based on the work of M. Liu, C. Zeng, X. Wei, et al.

Shubhan Pawar | University of Pennsylvania | September 25, 2025

2 of 18

The Goal - Democratizing High-Quality 3D Content

  • The Problem: The paper tackles sparse-view reconstruction—generating a high-quality, textured 3D mesh from just a few 2D images.
  • The Challenge: The paper tackles sparse-view reconstruction—generating a high-quality, textured 3D mesh from just a few 2D images.
  • Why It's Hard: With only a few views, there's a lot of missing information, creating ambiguity about the object's true shape. The model needs extensive prior knowledge to fill in these gaps.

Shubhan Pawar | University of Pennsylvania | Fall 2025 | MeshFormer, High-Quality Mesh Generation with a 3D-Guided Reconstruction Model | September 2025 | 1

3 of 18

Shubhan Pawar | University of Pennsylvania | Fall 2025 | MeshFormer, High-Quality Mesh Generation with a 3D-Guided Reconstruction Model | September 2025 | 2

Approach 1: Per-Shape Optimization (e.g.,DreamFusion)

    • Method: Uses large 2D diffusion models as priors to optimize a 3D representation (like NeRF) on a per-object basis through Score Distillation Sampling (SDS) losses.
    • Limitation: This process is extremely slow, often taking hours for a single object, and can suffer from 3D inconsistency issues.

Approach 2: Feed-Forward NeRF Models (e.g., One-2-3-45)

    • Method: A feed-forward network is trained to directly predict a Neural Radiance Field (NeRF) representation from the input views.
    • Limitation: While fast, extracting a clean, high-quality polygonal mesh from a NeRF's density field is a non-trivial problem. The resulting meshes are often noisy, suffer from artifacts, or lack fine-grained detail.

Approach 3: Large Reconstruction Models (LRMs) on Triplanes (e.g., MeshLRM)

    • Method: These models use large-scale transformers to process a ''triplane” representation—decomposing a 3D field into three 2D feature planes.
    • Limitation: They require massive computational resources for training (often over 100 GPUs). Furthermore, the triplane representation itself can introduce axis-aligned artifacts and has less explicit 3D structure.

The Landscape - Prior Approaches & Their Limitations

4 of 18

  • The Core Idea: High-quality meshes come from a single pipeline where inputs, architecture, and supervision are all grounded in 3D. Normals help twice (as early guidance and as a training target), while the SDF provides the stable geometric core the renderer builds on.
  • Inputs & Fusion (early): Multi-view RGB and predicted normal maps are encoded into 2D features. Via projection-aware cross-attention, each 3D voxel pulls consistent evidence from the views (using both features and raw values). This builds a 3D feature volume that respects visibility and geometry.
  • Geometry Core (middle): A small head predicts a high-res SDF volume; we extract a mesh from it. SDF + occupancy losses supervise the 3D shape directly, giving stable geometry, i.e. the frame the rest depends on.
  • Detail & Supervision (late): Separate heads predict color texture and a normal texture. A differentiable renderer produces RGB and normal maps; color/normal losses (MSE + LPIPS) compare those renders to the inputs.
  • Key clarity: Normals are used twice — as inputs (guidance during fusion) and as targets (via the predicted normal texture), not by back-propagating through mesh-derived normal.

Shubhan Pawar | University of Pennsylvania | Fall 2025 | MeshFormer, High-Quality Mesh Generation with a 3D-Guided Reconstruction Model | September 2025 | 3

The MeshFormer Philosophy - A ”3D-Native” Approach

5 of 18

  • Triplane Representation:
    • Represents a 3D scene by decomposing it into multiple 2D planes, which is memory-efficient compared to dense grids.
    • Drawback: It lacks explicit 3D spatial structure, making it difficult to establish precise correspondence between a 3D location and its projected 2D pixels. This can lead to costly all-to-all attention mechanisms and cumbersome training.
  • Dense Voxel Representation:
    • Discretizes space into a regular 3D grid, directly compatible with 3D convolutions.
    • Drawback: Extremely memory-intensive, as it stores information for every point in space, including empty regions. Processing dense grids is computationally demanding.
  • MeshFormer's Choice, Sparse Voxels:
    • Method: Stores information only in occupied regions of space, often using a hierarchical structure.
    • Advantage: This provides the best of both worlds. It explicitly preserves 3D spatial structure like a dense grid but is significantly more memory- and computationally-efficient by ignoring empty space. This allows for higher effective feature resolutions.

Shubhan Pawar | University of Pennsylvania | Fall 2025 | MeshFormer, High-Quality Mesh Generation with a 3D-Guided Reconstruction Model | September 2025 | 4

Representation - Why Voxels over Triplanes?

6 of 18

  • The Challenge: How to process high-resolution 3D data while also capturing the long-range dependencies needed for complex object priors?
  • The Solution: A hybrid UNet architecture that leverages the complementary strengths of 3D Convolutions and Transformers.
  • 3D Convolutions for Spatial Encoding/Decoding:
    • The encoder and decoder branches of the UNet use 3D convolutions (specifically, sparse convolutions in the fine stage).
    • 3D convolutions excel at detecting local geometric patterns like edges, corners, and textures within a confined neighborhood. Their inherent weight-sharing property makes them parameter-efficient and encourages generalization.
  • Transformer at the Bottleneck: At the most compressed, low-resolution stage of the UNet (the bottleneck), the voxel features are treated as a sequence of tokens and fed into a standard Transformer.
    • The Transformer's self-attention mechanism is uniquely capable of capturing global object priors and long-range dependencies between distant parts of the object. Placing it at the bottleneck makes this global reasoning computationally feasible.
  • Synergy: This hybrid model learns local details efficiently with convolutions while using the powerful Transformer for global shape understanding, outperforming pure CNN or pure ViT approaches.

Shubhan Pawar | University of Pennsylvania | Fall 2025 | MeshFormer, High-Quality Mesh Generation with a 3D-Guided Reconstruction Model | September 2025 | 5

Architecture - A Hybrid of Convolution and Transformer

7 of 18

  • The Problem: How does a 3D voxel feature effectively aggregate information from multiple 2D input views, especially in the presence of occlusion?
  • Naive Methods: Simple mean or max pooling of features is suboptimal and struggles with visibility issues.
  • MeshFormer's Solution: A Projection-Aware Cross-Attention mechanism.
    • Project each 3D voxel into every input view..
    • At each projected pixel, sample 2D features (from RGB and normal maps) plus the raw RGB/normal values.
    • The voxel is the query; sampled per-view evidence is key/value. Attention weights down-weight occluded/inconsistent views and aggregate the rest.
    • Benefit: This allows the model to learn to handle occlusions by down-weighting information from inconsistent or irrelevant views, leading to a more robust fusion of multi-view data.

Shubhan Pawar | University of Pennsylvania | Fall 2025 | MeshFormer, High-Quality Mesh Generation with a 3D-Guided Reconstruction Model | September 2025 | 6

The 3D-2D Connection

Projection-Aware Cross-Attention (RGB + Normal Inputs)

8 of 18

  • Why bring normals in early? RGB is ambiguous; normal maps carry explicit surface orientation, which helps the voxel decide the shape at the point
  • Inputs & encoding: We take multi-view RGB and normal maps (e.g., from Zero123++), convert normals from camera → world frame, and run both RGB + normal images through the same 2D encoder (DINOv2) to get patch features
  • Per-voxel projection & sampling: Each 3D voxel projects into all input views, samples DINOv2 features + raw RGB/normal at those pixels, and uses cross-attention to weight visible, consistent evidence and fuse it into the voxel’s 3D feature.
  • Normals get used again as a training target.

Shubhan Pawar | University of Pennsylvania | Fall 2025 | MeshFormer, High-Quality Mesh Generation with a 3D-Guided Reconstruction Model | September 2025 | 7

Normals as Input (Guidance) — used before any loss

9 of 18

  • The Problem: Training a model to output a mesh is notoriously unstable.
    • Relying only on a differentiable rendering loss makes the optimization landscape difficult to navigate without a good initialization.
    • Relying only on a geometric loss may not produce photorealistic textures.
  • Solution: A unified, single-stage training process that combines two complementary loss signals.
    • Direct SDF Loss (Geometric Accuracy):
      • Normal appear again in the loss path: the network predicts a normal texture, we render normal maps from it, and compare to the input normal maps (MSE + LPIPS) for stable, high-frequency detail.
      • The network is trained to predict a Signed Distance Function (SDF) field, which measures the distance of any point to the nearest surface.
      • This provides a powerful, explicit, and continuous 3D geometric signal that acts as a strong regularizer, ensuring the model learns a coherent and valid shape from the very beginning.
    • Differentiable Surface Rendering Loss (Visual Fidelity):
      • A mesh is extracted from the predicted SDF using Marching Cubes.
      • This mesh is then differentiably rendered, and the resulting images (both color and normal) are compared to the input views.
      • This loss refines the high-frequency surface details and the texture to ensure the final output is visually faithful to the input.
      • Normal rendering: use the predicted normal texture to render normal maps and compare to input normals (MSE + LPIPS).
      • Stability note: we do not back-prop through mesh-derived normals; learning a normal texture decouples detail from geometry.

Shubhan Pawar | University of Pennsylvania | Fall 2025 | MeshFormer, High-Quality Mesh Generation with a 3D-Guided Reconstruction Model | September 2025 | 8

Supervision - Unified Training with SDF and Rendering

10 of 18

  •  

Shubhan Pawar | University of Pennsylvania | Fall 2025 | MeshFormer, High-Quality Mesh Generation with a 3D-Guided Reconstruction Model | September 2025 | 9

Mathematical Formulation

The Overall Training Objective

11 of 18

  •  

Shubhan Pawar | University of Pennsylvania | Fall 2025 | MeshFormer, High-Quality Mesh Generation with a 3D-Guided Reconstruction Model | September 2025 | 10

Mathematical Formulation

The Geometric Core - SDF

12 of 18

  •  

Shubhan Pawar | University of Pennsylvania | Fall 2025 | MeshFormer, High-Quality Mesh Generation with a 3D-Guided Reconstruction Model | September 2025 | 11

Mathematical Formulation

Projection-Aware Cross-Attention

13 of 18

  • Unstable (what we don’t do)
    • Render mesh-derived normals and back-prop into vertex positions
    • Small vertex moves → large, non-linear normal flips across faces
    • Leads to noisy updates and unstable mesh learning
  • Stable (what we do)
    • Predict a 3D normal texture
    • Render normal maps from that texture; compare to input normals
    • Decouples fine detail from geometry → stable training

Shubhan Pawar | University of Pennsylvania | Fall 2025 | MeshFormer, High-Quality Mesh Generation with a 3D-Guided Reconstruction Model | September 2025 | 12

Stable vs. Unstable Normal Supervision

14 of 18

  1. State-of-the-Art Performance
    1. MeshFormer significantly outperforms prior and concurrent methods like One-2-3-45++, TripoSR, and MeshLRM across two challenging, unseen test datasets (GSO and OmniObject3D).
    2. It achieves the best scores on most 3D metrics (F-Score, Chamfer Distance) and 2D rendering metrics (PSNR).
  2. The Key Differentiator: Training Efficiency
    • Many competing triplane-based LRM models like TripoSR, InstantMesh require massive compute clusters for training, often cited as 128 or 176 high-end GPUs for several days.
    • MeshFormer can be trained effectively on just 8 GPUs in about two days.
    • When a competing model (MeshLRM) was trained under the same limited 8-GPU, 2-day budget, it performed poorly, while MeshFormer achieved results close to its fully-trained version.

Shubhan Pawar | University of Pennsylvania | Fall 2025 | MeshFormer, High-Quality Mesh Generation with a 3D-Guided Reconstruction Model | September 2025 | 13

Results

Quantitative & Unprecedented Training Efficiency

15 of 18

  •  

Shubhan Pawar | University of Pennsylvania | Fall 2025 | MeshFormer, High-Quality Mesh Generation with a 3D-Guided Reconstruction Model | September 2025 | 14

Results

Qualitative Analysis

16 of 18

  • The authors conducted ablation studies to prove the importance of each component of their system:
    • Without Normal Map Input: Performance drops significantly, especially in geometric detail, confirming that normal maps provide crucial, unambiguous geometric guidance.
    • Without SDF Supervision: Training with only a surface rendering loss becomes unstable, and the geometry quickly deteriorates. This proves the SDF loss is essential for providing a stable geometric foundation.
    • Without the Transformer Layer: Texture quality metrics drop, indicating that the Transformer's ability to model global priors is most critical for learning complex appearance.
    • Without Geometry Enhancement: The final F-score and CD metrics are slightly worse, quantitatively verifying that the post-processing step effectively sharpens and improves the final geometry.

Shubhan Pawar | University of Pennsylvania | Fall 2025 | MeshFormer, High-Quality Mesh Generation with a 3D-Guided Reconstruction Model | September 2025 | 15

Ablation Studies

Validating Each Design Choice

17 of 18

  • Summary: MeshFormer presents a highly efficient and effective model for sparse-view 3D reconstruction. By integrating 3D-native principles into its architecture, supervision, and input, it generates state-of-the-art textured meshes.
  • Key Contributions:
    • A novel, efficient hybrid architecture combining sparse 3D convolutions and a Transformer.
    • A stable, unified training strategy using both SDF and surface rendering losses.
    • A novel use of multi-view normal maps for both input guidance and post-processing geometry enhancement.
  • Limitations & Future Work:
    • The model's performance is heavily reliant on the quality and consistency of the multi-view images and normal maps generated by upstream 2D models.
    • Errors or inconsistencies from these 2D models can propagate and cause defects in the final 3D reconstruction.
    • Future work could focus on improving the model’s robustness to handle such imperfect inputs.

Shubhan Pawar | University of Pennsylvania | Fall 2025 | MeshFormer, High-Quality Mesh Generation with a 3D-Guided Reconstruction Model | September 2025 | 16

Conclusion & Future Work

18 of 18

Thank You

Shubhan Pawar | University of Pennsylvania | Fall 2025 | MeshFormer, High-Quality Mesh Generation with a 3D-Guided Reconstruction Model | September 2025 | 17

MeshFormer integrates RGB and normal inputs with a 3D-guided architecture, using SDF for stable geometry and rendering losses for fine detail. By leveraging projection-aware cross-attention, it produces high-quality textured meshes from sparse views while remaining computationally efficient, outperforming prior methods in accuracy and training scalability.

Shubhan Pawar

Graduate Student – Electrical & Systems Engineering

E: Shubhan@thepawars.com

University of Pennsylvania | Fall 2025