1 of 18

MeshFormer �High-Quality Mesh Generation with a 3D-Guided Reconstruction Model

Based on the work of M. Liu, C. Zeng, X. Wei, et al.

Shubhan Pawar | University of Pennsylvania | September 25, 2025

2 of 18

The Goal - Democratizing High-Quality 3D Content

The Problem: The paper tackles sparse-view reconstruction—generating a high-quality, textured 3D mesh from just a few 2D images.
The Challenge: The paper tackles sparse-view reconstruction—generating a high-quality, textured 3D mesh from just a few 2D images.
Why It's Hard: With only a few views, there's a lot of missing information, creating ambiguity about the object's true shape. The model needs extensive prior knowledge to fill in these gaps.

3 of 18

Approach 1: Per-Shape Optimization (e.g.,DreamFusion)

Method: Uses large 2D diffusion models as priors to optimize a 3D representation (like NeRF) on a per-object basis through Score Distillation Sampling (SDS) losses.
Limitation: This process is extremely slow, often taking hours for a single object, and can suffer from 3D inconsistency issues.

Approach 2: Feed-Forward NeRF Models (e.g., One-2-3-45)

Method: A feed-forward network is trained to directly predict a Neural Radiance Field (NeRF) representation from the input views.
Limitation: While fast, extracting a clean, high-quality polygonal mesh from a NeRF's density field is a non-trivial problem. The resulting meshes are often noisy, suffer from artifacts, or lack fine-grained detail.

Approach 3: Large Reconstruction Models (LRMs) on Triplanes (e.g., MeshLRM)

Method: These models use large-scale transformers to process a ''triplane” representation—decomposing a 3D field into three 2D feature planes.
Limitation: They require massive computational resources for training (often over 100 GPUs). Furthermore, the triplane representation itself can introduce axis-aligned artifacts and has less explicit 3D structure.

The Landscape - Prior Approaches & Their Limitations

Now before getting into Meshformer, I just wanted to go over some other related approaches. first we have the Per-Shape Optimization, this is used by models like DreamFusion. This method uses a powerful 2D diffusion model as a guide. It works by generating multiple 2D view of the 3D model and using the 2D diffusion to predict a "correction" to improve it. This correction is then used as a gradient to update the 3D model. While this can produce high-quality results, the process is extremely slow and can struggle with 3D consistency.

The second paradigm is Feed-Forward NeRF Models, like One-2-3-45, which were designed to be much faster. These models train a single network to predict the 3D scene as a densty / volumetric cloud. The main challenge here is that this cloud is optimized to look good in 2D pictures, not to have a precise geometric shape. As a result, the representation often results in "noisy" artifacts.

The third paradigm involves Large Reconstruction Models, or LRMs. These use massive transformers combined with triplane representation, which basically are three 2d feature planes representing the 3D scene. This design, has two major drawbacks. First, as these 2d feature planes are aligned along the x,y,z axis, they can create artifacts on the final mesh. And second, these are huge models that are extremely expensive to train, often requiring over one hundred GPUs.

4 of 18

The Core Idea: High-quality meshes come from a single pipeline where inputs, architecture, and supervision are all grounded in 3D. Normals help twice (as early guidance and as a training target), while the SDF provides the stable geometric core the renderer builds on.
Inputs & Fusion (early): Multi-view RGB and predicted normal maps are encoded into 2D features. Via projection-aware cross-attention, each 3D voxel pulls consistent evidence from the views (using both features and raw values). This builds a 3D feature volume that respects visibility and geometry.
Geometry Core (middle): A small head predicts a high-res SDF volume; we extract a mesh from it. SDF + occupancy losses supervise the 3D shape directly, giving stable geometry, i.e. the frame the rest depends on.
Detail & Supervision (late): Separate heads predict color texture and a normal texture. A differentiable renderer produces RGB and normal maps; color/normal losses (MSE + LPIPS) compare those renders to the inputs.
Key clarity: Normals are used twice — as inputs (guidance during fusion) and as targets (via the predicted normal texture), not by back-propagating through mesh-derived normal.

The MeshFormer Philosophy - A ”3D-Native” Approach

In response to prior work, MeshFormer's approach is to run everything in a single, unified pipeline that is grounded in 3D geometry from the start.

The process begins with the inputs. It takes a few RGB views and their predicted normal maps coming from 2d models like zero123++ and fuses them directly into a 3D sparse voxel volume. The key mechanism here is Projection-Aware Cross-Attention, which lets each 3D voxel dynamically change the importance of evidence from the various 2D views.

Once this rich 3D feature volume is built, the next step is to define the core geometry.

Now this is done by using a small network that predicts the SDF volume, which acts as the stable blueprint for the mesh.

Separately, other small networks predict the color and normal textures.

A renderer then captures the 2D equivalence of the mesh with these textures applied, and the difference between this photo and the original inputs is what creates the final training loss.

Normal maps has dual roles, they're used as input guidance early in the process, and the ALSO USED as a training target, we will see this in detail in a few moments.

5 of 18

Triplane Representation:

Represents a 3D scene by decomposing it into multiple 2D planes, which is memory-efficient compared to dense grids.
Drawback: It lacks explicit 3D spatial structure, making it difficult to establish precise correspondence between a 3D location and its projected 2D pixels. This can lead to costly all-to-all attention mechanisms and cumbersome training.

Dense Voxel Representation:

Discretizes space into a regular 3D grid, directly compatible with 3D convolutions.
Drawback: Extremely memory-intensive, as it stores information for every point in space, including empty regions. Processing dense grids is computationally demanding.

MeshFormer's Choice, Sparse Voxels:

Method: Stores information only in occupied regions of space, often using a hierarchical structure.
Advantage: This provides the best of both worlds. It explicitly preserves 3D spatial structure like a dense grid but is significantly more memory- and computationally-efficient by ignoring empty space. This allows for higher effective feature resolutions.

Representation - Why Voxels over Triplanes?

"Let's look at the architecture and its choice of data representation. The authors considered the trade-offs between three main options.

First, there’s the Triplane Representation. As per the theory, While this is memory-efficient, it lacks an explicit 3D structure. So consider it having like three 2D planes of an object instead of a physical 3D model. Because the 3D relationships are not stored directly, it makes training more complex and can require costly network design.

The second option they looked at was the Dense Voxels. Now this is a physical 3D grid, that has perfect 3D structure, but it's extremely inefficient because it wastes memory by storing information for all the empty space.

This brings us to MeshFormer's choice which is Sparse Voxels. This approach offers the best of both worlds. It only stores information where the object actually exists, giving you the perfect 3D structure, ignoring all the empty spaces

6 of 18

The Challenge: How to process high-resolution 3D data while also capturing the long-range dependencies needed for complex object priors?
The Solution: A hybrid UNet architecture that leverages the complementary strengths of 3D Convolutions and Transformers.
3D Convolutions for Spatial Encoding/Decoding:

The encoder and decoder branches of the UNet use 3D convolutions (specifically, sparse convolutions in the fine stage).
3D convolutions excel at detecting local geometric patterns like edges, corners, and textures within a confined neighborhood. Their inherent weight-sharing property makes them parameter-efficient and encourages generalization.

Transformer at the Bottleneck: At the most compressed, low-resolution stage of the UNet (the bottleneck), the voxel features are treated as a sequence of tokens and fed into a standard Transformer.

The Transformer's self-attention mechanism is uniquely capable of capturing global object priors and long-range dependencies between distant parts of the object. Placing it at the bottleneck makes this global reasoning computationally feasible.

Synergy: This hybrid model learns local details efficiently with convolutions while using the powerful Transformer for global shape understanding, outperforming pure CNN or pure ViT approaches.

Architecture - A Hybrid of Convolution and Transformer

"Now that we have the sparse voxel representation, let's look at how the network processes it. Here the main challenge is to see both the fine details and the big picture at the same time. To do this, MeshFormer uses a hybrid architecture combining the CNNs and Transformers.

First, it uses 3D convolutions to process the high-resolution features. These are great for efficiently detecting local patterns—like the sharp edges or patterns in a small neighborhood of the object.

Then, at the most compressed, low-resolution part of the network, it uses a Transformer. This is the region of the network, where the 3d grid has been downsampled heavily to a small feature set. Now as we already know, Transformer is good at understanding the global structure, i.e. how all the distant parts of the object relate to each other. Using it at this low-resolution bottleneck is what makes this global reasoning computationally efficient.

This hybrid approach provides the synergy needed: convolutions handle the fine, local details, while the transformer handles the overall, global shape. This balance makes the architecture effective.

7 of 18

The Problem: How does a 3D voxel feature effectively aggregate information from multiple 2D input views, especially in the presence of occlusion?
Naive Methods: Simple mean or max pooling of features is suboptimal and struggles with visibility issues.
MeshFormer's Solution: A Projection-Aware Cross-Attention mechanism.

Project each 3D voxel into every input view..
At each projected pixel, sample 2D features (from RGB and normal maps) plus the raw RGB/normal values.
The voxel is the query; sampled per-view evidence is key/value. Attention weights down-weight occluded/inconsistent views and aggregate the rest.
Benefit: This allows the model to learn to handle occlusions by down-weighting information from inconsistent or irrelevant views, leading to a more robust fusion of multi-view data.

The 3D-2D Connection

Projection-Aware Cross-Attention (RGB + Normal Inputs)

So, how does the 3D voxel grid get its information from the 2D input photos? A simple average of all views is suboptimal, as it will then struggle to handle cases when photos might have a bad or hidden view for that point. To solve this, MeshFormer uses Projection-Aware Cross-Attention process.

So First, the model uses each camera's projection matrix to calculate the exact pixel coordinate where a 3D voxel would land on each 2D image

Now, at that specific coordinate, the model obtains two types of information. This is like representing the same coordinate on two different 'maps'. The first 'map' is the raw image, from which it gets the precise RGB and normal values. The second 'map' is a feature map which is computed by DINOv2 network, this contains the local texture and context

In the paper they mention these 2 pieces of information i.e. the precise pixel values and the rich features—are concatenated, to form a single, powerful feature vector

Finally, the 3D voxel uses cross attention mechanism to then weigh this combined feature from each view, giving a high score to clear, informative views and a low score to occluded or inconsistent ones

The benefit of this approach is that the model learns to automatically down-weight or ignore bad views. This adaptive fusion of information is what makes the process so robust and is key to generating a clean final mesh

8 of 18

Why bring normals in early? RGB is ambiguous; normal maps carry explicit surface orientation, which helps the voxel decide the shape at the point
Inputs & encoding: We take multi-view RGB and normal maps (e.g., from Zero123++), convert normals from camera → world frame, and run both RGB + normal images through the same 2D encoder (DINOv2) to get patch features
Per-voxel projection & sampling: Each 3D voxel projects into all input views, samples DINOv2 features + raw RGB/normal at those pixels, and uses cross-attention to weight visible, consistent evidence and fuse it into the voxel’s 3D feature.
Normals get used again as a training target.

Normals as Input (Guidance) — used before any loss

Before we get into the training operations, I want to bring up the use of normals. Relying on RGB alone can be confusing; it doesn’t directly tell us which way the surface is facing. So we include normal maps as well—these carry explicit surface orientation, which helps each voxel decide the local shape at that point. We run each RGB image and its matching normal map through DINOv2 to get a per-image (per-view) feature map.

so then as mentioned earlier For any 3D voxel (a point in space), we find the corresponding pixel coordinate for each of the images using the camera matrix values. At that pixel we look up two things: (1) the DINOv2 feature vector, and (2) the raw RGB color and normal vector. We then let projection-aware cross-attention keep the visible, consistent evidence and down-weight occluded or conflicting views—producing a fused 3D feature per voxel that we’ll train next.”

�

And also one more thing the normals are not just used in the input stage but We’ll use normals again later as a training target via a predicted normal texture.”

9 of 18

The Problem: Training a model to output a mesh is notoriously unstable.

Relying only on a differentiable rendering loss makes the optimization landscape difficult to navigate without a good initialization.
Relying only on a geometric loss may not produce photorealistic textures.

Solution: A unified, single-stage training process that combines two complementary loss signals.

Direct SDF Loss (Geometric Accuracy):

Normal appear again in the loss path: the network predicts a normal texture, we render normal maps from it, and compare to the input normal maps (MSE + LPIPS) for stable, high-frequency detail.
The network is trained to predict a Signed Distance Function (SDF) field, which measures the distance of any point to the nearest surface.
This provides a powerful, explicit, and continuous 3D geometric signal that acts as a strong regularizer, ensuring the model learns a coherent and valid shape from the very beginning.

Differentiable Surface Rendering Loss (Visual Fidelity):

A mesh is extracted from the predicted SDF using Marching Cubes.
This mesh is then differentiably rendered, and the resulting images (both color and normal) are compared to the input views.
This loss refines the high-frequency surface details and the texture to ensure the final output is visually faithful to the input.
Normal rendering: use the predicted normal texture to render normal maps and compare to input normals (MSE + LPIPS).
Stability note: we do not back-prop through mesh-derived normals; learning a normal texture decouples detail from geometry.

Supervision - Unified Training with SDF and Rendering

So in the lectures and over multiple papers, we are aware that using only a 2D rendering loss can lead to predicting bad geometry, while using only a 3D geometric loss can result in not learning the detailed features. To solve this, MeshFormer combines two new loss functions to give both geometric stability and visual detail.

So First is the Direct SDF Loss, the objective here is to obtain the 3d structure of the object in the image. This loss forces the network to create a geometrically accurate base model first, ensuring the overall form is correct.

After the SDF loss we got the second loss operation which is the Differentiable Surface Rendering Loss this is used to handle the fine details and color. It works by rendering the current 3D mesh and comparing it to the original photos. The process involves two key steps. First, a mesh is extracted from the SDF output (base model) using the marching cube algorithm, then this mesh is rendered using NV Diffrast which does the differential rendering. So here we got2 loss functions working, first is the Color rendering loss which compares the rendered RGB to the original photos to learn texture and appearance and also normal rendering loss to predict the normal textures, which teaches to get the sharp surface detail.

10 of 18

Mathematical Formulation

The Overall Training Objective

"Okay, so we've just discussed the two key supervision concepts the SDF loss and the rendering loss. This slide shows how MeshFormer combines all of that into a single, unified training objective.

The complete loss function, here is essentially a weighted sum to get all the different goals we want the network to achieve simultaneously.

The first two terms,

L_SDF and L_occ, are the geometric losses. They are Mean Squared Error losses and are responsible for supervising the 3D shape.

The next four terms are the rendering losses. These compare the features using both a standard MSE loss and a perceptual loss called LPIPS. This combination ensures the final texture is visually accurate.

I actually missed putting it on the slide here, but in the paper they mention values of weights as λ1,···,λ6 are set to 80, 2, 16, 2, 8, and 8, respectively.

11 of 18

Mathematical Formulation

The Geometric Core - SDF

Ok so let’s go back to the SDF and look at it in a bit more detail. As per definition, SDF, is a field where, for any point in 3D space, it tells you two things: its shortest distance to the object's surface, and whether it's inside or outside the object. Because of this, the object's surface is defined as the "zero-level set"—all the points where the distance is exactly zero.

But what makes the SDF so powerful for training are its gradient properties. The gradient of the SDF gives you the surface normal, which is the direction the surface is facing, the gradient's length, is associated with the Eikonal equation.

This equation states that the gradient's magnitude must always be a constant value of 1. Now, while the direction of that gradient changes constantly to match a complex surface, this equation focuses on the magnitude.

This leads to a key insight: whatever the direction, the rate of change associated with the learning is constant. the network's optimizer thus receives a clean and predictable rate of values. If the rate were chaotic, I.e. high in some places or near zero in others then the optimizer could struggle. This way it tries to avoid issues like exploding gradient or vanishing gradient.

12 of 18

Mathematical Formulation

Projection-Aware Cross-Attention

13 of 18

Unstable (what we don’t do)

Render mesh-derived normals and back-prop into vertex positions
Small vertex moves → large, non-linear normal flips across faces
Leads to noisy updates and unstable mesh learning

Stable (what we do)

Predict a 3D normal texture
Render normal maps from that texture; compare to input normals
Decouples fine detail from geometry → stable training

Stable vs. Unstable Normal Supervision

14 of 18

State-of-the-Art Performance

MeshFormer significantly outperforms prior and concurrent methods like One-2-3-45++, TripoSR, and MeshLRM across two challenging, unseen test datasets (GSO and OmniObject3D).
It achieves the best scores on most 3D metrics (F-Score, Chamfer Distance) and 2D rendering metrics (PSNR).

The Key Differentiator: Training Efficiency

Many competing triplane-based LRM models like TripoSR, InstantMesh require massive compute clusters for training, often cited as 128 or 176 high-end GPUs for several days.
MeshFormer can be trained effectively on just 8 GPUs in about two days.
When a competing model (MeshLRM) was trained under the same limited 8-GPU, 2-day budget, it performed poorly, while MeshFormer achieved results close to its fully-trained version.

Results

Quantitative & Unprecedented Training Efficiency

15 of 18

Results

Qualitative Analysis

16 of 18

The authors conducted ablation studies to prove the importance of each component of their system:

Without Normal Map Input: Performance drops significantly, especially in geometric detail, confirming that normal maps provide crucial, unambiguous geometric guidance.
Without SDF Supervision: Training with only a surface rendering loss becomes unstable, and the geometry quickly deteriorates. This proves the SDF loss is essential for providing a stable geometric foundation.
Without the Transformer Layer: Texture quality metrics drop, indicating that the Transformer's ability to model global priors is most critical for learning complex appearance.
Without Geometry Enhancement: The final F-score and CD metrics are slightly worse, quantitatively verifying that the post-processing step effectively sharpens and improves the final geometry.

Ablation Studies

Validating Each Design Choice

17 of 18

Summary: MeshFormer presents a highly efficient and effective model for sparse-view 3D reconstruction. By integrating 3D-native principles into its architecture, supervision, and input, it generates state-of-the-art textured meshes.
Key Contributions:

A novel, efficient hybrid architecture combining sparse 3D convolutions and a Transformer.
A stable, unified training strategy using both SDF and surface rendering losses.
A novel use of multi-view normal maps for both input guidance and post-processing geometry enhancement.

Limitations & Future Work:

The model's performance is heavily reliant on the quality and consistency of the multi-view images and normal maps generated by upstream 2D models.
Errors or inconsistencies from these 2D models can propagate and cause defects in the final 3D reconstruction.
Future work could focus on improving the model’s robustness to handle such imperfect inputs.

Conclusion & Future Work

18 of 18

Thank You

MeshFormer integrates RGB and normal inputs with a 3D-guided architecture, using SDF for stable geometry and rendering losses for fine detail. By leveraging projection-aware cross-attention, it produces high-quality textured meshes from sparse views while remaining computationally efficient, outperforming prior methods in accuracy and training scalability.

Shubhan Pawar

Graduate Student – Electrical & Systems Engineering

E: Shubhan@thepawars.com

University of Pennsylvania | Fall 2025